Internet Text Files: End Of Line Characters
Version 1.3
HTML format of texteol.stf file

© 1994-1999 by Peter Benjamin
All Rights Reserved



You may distribute this paper by any means for non-profit, non commercial applications. Please send it to your friends and local BBSes and FTP archives. No charge for distribution is allowed. All text in this file is copyrighted and no changes are allowed. Please email corrections, suggestions and paper ideas to
pete@peterbenjamin.com

This paper is one of a series on cross platform and operating system portability issues.


Abstract

The intended reader of this paper is the novice to intermediate computer user who needs to share text files across multiple operating systems. This paper provides a simple explanation of man-readable computer files. The EOL or end-of-line character difference between platforms is explained for DOS, Macintosh and Unix. FTP (file transfer protocol) binary/ascii mode is explained to a degree.


Table of Contents



Introduction

Many computer users are confused and disappointed with the results when transferring a man-readable text file to another platform. The reason is the different End-Of-Line (EOL) character used by DOS, Macintosh and Unix. These differences are expounded. Conversion methods are mentioned. ASCII is the character set most talked about, though other character sets are mentioned, the EOL method is usually the same.


"Fixed length" record files do not use the EOL concept. The record length gives the EOL.

With the increasing use of Internet and it's FTP service, the incorrect setting of FTP options will prevent accurate transfer and commonly results in garbage files. Binary files must be transferred in a "binary", or "integer" mode different from text files. However, text files can be transferred in binary mode with no harm to the data.


Top of Page | Introduction | Text File Facts | End-of-Line Characters | Conversions


Text File Facts

These files are known by many names including, but not limited to:

  ASCII text file,   flat text file,   flat file,  man-readable file,
  ascii file,        text file,        notepad file (for Windows)
  ascii,             simple text,      teachtext (for Macintosh)

They have filenames that can end in the following, but the filename ending with these letter does not guarrantee it is man-readable text.

  *.txt  *.lst  *.doc  *.man  *.1    *.inf  *.eps  *.ps   *.rtf   *.htm
  *.asc  *.stf  *.ls   *.lsr  *.tex  *.cfg  *.pst  *.mal  *.rme   *.mbx   

ASCII stands for American Standard Code for Information Interchange. ASCII is pronounced "as-key"

The man readable portions are the same for all ASCII platforms and can be ported between all ASCII platforms with little or no conversion. There are other character sets like IBM EBCDIC. These other formats can be converted. ASCII character set consists of the characters you see on the keyboard.

  abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
  0123456789`~!@#$%^&*()_+-=[]{};':",.<>?/|\

These are the "displayed" characters. Special "non-display" characters do exist like "space" (a blank), "tab" and the "End-Of-Line" or EOL. These charcters are supposed to be invisible to the reader, that is they are in the class of "non-displayed" characters. In ASCII there are 94 display characters and 162 non-display characters, for a total of 256 possible characters. Some edit software will display these non-display characters as special symbols, most common as a period. Other display types are "hex" or "octal" for programmers.

Top of Page | Introduction | Text File Facts | End-of-Line Characters | Conversions


End-of-Line Characters

The three platforms, DOS, Macintosh and Unix, all use a different end-of-line character(s) or EOL to indicate the start of a new line. The EOL character represents the two actions the computer should take in displaying the text file lines. Upon encountering the EOL character the computer should do the following common typewriter functions: carriage return and line feed. These terms are commonly abbreviated as <cr> and <lf>. These abbreviations are used for now on.

<cr>Carriage return is to return the cursor or current active display location for the next character to the beginning of the same line.

<lf>Line feed is to change the display location right below the current position, or in other words, go to the next line below.

They are different in order to protect copyright privileges.

FIGURE 1. Simplified Platform Text File EOL Representations

The above figures use a "fixed length" line of text to increase clarity, that is, the end of each line and entire blank lines are "padded" with blanks. In actuality, to save storage space, these trailing padding blanks are not present in the stored file. For example:

Note the presence of <^Z>,or "Control-Z", a non-displayed ASCII character, used by old versions of DOS to indicate the end of the file.

The presence of trailing blanks can increase the size of the file by factors of 2 or 3. Sometimes the display of the text file does appear to have them.

On some systems, like mainframes, the blanks can really be there. These are know as fixed length record files. No EOL characters are used.

A DOS file will display correct on both Macintosh and Unix except for the presence of an extra character that may or may not be displayed at the beginning or the end of a line. On Unix the extra <cr> is commonly displayed as "^M" or Control-M.

A Macintosh ASCII file has an extra "header" before the ascii begins that is used by the Macintosh operating system (OS) to tell finder what application made the file (i.e. Simple Text BBedit TeachText).

Top of Page | Introduction | Text File Facts | End-of-Line Characters | Conversions


Conversions

There are many public domain programs that will convert between these format. Many edit and/or word processing software can handle these EOL variations.

DOS

In DOS there is no way to convert without having a special program. Neither the EDIT or EDLIN command will allow one to change the EOL characters or insert them. Recommended is a shareware program crlf.exe, found as crlf###.zip, where ### is the version number at many internet software archive sites (i.e. oak.oakland.edu).

Windows

In Windows use program "Write", "Word Pad", or "Quick View" to properly display the text. Also, see the Unix section below to use "sed" which DOS versions are available. Printing is done correctly. No conversion is done.

Macintosh

On Macintosh "Apple Exchange" is commonly used. There are many other Macintosh third party products for this function. Most are marketed as being able to read and write DOS diskettes and floppies. These programs do the conversions invisible to you.

Unix

Unix comes with many commands and filters that will do the conversions in any direction. Here are some "sed" examples:

  For DOS to Unix: sed s/.$//     infilename >outfilename 
  For Mac to Unix: sed s/x0d/x0a/ infilename >outfilename 
  For Unix to DOS: sed s/$/x0d/   infilename >outfilename
  For Unix to Mac: sed s/x0a/x0d/ infilename >outfilename
  For DOS to Mac:  sed s/x0d//    infilename >outfilename
  For Mac to DOS:  sed s/x0a/x0dx0a/ infilename >outfilename

Not all Unix sed commands are the same. This method may not work. It is possible to prepare the files ahead of time for transfer.

Internet FTP

This section deals only with one FTP subcommand "binary." For more information see the share text file FTPBEST.STF for exact commands to enter to ease your use of FTP and ensure consistent and best results.

The "binary" or "integer" mode ensures the file is transferred without any conversions of the internal data. This lack of conversion means the integrity of the internal data of the file is assured and the file will perform as advertised.
Conversions used are for ASCII text files only and deal with the character set and the End-Of-Line methods. Not all FTP software packages are the same. Different release levels and vendors will have different, more or less, commands, and even some commands with the same spelling will work differently. It is best if you stay with a limited set of commands that past experience has shown you that they work.

The "binary" command is a toggle switch with no arguments or other parameters (or words on the same line). Toggle switches are like light switches, either on or off, only two values are possible, and the same switch or command is used. Using the command at anytime simply changes the mode to the opposite value.

So, ALWAYS read the output of the binary command. Most FTP's start with it off, but some default it to on. The binary switch is on when after entering the command you see FTP> binary Integer mode FTP> Integer mode is a technical Unix term for treating all data files as integers or numbers, not as text. Integers or numbers must not have their values changed. Text on the other hand, might be converted to other character sets or EOL methods with no loss of information.
Since most FTP users transfer both text files and binary files, the preferred setting is always binary or integer mode. It is recommended that you get in the habit of always using the binary command right after typing FTP or entering FTP mode.

Having binary mode always on will ensure all the files you transfer will not change some of their internal data fields. Thus, all binary files will work as advertised and all text files will be intact and at most the text files can be run through a simple conversion.

Binary files can not be reverse converted due to the nature of having look alike EOL characters that should not be converted. Also, that is the reason why binary files that are converted will not work as advertised, since the conversion is done on these look alike fields and changes the "instructions" for the computer that are stored in the binary file.

Good Luck andCoolcruisin'!


Top of Page | Introduction | Text File Facts | End-of-Line Characters | Conversions


Caveat

The information provided is provided without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchant ability and fitness for a particular purpose.




HTML editing by Randy Clemens

All Trademarks are the respective property of their owners.
© 1994-1999 by Peter Benjamin All Rights Reserved