File formats and encoding

CWBillow · Jul 10, 2007#12007-07-10T00:43+00:00

What's the difference between ANSI and DOS encoding?

Regards,
Chuck Billow

Mofi · Jul 10, 2007#22007-07-10T11:32+00:00

If you mean with DOS the OEM character set, then the main difference is the upper 128 characters of the codepage. The lower 128 characters are in ANSI and OEM character set identical and are the ASCII characters (ignoring the control codes).

If you write only in English you will not see any difference for normal text.

Go to the Wikipedia page Code page, open the page about OEM code page 437 and open in a second tab or window the ANSI code page 1252. Compare the characters and you will see the difference. The Wikipedia pages also explain very good the code pages. It's worth to read those pages.

The OEM code page 437 is often used for creating small drawings with characters in a text file because it contains lots of "graphic" characters.

CWBillow · Jul 10, 2007#32007-07-10T12:09+00:00

Mofi:

And nobody has thought to put all these in one code page, why? It would be nice, clean and certainly useful. I guess that answers why not, huh?

Thanks for the help.

Regards,
Chuck Billow

Mofi · Jul 10, 2007#42007-07-10T18:38+00:00

CWBillow wrote:And nobody has thought to put all these in one code page, why?

Because a code page can only contain 256 characters. You know 1 byte has 8 bit and so you can code only 2^8 = 256 characters with 1 byte. That's the reason why code pages exist.

As the Wikipedia page I linked to also mentions, there is now the Unicode system which encodes all characters for all languages and even graphic, mathematic and symbol characters.

But most text files are still single byte coded files and not Unicode files. Also most fonts contain only certain code pages or even only parts of a code page and not the full Unicode character table. So there is always a problem when a conversion must be done from a text file (1 byte per character) to a Unicode file (2 bytes per character) and vice versa.

UTF-8 and ASCII escaped Unicode files are a mixture of text and Unicode files to be able to use the full range of Unicode characters, but encode the file content still with just a single byte per character for the most often needed ASCII characters. Only the real Unicode characters are coded with special character sequences. That reduces the file size a lot which is the reason why UTF-8 is used heavily for webpages: It supports all characters, but the HTML files are for many (especially European) languages only a little bit larger than when encoding it in ANSI.

CWBillow · Jul 10, 2007#52007-07-10T21:31+00:00

Now I AM worried: It's BEGINNING to make sense!

Thanks,
Chuck Billow

Peter · Jul 11, 2007#62007-07-11T19:15+00:00

Hello

I have a problem with two kind of files; both created by another software, Both have blank "text content", but one is opened by UE without comment (maybe a DOS file?), but opening the other file UE asks always for "Convert to DOS?"

Where is the difference? The code page behind the files?

Best regards

Peter

Mofi · Jul 11, 2007#72007-07-11T19:54+00:00

No, this message has nothing to do with the character set in a code page.

This message is shown when the file uses not the DOS (Windows) line termination carriage return (hex: 0D) + line-feed (hex 0A) which are in many programing languages encoded as \r and \n.

Your "not DOS file" is a Unix file, because it uses only the line-feed (= \n) as line termination as it is standard for Unix/Linux operating systems. On older MAC systems only carriage return (= \r) is used as line termination.

The handling of a file with non DOS line terminations can be configured at Configuration - File Handling - DOS/UNIX/MAC Handling. Read the help page for this dialog for further details.

Peter · Jul 12, 2007#82007-07-12T00:09+00:00

Thanks Mofi.

Peter