ANSI/ASCII conversion not working or what does UTF-8 to ASCII?

ANSI/ASCII conversion not working or what does UTF-8 to ASCII?

3
NewbieNewbie
3

    Dec 19, 2013#1

    Hi,

    UltraEdit version 16.00.0.1025.

    I found I had a txt file with UTF-8 encoding in it. So converted from UTF-8 to ASCII (so the menu said) - but that converted it to ANSI (Windows CP1252) - fine so then I try ANSI to OEM (which should be ASCII) and it doesn't convert correctly (it seems to be 7 bit type conversion). So does your latest versions correctly support converting ANSI to ASCII and use the correct terms for ANSI and ASCII?

    For example 0xE9 in CP1252 (ANSI) should be 0x82 in ASCII (CP437).

    TIA!!

    6,602548
    Grand MasterGrand Master
    6,602548

      Dec 20, 2013#2

      Well, the term ASCII (short for American Standard Code for Information Interchange) is not really correct in any conversion item.

      UTF-8 to ASCII should be UTF-8 to 8-bit encoded according to current code page set for the file. But that would be a little bit long.

      UTF-8 to ANSI would be also not correct as not all code pages are standardized by the American National Standards Institute. Some code pages are standardized by other organizations or companies.

      The ASCII table contains only 128 characters with code value 0 to 127 decimal as you can read on ASCII Table and Description and on Wikipedia article about ASCII. There is also an extended ASCII table, but code page 437 (North America) or code page 850 (Western Europe) is not equal the extended ASCII table.

      I think, 99.9999% of all computer users writing text do not know anything about the various standards for encoding text. A minority of computer users know at all about the main difference between Unicode and ASCII/ANSI encoded characters - multi-byte versus single-byte encoding.

      ASCII and ANSI are common synonyms for characters encoded with 8 bits and Unicode is understood usually as encoding for a wide range of characters.


      Okay, after this small lesson about various text encoding standards, back to command UTF-8 to ASCII in UltraEdit.

      This command converts a file encoded in UTF-8 (and converted to UTF-16 Little Endian on opening of the file in memory) to the code page set currently for the active file via View - Set Code Page (since UE v12.10) or via the encoding item in the standard (non basic) status bar with UE v19.00 or later. Therefore it is possible to convert a UTF-8 encoded file directly to code page 437 by selecting first 437 (OEM - United States) in Code Page Selection dialog or 850 (OEM - Multilingual Latin I) for code page 850, or selecting via the status bar in group OEM the item 437 (OEM - United States) or 850 (OEM - Multilingual Latin I).

      The same is true for all other UltraEdit file conversion commands containing the term ASCII. ASCII means always: code page as currently selected for the file to convert.


      Hint for users with UE 19.00 or later:
      The standard status bar of UE v19.00 or later supports selecting the code page for the active file via a list item opening a menu with the sublists Default, Unicode, MAC, IBM, ANSI, ISO, OEM, Others each containing appropriate items. The list item changes the code page respectively text encoding for the active file, but not the display font. So the user has to select a font and/or font script supporting the selected text encoding/code page to get the text correct displayed, too.

      The default code page for 8-bit encoded files is set via Advanced - Set Code Page/Locale and is usually the code page defined by the system (= operating system) which means on Windows the code page set for non Unicode files in the regional and language settings of Windows.

      UltraEdit since v12.10 supports automatic detection of other code pages than default system code page for 8-bit encoded text files according to charset (HTML, XHTML) or encoding (XML) declaration at top of HTML, XHTML and XML files.


      I don't know for sure, but I think the commands ANSI to OEM and OEM to ANSI are just running CString::AnsiToOem respectively CString::OemToAnsi on entire file. That means at least for North American and Western European countries (and perhaps also other countries) a conversion between the code pages 1252 (ANSI) and 437 or 850 (OEM). I'm using those commands often already several years in various UltraEdit versions since v8.00 and could never see a mistake in conversion.

      Please note that there are characters in code page 1252 which are not available in code page 437 or 850 and vice versa. Those characters can be therefore not converted to the other code page with correct display of the characters after the conversion.

      And of course after making the conversion, the font for display must be changed via View - Set Font, or just the Script of the font if the used font also supports OEM code pages, to get the characters displayed correct again after the conversion.

      Special hint: With using OEM Character Set command added manually to a customized toolbar or menu as explained at Manual customization of OEM Character Set command it is possible to quickly change display of characters between ANSI and OEM without changing the display font setting. And also non ASCII characters entered by keyboard are correct added to file according to OEM Character Set state.
      Best regards from an UC/UE/UES for Windows user from Austria

      3
      NewbieNewbie
      3

        Dec 21, 2013#3

        Hi,

        Thanks...

        Typically when someone talks about the ANSI character set, it's the one Windows made popular - Essentially CP1252. The old DOS character set was ASCII (the full 8 bit version) or what has become CP437. The only 7-bit ASCII was from the old teletype days - since the IBM PC came out everyone recalls the full 255 characters. There is another CP850 that some may find useful.

        It be nice to just have the option to convert to a given CPxxx character set, just need some collating tables and xlat the values. For me it's useful because I need to convert the ANSI or UTF-8 (depending where it came from) version to ASCII. The version I used didn't work on that file. It converted E9 to the letter T if I recall instead of 82. I converted the file another way, it's just more of a hassle than using ultraedit.

        6,602548
        Grand MasterGrand Master
        6,602548

          Dec 21, 2013#4

          For testing the conversion commands I created a UTF-8 encoded file containing the text: Character with code value 0xE9 is: é

          I restored UE v16.00.0.1025 from my archives and opened in this version of UltraEdit the UTF-8 file.

          Then I used File - Conversions - UTF-8 to ASCII and could see the same string. Character é has still the hexadecimal code value 0xE9 as Search - Character Properties reports with caret blinking on this character.

          Next I used File - Conversions - ANSI to OEM and could see the string now as: Character with code value 0xE9 is:

          Setting the caret on character and executing Search - Character Properties showed by me hexadecimal value 0x82.

          Therefore I activated OEM Character Set and saw the same string as: Character with code value 0xE9 is: é

          I disabled OEM Character Set and saw again at end of the text. So I opened View - Set Font and enabled Use OEM Fix Pitch Font which resulted on closing the dialog with button OK on selecting the font Terminal instead of Courier New and again é was displayed at end of the text.

          While still using OEM/DOS font Terminal, I executed File - Revert to Saved and the file was reloaded as UTF-8 encoded file after discarding all the modifications made on the file with the 2 conversion commands.

          With wrong font Terminal I saw the UTF-8 encoded text now as: Character with code value 0xE9 is: Ú

          I selected this time in View - Set Code Page first 437 (OEM - United States) and finally executed File - Conversions - UTF-8 to ASCII.

          I got now directly displayed with font Terminal: Character with code value 0xE9 is: é

          And Search - Character Properties executed with caret set on character é resulted in a dialog showing hexadecimal value 0x82 for this character.

          So everything is working correct in UltraEdit v16.00.0.1025 like on currently latest v20.00.0.1056.
          Best regards from an UC/UE/UES for Windows user from Austria

          3
          NewbieNewbie
          3

            Dec 22, 2013#5

            Hmm, I had used UltraEdit for these conversions for a while and thought they always worked until that problem showed up. I went back and tried again on the latest file and it did work.

            It started with someone editing the file, trying to convert yielded double chars on the special characters, so that's when I realized they had used an editor that used UTF-8 so I reopened it and did the UTF-8 to ANSI, then saved it, then ANSI to OEM, and that's where it wasn't giving me the correct result. Even when I tried again, but since then the file has been edited as ANSI and now seems to work? Not sure what that could have been? Anyway, thanks.