Encoding of special characters - ANSI/Unicode to ASCII conversion

Encoding of special characters - ANSI/Unicode to ASCII conversion

2
NewbieNewbie
2

    Apr 22, 2010#1

    UltraEdit Ver 5 provided a viable workaround for a problem I was never able to understand or fix. The workaround, however, no longer works in my new Ver 16. I know the problem is not with UE, but, after a zillion hours of floundering, I think the community of UE users is the group most likely to be able to suggest a solution to the problem, or at least how to get the UE workaround working again.

    So: I compose email in M/S Word in Windows, then cut and paste it into the AOL Write text box. Although it all looks right on my PC, many special characters, like the apostrophe in contractions, are garbled when the recipient gets it. AOL spell-checking fixes these characters one word at a time, but the workaround took only a few clicks and was simply to paste first into UE, then cut and paste from UE into AOL. The apostrophe, for instance, was changed from X92 to X27 going through UE Ver 5.

    There must be a general concept I am missing here since I have had this problem over the lifetime of many versions of Windows, Word, and AOL.

    Thanks in advance for whatever insight and help you can provide.

    Regards,
    George A. McClain

    6,606548
    Grand MasterGrand Master
    6,606548

      Apr 23, 2010#2

      Text written in Word is encoded by default using the codepage defined in the Windows regional and language options. For Western European, US and Canada this is Windows 1252 (ANSI - Latin I) which is very similar the standardized codepage ISO 8859-1 (Latin I). This is an ANSI codepage which means it uses 1 byte per character and the codepage contains therefore just 256 different characters.

      The first 32 characters are control characters. The next 96 characters are standard characters. The 32 control characters plus the 96 standard characters are well known as ASCII characters. Those 128 characters requiring 7 bits are the same in most codepages. The upper 128 characters in the codepage with 256 characters differ depending on which codepage is selected.

      When editing in Word and inserting a character not available in the standard codepage, Word inserts this character as Unicode character making the surrounding text block completely Unicode. So a Word file can contain a mixture of ANSI and Unicode encoded text making text searches with Windows Explorer in binary Word files (*.doc) a little bit tricky and unpredictable.

      When you copy a text written in Word to clipboard, Word copies them either as ANSI or as Unicode text to the clipboard depending on which characters the text includes. Contains the text any Unicode character, the entire text is copied as Unicode text to the clipboard, otherwise ANSI is used.

      New files in UltraEdit are either ANSI files or Unicode files depending on the encoding configuration setting for new files. By default the setting is set to ANSI using also the codepage defined in the Windows regional and language settings. So when pasting a text from Word it is either inserted into the new file 1:1 or is converted from Unicode to the current ANSI codepage using the relevant Windows kernel function.

      What you want is a conversion from Unicode or ANSI to 7 bit ASCII to use just the first 128 characters instead of the 256 ANSI or the 65536 Unicode characters. As far as I know this can be done only with a macro or script. If you want such a script or macro, please reply and ask for it and tell us what you want (script or macro). Important for us to know is what is your standard codepage. You can see that by starting UltraEdit, open a new file and look on View - Set Code Page.

      By the way: Why are you not writing the text for AOL in a text editor instead of MS Word? That would already avoid inserting ` and ´ when you click on key for character '. Of course you can disable the AutoFormat option of MS Word which converts straight single and double quotes into left and right single and double quotes, too.
      Best regards from an UC/UE/UES for Windows user from Austria

      2
      NewbieNewbie
      2

        Apr 23, 2010#3

        Thank you the clear and detailed explanation. It is exactly what I was hoping to get in the UE Forum. My standard code page is 1252.

        Turning off Word AutoFormating seems like the right choice this morning. I will have to see if that means giving up some function I really want, but I have frequently found things like auto-list and auto-bullet more annoying than helpful. I may find I am glad to have it gone.

        I use Word for correspondence simply because that is the way I have always done it since Word replaced Dos Script and later Email replaced hardcopy in an envelope. However, I will look into switching, especially if that would provide good search capability in my old letters.

        Thanks for the offer regarding a special script or macro, but I do not think that will be necessary. And thanks again for quick and very helpful response.

        Regards,
        George A. McClain

        901
        MasterMaster
        901

          Apr 23, 2010#4

          Since newer versions of UE can detect and manage encoded characters, it no longer swaps them out with standard replacements from the standard ASCII character table. As Mofi said, your best option is to write the message in a text editor in the first place instead of using Word.

          Just an FYI... I'm not sure if this is a bug or a "feature"... but when I select the View -> Reset Fonts menu option, the single and double open and close quotation marks display as the "garbled" characters that you say your recipients see. If you really must clean up an encoded text string you could try this technique to visually identify the characters you need to clean up before your cut and paste into AOL.