How to convert UTF-8 encoded text in clipboard to Windows-1252 on paste?

fredtheman · Dec 16, 2013#12013-12-16T11:51+00:00

Hello

As the venerable Eudora email client doesn't support UTF-8, I need a solution to easily convert UTF-8-encoded emails to Windows-1252.

However, I couldn't get UltraEdit to convert successfully through the File > Conversions option: Either UTF-8 to ASCII is disabled or nothing happens when selected.

Does anyone know why?

Thank you.

Here's some actual text to play with:

Code: Select all

Content-Type: text/plain; charset=UTF-8

Ã©tÃ© prÃ©venu de l'Ã©chec du paiement. Nous allons Ã  nouveau tenter d'effectuer un paiement le Dec 19, 2013. Pour rÃ©soudre ce problÃ¨me, vous devez dÃ©finir une carte comme source d'approvisionnement. Veuillez modifier la source d'approvisionnement pour cet abonnement. Pour ce faire, cliquez sur ce lien et suivez les Ã©tapes ci-aprÃ¨s : 

d'abonnement, choisissez une carte comme source d'approvisionnement dans l'une des listes dÃ©roulantes. Si aucune autre source d'approvisionnement n'est disponible, ajoutez une carte bancaire. 3.  Cliquez sur le sous-onglet PrÃ©fÃ©rences. 4.  Choisissez le lien Cartes bancaires dans la colonne Informations financiÃ¨res. 5.  Cliquez sur Ajouter. 6.  Suivez les instructions affichÃ©es Ã  l'Ã©cran pour ajouter une nouvelle carte bancaire Ã  votre compte

Mofi · Dec 16, 2013#22013-12-16T15:09+00:00

I have explained a lot about UTF-8 editing in UltraEdit at UTF-8 not recognized, largish file and Using UTF-8 with UltraEdit.

Your screenshot does not contain the status bar at bottom of the UltraEdit main window. Most likely you see whether UTF-8 nor U8- which means UltraEdit has the file not detected as being UTF-8 encoded. However, this is no problem. Use File - Open and select manually the encoding/format UTF-8 before opening the file with button Open. Now you can convert the file to ASCII/ANSI.

I just can hope that the file is completely encoded in UTF-8 and not just some parts of it. Mailbox storage files contain often email text in various encodings according to what the sender of the email has configured.

fredtheman · Dec 16, 2013#32013-12-16T16:12+00:00

Thanks. Indeed, the status bar says "1252 - (ANSI - Latin1)".

However, saving an email into a file before opening it in an editor just to get the accents right is too cumbersome.

Ideally, I was looking at a utility where I could simply paste some UTF-8 and have it converted to Windows-1252 without first going through a file.

Mofi · Dec 16, 2013#42013-12-16T19:46+00:00

I don't really understand your request for a utility which converts on paste a Unicode text to ANSI. That is done by Windows clipboard.

Well, if you copy a UTF-8 encoded byte stream like the one you posted to Windows clipboard, the clipboard interprets it as plain ANSI text, and not as Unicode text.

It is possible to code an UltraEdit script which takes the content of the Windows clipboard, converts the text being expected in this case as UTF-8 byte stream from UTF-8 to code page 1252 and writes this text back to the Windows clipboard.

Here is the code for this UltraEdit script:

Code: Select all

function decode_utf8(sUtfText)
{
   var sPlainText = "";
   var nCharPos   = 0;
   var nCodeByte1 = 0;
   var nCodeByte2 = 0;
   var nCodeByte3 = 0;
   while (nCharPos < sUtfText.length)
   {
      nCodeByte1 = sUtfText.charCodeAt(nCharPos);
      if (nCodeByte1 < 128)
      {
         sPlainText += String.fromCharCode(nCodeByte1);
      }
      else if ((nCodeByte1 > 191) && (nCodeByte1 < 224))
      {
         nCharPos++;
         if (nCharPos < sUtfText.length)
         {
            nCodeByte2 = sUtfText.charCodeAt(nCharPos);
            sPlainText += String.fromCharCode(((nCodeByte1 & 31) << 6) | (nCodeByte2 & 63));
         }
      }
      else
      {
         nCharPos += 2;
         if (nCharPos < sUtfText.length)
         {
            nCodeByte2 = sUtfText.charCodeAt(nCharPos-1);
            nCodeByte3 = sUtfText.charCodeAt(nCharPos);
            sPlainText += String.fromCharCode(((nCodeByte1 & 15) << 12) | ((nCodeByte2 & 63) << 6) | (nCodeByte3 & 63));
         }
      }
      nCharPos++;
   }
   return sPlainText;
}

UltraEdit.clipboardContent = decode_utf8(UltraEdit.clipboardContent);

// This additional line pastes the just converted text into the active file.
if (UltraEdit.document.length > 0) UltraEdit.activeDocument.paste();

The function decode_utf8 is a modified version from Javascript UTF-8. It is still not perfect, but it should be enough for your requirement of decoding UTF-8 text in clipboard to code page 1252.

Gabarito · Oct 03, 2014#52014-10-03T19:18+00:00

I'm an old user of UltraEdit.
And my previous version is very old too.
Although that, I'm not an advanced user of this wonderful editor, as I can see, after doing some searches in this forum and finding out too many smart people among professional programmers.

I can see that Mofi makes his better to explain the UTF-8 thingy, but some questions remains alive.

My previous version, 9.10a was be able to open a file (double clicking the TXT file, not File\Open) that had come from Google Translate via Clipboard and show the extended characters in a wrong way. This was a flag to me to know that I had to convert the whole file to ASCII. UltraEdit 9.10a could do that job and I ended with my converted file as I wished to.
The status bar showed only DOS, but nothing about the encoding. Sometimes, it showed U8-DOS. The command File\Conversions\UTF-8 to ASCII was always enabled and, just after clicking in it, UltraEdit changed the extended characters to the right ones, displaying the good job of changing the wrong characters by the right ones.
I always set Courier New as the font and script as ocidental.

The new version, 21.20, is different. After too many try and error, I found out some new features:

File\Open\Format-OpenAs
View\SetCodePage
Advanced\SetCodePage-Locale
Advanced\FileHandling\Conversions
Advanced\FileHandling\Save
Advanced\FileHandling\UnicodeUTF8Detection

I confess, I tryed too many combinations to make version 21.20 behaves as my old and good 9.10a without sucess...
Many time, the File\Conversions\UTF8 to ASCII is not enabled or the extended characters are showed as if they were ASCII (in a right way) even they aren't for real. I would wish they appear in a wrong way, just to inform to me that I have more work to do (Convert UTF-8 to ASCII).

So, I ask some help from the UltraEdit gurus.
How to open a file, double clicking in it, show the extended (and wrong) characters, be able to convert UTF-8 to ASCII and get the right characters displayed, on the fly?

Below, a small sample of the files I handle:

Code: Select all

Quero chamar atenÃ§Ã£o para uma coisa: ele diz, tÃ¡ certo, realmente tem que privatizar mesmo. Ocorre que ele sempre foi contrÃ¡rio a isso e nÃ£o era por convicÃ§Ã£o da de sabotagem. As aÃ§Ãµes sempre foram positivas. Mas nÃ£o Ã©.
EstÃ£o sem saÃda. As privatizaÃ§Ãµes comeÃ§aram com Fernando.

Here, the text after UTF-8 to ASCII conversion:

Code: Select all

Quero chamar atenção para uma coisa: ele diz, tá certo, realmente tem que privatizar mesmo. Ocorre que ele sempre foi contrário a isso e não era por convicção da de sabotagem. As ações sempre foram positivas. Mas não é.
Estão sem saída. As privatizações começaram com Fernando.

It would be better if the solution does not involve File\Open\Format-OpenAs trick.
Just File Association, by double clicking in Windows Explorer.

My question is similar to above written by fredtheman.

Note: None of the files has a BOM.

Mofi · Oct 04, 2014#62014-10-04T17:02+00:00

Let us first clarify if I have understood you correct.

You copy in another application a Unicode text to clipboard as UTF-8 encoding stream.
You paste this UTF-8 encoding stream into a new ASCII/ANSI file in UltraEdit. UE v9.10a indicated in status bar just DOS. UE v21.20 indicates DOS for line terminator and 1252 (ANSI - Latin I) for encoding / code page (or also just DOS if basic status bar is used according to configuration).
You see that the text is not in ANSI using Windows code page 1252 (or whatever is your default system code page), but a UTF-8 encoded text.
Therefore you executed command UTF-8 to ASCII in UE v9.10a to convert the text from UTF-8 to ASCII/ANSI (just one byte per character) using code page Windows 1252.

Well, UltraEdit v21.20 now supports really all Unicode encodings. For that reason it is not possible anymore to run command UTF-8 to ASCII or any other Unicode to ASCII/ANSI command if the active file is not already loaded as Unicode file (2 bytes per character in memory instead of just 1 byte per character).

It is very uncommon to copy Unicode text in UTF-8 encoding to Windows clipboard. Usually a Unicode text is copied to Windows clipboard in UTF-16 encoding. In this case a paste from clipboard into an ASCII/ANSI file results in an automatic conversion from Unicode to ANSI using the code page set for active file which is usually the default system code page for non Unicode files as configured in the Windows region and language settings.

You have 2 possibilities for converting a text copied to Windows clipboard and pasted from clipboard as UTF-8 encoding stream to ANSI text using code page 1252.

Convert with UltraEdit (many steps, not recommended when often needed):
- UTF-8 encoding stream is pasted into a new ANSI file.
- The ANSI file is saved and closed.
- The file is re-opened for example via the recent files list in menu File.
- UltraEdit detects know the UTF-8 encoded characters in the file and loads this file therefore as Unicode file instead of an ANSI file.
- Now the command UTF-8 to ASCII can be used to convert the file from UTF-8 to ASCII/ANSI as preferred.
Convert with script (just 2 steps executed by key once configured):
- My script posted above is copied into a new ASCII file.
- The file is saved for example with name PasteWithUtf8ToAnsi.js into directory %APPDATA%\IDMComp\UltraEdit\MyScripts
- Menu item Scripts in menu Scripting is clicked and the just saved UltraEdit script is added to the list of scripts with assigning a hotkey to this script for easy execution by key. See Customized copy of file name with path for FTP files for the step by step instructions.
- Now whenever text is pasted from Windows clipboard which is obviously a UTF-8 encoding stream, undo the paste with Ctrl+Z (step 1) and instead use the hotkey of the script to paste the text with conversion from UTF-8 to ASCII/ANSI (step 2).

Both methods can be used to convert a double UTF-8 encoded text back to UTF-8.

What is a double UTF-8 encoded text?

Well, that can be most easily explained with an example.

In UltraEdit use File - Open and select ASCII on option Open as in the dialog before selecting a file which is already a UTF-8 encoded file containing UTF-8 encoded characters like many HTML and XML files are nowadays.
File - Conversions - ASCII to UTF-8 is used in UltraEdit to encode the already UTF-8 encoded text once more using UTF-8 encoding.
The file is saved now with a new file name using File - Save As.
The just saved file is closed and opened once again, but this time with automatic detection of encoding.
UltraEdit loads the file as Unicode file because it detects the UTF-8 encoding.
But the user can see nevertheless strange character sequences as the file was UTF-8 encoded twice.

That happens quite often if users do not know what UTF-8 encoding really is and have not read documentation of the used application. Many people imported already UTF-8 encoded text into a database wrong as ANSI text and exported the text later as Unicode text with using UTF-8 encoding and wondered why the double UTF-8 encoded text does not look right.

BTW: Most XML files have a UTF-8 encoding declaration in first line just because of XML creator does not know what this means. Many XML file contain never characters with a code value greater 127, but encoding is nevertheless declared with encoding="UTF-8". That results in slower parsing of the XML file in comparison to using encoding="ASCII" in case of XML reading application really supports Unicode encodings.

Gabarito · Oct 04, 2014#72014-10-04T19:45+00:00

Mofi, you gave to me, and to everyone who read this topic, the best explanations I could get.

Now, too many concepts are very clear and I can handle better my work files and texts from clipboard.

Good you brought two workarounds, because the concepts became more easy to understand.
Of course, the second trick is better, using your DecodeFromClipboard script, and I'll use it.

In the past, the wrong display of the characters was warning to me that something needed more job to do on. And I had used the UTF8-ASCII conversion. But I was doing it such as a blind, without understand what was going on.

Now, I know what is happening.
I'll study a little more about code page, Unicode, ASCII, ANSI, OEM, ...

Thank you very much.
Excellent!