How to convert a file from ANSI (1250 Central Europe) to UTF-8?

david0 · Jun 16, 2020#12020-06-16T18:39+00:00

I'd like to convert files from ANSI (1250 Central Europe) to UTF-8 and vice-versa. When I try using Menu > Advanced > Conversions > ASCII to UTF-8 and use Ctrl+H to switch to hex editing, the resulting file is displayed with ANSI characters as though they were UTF-8. They look like this: Â¡Â¿Ã¡Ã©ÃÃ±Ã³Ãº??. So confusing!

Every editor I've tried seems to have various troubles converting back and forth between ANSI and UTF-8, including UltraEdit. Why can't the user interface be clearer, or at least well explained somewhere?

So here is my test case:

Code: Select all

¡¿áéíñóú??
ÁàáÀÄäÉéÈèÍíñóÒòÖöÚúÜüß
™®©©° – —
?§¶·¨«»

I must be misunderstanding something, but I'm not sure what. Please correct my errors.

Mofi · Jun 17, 2020#22020-06-17T05:54+00:00

The hex edit mode shows the binary bytes - not characters - of a file. The ASCII representation of the bytes (not characters) uses the code page as defined by default for UltraEdit which is by default the ANSI code page defined by Windows according to region (country) configured for the user account. You will never see the bytes in hex edit mode displayed Unicode interpreted according to UTF-7, UTF-8 or UTF-16. The hex edit mode displays the bytes of any type of file and not the characters of a text file. Please read the introducing chapters on power tip page Unicode text and Unicode files in UltraEdit to get better knowledge about character encoding.

Viewing a file with any character encoding is very easy with UltraEdit. There is at bottom of main application window the status bar which contains since UltraEdit for Windows v19.00 the encoding selector. So the bytes of the currently displayed file can be interpreted using a different encoding as automatically selected on opening the file in case of automatic encoding selection was not correct for the file since UltraEdit for Windows v24.10. In older versions selecting a different encoding on status bar could result in converting the file to the selected encoding instead of just displaying the bytes of current file according to selected encoding. I know from IDM support that this encoding selector behavior change was done after a good deal of internal discussion prompted by messages from users who didn't actually want to convert their files, but simply wanted to change which encoding was used to display the file in certain cases.

The currently used font must support the selected encoding as well, i.e. it must have glyphs defined for the characters of selected encoding. Most fonts support only characters of a few code pages. There is a different font chosen automatically by UltraEdit for Windows since v24.00 for a character not supported by configured font if the text file is Unicode encoded. That can result in a caret positioning issue because of the character widths are always according to configured font. So if a different font is used for just some characters in a line and the alternate font used for those few characters defines a different width for those characters, the caret positioning can be wrong. Internet browsers do that also on displaying Unicode text on which some characters are not supported by the font defined by the web page creator or the user. But Internet browsers don't show a caret at all and so most users don't recognize that some characters are displayed using a different font.

UltraEdit informs the user if the configured font must be changed to support the different code page respectively encoding. For example UltraEdit shows the warning on changing interpretation of the bytes of a text file from Windows-1252 displayed with a font with script Western selected to Windows-1250 on which the font must be changed to script Central Europe if the font supports that code page at all.

The appropriate conversion command can be used after selecting the correct encoding for the currently displayed text file. Or the command Save as is used which has an encoding option to convert the file on saving to UTF-8 without or with byte order mark (BOM) or UTF-16 Little Endian or Big Endian without or with BOM or ANSI according to ANSI code page defined in UltraEdit for ANSI encoded text files.

Your example text block is definitely not encoded using code page Windows-1250. Please look on Wikipedia article about code page Windows-1250. The character ¡ is not available in code page Windows-1250. The inverted exclamation mark is available in code page Windows-1252 with hexadecimal code value A1 and has the Unicode code value 00A1. The character ¿ is also not available in code page Windows-1250 while inverted question mark is available in code page Windows-1252 with hexadecimal code value BF which has the Unicode code value 00BF.

So you made a mistake. You thought text is ANSI encoded with code page Windows-1250, but is in real encoded with code page Windows-1252. So you get the characters displayed wrong on converting the bytes of the file interpreted according to Windows-1250 converted to Unicode with UTF-8 encoding. The byte with code value A1 is not converted to Unicode character with code value 00A1 as expected by you, but to Unicode character with code value 02C7 (caron) according to code page Windows-1250.

If you see the characters ¡ and ¿ displayed in document window of UltraEdit although having selected code page Windows-1250, you have most likely ignored the warning of UltraEdit that the font must be modified, i.e. script must be changed from Western to Central Europe and so the configured font displays the bytes nevertheless Windows-1252 encoded and not Windows-1250 encoded.

david0 · Jun 17, 2020#32020-06-17T16:18+00:00

Thanks for your detailed answer. Unfortunately, "Windows 1250" was a typo. I meant "Windows 1252", which is the same as ISO-8859-1, except in the range 80-9f (hex). Also, please forget about Hex Editing mode; I was hoping that it could render multibyte strings in UTF-8 using single glyphs. I've been searching, and no hex editor--strangely--seems to be capable of doing this even though it would not be difficult to program.

So let me try to ask my remaining question.

You say, "So the bytes of the currently displayed file can be interpreted using a different encoding..." but this is not what I'm asking for. I am not asking to render the bytes differently, but to convert the bytes from one encoding to another.

Let's say I have some text in ISO-8859-1 encoding. It is a string of bytes in a file (one byte is one character in this encoding). I want to convert this string of bytes (the entire file or a selection, I don't care which) to UTF-8 encoding, so I can save the result as a file. I haven't been able to accomplish this in UE (I can create a file from scratch in UTF-8, of course).

Also, I want to do the opposite: given a file already in UTF-8 encoding, I want to convert it from UTF-8 to ISO-8859-1 and save it.

When opening converted files, they should appear the same as before conversion (that is, rendered as the same glyphs), assuming all characters are in the first 256 characters in the character set, whether ISO-8859-1 or Unicode. But the bytes in the converted files may be different after conversion, since UTF-8 is a completely different encoding for all characters 80 (hex) or more from the start of the character set.

I hope I've asked my question exactly and correctly this time.

Mofi · Jun 17, 2020#42020-06-17T17:28+00:00

Which version of UltraEdit do you use on which operating system?

This information is important for me as there are various differences between the versions of UltraEdit for Windows and I want to write the steps which are definitely working for the version of UltraEdit used by you.

In general the solution is converting a file from one single byte per character encoded format using code page X to Unicode (UTF-16) format and next from Unicode to other single byte per character encoded format using code page Y.

david0 · Jun 17, 2020#52020-06-17T17:38+00:00

Oh, sorry, I'm getting forgetful. Windows 10 and UE Version 26.20.0.46 .

Oh, that sounds very complicated and error-prone. I was hoping that UE would be the one editor that would just do something like this easily and reliably.

Never mind, thank you for all your help. Maybe I'll have to write my own conversion utility program. I've already written the conversion code to go to UTF-8 in PHP and it works great. It's just that I have so many other projects already. Writing a hex editor that shows UTF-8 will have to compete against at least two other projects that are in progress now.

Mofi · Jun 20, 2020#62020-06-20T10:39+00:00

Well, it is very easy with UltraEdit for Windows v26.20.0.46 to convert a text file being not already Unicode encoded to another non Unicode character encoding.

The necessary steps to execute

on ribbon tab Advanced in pop-up menu Conversions or
in submenu Conversions of contemporary menu Advanced or
in submenu Conversions of traditional menu File or
in submenu Conversions of file tab context menu

are just

clicking on item ASCII to Unicode and
clicking on item Unicode to ASCII and
selecting the wanted destination code page and
saving the file with Ctrl+S.

That's it.

The first step must be omitted if the file is Unicode encoded with encoding UTF-16 LE or UTF-16 BE.

The first step must be omitted and the item to click on second step is UTF-8 to ASCII if the file is Unicode encoded with UTF-8.

There is no command to convert a non Unicode encoded file with code page X to a non Unicode encoded file with different code page Y directly with two exceptions:

A one byte per character encoded file using the ANSI code page according to country configured for the used user account can be converted with command ANSI to OEM to the OEM code page according to country configured for the used user account.
A one byte per character encoded file using the OEM code page according to country configured for the used user account can be converted with command OEM toANSI to the ANSI code page according to country configured for the used user account.

Why has UltraEdit no command to convert directly a non Unicode encoded file with code page X to a non Unicode encoded file with code page Y?

I don't know. I can only suppose that nobody requested that ever in the past because of there are not many code pages which encode the same characters with different code values.

I wrote an UltraEdit script to convert all files to UTF-8 which is more often used, mainly by web page authors, to convert an existing set of web files with a non Unicode character encoding to UTF-8, see How to convert all files in a folder to UTF-8? HTML writers should read also Script to convert special characters to HTML code.

By the way: There are freeware applications available in world wide web which are designed for just converting files from one encoding to another encoding. Such a tool should be used if lots of files with a well-known character encoding should be converted to a different character encoding. UltraEdit is mainly designed for text editing as all other text editors and not for text file conversions.