Default behavior for encoding conversion (WIN-1252 to ISO-8859-1 and back)

bobh · Mar 05, 2016#12016-03-05T11:26+00:00

I have a CSV file which is encoded with Windows 1252 (CP1252) encoding due to one character (out of 300K) which falls in the range 0x80-0x9F. When I open it in UltraEdit, it is recognized as ISO-8859-1, presumably because only a small subset of the characters at the beginning of the file are parsed (the "offending" character "ž", which has code point 0x9E in WIN-1252, is somewhere in the middle of the file). If I scroll down even just one page after opening the file, the encoding is recognized correctly as WIN-1252 and the status bar indicator in UltraEdit shows the change.

Now if I use the file conversion functions to convert explicitly from WIN-1252 to ISO-8859-1, this character (0x9E) is converted to 0x1A which is a control character and is undefined in ISO-8859-1 (likewise, 0x9E is undefined in ISO-8859-1). Given the fact that most people are not aware of the difference between these encodings (and probably do not care as long as everything works!), wouldn't it make more sense to leave the code the same and not convert it at all? The new HTML5 specification requires browsers to render webpages which are encoded in ISO-8859-1 as if they are Windows 1252. According to the Wikipedia site for Windows-1252, this is in order to deal with the very common mislabeling of these pages as ISO-8859-1 when they sometimes contain such characters.

If I convert the file from WIN-1252 to UTF-8, the character is converted correctly to its Unicode equivalent. Converting from UTF-8 back to ISO-8859-1 gives me 0x1A instead of 0x9E. Of course, I expect to lose information when I convert from UTF-8 to a single-byte encoding. But wouldn't it make more sense to convert Unicode characters to their Windows 1252 equivalent if possible, instead of converting to some control character that is of no use to anyone?

I was confused about this for a long time because when I import this CSV file into MySQL with the table defined as "CHARSET=utf8" using LOAD DATA INFILE with CHARACTER SET 'latin1' specified, all the data is imported correctly including the offending character. This is because MySQL actually parses the data as Windows-1252 instead of ISO-8859-1 encoding.

What do you think UltraEdit should do here, especially when converting from Unicode to ISO-8859-1 encoding? I assume that conversions from Unicode to a Cyrillic or Eastern European character set would behave as expected.

Mofi · Mar 05, 2016#22016-03-05T18:01+00:00

Character conversions are done according to Unicode standard by library functions. UltraEdit uses currently most likely ICU - International Components for Unicode according to the icu*.dll files in program files folder of UltraEdit and perhaps also the library functions of Windows for character conversions. I don't know that for sure as I don't know code of UltraEdit. So this information is based on a best guess.

I could not reproduce your conversion on Windows XP SP3 x86 with UE v22.20.0.49.

I put into a new file using default code page Windows-1252 the two characters €ž (0x80 0x9E) and saved that file.
Next I used the encoding selector in status bar and selected ISO - 28591 (ISO 8859-1 Latin I).
I re-opened the file and the status bar displayed 28591 (ISO 8859-1 Latin I), but both characters were still displayed as €ž as before.
I closed the file once again. Then I opened Advanced - Configuration - Toolbars / Menus - Miscellaneous, clicked on button Clear history and closed the configuration dialog with button Cancel.
This cleared the information that code page for this file was explicitly set to 28591 instead of 1252 which is the default code page on my machine for non Unicode files according to configuration in Windows region and language settings.
I opened the file with the 2 characters once again and status bar displayed this time 1252 (ANSI - Latin I) as expected.
I converted the file to Unicode - UTF-8, saved the file, and converted it next once again to ISO - 28591 (ISO 8859-1 Latin I).

The result was: ?z (0x3F 0x7A)

This is okay because the Euro Sign is not available in ISO 8859-1 and therefore converted to question mark. And Latin Small Letter Z With Caron was converted to Latin Small Letter Z which is also okay as being most similar available character.

I don't know which conversion you made to get ž converted to the substitute control character. But this could be also right taking its name into account: substitute.

By the way: The first 127 code values are the same in nearly all ANSI, OEM and ISO code pages. Windows-1252 and ISO/IEC 8859-1 include therefore also the substitute control character. Those 127 characters are from ASCII table. Well, Wikipedia lists the control characters with the values 0x00 to 0x1F as undefined in ISO/IEC 8859-1 and all other ISO/IEC 8859-* standards which is correct as this standard most likely defines no control characters at all. (I have never read ISO 8859.) But if bytes with values 0x00 to 0x1F would be really interpreted as undefined by software and therefore simply ignored, no text file using ISO/IEC 8859-1 could contain a horizontal tab, a carriage return, a line-feed, or a form-feed character which would make such text files really hard to read.

bobh · Mar 06, 2016#32016-03-06T09:51+00:00

Hi Mofi,

Thanks for taking time to answer. I tried what you did, and after converting the file to UTF-8 and back to ISO-8859-1, I see two control characters (0x1A and 0x1A).

I am using UEX 4.1.0.10 on Linux Ubuntu 14.04 LTS, so maybe this is a platform-related issue?

Mofi · Mar 06, 2016#42016-03-06T11:05+00:00

This is definitely a platform-related issue as I'm quite sure that the conversion is not done by code in UltraEdit itself. All applications I know including those I wrote myself use library functions for character conversions. UEX is definitely no exception.