I have a CSV file which is encoded with Windows 1252 (CP1252) encoding due to one character (out of 300K) which falls in the range 0x80-0x9F. When I open it in UltraEdit, it is recognized as ISO-8859-1, presumably because only a small subset of the characters at the beginning of the file are parsed (the "offending" character "ž", which has code point 0x9E in WIN-1252, is somewhere in the middle of the file). If I scroll down even just one page after opening the file, the encoding is recognized correctly as WIN-1252 and the status bar indicator in UltraEdit shows the change.
Now if I use the file conversion functions to convert explicitly from WIN-1252 to ISO-8859-1, this character (0x9E) is converted to 0x1A which is a control character and is undefined in ISO-8859-1 (likewise, 0x9E is undefined in ISO-8859-1). Given the fact that most people are not aware of the difference between these encodings (and probably do not care as long as everything works!), wouldn't it make more sense to leave the code the same and not convert it at all? The new HTML5 specification requires browsers to render webpages which are encoded in ISO-8859-1 as if they are Windows 1252. According to the Wikipedia site for Windows-1252, this is in order to deal with the very common mislabeling of these pages as ISO-8859-1 when they sometimes contain such characters.
If I convert the file from WIN-1252 to UTF-8, the character is converted correctly to its Unicode equivalent. Converting from UTF-8 back to ISO-8859-1 gives me 0x1A instead of 0x9E. Of course, I expect to lose information when I convert from UTF-8 to a single-byte encoding. But wouldn't it make more sense to convert Unicode characters to their Windows 1252 equivalent if possible, instead of converting to some control character that is of no use to anyone?
I was confused about this for a long time because when I import this CSV file into MySQL with the table defined as "CHARSET=utf8" using LOAD DATA INFILE with CHARACTER SET 'latin1' specified, all the data is imported correctly including the offending character. This is because MySQL actually parses the data as Windows-1252 instead of ISO-8859-1 encoding.
What do you think UltraEdit should do here, especially when converting from Unicode to ISO-8859-1 encoding? I assume that conversions from Unicode to a Cyrillic or Eastern European character set would behave as expected.
Now if I use the file conversion functions to convert explicitly from WIN-1252 to ISO-8859-1, this character (0x9E) is converted to 0x1A which is a control character and is undefined in ISO-8859-1 (likewise, 0x9E is undefined in ISO-8859-1). Given the fact that most people are not aware of the difference between these encodings (and probably do not care as long as everything works!), wouldn't it make more sense to leave the code the same and not convert it at all? The new HTML5 specification requires browsers to render webpages which are encoded in ISO-8859-1 as if they are Windows 1252. According to the Wikipedia site for Windows-1252, this is in order to deal with the very common mislabeling of these pages as ISO-8859-1 when they sometimes contain such characters.
If I convert the file from WIN-1252 to UTF-8, the character is converted correctly to its Unicode equivalent. Converting from UTF-8 back to ISO-8859-1 gives me 0x1A instead of 0x9E. Of course, I expect to lose information when I convert from UTF-8 to a single-byte encoding. But wouldn't it make more sense to convert Unicode characters to their Windows 1252 equivalent if possible, instead of converting to some control character that is of no use to anyone?
I was confused about this for a long time because when I import this CSV file into MySQL with the table defined as "CHARSET=utf8" using LOAD DATA INFILE with CHARACTER SET 'latin1' specified, all the data is imported correctly including the offending character. This is because MySQL actually parses the data as Windows-1252 instead of ISO-8859-1 encoding.
What do you think UltraEdit should do here, especially when converting from Unicode to ISO-8859-1 encoding? I assume that conversions from Unicode to a Cyrillic or Eastern European character set would behave as expected.