How does automatic UTF-8 encoding detection work in UltraEdit and UEStudio?

abcjme · Mar 08, 2019#12019-03-08T03:47+00:00

My text files only use characters from the code page Windows-1252.
Yet, upon opening, UltraEdit sometimes indicates that a text file is UTF-8 encoded.
Why might UltraEdit make this incorrect assumption?

Mofi · Mar 08, 2019#22019-03-08T06:56+00:00

Your question can't be really answered without having a file on which encoding detection is incorrect.

I recommend reading first the two introduction chapters on Unicode and difference between UTF-8 and UTF-16 of power tip Working with Unicode in UltraEdit/UEStudio. It is very important to know for any text writer that character encoding is based on rules and the text author has to know these rules and has to keep them or the applications interpreting the data in text files will not interpret the text data as expected by the author of the text file.

The UTF-8 detection is as follows:

UTF-8 encoded byte order mark (BOM) at top of file.
The BOM is not displayed in text edit mode according to Unicode standard. It can be seen in UltraEdit in hex edit mode by looking on first four bytes.
The opened file is an HTML, XHTML or XML file with UTF-8 character set or encoding declaration.
See the topic Short UTF-8 charset declaration in HTML5 header (solved) for details.
Note: An XML file without encoding declaration at top must be UTF-8 encoded or the XML file is invalid encoded. See non-normative chapter Autodetection of Character Encodings of the Extensible Markup Language (XML) 1.0 specification. Non-normative means also that a text editor like UltraEdit can scan entire XML file for byte sequences which can be interpreted as valid UTF-8 encoded characters and interpret the bytes in the XML file as ANSI encoded with no such byte sequence found. A text editor has to take into account that the user first pastes or writes the XML data and finally inserts the correct encoding declaration at top of the XML file. Of course the user is responsible for converting the XML file from ANSI to UTF-8 if inserting a UTF-8 encoding declaration at end of the text editing session at top of the file and not the used text editor.
The user is also responsible for declaring the correct character set according to used character encoding for HTML and XHTML. See the topic this post for details about character encoding in HTML/XHTML files.
The file contains byte sequences which are valid UTF-8 character encoding sequences.
This automatic detection of UTF-8 encoded characters can result in a wrong encoding detection as it is impossible for any program to really know on a text file with no BOM and no UTF-8 character set or encoding declaration if for example the byte sequence C3 BC should be interpreted as the two characters Ã¼ (Windows-1252) or as character ü (UTF-8).
The user has once in the past selected UTF-8 encoding for a file which UltraEdit remembers in INI file and applies again on next opening of the file.
Closing all files, opening Advanced - Settings or Configuration - Toolbars / menus - Miscellaneous, clicking on button Clear history and closing configuration with button Cancel clears this file encoding history among all other histories stored in INI file of UltraEdit.