Okay, I will try to answer your questions although already answered in other UTF-8 related topics. But before reading further read carefully
Unicode text and Unicode files in UltraEdit/UEStudio to get the basic understanding about encoding which looks like you don't have.
Why can't few multi-byte characters be for UE enough to detect it as UTF-8?
UltraEdit searches for byte sequences which could be interpreted as UTF-8 character code only in the first 9 KB (UE v11.20a) or 64 KB (UE v14.00b and UltraEdit for Windows < v24.10 or UEStudio < v17.10). Why not in complete file? Because that would make UltraEdit extremely slow on opening any file when setting
Auto detect UTF-8 files is enabled. Scanning always complete file for a byte sequence which could be interpreted as UTF-8 character code would actually result in reading all bytes of a file before displaying it. Not a very good idea for files with several MB and of course a very bad idea for files with hundreds of MBs or even GBs.
Also how can UltraEdit be sure that the byte sequence
E2 80 9C (hex codes) should be really interpreted as UTF-8 character code for the character
“ and not interpreted as string
“ using code page 1252? Can you answer that question if I give you a file with these 3 bytes? How can you know want I meant with these 3 bytes. Maybe I'm a Russian and the same 3 bytes mean
“ or I'm a Greek and the same 3 bytes mean
β€. Do you understand the problem? There must be a convention for a program which reads the bytes
E2 80 9C how to interpret it.
That's the reason why organizations like the
International Organization for Standardization (ISO) or the
Unicode Consortium exist. They define standards. Without standards our high tech world can't exist. Unicode is a standard - see
About the Unicode Standard.
So what is the real problem. The real problem is that the program which created the 17 MB file you open encodes characters with UTF-8 byte sequences, but has not declared the file with a UTF-8 BOM as UTF-8 file. If your file is a HTML, XHTML or XML file then it does not need a BOM, but then it must have at top of the file a declaration for the UTF-8 encoding. That your file does not have a BOM and no standardized character encoding declaration means your program ignores all the standards.
UTF-8 is really a special encoding standard. It was defined because many programs can only handle single byte encoded text files and don't support the Unicode standard. With UTF-8 it is possible to encode non ASCII characters in ASCII/ANSI files and therefore make the files with the non ASCII characters still readable for programs not supporting the Unicode standard. Many interpreters like PHP and Perl are (or were) for example not capable to correct interpret UTF-16 files. They can interpret only ASCII files and ASCII strings and they don't know about the special meaning of
00 00 FE FF (UTF-32, big endian),
FF FE 00 00 (UTF-32, little endian),
FE FF (UTF-16, big endian),
FF FE (UTF-16, little endian) and
EF BB BF (UTF-8) at top of a text file and therefore often break with an error if a BOM exists. That is one reason why for HTML, XHTML and XML a special declaration for the encoding using only ASCII characters was standardized - the document writers can use non ASCII characters, the non Unicode standard compatible interpreters can still interpret the files, but the browsers supporting the standards know which encoding is used for the file and can interpret and display the byte stream correct.
Okay, back to your problem. UltraEdit for Windows < v24.10 and UEStudio < v17.10 do not scan whole file for UTF-8 byte sequences because of the reasons described above. So your 17 MB file is opened in ASCII mode. The bytes of the UTF-8 byte sequences are encoded itself with UTF-8 if the file is saved in UTF-8 now. So the character
œ already present in the file with the 2 bytes C5 93 and interpreted with your code page as
Ĺ“ are saved with the 5 bytes C4 B9 E2 80 9C and now you have garbage. The solution is to use the special file option option
Open as in the
File - Open dialog or insert the 3 bytes of the UTF-8 BOM, save the file as ASCII as loaded, close it and open it again.
I think, I have to explain also why UltraEdit for Windows < v25.10 and UEStudio < v18.10 convert whole file detected as UTF-8 into UTF-16 LE which needs time on larger files. Most characters in a UTF-8 file are encoded with a single byte, others with 2 bytes, some with 3 or even 4 bytes. That is not very good for a program which does not only display the content, but also allows to modify it with dozens of functions. Converting the UTF-8 file to UTF-16 LE results in a fixed number of bytes per character for the characters in
basic multilingual plane. That makes it efficient to handle the bytes of the characters in memory and in file. Also in all programming languages I know there is only the choice to use single byte character arrays for strings or double byte Unicode arrays. UTF-8 is really something special as already written above.
Why can't UE have an configuration option to assume all opened files are UTF-8 (or any other encoding)?
That's a suggestion for an enhancement that can be send by email to IDM. But the real problem is the program which created the 17 MB file using UTF-8 encoding without marking the file as UTF-8 encoded file. If all programs creating UTF-8 files would be compatible with the Unicode standard and would write the encoding information into the file as required by the standards, then all other programs which are already really compatible to the Unicode standard would have no problems reading those files.
Added on 2009-11-09: I have found an undocumented setting in uedit32.exe of v11.10c and later versions.
With manually adding to
uedit32.ini, usually located in
%APPDATA%\IDMComp\UltraEdit, while UltraEdit is not running using Windows Notepad in already existing section
[Settings] (use Find command) the line
Force UTF-8=1
it can be forced to read all non UTF-16 files as UTF-8 encoded files. But new files are nevertheless created and saved either as Unicode (UTF-16 LE) or ASCII/ANSI files, except with UE v16.00 and later the default
Encoding Type is set to
Create new files as UTF-8.
So this special setting is only for already named files. However, creating a new file in ASCII/ANSI with UE < v16.00, save it with a name, close it and re-open it results in a new file encoded in UTF-8. Be careful with that setting. Even real ANSI files are loaded with this setting as UTF-8 encoded file causing all non ASCII characters with a code value greater decimal 127 to be interpreted wrong.