It looks like you don't know anything about text encoding. Therefore I suggest reading first the introduction chapter on power tip Working with Unicode in UltraEdit/UEStudio
or even better the entire power tip.
Next read my first post on What's the best default new file format?
I wrote there much about text encoding and added some links to useful pages.UTF
is the abbreviation for U
nicode Transformation F
ormat. Every application supporting text can manage them in memory either with 1 byte per character (using a char
array in C/C++) for OEM/ASCII/ANSI encoded files and with 2 bytes per character (using a wchar_t
array in C/C++) for Unicode encoded files.UTF-8
is a special format which saves characters with 1, 2, 3, 4, 5 or even 6 bytes whereby in November 2003 UTF-8 was restricted to code range U+10FFFF (4 bytes) with RFC 3629
to be able to convert UTF-8 to UTF-16.
So how many bytes are needed in memory for each character can vary from character to character when a UTF-8 encoded text could be directly loaded into memory. But such a character encoding is not supported by any programming language for memory management. Therefore all applications supporting UTF-8 encoded files convert them to UTF-16. Also UltraEdit converts a UTF-8 encoded file to UTF-16 LE which uses fixed 2 bytes per character making it possible to load the text into memory using wchar_t
This explains why the file size of the temporary file using UTF-16 Little Endian encoding is "double" the file size of the UTF-8 encoded file. Well, "double" is not really correct when the UTF-8 encoded file contains at least 1 character encoded with more than 1 byte. And the temporary file contains also UTF-16 LE byte order mark never displayed in text mode which means there are 2 additional bytes at top of the file. UltraEdit displays a byte order mark only in hex editing mode, but not in text mode according to Unicode standard.
UltraEdit does not modify a file just because of opening it or on closing it without a modification. But a UTF-8 encoded file must be transformed
to UTF-16 LE to be able to load the Unicode text file into memory - in parts or entirely depending on file size.
I'm not sure what you have really done. Your description is not precise enough. It reads like you used File - Open
with manually selecting UTF-8
in the dialog. This should be done only if you are 100% sure that entire
file is encoded in UTF-8, but UltraEdit does not automatically detect it because no BOM, no UTF-8 character set (HTML, XHTML) or UTF-8 encoding declaration (XML) in first few KB, and also no UTF-8 encoded character in first 64 KB. See UTF-8 not recognized, largish file
for details on automatic UTF-8 detection and when it fails.
But I think, I could find out what you have done which resulted in a modified file although the file was just opened and closed without making an obvious change.
- I selected configuration option Use temporary file for editing (normal operation) and set 0 for Threshold for above at Advanced - Configuration - File Handling - Temporary Files which are the default settings, but not my preferred temporary file settings.
- Next I created a new ANSI file with DOS line terminators using code page Windows-1252 and copied into the file content of file changes.txt in UltraEdit program files folder, but not one times, not two times, ... no ..., several times with Ctrl+V, Ctrl+V, Ctrl+V, Ctrl+A, Ctrl+C, Ctrl+V, Ctrl+V, Ctrl+V, Ctrl+A, Ctrl+C, Ctrl+V, Ctrl+V, ... Ctrl+V. The file size of the new file increased with this method very quickly to a size with about 223 MB.
- As this text file contained only ASCII characters, I added at bottom Bäume (German word meaning trees) containing umlaut ä with hexadecimal code value E4. The entire word Bäume is encoded with Windows-1252 with the bytes 42 E4 75 6D 65.
Then I saved the file and closed it.
- I used File - Open, selected the file and let option Auto Detect ASCII/Unicode unchanged. As the file is greater than 50 MB which is the internal threshold value on using 0 for threshold, the following dialog was opened and I let first option selected.
- UltraEdit opened the file without using a temporary file as ASCII/ANSI file with using code page 1252 as indicated also on status bar. Nothing changed on disk after opening. I closed the file and nothing changed on disk after closing.
- I used again File - Open, but this time I selected UTF-8, although the file is not encoded in UTF-8.
THAT WAS THE MISTAKE YOU HAVE MADE MOST LIKELY, TOO.
You have manually selected an encoding which was wrong for the file as the file was not encoded in UTF-8.
- Now it took much longer to open the file as UltraEdit needs to transform the file content from UTF-8 to UTF-16 to load finally parts of it into memory for viewing and editing. As I selected already before Disable temporary files when opening large files (greater than 50 MB) for this edit session only (Recommended), UltraEdit did not ask me again if I want to open the large file with or without usage of a temporary file. UltraEdit would have done it, if I would have exited UE and restarted it.
As no temporary file could be used, UltraEdit transformed now original file from UTF-8 to UTF-16.
- I knew that this results in interpreting ä wrong as this character would be stored in a UTF-8 encode file with the 2 bytes C3 A4 (hexadecimal). So I was not astonished to see now at bottom of file B㴭e instead of Bäume.
The UTF-8 to UTF-16 transformation of ANSI encoded Bäume with the bytes 42 E4 75 6D 65 resulted in memory in bytes 42 00 2D 3D 65 00.
- I closed the file now. UltraEdit converted the file back from UTF-16 LE to UTF-8 without adding UTF-8 BOM at top of file because I have unchecked Write UTF-8 BOM header to all UTF-8 files when saved at Advanced - Configuration - File Handling - Save.
- 42 00 2D 3D 65 00 (UTF-16 LE in memory) was stored on disk now as 42 E3 B4 AD 65 which of course is not 42 E4 75 6D 65.
Why this difference?
E4 75 6D is an invalid UTF-8 byte stream. Therefore the library function which converted this byte stream from UTF-8 to UTF-16 LE must fail to do it right. This always happens with ANSI encoded text files being interpreted as UTF-8 encoded byte stream because of a wrong encoding selection made by the user. It is not possible to restore original byte stream in such cases.
Does this wrong conversion also occur when selecting UTF-8
for an ASCII/ANSI/OEM encoded text file on using a temporary file?
Yes, of course it does. But when using a temporary file the original file does not need to be modified by UltraEdit. UltraEdit can simply delete the temporary file on closing the file without modifying at any time the original file.Conclusion:
- A temporary file should be always used when a file is a UTF-8 or ASCII Escaped Unicode file independent on file size as the file must be converted in any case at least temporarily to UTF-16 LE on disk (storage media).
- ANSI or UTF-16 should be used in all applications which create large text files of more than 20 MB. A usage of ANSI or UTF-16 noticeable speeds up writing to text file and makes it easier and faster for all other applications reading in and processing this file.
There are two exceptions for second recommendation:
- The input data is encoded already in UTF-8 and the application creating the text file does not really support Unicode which means it interprets the UTF-8 encoded text as array of bytes and outputs therefore simply also an array of bytes without knowing how this bytes should be interpreted at all. This is one advantage of UTF-8 in comparison to UTF-16. Non Unicode aware applications like PHP interpreter can load and output Unicode text as long as the text does not need to be modified by the application, just read and output.
- The text file contains to a large extent (> 97%) just ASCII characters, space on storage media must be saved, and file is not often further processed. Daily created log files which can contain sometimes any character from entire Unicode table are a typical example for this exception.