UltraEdit has started changing my large UTF-8 encoded files on open/close (solved)

UltraEdit has started changing my large UTF-8 encoded files on open/close (solved)

2
NewbieNewbie
2

    Jul 25, 2015#1

    I opened a 163 MB file of tweets, in UTF-8. As I did so, I watched UltraEdit rewrite the file to double its length -- without any prompt or warning. I then immediately closed that tab and watched UltraEdit again rewrite the file -- without any prompt or warning -- back to its original length. (But altered from its original content.) No BAK file is made.

    What is going on here?

    Before today, when I opened a huge file, UltraEdit would
    • never touch it unless I asked it to by explicitly saving, and
    • ask me if I wanted a temp file.
    I can't find anything in the configuration that helps suggest what might have changed.

    Because UltraEdit is actually changing the content of the file (and not just wasting my time by writing the file twice when I asked it to do so zero times), this means I cannot use UltraEdit on UTF-8 files until I figure out what is going on. No backup is kept so my original file is irrecoverably lost.

    Version 22.10.0.12
    Windows 7 64 bit

    "Temporary files" in config is set to "Use temporary file for editing (normal operation)" with no threshold set. I definitely did not change this today, and it was working yesterday.

    I just closed and opened UltraEdit and now it is again prompting me if I want a temp file ... but if I say no temp file like normal, it still changes my original file without any prompt on reading (doubling its length) and rewrites it again on closing (back to the original length).

    UltraEdit should never change the contents of a file just because I opened it, nor when I close the file without saving it. What is going on here?

    6,675585
    Grand MasterGrand Master
    6,675585

      Jul 25, 2015#2

      It looks like you don't know anything about text encoding. Therefore I suggest reading first the introduction chapter on power tip Working with Unicode in UltraEdit/UEStudio or even better the entire power tip.

      Next read my first post on What's the best default new file format? I wrote there much about text encoding and added some links to useful pages.

      UTF is the abbreviation for Unicode Transformation Format. Every application supporting text can manage them in memory either with 1 byte per character (using a char array in C/C++) for OEM/ASCII/ANSI encoded files and with 2 (or 4) bytes per character (using a wchar_t array in C/C++) for Unicode encoded files.

      UTF-8 is a special format which saves characters with 1, 2, 3, 4, 5 or even 6 bytes whereby in November 2003 UTF-8 was restricted to code range U+10FFFF (4 bytes) with RFC 3629 to be able to convert UTF-8 to UTF-16.

      So how many bytes are needed in memory for each character can vary from character to character when a UTF-8 encoded text could be directly loaded into memory. But such a character encoding is not supported by any programming language for memory management. Therefore all applications supporting UTF-8 encoded files convert them to UTF-16. Also UltraEdit converts a UTF-8 encoded file to UTF-16 LE which uses fixed 2 bytes per character making it possible to load the text into memory using wchar_t arrays.

      This explains why the file size of the temporary file using UTF-16 Little Endian encoding is "double" the file size of the UTF-8 encoded file. Well, "double" is not really correct when the UTF-8 encoded file contains at least 1 character encoded with more than 1 byte. And the temporary file contains also UTF-16 LE byte order mark never displayed in text mode which means there are 2 additional bytes at top of the file. UltraEdit displays a byte order mark only in hex editing mode, but not in text mode according to Unicode standard.

      UltraEdit does not modify a file just because of opening it or on closing it without a modification. But a UTF-8 encoded file must be transformed to UTF-16 LE to be able to load the Unicode text file into memory - in parts or entirely depending on file size.

      I'm not sure what you have really done. Your description is not precise enough. It reads like you used File - Open with manually selecting UTF-8 in the dialog. This should be done only if you are 100% sure that entire file is encoded in UTF-8, but UltraEdit does not automatically detect it because no BOM, no UTF-8 character set (HTML, XHTML) or UTF-8 encoding declaration (XML) in first few KB, and also no UTF-8 encoded character in first 64 KB on using UltraEdit for Windows < v24.10 or UEStudio < v17.10. See UTF-8 not recognized, largish file for details on automatic UTF-8 detection and when it fails.

      But I think, I could find out what you have done which resulted in a modified file although the file was just opened and closed without making an obvious change.
      • I selected configuration option Use temporary file for editing (normal operation) and set 0 for Threshold for above at Advanced - Configuration - File Handling - Temporary Files which are the default settings, but not my preferred temporary file settings.
      • Next I created a new ANSI file with DOS line terminators using code page Windows-1252 and copied into the file content of file changes.txt in UltraEdit program files folder, but not one times, not two times, ... no ..., several times with Ctrl+V, Ctrl+V, Ctrl+V, Ctrl+A, Ctrl+C, Ctrl+V, Ctrl+V, Ctrl+V, Ctrl+A, Ctrl+C, Ctrl+V, Ctrl+V, ... Ctrl+V. The file size of the new file increased with this method very quickly to a size with about 223 MB.
      • As this text file contained only ASCII characters, I added at bottom Bäume (German word meaning trees) containing umlaut ä with hexadecimal code value E4. The entire word Bäume is encoded with Windows-1252 with the bytes 42 E4 75 6D 65.
        Then I saved the file and closed it.
      • I used File - Open, selected the file and let option Auto Detect ASCII/Unicode unchanged. As the file is greater than 50 MB which is the internal threshold value on using 0 for threshold, the following dialog was opened and I let first option selected.
        temporary_file_handling.png (4KiB)
        Dialog displayed on opening a file greater 50 MB with 0 set for threshold.
      • UltraEdit opened the file without using a temporary file as ASCII/ANSI file with using code page 1252 as indicated also on status bar. Nothing changed on disk after opening. I closed the file and nothing changed on disk after closing.
      • I used again File - Open, but this time I selected UTF-8, although the file is not encoded in UTF-8.

        THAT WAS THE MISTAKE YOU HAVE MADE MOST LIKELY, TOO.

        You have manually selected an encoding which was wrong for the file as the file was not encoded in UTF-8.
      • Now it took much longer to open the file as UltraEdit needs to transform the file content from UTF-8 to UTF-16 to load finally parts of it into memory for viewing and editing. As I selected already before Disable temporary files when opening large files (greater than 50 MB) for this edit session only (Recommended), UltraEdit did not ask me again if I want to open the large file with or without usage of a temporary file. UltraEdit would have done it, if I would have exited UE and restarted it.
        As no temporary file could be used, UltraEdit transformed now original file from UTF-8 to UTF-16.
      • I knew that this results in interpreting ä wrong as this character would be stored in a UTF-8 encode file with the 2 bytes C3 A4 (hexadecimal). So I was not astonished to see now at bottom of file B㴭e instead of Bäume.
        The UTF-8 to UTF-16 transformation of ANSI encoded Bäume with the bytes 42 E4 75 6D 65 resulted in memory in bytes 42 00 2D 3D 65 00.
      • I closed the file now. UltraEdit converted the file back from UTF-16 LE to UTF-8 without adding UTF-8 BOM at top of file because I have unchecked Write UTF-8 BOM header to all UTF-8 files when saved at Advanced - Configuration - File Handling - Save.
      • 42 00 2D 3D 65 00 (UTF-16 LE in memory) was stored on disk now as 42 E3 B4 AD 65 which of course is not 42 E4 75 6D 65.

        Why this difference?

        E4 75 6D is an invalid UTF-8 byte stream. Therefore the library function which converted this byte stream from UTF-8 to UTF-16 LE must fail to do it right. This always happens with ANSI encoded text files being interpreted as UTF-8 encoded byte stream because of a wrong encoding selection made by the user. It is not possible to restore original byte stream in such cases.
      Does this wrong conversion also occur when selecting UTF-8 for an ASCII/ANSI/OEM encoded text file on using a temporary file?

      Yes, of course it does. But when using a temporary file the original file does not need to be modified by UltraEdit. UltraEdit can simply delete the temporary file on closing the file without modifying at any time the original file.

      Conclusion:
      • A temporary file should be always used when a file is a UTF-8 or ASCII Escaped Unicode file independent on file size as the file must be converted in any case at least temporarily to UTF-16 LE on disk (storage media).
      • ANSI or UTF-16 should be used in all applications which create large text files of more than 20 MB. A usage of ANSI or UTF-16 noticeable speeds up writing to text file and makes it easier and faster for all other applications reading in and processing this file.
      There are two exceptions for second recommendation:
      • The input data is encoded already in UTF-8 and the application creating the text file does not really support Unicode which means it interprets the UTF-8 encoded text as array of bytes and outputs therefore simply also an array of bytes without knowing how this bytes should be interpreted at all. This is one advantage of UTF-8 in comparison to UTF-16. Non Unicode aware applications like PHP interpreter can load and output Unicode text as long as the text does not need to be modified by the application, just read and output.
      • The text file contains to a large extent (> 97%) just ASCII characters, space on storage media must be saved, and file is not often further processed. Daily created log files which can contain sometimes any character from entire Unicode table are a typical example for this exception.
      Best regards from an UC/UE/UES for Windows user from Austria

      2
      NewbieNewbie
      2

        Jul 26, 2015#3

        I really appreciate your extensive reply.

        I understand Unicode just fine.  What I didn't expect is that my file would be written to when I hadn't changed anything. Further experimentation indicates this only happens for files large enough to trigger the special "large file" handling. I didn't see anything in the "Working with Unicode in UltraEdit/UEStudio" tip that indicates opening a file without use of a temp file would cause your file to be transformed (written to). For example, if I open the same file with Notepad++, my file is not touched. The "last modified" date does not change, the file contents do not change. There's nothing about Unicode that requires that UltraEdit transform my file in place, on disk.  I expected that transformation to occur in memory only.

        The huge text file I'm opening is stored in UTF-8 with no BOM, and has at least a dozen different scripts in it, although it is primarily Latin and probably 95% ASCII compatible. UltraEdit displays its context perfectly, and I can correctly see all of the scripts inside it. For what it's worth, Notepad++ also displays its contents perfectly, and it doesn't have to rewrite my file to do it.

        I opened the file by right clicking on it in Explorer and selecting "UltraEdit" from the context menu. I must have previously changed some configuration option that is triggering this transform to occur, as it wasn't occurring before today (well, now, yesterday).

        What I've learned is that when the temp file is disabled (due to the file being very large), that the file isn't converted in memory to UTF-16, but instead in my file. And then when I close my file without making any changes, it is transformed back to UTF-8. This was a great surprise to me. I didn't expect the file to be written to, period, no matter what, full stop, unless I clicked "save".  I expected any transformation to occur in memory, not on my file! Honestly, I consider this a bug or a mis-feature. If my file will be written to at all, even if it's an information preserving transform, I should be warned before this occurs. The "Temporary File Handling" pop-up warning (that you show in your post) doesn't warn you that the file is about to be rewritten. If it had warned me, I would have killed UltraEdit so my file wouldn't be touched. The file was being processed (read) by another program, so UltraEdit's rewrite caused problems. There was nothing anywhere that I am aware of to suggest that UltraEdit would write to my file. If I've missed something, please let me know.

        Again, I appreciate your post. And while I understand Unicode, the next person who runs into this may not, so your post is a very helpful one. Thank you.

        6,675585
        Grand MasterGrand Master
        6,675585

          Jul 26, 2015#4

          Well, I agree on some of your points.

          But I'm not sure if IDM is aware on what happens on opening a large UTF-8 encoded file without usage of a temporary file. Although I'm using UltraEdit already 15 years, I was not aware of the file management behavior detected yesterday as I tried to reproduce what you wrote.

          I'm quite sure that opening a large UTF-8 encoded file without usage of a temporary file is done very, very, very rare taking all text file opens by all UltraEdit users into account. In all probability opening a large UTF-8 encoded file without usage of a temporary file is perhaps 1:10.000.000.000.

          That the conversion is not done in memory can be easily explained by the fact that UltraEdit is a disk based editor which can open any file of any size on machines with any amount of installed and free RAM (which Windows XP requires as minimum). It is possible with UltraEdit to view and edit a file with 8 GB size on a computer just having 2 GB RAM installed at all. As far as I know Notepad++ is a memory based text editor. It loads the entire file into RAM, if that is possible at all. So Notepad++ can't be used for viewing and editing huge files. There is at least the 2 GB limit for Notepad++ being an x86 application like UltraEdit.

          A disk based text editing is possible only if there is a fixed number of bytes per character as otherwise it would be really, really difficult to keep position in file on disk with loaded portion of file in memory in synchronization.

          So what could the developers of IDM do to avoid what happened here the first time since I moderate the user-to-user forums which is more than 10 years?

          My suggestion would be as follows:
          • If a file is opened without usage of a temporary file either because of configuration or decision made by the user on asking
            AND
          • the file is detected either automatically or explicitly chosen by the user as being encoded in UTF-8 or ASCII Escaped Unicode
            THEN
          • it should be opened with using a temporary file because the file must be converted to UTF-16 LE ignoring the users configuration respectively decision
            AND
          • inform the user with a message box that the file is opened with usage of a temporary file because of UTF-8 respectively ASCII Escaped Unicode format.
          I agree that modifying the original file on opening is not good here, especially taking into account what could happen on a power failure while transformation is in progress.

          I recommend to you that you report to IDM by email what happens on opening a large UTF-8 encoded file without usage of a temporary file (you can add the link to this topic) and suggest what you think would be a better file data management for this very rare opening file use case.

          I don't want to report this by email to IDM support as I'm quite sure that I will never run into this issue again in the next 15 years. So this issue does not bother me at all. I'm egoistic here by way of exception. I spend already several hours on this issue although I'm not affected by this issue ever.

            Jun 24, 2018#5

            There have been some enhancements since July 2015 regarding to this issue.

            UltraEdit for Windows and UEStudio were rewritten to become a full Unicode aware application with UltraEdit v24.00 and UEStudio v17.00.

            The limitation of searching just in first 64 KB for a UTF-8 encoded character was removed with UltraEdit for Windows v24.10 and UEStudio v17.10. That means that for example a 4 GB file is detected also as UTF-8 encoded without BOM, HTML/XHTML character set, or XML encoding declaration even on containing only ASCII characters with the exception of last character in file which is stored UTF-8 encoded.

            And a large or huge UTF-8 encoded file opened without usage of a temporary file is no longer converted temporarily to UTF-16 on using UltraEdit for Windows v25.10 or UEStudio v18.10 or any later version because of multi-byte character encodings are handled smarter by UE/UES internally to avoid the former limitations (with reduced efficiency on doing quickly massive changes on such a file for example with a macro or script running multiple regular expression replace all on entire file).
            Best regards from an UC/UE/UES for Windows user from Austria