A quibble regarding the term "Unicode"

A quibble regarding the term "Unicode"

7
NewbieNewbie
7

    Jul 29, 2013#1

    It seems to me that UE doc and UI is not using terminology correctly or consistently. On the File/Conversions menu, you have choices for Unicode and for UTF-8. This is actually a choice between UTF-16 encoding and UTF-8 encoding. UTF-8 IS Unicode, just as much as UTF-16. They are just different encodings. The suggestion here that Unicode and UTF-8 are different things is unnecessarily confusing the issue.

    I have read help page Unicode and UTF-8 support and Unicode text and Unicode files in UltraEdit/UEStudio and they seem quite accurate but have to resort to complicated verbiage because they are trying to simultaneously describe both encodings as Unicode, but only refer to one of them as "Unicode".

    Am I right about this?

    6,686585
    Grand MasterGrand Master
    6,686585

      Jul 30, 2013#2

      You have obviously enough knowledge about Unicode and the various encodings for storing Unicode characters in files. But you belong with this knowledge to the minority of UltraEdit users according to my experience on answering questions in the forums in the several last years. Most users writing text files do not know how characters are stored and that there are different methods for storing them. In best case they heard or read something about "Unicode" and "UTF-8", but do not really know what those two terms mean. I suppose that poor knowledge of the overwhelming majority of UltraEdit users about Unicode and its various encodings for storage is one reason why IDM named the menu items as they are right now.

      In real in memory of any application there are only 2 types of character encodings:
      • Single byte character
        which is in C/C++/C# and other programming languages of type char. On many manuals and help pages this character storage type in memory of an application is called ASCII although this is wrong as for the bytes with value 128 to 255 (char is unsigned) or -128 to -1 (char is signed) the associated characters are not ASCII, but depend on the code page respectively ANSI/ISO standard used.
      • Wide character
        which is in C/C++/C# and other programming languages of type wchar or other types. In many manuals and on many help pages this character storage type in memory of an application is called Unicode simply because (most) characters as defined by the Unicode standard can be handled with wide character representation. It is not standardized if a wide character uses in memory 2 bytes (16-bit, unsigned short) or 4 bytes (32-bit, unsigned int or unsigned long). This depends on the library used for Unicode characters/strings.
      So independent on used encoding for storing the characters in a file, after opening a file in UltraEdit there are only two types of characters: ASCII or Unicode. Only on conversion from ASCII to Unicode or from Unicode to ASCII a real conversion takes place. For nearly all other conversions supported in the lower half of the conversions menu there is no real conversion done as UTF-16 LE with / without BOM, UTF-16 BE with / without BOM, UTF-8 with / without BOM and ASCII Escaped Unicode are only important for next file save, but not for the characters loaded already in memory. So for example on executing UTF-8 to Unicode nothing else happens on a loaded UTF-8 file then just showing UTF-16 encoding information in the status bar as the only change done is that the encoding type for next file save is changed. Well, in practice it is not so easy as it looks like now in case of doing such a conversion on a really large file with several hundred MB or even some GB which is most likely opened without using a temporary file and with having only a small portion of the file loaded into memory.

      I think, you can imagine how many items the conversion menu would have if every possible "conversion" from one encoding to another is offered with a menu item:
      • ASCII/ANSI to UTF-8 with BOM
      • ASCII/ANSI to UTF-8 without BOM
      • ASCII/ANSI to UTF-16 LE with BOM
      • ASCII/ANSI to UTF-16 LE without BOM
      • ASCII/ANSI to UTF-16 BE with BOM
      • ASCII/ANSI to UTF-16 BE without BOM
      • UTF-8 with BOM to UTF-8 without BOM
      • UTF-8 without BOM to UTF-8 with BOM
      • UTF-16 LE with BOM to UTF-8 without BOM
      • UTF-16 LE without BOM to UTF-8 without BOM
      • and so on
      I think that would be really confusing for all users of UltraEdit even for those knowing the differences of all those encoding types. And I'm quite sure that the menu would not fit on screen with such a list. Therefore my opinion is that IDM made a quite good job on summing up all those possible conversions to a quite small list of commands with using the terms which are most known and therefore most often used without the knowledge what Unicode, UTF-8 and ASCII really means.

      5
      NewbieNewbie
      5

        Sep 16, 2013#3

        Hi Mofi,
        Mofi wrote:Therefore my opinion is that IDM made a quite good job on summing up all those possible conversions to a quite small list of commands
        many thanks for your detailed explanations and your efforts, but I'm fully with leotohill.
        Let me explain:
        1. Broadly speaking (not specific to UE, and simplified as well), when a new text file is opened and loaded into memory, it is nothing but a bunch of bytes.
        2. In a second step, the loading program "somehow" has to determine which decoding is to be applied to the previously loaded byte-buffer. The decoding determines how bytes are converted into characters, and can be obtained heuristically, from asking the user, from remembering a previous choice, etc.
        3. The decoding is now used to obtain a series of characters that the editor can show the user for reading or editing.
        It's wonderful that in recent versions of UE, we can in the status bar make "View As" changes, not only for the syntax highlighting, but for the decoding previously chosen in step 2 as well. (Not a single byte in the original input file is affected from such changes.)

        Now, the menu item for conversions that actually change the bytes that are written to the output file, should in my opinion have three submenus:
        1. End-of-line: to DOS, to UNIX, to MAC (it really doesn't matter what "from" is)
        2. Encoding: Hierarchical menu as in the status bar for "view as" purpose (the big proposed change! :wink: )
        3. BOM: add BOM, remove BOM (greyed out whatever is not applicable, and always greyed out if decoding is not UTF-8, UTF-16LE or UTF-16BE), or simply "toggle BOM"
        To be honest, for conversions it's this simplicity and clarity that I miss in UE the most! :D

        6,686585
        Grand MasterGrand Master
        6,686585

          Sep 21, 2013#4

          Carsten wrote:Broadly speaking (not specific to UE, and simplified as well), when a new text file is opened and loaded into memory, it is nothing but a bunch of bytes.
          Loading first all bytes from the file into memory would have some disadvantages:
          • It would work only for small files, not for files with several hundred MB or even several GB. Also it would take too long for larger files.
          • It would result in wasting memory as content of file will be at least for a certain time loaded twice in memory. For example a UTF-16 encoded file with 40 MB would require first 40 MB of RAM for byte load and additional 160 MB of RAM for wide character load (Visual Studio uses 32-bit unsigned int for wide characters (Unicode characters) as this results in perfect alignment for 32-bit and 64-bit processors resulting in much faster accessing the characters of a Unicode string than using 16-bit wide characters as other libraries use to hold Unicode strings in memory with as less bytes as needed.
          Carsten wrote:In a second step, the loading program "somehow" has to determine which decoding is to be applied to the previously loaded byte-buffer.
          Finding out the encoding is exactly the problem if users or applications creating the text files do not follow the standards.

          The people who defined the various standards like the Unicode standard, the HTML, XHTML and XML standards, etc. thought about the problem on detecting quickly on load of the file how the bytes in the file must be interpreted by the applications. Therefore all those standards prescribe that the character encoding used in the file must be declared in the first 1024 bytes of the file with various methods (byte order mark in Unicode standard), charset declaration in HTML and XHTML standard, encoding declaration in XML standard, ...

          It is unfair to complain about character encoding detection and handling of an application like UltraEdit which follows the rules as defined in the standards, but the users or the application created the file has not because simply not reading and knowing the standards. The goal of a standard is to get things working together by defining the rules. Everything ignoring those rules must result in something not working.

          According to the various standards, applications must read only the first 1024 bytes as you have described with a simply byte load, evaluate it and then the application should know which encoding is used for the characters in the text file.

          UltraEdit extends that already by loading and evaluating the first 64 KB of a file on using UltraEdit for Windows < v24.10 or UEStudio < v17.10 (as processors are nowadays fast and file content is loaded by default in blocks of 64 KB) and has a special routine included detecting byte sequences which are typical for UTF-8 encoded characters and for ASCII escaped Unicode characters.

          And UltraEdit offers additionally the possibility for the user to define the encoding on opening the file. The File - Open dialog of Windows is extended by UltraEdit to have the option Format and Open As (Windows 2000/XP) respectively Encoding (Windows Vista and later Windows versions).

          And for single byte encoded text files there is an automatic code page detection feature also included.

          And UltraEdit also remembers the code page/encoding set by the user different do auto-detected code page/encoding in uedit32.ini for applying to the file on next opening.

          So what should UltraEdit do next to support text files with using a not easily to detect encoding, but not following the rules of the various standards?
          Carsten wrote:End-of-line: to DOS, to UNIX, to MAC  (it really doesn't matter what "from" is)
          Well, right. I agree that it does not matter from which line terminator type the conversion is done. So it would be enough to have the items "To DOS", "To UNIX" and "To MAC" and the Save As dialog has therefore exactly these 3 options for determining the line terminator type on first save of a file additionally to Default which simply means keep line terminator as is at the moment.

          On the other hand my experience after nearly 10 years of forum care is that
          • most users working with text files do not know what DOS/UNIX/MAC means at all.
          • Therefore most users have never done a conversion of the line endings.
          • And most users have never looked in any application on the bar at bottom of the application window which a minority of computer users know as "status bar" and what is displayed there. Yes, really! The overwhelming majority of computer users has never really recognized what this strange area at bottom of an application window is for at all. They work 5, 10, 20 years with a computer and have nevertheless never looked on what is displayed at bottom of the application window.
          So IDM could rename the 3 menu items at top of File - Conversions as they are always enabled independent on current line terminator type of the original file or the temporary file and always makes the conversion correct.

          But IDM does not like changing menus as this results in lots of work for the programmers, the testers, the documentation teams (offline and online help with taking into account that the online help should be helpful for all users independent on version of UltraEdit used), the person who keeps the pages on the UltraEdit website up-to-date (best for all versions of UE), the localization teams, ...

          I suggested IDM during beta test period of UE v20.00 to revise the Edit and the View menu by moving some menu items to submenus or other menus as those two menus contain already so many menu items that the menus do not fit completely on my screen with a resolution of 1200x800 pixels even with UE maximized. I wrote in my email also what changes need to be made not only in the *.mb1 files, but also in all help files on several help pages in all supported languages, and on some webpages to be correct after the revise of the menu structure. Maybe this additional information was the reason why IDM did not follow my suggestions as IDM could see: wow, that's a lot of work caused by moving some menu commands which can be done by the user itself quickly within 5 minutes.
          Carsten wrote:Encoding: Hierarchical menu as in the status bar for "view as" purpose (the big proposed change!   :wink: )
          A submenu with dozens of menu commands or a set of submenus? Well, yes, that is possible as the encoding type field in the "new" status bar (which I don't use as I prefer the basic status bar) demonstrates. I was quite happy with View - Set Code Page which does the same as the encoding type field in the "new" status bar taking into account that I need that feature only for answering code page related questions in the forum. I'm not a translator and therefore do not switch the code page or encoding ever like the majority of the UltraEdit users.

          Perhaps you and many other users have never recognized that the Save As dialog offers all options to make quickly a conversion of current line terminator type and character encoding to any other line terminator type and character encoding. And of course the Save As dialog can be used to just convert the current file as it is possible to keep the name of the active file.

          So why not simply pressing F12 to open Save As dialog, selecting the wanted line terminator and format/encoding, and hit twice key RETURN to convert the active file to whatever wanted.

          5
          NewbieNewbie
          5

            Sep 21, 2013#5

            Hi Mofi,

            thanks for your detailed reply, it's very much appreciated!
            Mofi wrote:It is unfair to complain about character encoding detection
            Please note that I didn't complain at all about UE's characters encoding detection. In fact, I'm happy if UE follows the standards. The purpose of my first items list ("1. Broadly speaking [...], 2. In a second step [...], 3. The decoding determines [...]") was just to prepare a common basis for the improvements that I suggested in the second items list.
            Mofi wrote:So what should UltraEdit do next to support text files with using a not easily to detect encoding, but not following the rules of the various standards?
            What I assume it probably does already: use a default. Note that it is not really so bad if the default is wrong: This is the beauty and importance of being able to
            • easily see what encoding is currently used for generating the displayed characters, and
            • easily change it later.
            UE solves this very well already. My favourite is the status bar, because it's hierarchy is easily navigated, but the equivalent "Set Code Page" menu is of course fine as well.
            Mofi wrote:I suggested IDM during beta test period of UE v20.00 to revise the Edit and the View menu by moving some menu items to submenus or other menus as those two menus contain already so many menu items that the menus do not fit completely on my screen with a resolution of 1200x800 pixels even with UE maximized. I wrote in my email also what changes need to be made not only in the *.mb1 files, but also in all help files on several help pages in all supported languages, and on some webpages to be correct after the revise of the menu structure. Maybe this additional information was the reason why IDM did not follow my suggestions as IDM could see: wow, that's a lot of work caused by moving some menu commands which can be done by the user itself quickly within 5 minutes.
            Well, all I can say is that this is very disappointing of IDM. I fell in love with UE when I first installed it -- I cannot remember how many years ago, IDM can probably look it up.
            But recent versions seem to value feature monstrosity higher than user experience -- your description about their menu items revision policy is the perfect example, as well as the multi-purpose Save As button, see below.
            Mofi wrote:A submenu with dozens of menu commands or a set of submenus?
            Yes, exactly, this is how I think the File - Conversions menu should look like. Well structured and intuitive to understand what it does.
            Mofi wrote:Perhaps you and many other users have never recognized that the Save As dialog offers all options to make quickly a conversion of current line terminator type and character encoding to any other line terminator type and character encoding. And of course the Save As dialog can be used to just convert the current file as it is possible to keep the name of the active file.
            Sorry, but overloading the (Open and) Save dialogs for changing the encoding was probably the most ridiculous feature that IDM has ever invented.

            Why in the world are two entirely separate concepts, namely (1) saving a file under a different name or path, and (2) converting the file to a different encoding, combined in the "Save As" feature (while the File - Conversions menu is left in the quasi incomprehensible state that it is)?
            Mofi wrote:So why not simply pressing F12 to open Save As dialog, selecting the wanted line terminator and format/encoding, and hit twice key RETURN to convert the active file to whatever wanted.
            Because when I want to convert, who says that I want to save??
            Normally, the reason to open a file with UE is that I want to edit it. After having opened the file, I sometimes see (in the status bar) that it hasn't the EOL character or the characters encoding that I think it should have (but this was usually not the reason I opened the file in the first place). So I use the File - Conversions menu as required, and then make the edits that I want to make. I save when I'm done, or in intervals, when I feel that after a long work the changes should be written to the file on disk.

            If it had not been documented that Save As can be used for conversion, I'd never found this. It still feels awkward and unnecessary to me: duplicating the feature of File - Conversions menu, but not exactly, but - confusingly - slightly different...

            (Btw., I've already asked the nice folks at IDM support to have a look at this thread. :wink: )