Carsten wrote:Broadly speaking (not specific to UE, and simplified as well), when a new text file is opened and loaded into memory, it is nothing but a bunch of bytes.
Loading first all bytes from the file into memory would have some disadvantages:
- It would work only for small files, not for files with several hundred MB or even several GB. Also it would take too long for larger files.
- It would result in wasting memory as content of file will be at least for a certain time loaded twice in memory. For example a UTF-16 encoded file with 40 MB would require first 40 MB of RAM for byte load and additional 160 MB of RAM for wide character load (Visual Studio uses 32-bit unsigned int for wide characters (Unicode characters) as this results in perfect alignment for 32-bit and 64-bit processors resulting in much faster accessing the characters of a Unicode string than using 16-bit wide characters as other libraries use to hold Unicode strings in memory with as less bytes as needed.
Carsten wrote:In a second step, the loading program "somehow" has to determine which decoding is to be applied to the previously loaded byte-buffer.
Finding out the encoding is exactly the problem if users or applications creating the text files do not follow the standards.
The people who defined the various standards like the Unicode standard, the HTML, XHTML and XML standards, etc. thought about the problem on detecting
quickly on load of the file how the bytes in the file must be interpreted by the applications. Therefore all those standards prescribe that the character encoding used in the file must be declared in the first 1024 bytes of the file with various methods (byte order mark in Unicode standard), charset declaration in HTML and XHTML standard, encoding declaration in XML standard, ...
It is unfair to complain about character encoding detection and handling of an application like UltraEdit which follows the rules as defined in the standards, but the users or the application created the file has not because simply not reading and knowing the standards. The goal of a standard is to get things working together by defining the rules. Everything ignoring those rules must result in something not working.
According to the various standards, applications must read only the first 1024 bytes as you have described with a simply byte load, evaluate it and then the application should know which encoding is used for the characters in the text file.
UltraEdit extends that already by loading and evaluating the first 64 KB of a file on using UltraEdit for Windows < v24.10 or UEStudio < v17.10 (as processors are nowadays fast and file content is loaded by default in blocks of 64 KB) and has a special routine included detecting byte sequences which are typical for UTF-8 encoded characters and for ASCII escaped Unicode characters.
And UltraEdit offers additionally the possibility for the user to define the encoding on opening the file. The
File - Open dialog of Windows is extended by UltraEdit to have the option
Format and
Open As (Windows 2000/XP) respectively
Encoding (Windows Vista and later Windows versions).
And for single byte encoded text files there is an automatic code page detection feature also included.
And UltraEdit also remembers the code page/encoding set by the user different do auto-detected code page/encoding in uedit32.ini for applying to the file on next opening.
So what should UltraEdit do next to support text files with using a not easily to detect encoding, but not following the rules of the various standards?
Carsten wrote:End-of-line: to DOS, to UNIX, to MAC (it really doesn't matter what "from" is)
Well, right. I agree that it does not matter from which line terminator type the conversion is done. So it would be enough to have the items "To DOS", "To UNIX" and "To MAC" and the Save As dialog has therefore exactly these 3 options for determining the line terminator type on first save of a file additionally to
Default which simply means keep line terminator as is at the moment.
On the other hand my experience after nearly 10 years of forum care is that
- most users working with text files do not know what DOS/UNIX/MAC means at all.
- Therefore most users have never done a conversion of the line endings.
- And most users have never looked in any application on the bar at bottom of the application window which a minority of computer users know as "status bar" and what is displayed there. Yes, really! The overwhelming majority of computer users has never really recognized what this strange area at bottom of an application window is for at all. They work 5, 10, 20 years with a computer and have nevertheless never looked on what is displayed at bottom of the application window.
So IDM could rename the 3 menu items at top of
File - Conversions as they are always enabled independent on current line terminator type of the original file or the temporary file and always makes the conversion correct.
But IDM does not like changing menus as this results in lots of work for the programmers, the testers, the documentation teams (offline and online help with taking into account that the online help should be helpful for all users independent on version of UltraEdit used), the person who keeps the pages on the UltraEdit website up-to-date (best for all versions of UE), the localization teams, ...
I suggested IDM during beta test period of UE v20.00 to revise the
Edit and the
View menu by moving some menu items to submenus or other menus as those two menus contain already so many menu items that the menus do not fit completely on my screen with a resolution of 1200x800 pixels even with UE maximized. I wrote in my email also what changes need to be made not only in the *.mb1 files, but also in all help files on several help pages in all supported languages, and on some webpages to be correct after the revise of the menu structure. Maybe this additional information was the reason why IDM did not follow my suggestions as IDM could see: wow, that's a lot of work caused by moving some menu commands which can be done by the user itself quickly within 5 minutes.
Carsten wrote:Encoding: Hierarchical menu as in the status bar for "view as" purpose (the big proposed change!
)
A submenu with dozens of menu commands or a set of submenus? Well, yes, that is possible as the encoding type field in the "new" status bar (which I don't use as I prefer the basic status bar) demonstrates. I was quite happy with
View - Set Code Page which does the same as the encoding type field in the "new" status bar taking into account that I need that feature only for answering code page related questions in the forum. I'm not a translator and therefore do not switch the code page or encoding ever like the majority of the UltraEdit users.
Perhaps you and many other users have never recognized that the
Save As dialog offers all options to make quickly a conversion of current line terminator type and character encoding to any other line terminator type and character encoding. And of course the
Save As dialog can be used to just convert the current file as it is possible to keep the name of the active file.
So why not simply pressing F12 to open Save As dialog, selecting the wanted line terminator and format/encoding, and hit twice key RETURN to convert the active file to whatever wanted.