UTF-8 not recognized, largish file

BillKat · Jan 20, 2009#12009-01-20T11:45+00:00

Hello all,

I have mysqldump files which are typically about 60 MB. The source database is fully UTF-8, and the dump file on the server does contain UTF-8 chars like beta, gamma etc, verified by viewing or editing the files there. The UNIX 'file' command describes the file as 'UTF-8 English Unicode text, with very long lines'.

A shell script adds these lines to the very top of the dump file:
charset=utf-8
encoding=utf-8

I did this after reading on here of the post 10 KB and UTF-8 recognition thing.

I copy the file down to my Windows machine and edit the file in UE; but UE interprets the file type as ASCII/ANSI file with UNIX line terminators and doesn't display the betas, etc properly.

It's fine with smaller UTF-8 mysqldump files originating from the same database.

My UE configuration has:

UTF-8 detection turned on (as also UTF-16 for now)
UNIX/MAC detection set to auto convert to DOS

I'm now stuck for ideas ... any help appreciated, cheers all.

Mofi · Jan 20, 2009#22009-01-20T14:02+00:00

I think the reason is that in UE v14.20.1.1006 the search for the UTF-8 charset declaration is not anymore a simple search for charset=utf-8 as in previous versions of UE as I have found out with some tests yet. If I create a file with only this string UltraEdit does not interpret it anymore as UTF-8 encoded file. But if I create a new file with following line:

<meta http-equiv="content-type" content="text/html; charset=utf-8">

UltraEdit v14.20.1.1006 loads that file as UTF-8 file. Further tests let me think that UltraEdit now uses a regular expression search.

<meta charset=utf-8> was also recognized as valid UTF-8 character set declaration. So the regular expression in UltraEdit syntax is maybe something like <*meta*charset=utf-8*> for HTML/XHMTL respectively <*?xml*encoding="utf-8*> for XML.

You know that you can specify in the File - Open dialog a special encoding format using option Open As available since UE v14.10.0.

And last I suggest to add at top of the file with your script not the charset declarations for HTML or XML. I suggest you add at top of the file the BOM (Byte Order Mark) for UTF-8. That would declare that file as UTF-8 file without a doubt. You have to insert the characters ï»¿ (hex: EF BB BF) at top of the file.

BillKat · Jan 20, 2009#32009-01-20T15:18+00:00

Thanks Mofi, useful stuff there. I will have a go and post back, in case it helps anyone else out in the future.
Cheers.

edit:
OK, quick tests - the 'File Open as utf8' works fine - never even thought to look there, after years of "right-click > open with UE" or drag & drop ... ((slaps head))
And the BOM inserting works like a charm too. Excellent.

So thanks a lot Mofi! Sorted.

Zababa · Feb 19, 2009#42009-02-19T19:31+00:00

Hi, I use UE 14.20.1.1008 in MS Windows XP SP3.

I have a 17 MB large UTF-8 file without BOM with just a dozen or so non-ASCII characters in it somewhere near the end of the file. I have enabled automatic Unicode recognition in the advanced configuration.

When I drag and drop the file to UE, it gets recognized as "UNIX" (in the status line), i.e. a plain ASCII file with UNIX line ends. When I then search for a character like "œ" I get a match at the two bytes "Ĺ“" well, yes, the match is where the "œ" is supposed to be. That leads me to the idea that the file is still not messed up, and I save it as another file and I explicitly say it shall be UTF-8 (I tried both BOM and NOBOM here). UE does certainly some hard work with Unicode because the file being saved temporarily has about the double size before it shrinks again to some 17 MB.

But alas, the newly saved file IS messed up (even if I don't drag and drop it but open with open dialogue and select encoding 65001). In there I cannot find any matches for "œ" but just for the two (now multi-byte) characters "Ĺ" and "“"

The only thing I can do to avoid this is not to open a file by drag and drop but via the open dialog. But that is really not user friendly. I mean it still takes a lot of time to open a 17 MB text file. I think UE reads it from the first character to the last before it displays it. Why can't few multi-byte characters be for UE enough to detect it as UTF-8?

Why can't UE have an configuration option to assume all opened files are UTF-8 (or any other encoding)? Then I could even disable the (for such file like the one here useless auto-detect feature).

I wish UE will soon make UTF-8 the default or even better: add an option in the configuration to select an encoding which shall be assumed when opening files.

Mofi · Feb 20, 2009#52009-02-20T12:28+00:00

Okay, I will try to answer your questions although already answered in other UTF-8 related topics. But before reading further read carefully Unicode text and Unicode files in UltraEdit/UEStudio to get the basic understanding about encoding which looks like you don't have.

Why can't few multi-byte characters be for UE enough to detect it as UTF-8?

UltraEdit searches for byte sequences which could be interpreted as UTF-8 character code only in the first 9 KB (UE v11.20a) or 64 KB (UE v14.00b and UltraEdit for Windows < v24.10 or UEStudio < v17.10). Why not in complete file? Because that would make UltraEdit extremely slow on opening any file when setting Auto detect UTF-8 files is enabled. Scanning always complete file for a byte sequence which could be interpreted as UTF-8 character code would actually result in reading all bytes of a file before displaying it. Not a very good idea for files with several MB and of course a very bad idea for files with hundreds of MBs or even GBs.

Also how can UltraEdit be sure that the byte sequence E2 80 9C (hex codes) should be really interpreted as UTF-8 character code for the character “ and not interpreted as string â€œ using code page 1252? Can you answer that question if I give you a file with these 3 bytes? How can you know want I meant with these 3 bytes. Maybe I'm a Russian and the same 3 bytes mean вЂњ or I'm a Greek and the same 3 bytes mean β€. Do you understand the problem? There must be a convention for a program which reads the bytes E2 80 9C how to interpret it.

That's the reason why organizations like the International Organization for Standardization (ISO) or the Unicode Consortium exist. They define standards. Without standards our high tech world can't exist. Unicode is a standard - see About the Unicode Standard.

So what is the real problem. The real problem is that the program which created the 17 MB file you open encodes characters with UTF-8 byte sequences, but has not declared the file with a UTF-8 BOM as UTF-8 file. If your file is a HTML, XHTML or XML file then it does not need a BOM, but then it must have at top of the file a declaration for the UTF-8 encoding. That your file does not have a BOM and no standardized character encoding declaration means your program ignores all the standards.

UTF-8 is really a special encoding standard. It was defined because many programs can only handle single byte encoded text files and don't support the Unicode standard. With UTF-8 it is possible to encode non ASCII characters in ASCII/ANSI files and therefore make the files with the non ASCII characters still readable for programs not supporting the Unicode standard. Many interpreters like PHP and Perl are (or were) for example not capable to correct interpret UTF-16 files. They can interpret only ASCII files and ASCII strings and they don't know about the special meaning of 00 00 FE FF (UTF-32, big endian), FF FE 00 00 (UTF-32, little endian), FE FF (UTF-16, big endian), FF FE (UTF-16, little endian) and EF BB BF (UTF-8) at top of a text file and therefore often break with an error if a BOM exists. That is one reason why for HTML, XHTML and XML a special declaration for the encoding using only ASCII characters was standardized - the document writers can use non ASCII characters, the non Unicode standard compatible interpreters can still interpret the files, but the browsers supporting the standards know which encoding is used for the file and can interpret and display the byte stream correct.

Okay, back to your problem. UltraEdit for Windows < v24.10 and UEStudio < v17.10 do not scan whole file for UTF-8 byte sequences because of the reasons described above. So your 17 MB file is opened in ASCII mode. The bytes of the UTF-8 byte sequences are encoded itself with UTF-8 if the file is saved in UTF-8 now. So the character œ already present in the file with the 2 bytes C5 93 and interpreted with your code page as Ĺ“ are saved with the 5 bytes C4 B9 E2 80 9C and now you have garbage. The solution is to use the special file option option Open as in the File - Open dialog or insert the 3 bytes of the UTF-8 BOM, save the file as ASCII as loaded, close it and open it again.

I think, I have to explain also why UltraEdit for Windows < v25.10 and UEStudio < v18.10 convert whole file detected as UTF-8 into UTF-16 LE which needs time on larger files. Most characters in a UTF-8 file are encoded with a single byte, others with 2 bytes, some with 3 or even 4 bytes. That is not very good for a program which does not only display the content, but also allows to modify it with dozens of functions. Converting the UTF-8 file to UTF-16 LE results in a fixed number of bytes per character for the characters in basic multilingual plane. That makes it efficient to handle the bytes of the characters in memory and in file. Also in all programming languages I know there is only the choice to use single byte character arrays for strings or double byte Unicode arrays. UTF-8 is really something special as already written above.

Why can't UE have an configuration option to assume all opened files are UTF-8 (or any other encoding)?

That's a suggestion for an enhancement that can be send by email to IDM. But the real problem is the program which created the 17 MB file using UTF-8 encoding without marking the file as UTF-8 encoded file. If all programs creating UTF-8 files would be compatible with the Unicode standard and would write the encoding information into the file as required by the standards, then all other programs which are already really compatible to the Unicode standard would have no problems reading those files.

Added on 2009-11-09: I have found an undocumented setting in uedit32.exe of v11.10c and later versions.

With manually adding to uedit32.ini, usually located in %APPDATA%\IDMComp\UltraEdit, while UltraEdit is not running using Windows Notepad in already existing section [Settings] (use Find command) the line

Force UTF-8=1

it can be forced to read all non UTF-16 files as UTF-8 encoded files. But new files are nevertheless created and saved either as Unicode (UTF-16 LE) or ASCII/ANSI files, except with UE v16.00 and later the default Encoding Type is set to Create new files as UTF-8.

So this special setting is only for already named files. However, creating a new file in ASCII/ANSI with UE < v16.00, save it with a name, close it and re-open it results in a new file encoded in UTF-8. Be careful with that setting. Even real ANSI files are loaded with this setting as UTF-8 encoded file causing all non ASCII characters with a code value greater decimal 127 to be interpreted wrong.

Zababa · Mar 26, 2009#62009-03-26T09:27+00:00

Hello Mofi,

thank you for your thorough answer.

Mofi wrote:How can you know want I meant with these 3 bytes. Maybe I'm a Russian and the same 3 bytes mean вЂњ or I'm a Greek and the same 3 bytes mean β€. Do you understand the problem? There must be a convention for a program which reads the bytes E2 80 9C how to interpret it.

I completely understand that UE can't tell what it is. It knows much more encodings than 99.99% of its users and seen in bytes it really is ambiguous. But on the other hand, we don't live in the early 90's anymore. How many of today's text documents are written in traditional encodings*? (I know that we need the ISO standards for editing legacy texts which have been saved before the Unicode era, there's no question about it.) It's just a matter of probability. How big is the probability that a Greek or Russian will want to open a text in the respective ISO encoding today? I bet for all texts they write they use some kind of UTF (unless they write in notepad, which unfortunately still features some traditional encoding as the default).

So, what I was complaining about was that nobody of the UE developers considers the falling frequency of handling ISO encoded texts and the user still needs to convince UE pretty hard that he really would like to do things in Unicode.

If all programs creating UTF-8 files would be compatible with the Unicode standard and would write the encoding information into the file as required by the standards, then all other programs which are already really compatible to the Unicode standard would have no problems reading those files.

What standard are you talking about? UTF-8 BOM is deprecated by the Unicode Consortium itself** and is rather seen as a quirky thing. The BOM of UTF-8 BOM is superfluous (and is no real BOM anyway) because UTF-8 has strictly defined byte order. However, UTF-8 BOM is predominantly used on the Windows platform as an explicit indicator of UTF-8 because many programs — including UE — are reluctant to embrace UTF-8 (NOBOM) as the new encoding standard. (I know that UTF-16 or UTF-32 (whatever endian) are even better (from the programmer's point of view) but most users complain about them being uneconomical for latin-based scripts.) On Linux and Mac UTF-8 (NOBOM) is no big deal and it's usually the default to save text files in. There, if you encounter a UTF-8 BOM file in some kind of workflow, tools and utilities freak out because they don't expect such thing as BOM in a UTF-8 file (that's their ignorance and flaw, I know). They assume if it has nothing then it's UTF-8 (NOBOM). They recognize the BOMs of UTF-16 and -32. They assume that if it's some kind of legacy encoding, users will tell them explicitly.

So I just mean UE could behave in our modern unicoded times the same way: Be (by default) prepared to open (and save) some kind of UTF. If not, the user will tell you.

And even better, as you will probably argue, there might be some users or work periods where you have to deal predominantly with legacy encodings. For this UE should have an option in its configuration where you could set up an encoding which UE would assume when opening files and another encoding for saving files. (These two independent encoding defaults can get very handy if you want to convert dozens of files from one encoding in another.)

My post you reacted on wasn't meant predominantly like a help request with a problem, it was more like a sigh wondering why is UE still so ISO-oriented. I know you did not cause these problems and cannot solve them. You are just somebody who knows where these problems originate and you can explain the inner ongoings of UE perfectly to rather ignorant users (as I am). I will suggest the default encoding options to IDM.

----------
* assuming you reconsider ASCII as UTF-8 NOBOM
** "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." (cited from Section 2.13, Special Characters and Noncharacters, (Unicode 5.1))

pietzcker · Mar 26, 2009#72009-03-26T10:16+00:00

I'd like to hope that UTF-8 is the defacto standard today, but I kind of doubt it. If I save a CSV file in Excel or a TXT or HTML file in MS Word, they will be written in the local encoding, not in UTF-8. Python, my favorite programming language, has just made the jump to Unicode with Python 3, but the file handling routines still expect the local standard encoding unless specified otherwise. I guess it'll take a few years until the old encodings are dropped...

Zababa · Mar 26, 2009#82009-03-26T10:45+00:00

pietzcker wrote:If I save a CSV file in Excel or a TXT or HTML file in MS Word, they will be written in the local encoding, not in UTF-8.

That's a shame. It's more or less dependent on the platform's default. I think most users never bother with any encoding. They just want to save it, open it and expect things to be alright. They don't ever imagine there are more ways to encode the text. In this cases much depends on the default setting. The traditional encodings are not the best default choice. That is something program developers have to bear in mind. As long as there will be programs around for which UTF-8 (or 16 or 32) is the unexpected setting, as long user's like me will get upset.

pietzcker wrote:Python, my favorite programming language, has just made the jump to Unicode with Python 3, but the file handling routines still expect the local standard encoding unless specified otherwise.

I know. I love Python 3. It's the only language where I can give my functions Czech names. (-: It handles Unicode files neatly (although I have to tell it)

pietzcker wrote:I guess it'll take a few years until the old encodings are dropped...

I hope not. It's one of the reasons I am thinking of switching and not use Windows anymore (or at least as little as possible).
Sorry for sliding off topic.

Mofi · Mar 27, 2009#92009-03-27T15:56+00:00

Well, Zababa you are absolutely right with the UTF-8 BOM declared as deprecated in the meantime.

But for Windows platforms it will take surely more than 5 years until most text files are no longer written using a code page, but using a Unicode encoding. There are more than 30 years of computer history with single byte coded text files which you can get rid of in 2 years, not on a platform so widely used as Windows.

For example I'm mainly a programmer. I have never, really never seen a C or C++ source file encoded with UTF-8. UltraEdit is heavily used by programmers. It would be very dangerous for many program sources if any non ASCII character is suddenly encoded with UTF-8 by default. Any non ASCII character in a NULL terminated C string would suddenly produce unexpected results or buffer overflows.

But I agree from the text writers view that it would be really helpful to be able to specify the default encoding for all files or files with a defined extension.

So from the text writers point of view it would be good if the current option Create new file as Unicode at Configuration - Editor - New File Creation would be converted into a radio button option like:

Create new file as:

ASCII/ANSI file
Unicode UTF-8 file
Unicode UTF-16 LE file

And at Configuration - File Handling - Unicode/UTF-8 Detection an additional option, for example with name "Load files with following extensions as UTF-8" with an edit field could be offered to specify file extensions of files which are loaded as UTF-8 if none of the enabled Unicode detections already detect the Unicode encoding. The file extensions can be separated with a space and the * as wildcard for all files should be possible too like the File Extensions = list in the wordfile for a syntax highlighting language definition.

Of course the script and macro environment must then be also enhanced for being able to detect from within a script or macro which encoding a new file has and to be able to convert the encoding also from/to UTF-8. Currently there is no script or macro command to make UTF-8 conversions in any direction or detect a file as being encoded in UTF-8. Otherwise public scripts/macros working fine for user A could produce garbage for user B.

But don't expect that I suggest such enhancements. You have to do it. Although you maybe can't believe it I don't use UTF-8 or any other Unicode encoding for my daily work although I edit daily many text files. So I'm not really interested in such new options.

Added on 2009-11-09: I have found an undocumented setting in uedit32.exe of v11.10c and later versions.

With manually adding to uedit32.ini, usually located in %APPDATA%\IDMComp\UltraEdit, while UltraEdit is not running using Windows Notepad in already existing section [Settings] (use Find command) the line

Force UTF-8=1

it can be forced to read all non UTF-16 files as UTF-8 encoded files. But new files are nevertheless created and saved either as Unicode (UTF-16 LE) or ASCII/ANSI files, except with UE v16.00 and later the default Encoding Type is set to Create new files as UTF-8.

So this special setting is only for already named files. However, creating a new file in ASCII/ANSI with UE < v16.00, save it with a name, close it and re-open it results in a new file encoded in UTF-8. Be careful with that setting. Even real ANSI files are loaded with this setting as UTF-8 encoded file causing all non ASCII characters with a code value greater decimal 127 to be interpreted wrong.

sfqfirst · Jan 12, 2010#102010-01-12T07:51+00:00

Hi, I am comparing two files, they are different only with the top of file "EF BB BF" (3 bytes). One has, the other does not have.
But "Hex Edit" auto adds "EF BB BF" at top of file, so I always cannot find the different of them.
How can UE not change the file's any part, when using "Hex Edit" view?

I have tried "Unicode/UTF-8 Detection" any option's value, but it did not work.

Thanks.

Mofi · Jan 12, 2010#112010-01-12T08:55+00:00

sfqfirst, which version of UltraEdit do you use?

If I open a UTF-8 file without BOM with UE v15.20.0.1022 and switch to hex edit mode UltraEdit does not add the 3 BOM bytes. It shows the content as really saved on hard disk. If I open a UTF-8 file with BOM with UE v15.20.0.1022 and switch to hex edit mode the BOM bytes are displayed at top of the file.

You can use in the File - Open dialog the option Open as binary to open any file directly in hex editing mode. This option exists since version 14.10 of UltraEdit.

You can also try to use File - Revert to Saved after switching to hex edit mode. Without testing (because not needed with v15.20.0.1022) you then should see the bytes of the file as stored on hard disk.

sfqfirst · Jan 12, 2010#122010-01-12T09:36+00:00

Thank you. I got it as you said.
I am very glad to see your answer is so fast, although we are in different countries.
I used UltraEdit 15.10.

Mot · Aug 11, 2015#132015-08-11T06:03+00:00

I just installed a trial version of UltraEdit 22.10.0.12 on a Windows 10 system.

I have files that need to be in UTF-8 format. I know they are in UTF-8 format because I specifically saved them that way.

If I already have my file open, then close UltraEdit and reopen it, the previous file get opened again but it opens as an ANSI 1252 format and completely messes up some of the special characters in the file. There is no way to hit the "Convert to UTF-8" command and get back those characters.

The only way I can open the file as UTF-8 is to set the Encoding to UTF-8 before clicking on the open button; but that is really inconvenient if that needs to be done every time I open UltraEdit - close all previous tabs and reopen each file as UTF-8.

I have the following configurations set:

* File Handling > Code Page Detection: everything unchecked... if the "Auto code page detection" is set, my files open as ANSI 1252.
* File Handling > DOS/Unix/Mac Handling: set to DOS & Automatically convert to DOS format
* File Handling > Save: only "Trim trailing spaces..." and "Do not auto-save FTP files" are checked.
* File Handling > Unicode/UTF-8 Detection: "Auto detect UTF-8 files" and "Detect Unicode (UTF-16 files without BOM)" are checked.

Am I missing something?

How can I make my default load & save format as UTF-8 (no BOM)?

Thanks!

Mofi · Aug 11, 2015#142015-08-11T17:40+00:00

All you need to know is explained already in

power tip Working with Unicode in UltraEdit/UEStudio,
forum topic Using UTF-8 with UltraEdit,
and the posts above in this forum topic.

But I summarize them here again for UE v22.10 as lots of things changed in the meantime.

Creating a new file by default as UTF-8 encoded file

Open Advanced - Configuration - Editor - New File Creation and select Create new files as UTF-8 for Encoding Type.

Open Advanced - Configuration - File Handling - Save and uncheck Write UTF-8 BOM header to all UTF-8 files when saved and uncheck also Write UTF-8 BOM on new files created within this program (if above is not set), except you create in UltraEdit new UTF-8 files interpreted later by an application supporting UTF-8 encoded files with BOM. PHP and JavaScript interpreter don't support UTF-8 files with BOM.
UTF-8 detection on opening a file

On opening a file UTF-8 encoding is detected automatically by UltraEdit if
- file contains UTF-8 byte order mark (BOM), or
- file contains in first few KB (never evaluated how many KB) a HTML/XHTML character set or XML encoding declaration found with UltraEdit regular expression <*meta*charset=utf-8*> respectively <*?xml*encoding="utf-8*>.
  This means short HTML5 charset declaration is currently not supported by UE v22.10.0.18.
- Or the first 64 KB of the file contains at least 1 character encoded with UTF-8 with more than 1 byte on using UltraEdit for Windows < v24.10 or UEStudio < v17.10, i.e. a character with a code value greater U+007F (decimal 127) - a non ASCII character.
Force opening files as UTF-8 encoded

When working only with UTF-8 encoded files and never or only rarely with ASCII/ANSI files (1 byte per character using a code page), but the UTF-8 encoded files don't have a BOM nor a detected charset or encoding declaration and often do not contain non ASCII characters in first 64 KB on using UltraEdit for Windows < v24.10 or UEStudio < v17.10, you should
- exit UltraEdit after making the configuration settings as written above,
- open with Notepad file %APPDATA%\IDMComp\UltraEdit\uedit32.ini,
- search in this file for [Settings],
- insert below a line with Force UTF-8=1 and save the INI file and exit Notepad.
Now all files not being detected as UTF-16 encoded files are opened as UTF-8 encoded files.

Attention:
Also ANSI encoded files with a character with a code value greater decimal 127 are opened now by default as UTF-8 encoded files resulting in getting a corrupt text encoding on file save. ASCII/ANSI files must be opened with this configuration setting with File - Open and explicitly selecting ASCII in Open as.

See also the post Short UTF-8 charset declaration in HTML5 header for enhancements made on UTF-8 detection in UltraEdit for Windows v24.00 and UEStudio v17.00.

Mot · Aug 11, 2015#152015-08-11T20:17+00:00

Thanks! Option 3 worked great for me.

Option 2 wouldn't work out so well since most of the files I work with are javascript; I guess adding a comment at the beginning with utf-8 in it wouldn't be out of the question.

UltraEdit, UltraCompare, UEStudio forums