UTF-8 auto-detection problem with first multi-byte after 10k

negg · Sep 30, 2006#12006-09-30T01:26+00:00

UTF-8 problems again *sigh*

I nearly got mad trying to convert a messed up file from ASCII to UTF-8 again by converting it to UTF-8, repairing the messed up Umlaut-characters, and saving it. Just to make sure I reopened the file again and...its back to ASCII, the characters I just entered messed up again!

So I did the same again and again: always the same result. I tried to do the same with a small test file: all worked fine! So what is this, I thought?

Then I noticed that the first multibyte character in the file is very far at the end of the file. So I did a test, try yourself:

- create a new file, convert it to U8 and enter an umlaut (e.g. ä), save and close
- reopen the file: its still U8. So far so good. No enter at least 10kb of ascii characters before the umlaut (in my test file its 11116 "-" characters), save and close
- reopen the file: its ASCII again, the umlaut messed up!

Is this a convention/standard? Does this have to do with a setting? I have autodetect U8 on...so this rather looks like a bug to me.

negg

Mofi · Sep 30, 2006#22006-09-30T16:10+00:00

No, it's not a bug. UE scans always only the first 9 KB (UE v11.20a) or 64 KB (UE v14.00b) of a file to detect it's format. And that's okay because there are users which edit really large files with several MBs or even GBs.

I rarely use UTF-8 encoding. Most of the files I daily edit are normal ASCII files. If UltraEdit would start to check always the whole file for binary chars, UTF-8 multi-byte chars, UTF-16 chars, ASCII escaped Unicode chars, etc. the load time of a file would increase dramatically and we all don't want that.

So I have a question to you: What type of files you encode in UTF-8? HTML or PHP or XML files?

Yes! Why do you not specify the UTF-8 encoding at top of the file as required by the HTML standard?

No, other files! Why do you not use the UTF-8 BOM?

For UTF-8 handling see [url=hhttp://www.ultraedit.com/forums/viewtopic.php?t=3511]Using UTF-8 with UltraEdit[/url], especially the chapter My suggestions for the configuration for UTF-8 webpage writers in my post and the linked pages there.

By the way: If your files are HTML files why do you not use HTML entities for the German umlauts and the 'ß'? In the HTML toolbar of UltraEdit there is a symbol with the tooltip "HTML Text2HTML" which will convert characters to known HTML entities.

And I personally have following macro associated to a hotkey which does the same as "HTML Text2HTML" with a limited character set.

IfExtIs "html"
Else
IfExtIs "htm"
Else
ExitMacro
EndIf
EndIf
InsertMode
ColumnModeOff
HexOff
IfSel
Find MatchCase "ä"
Replace All SelectText "ä"
Find MatchCase "ö"
Replace All SelectText "ö"
Find MatchCase "ü"
Replace All SelectText "ü"
Find MatchCase "Ä"
Replace All SelectText "Ä"
Find MatchCase "Ö"
Replace All SelectText "Ö"
Find MatchCase "Ü"
Replace All SelectText "Ü"
Find MatchCase "ß"
Replace All SelectText "ß"
Else
Find MatchCase "ä"
Replace All "ä"
Find MatchCase "ö"
Replace All "ö"
Find MatchCase "ü"
Replace All "ü"
Find MatchCase "Ä"
Replace All "Ä"
Find MatchCase "Ö"
Replace All "Ö"
Find MatchCase "Ü"
Replace All "Ü"
Find MatchCase "ß"
Replace All "ß"
EndIf

And additionally I have macros like that with hotkey ä ö ü Ä Ö Ü ß to automatically insert the umlauts and 'ß' in the correct form based on the file extension:

IfExtIs "html"
"ß"
ExitMacro
EndIf
IfExtIs "htm"
"ß"
ExitMacro
EndIf
IfExtIs "asm"
"ss"
ExitMacro
EndIf
IfExtIs "c"
"ss"
ExitMacro
EndIf
IfExtIs "h"
"ss"
ExitMacro
EndIf
IfExtIs "inc"
"ss"
ExitMacro
EndIf
"ß"

negg · Sep 30, 2006#32006-09-30T20:59+00:00

Mofi wrote:No, it's not a bug. UE scans always only the first 9 KB (UE v11.20a) or 64 KB (UE v14.00b) of a file to detect it's format. And that's okay because there are users which edit really large files with several MBs or even GBs.

Thought so. Just wanted to make sure this is done by intention. Still I think this is quite dangerous, don't know how many files I have messed up this way without even noticing...

Mofi wrote:So I have a question to you: What type of files you encode in UTF-8? HTML or PHP or XML files?

Yes! Why do you not specify the UTF-8 encoding at top of the file as required by the HTML standard?

In HTML I could, in PHP files there is no such standard (as has been pointed out to you a couple of times already by other people

yes I've read nearly all your posts, as they always provide useful information. Just taking on this discussion here to make the point clear for other people reading this

I know I could write the string "encoding=utf-8" in an initial comment in a PHP file, I could if I would work on my own. But I work in a bigger team, I would have to tell anyone to keep that bit in the files, and not remove it. Then maybe the next one is using Editor "XYZ" that needs the string "this is a UTF8 file" in the second line, another editor needs this and that, etc. etc. If we have to remember and keep all those "hint" lines in the files we need a database for this task

so it works, but its simply not practical.

Mofi wrote:No, other files! Why do you not use the UTF-8 BOM?

I edit big MySQL-dumps that have to be UTF-8, with a BOM MySQL (at least in my tests some time ago) does not import these files.

My wish would be to have this configurable, how many bytes to check...at least via ini parameter. Maybe I place such a wish to IDM.

But thanks anyway for your feedback Mofi!

Martin

Mofi · Oct 01, 2006#42006-10-01T14:43+00:00

Hi Martin,

maybe you should also request my suggested feature for users only working with UTF-8 files: Create and load ASCII files as UTF-8 by email to IDM support. Johannes (Ammaletu) has done it already at Using UTF-8 with UltraEdit. The more users request this feature the higher the chance to see it realized in one of the next major releases.

Such a feature (configuration option) would solve your problem too.

I don't write PHP files but I really start to think that the PHP interpreter should recognize the BOM of a file not as characters which should be sent to the browser. I searched a while with Google for the problem of UTF-8 encoded PHP files and it looks like many PHP programmers have the same problem with other editors (i.e. Dreamweaver) because the PHP interpreter has problems with a BOM. Many PHP programmers think it's a big bug of the PHP interpreter - see for example php doesn't ignore the utf-8 BOM. It looks like the PHP interpreter developers are reacting now on this issue.

UltraEdit, UltraCompare, UEStudio forums