UTF-8 auto-detection problem with first multi-byte after 10k

UTF-8 auto-detection problem with first multi-byte after 10k

2
NewbieNewbie
2

    Sep 30, 2006#1

    UTF-8 problems again *sigh*

    I nearly got mad trying to convert a messed up file from ASCII to UTF-8 again by converting it to UTF-8, repairing the messed up Umlaut-characters, and saving it. Just to make sure I reopened the file again and...its back to ASCII, the characters I just entered messed up again!

    So I did the same again and again: always the same result. I tried to do the same with a small test file: all worked fine! So what is this, I thought?

    Then I noticed that the first multibyte character in the file is very far at the end of the file. So I did a test, try yourself:

    - create a new file, convert it to U8 and enter an umlaut (e.g. ä), save and close
    - reopen the file: its still U8. So far so good. No enter at least 10kb of ascii characters before the umlaut (in my test file its 11116 "-" characters), save and close
    - reopen the file: its ASCII again, the umlaut messed up!

    Is this a convention/standard? Does this have to do with a setting? I have autodetect U8 on...so this rather looks like a bug to me.

    negg

    6,675585
    Grand MasterGrand Master
    6,675585

      Sep 30, 2006#2

      No, it's not a bug. UE scans always only the first 9 KB (UE v11.20a) or 64 KB (UE v14.00b) of a file to detect it's format. And that's okay because there are users which edit really large files with several MBs or even GBs.

      I rarely use UTF-8 encoding. Most of the files I daily edit are normal ASCII files. If UltraEdit would start to check always the whole file for binary chars, UTF-8 multi-byte chars, UTF-16 chars, ASCII escaped Unicode chars, etc. the load time of a file would increase dramatically and we all don't want that.

      So I have a question to you: What type of files you encode in UTF-8? HTML or PHP or XML files?

      Yes! Why do you not specify the UTF-8 encoding at top of the file as required by the HTML standard?

      No, other files! Why do you not use the UTF-8 BOM?

      For UTF-8 handling see [url=hhttp://www.ultraedit.com/forums/viewtopic.php?t=3511]Using UTF-8 with UltraEdit[/url], especially the chapter My suggestions for the configuration for UTF-8 webpage writers in my post and the linked pages there.


      By the way: If your files are HTML files why do you not use HTML entities for the German umlauts and the 'ß'? In the HTML toolbar of UltraEdit there is a symbol with the tooltip "HTML Text2HTML" which will convert characters to known HTML entities.

      And I personally have following macro associated to a hotkey which does the same as "HTML Text2HTML" with a limited character set.

      IfExtIs "html"
      Else
      IfExtIs "htm"
      Else
      ExitMacro
      EndIf
      EndIf
      InsertMode
      ColumnModeOff
      HexOff
      IfSel
      Find MatchCase "ä"
      Replace All SelectText "ä"
      Find MatchCase "ö"
      Replace All SelectText "ö"
      Find MatchCase "ü"
      Replace All SelectText "ü"
      Find MatchCase "Ä"
      Replace All SelectText "Ä"
      Find MatchCase "Ö"
      Replace All SelectText "Ö"
      Find MatchCase "Ü"
      Replace All SelectText "Ü"
      Find MatchCase "ß"
      Replace All SelectText "ß"
      Else
      Find MatchCase "ä"
      Replace All "ä"
      Find MatchCase "ö"
      Replace All "ö"
      Find MatchCase "ü"
      Replace All "ü"
      Find MatchCase "Ä"
      Replace All "Ä"
      Find MatchCase "Ö"
      Replace All "Ö"
      Find MatchCase "Ü"
      Replace All "Ü"
      Find MatchCase "ß"
      Replace All "ß"
      EndIf


      And additionally I have macros like that with hotkey ä ö ü Ä Ö Ü ß to automatically insert the umlauts and 'ß' in the correct form based on the file extension:

      IfExtIs "html"
      "ß"
      ExitMacro
      EndIf
      IfExtIs "htm"
      "ß"
      ExitMacro
      EndIf
      IfExtIs "asm"
      "ss"
      ExitMacro
      EndIf
      IfExtIs "c"
      "ss"
      ExitMacro
      EndIf
      IfExtIs "h"
      "ss"
      ExitMacro
      EndIf
      IfExtIs "inc"
      "ss"
      ExitMacro
      EndIf
      "ß"
      Best regards from an UC/UE/UES for Windows user from Austria

      2
      NewbieNewbie
      2

        Sep 30, 2006#3

        Mofi wrote:No, it's not a bug. UE scans always only the first 9 KB (UE v11.20a) or 64 KB (UE v14.00b) of a file to detect it's format. And that's okay because there are users which edit really large files with several MBs or even GBs.
        Thought so. Just wanted to make sure this is done by intention. Still I think this is quite dangerous, don't know how many files I have messed up this way without even noticing...
        Mofi wrote:So I have a question to you: What type of files you encode in UTF-8? HTML or PHP or XML files?

        Yes! Why do you not specify the UTF-8 encoding at top of the file as required by the HTML standard?
        In HTML I could, in PHP files there is no such standard (as has been pointed out to you a couple of times already by other people ;-) yes I've read nearly all your posts, as they always provide useful information. Just taking on this discussion here to make the point clear for other people reading this ;-)

        I know I could write the string "encoding=utf-8" in an initial comment in a PHP file, I could if I would work on my own. But I work in a bigger team, I would have to tell anyone to keep that bit in the files, and not remove it. Then maybe the next one is using Editor "XYZ" that needs the string "this is a UTF8 file" in the second line, another editor needs this and that, etc. etc. If we have to remember and keep all those "hint" lines in the files we need a database for this task ;-) so it works, but its simply not practical.
        Mofi wrote:No, other files! Why do you not use the UTF-8 BOM?
        I edit big MySQL-dumps that have to be UTF-8, with a BOM MySQL (at least in my tests some time ago) does not import these files.

        My wish would be to have this configurable, how many bytes to check...at least via ini parameter. Maybe I place such a wish to IDM.

        But thanks anyway for your feedback Mofi!

        Martin

        6,675585
        Grand MasterGrand Master
        6,675585

          Re: UTF-8 auto-detection problem with first multi-byte after

          Oct 01, 2006#4

          Hi Martin,

          maybe you should also request my suggested feature for users only working with UTF-8 files: Create and load ASCII files as UTF-8 by email to IDM support. Johannes (Ammaletu) has done it already at Using UTF-8 with UltraEdit. The more users request this feature the higher the chance to see it realized in one of the next major releases.

          Such a feature (configuration option) would solve your problem too.

          I don't write PHP files but I really start to think that the PHP interpreter should recognize the BOM of a file not as characters which should be sent to the browser. I searched a while with Google for the problem of UTF-8 encoded PHP files and it looks like many PHP programmers have the same problem with other editors (i.e. Dreamweaver) because the PHP interpreter has problems with a BOM. Many PHP programmers think it's a big bug of the PHP interpreter - see for example php doesn't ignore the utf-8 BOM. It looks like the PHP interpreter developers are reacting now on this issue.
          Best regards from an UC/UE/UES for Windows user from Austria