Tapatalk

Can't get rid of BOM

Can't get rid of BOM

3
NewbieNewbie
3

    Jun 25, 2006#1

    The BOM is inserted (and saved) in all newly created/converted UTF-8 files, ignoring the following options in INI files:

    Write UTF-8 BOM=0
    Write UTF-8 BOM NF=0

    (UE v12.00a/12.10a)

    6,685587
    Grand MasterGrand Master
    6,685587

      Jun 26, 2006#2

      After unchecking Write UTF-8 BOM header to ALL UTF-8 files when saved and Write UTF-8 BOM on new files created within this program (if above is not set) at Configuration - File Handling - Save. Then you have to close UltraEdit and restart it. This restart is important. Now it should work if you have not specified in the Save As dialog a different Format. Tested with v12.10a on WinXP SP2.
      Best regards from an UC/UE/UES for Windows user from Austria

      3
      NewbieNewbie
      3

        Jun 26, 2006#3

        I have the same OS and version and there is still BOM problem.

        Another example - I have checked "Always create new files as UNICODE". When I create a new file, BOM chars can always be seen when switching to Hex mode (Alt+H) or when saving file to disk and look it with F3 in Total Commander.

        6,685587
        Grand MasterGrand Master
        6,685587

          Re: Can't get rid of BOM

          Jun 26, 2006#4

          New Unicode files will always have a BOM, because there are UTF-16 LE and UTF-16 BE. The 2 BOM options are only for UTF-8 and not for UTF-16 (Unicode).

          If you create a new file in Unicode and convert it with File - Unicode/UTF-8 to UTF-8 (Unicode Editing) BEFORE first save it works.

          If you open a Unicode file with BOM and convert it to UTF-8 and save it, the BOM is also removed. But if you open a UTF-8 file with BOM and save it, the BOM will remain according to the explanation in the help about the option "Write UTF-8 BOM header to ALL UTF-8 files when saved".

          Well, you can use for new files the Save As format "UTF-8 - NO BOM". The once selected format in the Save As dialog is used until you select in the same dialog a different format. The Save As format option is the only way to create new files by default in UTF-8 format up to UltraEdit version 15.20, exactly the new file in Unicode or ASCII is converted on first save to UTF-8.

          Look at the format info in the status bar of the UltraEdit window before save.

          With UltraEdit v16.00 the configuration setting for creation of a new file with Unicode encoding was replaced by a radio button option with 3 choices: ANSI, UTF-8, UTF-16. Furthermore with UE v16.00 the Format option of the Save As dialog is not remembered anymore. It is automatically set to Default for next usage of Save As after saving a (new) file with a new name.
          Best regards from an UC/UE/UES for Windows user from Austria

          2
          NewbieNewbie
          2

            Aug 15, 2006#5

            Ok but I want have my files in UTF-8 (without BOM) when I'm writing in it. It is too late when I'm saving. Or it wont be late if it will works during "save" not only "save as" dialog. I find macros, which I can set to "file load" action but I don't know how to make macro with conversion to UTF-8...

            my problem is:
            I open the file. It is recognise as "DOS" format... I want to edit it, and write also national chars, but it's ok only in U8-DOS format. So what I see is some strange chars. When I use conversion from ASCII to UTF-8 its ok, but it is really uncomfortable.

            Can you help me?

            6,685587
            Grand MasterGrand Master
            6,685587

              Aug 15, 2006#6

              hydrant wrote:I don't know how to make macro with conversion to UTF-8...
              My first suggestion is to assign a hot key to command FileConvASCIItoUTF-8(UNICODE) in the key mapping configuration dialog. This would enable you to make the conversion with a single key stroke. You could also add the command to a toolbar, but the command does not have an icon. This is my best advice.


              My second suggestion is a macro, which you can use on every file load. I have to take following into consideration while writting the macro.

              1) Convert the file only if it is not already an UTF-8 or UTF-16 file. Problem: There is no macro command which I can use to determine the file format.

              2) If no UTF-8 multi-byte character is added, the file should be still an ASCII file on next file open. The option Write UTF-8 BOM header to ALL UTF-8 files when saved must be unchecked to fulfill this requirement.

              3) There is no ASCIItoUTF8 macro command.


              Point 1) is done by the macro by first searching for an appropriate charset (HTML files) or encoding (XML files) information. If one of those strings is found, the macro has nothing to do.

              If there is no encoding information, the macro switches to HEX mode (if not already active) and copies the first 2 bytes in hex mode into a new ASCII file. The space must be inserted before paste to be capable to switch to hex mode. If in the new file with the 3 bytes the UTF-16 LE BOM is found, the file is an UTF-8 or UTF-16 file without encoding information.

              If the BOM is not found, the file is either an ASCII file or a binary file. So the macro searches next for a NUL byte (0x00). If the NUL byte is found, the file is handled as binary file and the macro exits in hex mode.

              You see here, Unicode (UTF-x) files without BOM or encoding information are really awful for file reading routines. The macro solution is still not perfect because it fails on binary files without a NUL byte, although this is rare, or on binary files which starts with FF FE. But this detection routine should be enough for your purpose.

              Point 2) and 3) is done with an extremly dirty trick. The file is an ASCII file (hopefully). To force UltraEdit to open it as UTF-8 file without BOM, the macro inserts at top of the file the HTML specification "charset=utf-8", saves the file, closes it and reopens it again. Because of the string UltraEdit handles it now as UTF-8 file without BOM.

              This is no real conversion. The macro just let's UltraEdit think, it is an UTF-8 file. If your ASCII file contains characters with hex code greater than 0x7F, that characters will be now handled wrong and are also displayed wrong. YOU HAVE BEEN WARNED!

              Well, you can avoid this character destruction if you run replace all commands with search for a character with hex code greater 0x80 and replace it with the appropriate characters of the multi-byte encoding before saving and closing the ASCII file with the temporary charset=utf-8 string. But the multi-byte codes are different according to the current codepage used, so I can't make suggestions here. And the sequence of the replace all commands is important to not convert 1 byte of an already inserted multi-byte code again.

              After editing your file and save it, no BOM is added and if your file still does not contain a character greater 0x7F, it is in real still an ASCII file.

              Top
              Find "charset=utf-8"
              IfFound
              Top
              ExitMacro
              EndIf
              Find "encoding="utf-8""
              IfFound
              Top
              ExitMacro
              EndIf
              Find "charset=utf-16"
              IfFound
              Top
              ExitMacro
              EndIf
              Find "encoding="utf-16""
              IfFound
              Top
              ExitMacro
              EndIf
              HexOn
              Clipboard 7
              StartSelect
              Key RIGHT ARROW
              Key RIGHT ARROW
              Key RIGHT ARROW
              Key RIGHT ARROW
              Key RIGHT ARROW
              Copy
              EndSelect
              Key HOME
              NewFile
              UnicodeToASCII
              " "
              HexOn
              Paste
              ClearClipboard
              Clipboard 0
              Key HOME
              Find "FF FE"
              IfFound
              CloseFile NoSave
              HexOff
              ExitMacro
              EndIf
              CloseFile NoSave
              Find "00"
              IfFound
              ExitMacro
              EndIf
              HexOff

              Top
              InsertMode
              "charset=utf-8"
              Clipboard 7
              CopyFilePath
              CloseFile Save
              Open "^c"
              Find "charset=utf-8"
              Delete
              Save
              ClearClipboard
              Clipboard 0

              Again! Best is to convert it manually to UTF-8 with a hot key when you need it. The macro is no real good solution.

              What type of files do you edit which need UTF-8 encoding, but without BOM and without appropriate charset or encoding specification in the file header before first multi-byte character?
              Best regards from an UC/UE/UES for Windows user from Austria

              2
              NewbieNewbie
              2

                Re: Can't get rid of BOM

                Aug 15, 2006#7

                Uau, you are macro guru!
                Mofi wrote:Again! Best is to convert it manually to UTF-8 with a hot key when you need it. The macro is no real good solution.
                Now I see, you are right...
                Mofi wrote:What type of files do you edit which need UTF-8 encoding, but without BOM and without appropriate charset or encoding specification in the file header before first multi-byte character?
                I'm web application programmer so I edit mainly php, html files. I use smarty temlate engine (if you know it) too, and there is fatal problem with BOM. If I generate the HTML page with BOM on it, internet browsers can't display it properly. Maybe I could somehow setup apache/smarty but unfortunately I don't know how.

                Thank you for your time. It was really informative.

                --hydrant

                6,685587
                Grand MasterGrand Master
                6,685587

                  Aug 16, 2006#8

                  hydrant wrote:I'm web application programmer so I edit mainly php, html files.
                  I thought already that you are writing on HTML files. Well, browsers don't like the BOM, that's true. But they also needs to know, that the file is an UTF-8 file. So you should specify the charset correctly. You have to insert the line with the charset specification in the <head> section before first occurence of an UTF-8 multi-byte character (title, description, keywords, ...).

                  See the page Character encodings from the World Wide Web Consortium.
                  Best regards from an UC/UE/UES for Windows user from Austria

                  3
                  NewbieNewbie
                  3

                    Aug 16, 2006#9

                    Is it possible to always create new files as UTF-8 without BOM (not UNICODE with BOM as it seems to be an option)?

                    6,685587
                    Grand MasterGrand Master
                    6,685587

                      Aug 17, 2006#10

                      johna wrote:Is it possible to always create new files as UTF-8 without BOM (not UNICODE with BOM as it seems to be an option)?
                      No, not directly. But it's possible to do it with a macro trick because UTF-8 files are most HTML or XML files with an appropriate character encoding specification. So you have to create first a template file for HTML and/or XML. Here is an example:

                      Code: Select all

                      <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                         "http://www.w3.org/TR/html4/loose.dtd">
                      <html>
                      <head>
                       <meta http-equiv="content-type" content="text/html;charset=utf-8">
                       <title></title>
                      </head>
                      <body>
                      </body>
                      </html>
                      This template file is saved with a fixed name and at a fixed location.

                      Next create a macro named for example "NewHtmlTransUTF8" with the 2 macro properties disabled and a hot key CTRL+N which overrides the hot key of the FileNew command or a different hot key. This macro must be saved in a macro file which is automatically loaded. The macro should contain following lines (for the example above):

                      Open "path to your template file\template file name"
                      Loop 5
                      Key DOWN ARROW
                      EndLoop
                      Loop 8
                      Key RIGHT ARROW
                      EndLoop
                      SaveAs ""

                      The macro opens the template file which is already detected by UltraEdit as UTF-8 file because of the charset=utf-8 string, sets the cursor between <title></title> and forces you to immediatelly save the file with a new name.

                      Such a template file can be a great time saver when working on a HTML or XML project.

                      Instead of a macro you can also add your template file(s) to the File - Favorite Files for quick access. But then don't forget to use first Save As before Save.
                      Best regards from an UC/UE/UES for Windows user from Austria