What's the best default new file format?

What's the best default new file format?

7
NewbieNewbie
7

    Feb 09, 2015#1

    I often work with Chinese characters so I tried changing the new file default to use UTF-8, but then I found that when creating DOS batch files (which don't need them) it created a problem with garbage characters being placed at the beginning of the file which choked the command processor.

    I don't understand file formats or encoding that well, but is there a default new file format (and encoding) that will handle extended characters and also be used to create DOS batch files?

    6,686585
    Grand MasterGrand Master
    6,686585

      Feb 09, 2015#2

      Your question can't be really answered as it depends on which file types you work most with.

      First read the IDM power tip Working with Unicode in UltraEdit/UEStudio and the Wikipedia article about character encoding. I added lots of links to my text below. You should follow them and carefully read the referenced pages.

      HTML, XHTML and XML files are very often encoded with UTF-8 without byte order mark (BOM). Those files contain at top a charset declaration (HTML and XHTML) respectively encoding declaration (XML). UltraEdit recognizes those declarations and opens UTF-8 encoded files without BOM as Unicode file if the declaration can be found in first 64 KB of file on using UltraEdit for Windows < v24.10 or UEStudio < v17.10.

      To create by default new files with UTF-8 encoding without the 3 bytes of byte order mark added on first save, the settings
      • Write UTF-8 BOM header to all UTF-8 files when saved
      • Write UTF-8 BOM on new files created within this program (if above is not set)
      must be both unchecked at Advanced - Configuration - File Handling - Save. And of course in Save As dialog opened on first save Default or UTF-8 - NO BOM must be selected for encoding/format to save the new file really as UTF-8 encoded file without BOM.

      A UTF-8 encoded file without appropriate declaration, without BOM and without any character with a code value greater 127 is binary equal to same file encoded for example with Windows-1252. So a text editor can't determine the encoding wanted by the user for this file in this case. So be careful on using UTF-8 encoding by default for all type of text files if using it without BOM.

      Batch files require a completely different character encoding than other text files. The reason is the usage of OEM character set in console windows. Open a command prompt window, enter chcp, hit key RETURN or ENTER and you see which code page is used by default in console window according to your Windows language configuration. Typical in Western European and North American countries are code page 850 and code page 437 which encodes the characters in upper half of the table (code value greater 127) completely different to very common code page Windows-1252 usually used for single byte encoded text files on Windows. So batch files are not Unicode files, but files with just 1 byte per character (= maximum of 256 different characters) using by default a code page usually not used anymore for text files.

      This is one reason why I have a separate configuration for files with extension bat or cmd. After creating a new file (ANSI, Windows-1252), I save it first with file extension bat before doing anything else if I want to create a new batch file. This results in an immediate change of font, font size and used encoding as OEM Character Set feature of UltraEdit is automatically enabled for this file. Also the tab stop and indent values are automatically set now for batch file editing (4 spaces per tab). Now I can write comments and text output by the batch file to console window or into a file via redirection using characters in upper half of the character table (mainly German characters like äöüÄÖÜß) and UltraEdit automatically inserts them into the batch file with code value according to code page 850 instead of Windows-1252 as all other non Unicode files opened in same instance of UltraEdit still use.

      Are you interest in special configuration for batch files?

      Yes, read Different font depending on file extension. It is also advisable having command OEM Character Set in a customized toolbar (follow the link at bottom of referenced topic). I don't need to click on this command to enable OEM Character Set for batch files as automatically enabled for *.bat and *.cmd because of my configuration. But it is very good to have this command in toolbar to see if OEM Character Set is currently enabled for active file.
      Best regards from an UC/UE/UES for Windows user from Austria

      7
      NewbieNewbie
      7

        Feb 11, 2015#3

        Wow, thanks for all of the information. I have attempted to digest it all and I think I understood everything except for the OEM character set stuff. I will say that your reply and the power tip page were far better written than the Wikipedia page.

        I mostly use UltraEdit to create files that are read only by UltraEdit, though occasionally I create either DOS batch files or other programming files (Java, Perl, etc.), but not much raw HTML.

        What I would like to do is find a file format or process that will handle extended (Chinese) characters, but also a way to easily create DOS batch files (and other files) which don't need them. I tried using the new file default of UTF-8 (with the BOM), but nothing I did with special configurations and OEM Character sets would make changing the file to a BAT extension get rid of the BOM. My guess is that I have misunderstood things or that's just not possible. I understand that I can convert the files, but invariably I forget until I see the command processor choke the first time I run things.

        Can you recommend something that will help me?

        115
        Power UserPower User
        115

          Feb 11, 2015#4

          This might not be the prettiest solution but it should work.

          With the settings for new files set to ASCII - create a new blank file (or add a comment line at the top) and save it as New_ASCII.txt - this becomes your blank file for creating new files without UTF8 encoding.

          Then change the setting for new files to UTF8. New files will be created in this format.

          When you want to create a new non-UTF8 file, open New_ASCII.txt and do a Save As to the new name of your file.

          6,686585
          Grand MasterGrand Master
          6,686585

            Feb 12, 2015#5

            Mick, that is a good idea for this use case. I suggest to add New_ASCII.txt to list of favorite files. This makes it possible to quickly open it using File - Favorite Files or Favorites on pane Lists of File View.

            JFord, be careful with using BOM for UTF-8 encoded files. Many other interpreters and compilers also do not recognize those 3 bytes as encoding information and output an error because of UTF-8 BOM. For example PHP interpreter does not support UTF-8 BOM like command line interpreter of Windows.

            There is a special setting to load all files not being encoded as UTF-16 LE or UTF-16 BE as UTF-8 encoded file even if there is no UTF-8 BOM, no UTF-8 character set or encoding declaration and also no UTF-8 encoded character in first 64 KB of the file on using UltraEdit for Windows < v24.10 or UEStudio < v17.10. But if you often view or edit batch files which contain also extended characters, it is not advisable to use the special setting which must be added manually to uedit32.ini.

            Note: There is View - Views/Lists - ASCII Table. As long as using only characters from 0 to 127 in this table, there is no difference between UTF-8 encoding, OEM or ANSI/Windows code page in binary representation of the file.
            Best regards from an UC/UE/UES for Windows user from Austria

            7
            NewbieNewbie
            7

              Feb 12, 2015#6

              I will consider that workaround. Given that there isn't a direct solution to my problem I will think about what is the best way to remind me that I need to change formats. Often I am a bear of little brain.

              I would like to post another related question here in this thread, but if you think it needs its own topic then I will do that instead.

              Today I needed to get a list of files from another machine many of which have Chinese characters. I used the DOS dir command and piped the output to a text file. I opened that output file in Notepad on that machine and it looked perfect, but I could not figure out how to open the same file on my machine through UltraEdit so that the characters would appear correctly. The DOS code page on the source machine was 936. In the end I just used MS Word as a conduit, but what options in the UltraEdit Open File dialog should I have used?

              6,686585
              Grand MasterGrand Master
              6,686585

                Feb 17, 2015#7

                In the status bar at bottom you can select the code page for active file since UE v19.00 except basic status bar is used (configuration setting, by default not enabled).

                So you have to select ANSI - 936 first. But that does not change display of text. This must be done with View - Set Font. Or after setting code page 936 the file is converted to Unicode as MS Word and MS Notepad do automatically, for example by clicking once again into encoding/code page selector in status bar and selecting now Unicode - UTF16LE. As long as the used font supports those characters in Unicode table, you get now the characters displayed correct.

                See also my post on topic Message about change of font and/or script settings in font dialog.
                Best regards from an UC/UE/UES for Windows user from Austria