User to user discussion and support for UltraEdit, UEStudio, UltraCompare, and other IDM applications.

Help with writing and running scripts
9 posts Page 1 of 1
I want to use script to strip UTF-8 BOM header, but how to check bytes in hex mode?

For example:
Untitled.jpg
Untitled.jpg (26.99 KiB) Viewed 161 times

Javascript code:

Code: Select all
UltraEdit.activeDocument.hexOn();
UltraEdit.activeDocument.gotoPos(0);
UltraEdit.activeDocument.gotoPosSelect(3);
var sel = UltraEdit.activeDocument.selection;
UltraEdit.outputWindow.write(sel); // <---- The type of sel is string not byte[], and the value of sel is null! How to check the selected bytes?

Best regards,
Thanks.
Why do you not run a simple Replace in Files searching for  (Windows-1252 encoded UTF-8 BOM) and using an empty replace string?

Or when the Replace in Files should be independent on encoding, run a Perl regular expression Replace in Files with the search string \A\xEF\xBB\xBF and an empty replace string.

However, if you want to really make it that complicated, look on this script code:

Code: Select all
UltraEdit.activeDocument.hexOn();
UltraEdit.activeDocument.gotoPos(0);
UltraEdit.activeDocument.gotoPosSelect(3);

// The selected bytes are assigned to a JavaScript
// string which is an array of Unicode characters.
var sBytes = UltraEdit.activeDocument.selection;

if (sBytes.length)
{
   var sByteChars = "Selected: ";
   var sByteCodes = " =";

   // Process the string character by character
   // which means in this case byte by byte.
   for (var nCharIndex = 0; nCharIndex < sBytes.length; nCharIndex++)
   {
      // Get the code value of the character which means
      // the integer value of the selected byte in file.
      var nByteCode = sBytes.charCodeAt(nCharIndex);

      // Convert this code value back to a character assigned to a
      // string of length 1 and append this one character string to
      // sByteChars representing the characters as bytes. Well, the
      // string build here character by character is the same as sBytes.
      sByteChars += String.fromCharCode(nByteCode);

      // Convert the integer value of the byte to a string using hexadecimal
      // system and convert the resulting string to upper case. And if the
      // byte has a value smaller than 16, inserting a leading zero to have
      // all bytes represented with two hexadecimal digits.
      var sHexCharCode = nByteCode.toString(16).toUpperCase();
      if (sHexCharCode.length < 2) sHexCharCode = '0' + sHexCharCode;
      sByteCodes += ' ' + sHexCharCode;
   }

   // Build an additional information for output depending on the
   // selected bytes representing a UTF-8 byte order mark or not.
   var sInfo = "\n\nThis is ";
   sInfo += (sBytes == String.fromCharCode(239,187,191)) ? "the UTF-8 BOM." : "anything else.";
   // Output the selected bytes as string and in hexadecimal representation.
   UltraEdit.messageBox(sByteChars + sByteCodes + sInfo);
}

A JavaScript String object is internally a structure with one element being a pointer to an array of 16-bit unsigned integer values representing the characters of the Unicode encoded string. Therefore a JavaScript String object can be also used as memory of bytes being an array of 8-bit unsigned integer values.
Best regards from Austria
I want to manipulate hex bytes directly instead of simple Replace in Files searching for , this because manipulating bytes can be applied to many other scenes, not just strip UTF-8 BOM.

Now I can read bytes in hex mode. Thanks Mofi!

But now there is a new issue:
After I delete 3 bytes and save the file, I find that the BOM still be in there.

Steps to reproduce:

  1. Open a UTF-8 file which has BOM header, e.g. BOM.txt.
  2. Run test.js:
    Code: Select all
    // test.js
    UltraEdit.activeDocument.hexOn();
    UltraEdit.activeDocument.gotoPos(0);
    UltraEdit.activeDocument.gotoPosSelect(3);
    UltraEdit.activeDocument.hexDelete(3);
    UltraEdit.closeFile(UltraEdit.activeDocument.path, 1);
  3. Reopen the BOM.txt, the BOM still be there.
Well, this is not surprising me. UltraEdit detects the file as being UTF-8 encoded with BOM on opening it. Therefore it saves the file also as UTF-8 encoded with BOM. That the script switches to hex mode and deletes the first 3 bytes of byte stream does not matter for UltraEdit. The file is nevertheless handled as UTF-8 encoded with BOM and not as binary or ASCII encoded file. So UE adds the BOM again on saving the file.

For removing UTF-8 BOM it is best to run a Replace in Files without opening the file at all in UltraEdit. Perl regular expression Replace in Files executed on 1 or more files is the best method to modify the bytes of a file, independent on files being binary files or text files and which encoding the text files have. Please note that Replace in Files can be also executed on just one file not being opened in UltraEdit.
Best regards from Austria
Hi Mofi, as you mention above.
This means UltraEdit always treat the file as text, no matter I use hex mode or not.
Even I open a file with .bin extension and it has EF BB BF header coincidentally, UltraEdit will still treat it as text. Right?

This made me a little disappointed, because for example I can not use UltraEdit as an alternative to WinHex.
In the File - Open dialog window there is the option Open as binary. This option can be used to open any file as binary file directly in hex edit mode without running encoding detection procedures for text files.

For UTF-8 encoded files it is also possible to select ASCII/ANSI as encoding in File - Open dialog window to interpret the UTF-8 byte stream as ANSI character stream. This makes it also possible to remove the UTF-8 BOM in text or in hex edit mode.

The file extension does not matter on running binary/text detection and the encoding detection in case of file is a text file. It only matters what the file contains (at least the first 64 KiB).
Best regards from Austria
Hi Mofi,
Following your instructions, now I can use script to remove the UTF-8 BOM in hex mode.
Thanks.

But how to use this method to remove UTF-8 BOM in multiple files?
My idea is to use GetListOfFiles.js to get filename list and open them by JavaScript, but ...
How to open a file with option: "Open as binary" or "ASCII/ANSI as encoding" by script?
It seems that these two options is only for UI?
Is there any script API can do this?
As I have written already several times, you make this task too complicated by removing the UTF-8 BOM in hex edit mode of an opened file. Use a Perl regular expression Replace in Files command which removes UTF-8 BOM from files within a second.

Here is the script code if you really want to use a script for the execution of the single command:

Code: Select all
// Run a Perl regular expression Replace in Files recurisve on all
// files in directory C:\Temp ignoring hidden directories to remove
// the UTF-8 BOM from each file with first 3 bytes being EF BB BF.

UltraEdit.frInFiles.directoryStart="C:\\Temp\\";

UltraEdit.perlReOn();
UltraEdit.frInFiles.filesToSearch=0;
UltraEdit.frInFiles.searchInFilesTypes="*";
UltraEdit.frInFiles.ignoreHiddenSubs=true;
UltraEdit.frInFiles.searchSubs=true;
UltraEdit.frInFiles.regExp=true;
UltraEdit.frInFiles.matchCase=true;
UltraEdit.frInFiles.matchWord=false;
UltraEdit.frInFiles.preserveCase=false;
UltraEdit.frInFiles.useEncoding=false;
UltraEdit.frInFiles.logChanges=true;
UltraEdit.frInFiles.openMatchingFiles=false;
UltraEdit.frInFiles.replace("\\A\\xEF\\xBB\\xBF", "");

All you need to do with this script is opening it in UltraEdit, modifying the directory path in first code line, saving the script and running it with clicking on Play script respectively Run active script.
Best regards from Austria
Very thank you.
9 posts Page 1 of 1