Any way to separate data by bytes?

sum1 · Jan 06, 2014#12014-01-06T07:30+00:00

Hi all,

Is there any way to separate data by bytes?
Bytes! Not characters. A non-English character may be more than one byte.
And my source data is byte-based. So I have to do this.

The numbers of bytes of each column:
8,50,30,30,30,50,50,50,50,30,15,8,1,6,25,25,55,55,55,55,12,30,11,3,30,24,55,55,55,55,12,30,11,3,30,55,55,55,55,3,30,30,10,3,1,1,1,30,1,10,4,1,6,3,8,10,6,30,10,6,6,1,1,1,1,1,1,1,1,30,30,30,1,8,8,8,8,8,8,8,8,8,8,8,8,8,1,10,3,30,4,5,7,7,4,3,30,5,19,19,19,19,20,7,8,9,19,19,8,10,30,50,5,35,3,30,3,30,4,30,4,30,25,4,50,3,30,3,30,30,40,30,40,30,40,30,40,30,40,30,40,6,8,50,50,50,5,25,6,6,70,20,1,5,5,3,10,30,25,5,30,8,3,3,8,8,3,5,3,8,8,6,8,2,2,8,12,3,2,8,1,5,5,1

Make it TSV or CSV. (Separate with tabs or commas)

Or any other tool?

TIA!

Mofi · Jan 06, 2014#22014-01-06T18:39+00:00

For an ASCII/ANSI text file the number of bytes is equal the number of characters as each character is encoded with just a single byte.

For Unicode files the number of bytes must be divided by 2 to get the number of characters. Every Unicode encoded text file including UTF-8 or ASCII Escaped Unicode text files are converted to UTF-16 Little Endian in memory for viewing/editing which uses 16 bit (= 2 bytes) per character.

A fixed column file with each line having same length can be converted to a CSV file using Column - Convert to Character Delimited.

If you need first to insert also the line breaks, use command Insert - String at Every Increment.

sum1 · Jan 07, 2014#32014-01-07T19:22+00:00

Thank you for your hints!

I made out a procedure and succeeded manually.
Could you (or anyone) compile a script for me (if the steps can be scripted)?

Step 1:
Ask the user to enter an unused character (use ! by default) to complement the number of characters.
(If Canceled, quit the script.)
(Since the source data is byte-based and I have to handle it by characters, I'm going to complement the number of characters for the non-ASCII characters, in order to make the number of characters equal to the original number of bytes.)

Step 2:
Replace: (Perl regular expressions)
([^\x00-\x7f])
with:
\1!

Step 3:
Replace: (Perl regular expressions)
([\x{0800}-\x{ffff}])
with:
\1!
(The source data encoding is UTF-8. So these characters are 3 bytes. Another ! should be added.)

(Now the number of characters has been equal to the original number of bytes.)

Step 4:
(Column menu) Convert to Character Delimited command.
Best if the script can use that dialog box and go on. Otherwise this step should be scripted as:
Ask the user to enter the Separator Character (use ^t by default);
(If Canceled, quit the script.)
Ask the user to enter the Field Widths (use 8,50,30 by default);
(If Canceled, quit the script.)
Then convert the data.

(Because step 2+3 could take time, the script should ask the questions of step 4 ahead of implementing step 2+3+4.)

Step 5:
Ask the user whether to remove the complement character.
(If Confirmed, Replace the complement character with "".)

If there are better solutions, please let me know and amend mine.
Thank you!

Mofi · Jan 11, 2014#42014-01-11T13:30+00:00

Command Convert to Character Delimited is not available as scripting command, but that does not really matter as it can be replaced by a sequence of regular expression replaces.

I would wrote the script for you, but I do not understand step 1, step 2 and step 3.

Characters in a UTF-8 encoded file are stored with 1, 2, 3 or even 4 bytes. But how the characters are stored in the file on storage media does not really matter if UltraEdit automatically detects the file to be encoded in UTF-8, or you open the file using File - Open and select UTF-8 in encoding/format drop down list before opening the file.

The UTF-8 encoded file is loaded by UltraEdit with conversion to UTF-16 Little Endian in memory and displayed are just the characters. Each character is kept now with 2 bytes in memory (or even 4 bytes), but that is not important for us, the users, as we work now in UltraEdit with characters.

It might be that you open the UTF-8 encoded file as ASCII/ANSI file and therefore see the UTF-8 byte streams for characters with a Unicode value greater than 127 as ANSI characters. But in this case your regular expression search of step 3 is absolutely useless as there is no byte (character) with a value greater than 0xFF.

Well, it looks like with appending ! to characters with a value in range 0 to 127 and do the same for characters with code value 2048 to 65535 you want that the file contains for each character 2 or 4 bytes. But that will not work as some Unicode characters in upper range are already encoded with 4 bytes. A DOS/Windows line termination (carriage return + line-feed) would be broken because of inserting an exclamation mark between.

I have really a problem to understand what is your intention on working on bytes of the UTF-8 encoded file than working on the characters. For me it looks like you make something more complicated than really necessary.

sum1 · Jan 14, 2014#52014-01-14T20:38+00:00

Let me give an example with some pictures:

The bytes are continuous in the file. To make it clear in the picture, I broke the stream at 0D0A.

The data fields are fixed length in bytes. The red Solid lines are where I want to insert the delimiter bytes.

It would be best if there is a way to insert the delimiter bytes into the byte stream directly. But I don't know such a way (which is what I exactly want). So currently I have to open the files with a text editor and handle the text by characters.

The data fields are fixed length in bytes. But the number of characters could be different. (I highlighted some corresponding bytes and characters with different colors.)

UltraEdit's Convert to Character Delimited command can only handle the text by characters. (And I don't know if regular expressions can handle by bytes.) So I insert ! next to the multi-byte characters according to the numbers of the bytes.
Now the number of characters is equal to the number of the original bytes, which lets UltraEdit's Convert to Character Delimited command insert the delimiters to the right positions.

Any better solutions/tools are welcome.

The example text and its Hex(UTF-8):
(Column Width: 8,30,15,10,13 bytes)

Code: Select all

Field 1 Field 2 ăĕĭŏŭ âêîôû Field3(15bytes)Field 4   Field 5, etc.
123     Any unicode string without tab诸如此类             Field 5, etc.
12345678[---This field is 30 bytes---]エトセトラ[10 bytes]Field 5, etc.

4669656C642031204669656C64203220C483C495C4ADC58FC5AD20C3A2C3AAC3AEC3B4C3BB204669656C64332831356279746573294669656C6420342020204669656C6420352C206574632E0D0A3132332020202020416E7920756E69636F646520737472696E6720776974686F757420746162E8AFB8E5A682E6ADA4E7B1BB202020202020202020202020204669656C6420352C206574632E0D0A31323334353637385B2D2D2D54686973206669656C642069732033302062797465732D2D2D5DE382A8E38388E382BBE38388E383A95B31302062797465735D4669656C6420352C206574632E0D0A

By the way, about the picture above, if someone would be interested:
All the colorful highlightings and lines in the text are done within EmEditor (another text editor), not by an image editor.

Mofi · Jan 16, 2014#62014-01-16T07:14+00:00

Okay, I will code the script for you on weekend. It would be good to have an input file (I could create this the your posted data) and an output file on which I can develop and test the script and check if the output of the script is really what you expect on input data.

Would you create a small input file and a small output file, pack both with ZIP or RAR and attach the zip/rar file to your next post. That would be best for me.

I'm thinking now that it might be better to switch to hex edit mode for the file, select all, copy the byte stream to user clipboard 9, create in memory the byte stream in a string variable with inserting the separator character after x,y,z, ... bytes applied as often as needed and output the created byte stream into a new file also in hex edit mode. The new file can than be saved and should be what you want.

Let's see which method is easier to implement as script.

sum1 · Jan 16, 2014#72014-01-16T19:29+00:00

Thank you, Mofi.

After I discussed with others these days, I'm thinking things might be easier. Maybe inserting the delimiters into ASCII codes is the most direct and fast way.

Before you get to coding, please be bothered to read this thread here and see if there's any useful information:

http://www.emeditor.com/forums/topic/an ... -by-bytes/

Mofi · Jan 19, 2014#82014-01-19T13:52+00:00

To see that you asked for help in another forum of another editor and somewhere posted a script solution already dropped my motivation for helping further on this topic to zero.

sum1 · Jan 20, 2014#92014-01-20T21:13+00:00

Sorry to see that, Mofi. But you might misunderstood what I meant.

I linked EmEditor forum above, for I just wanted to share what I learned there. The steps of inserting ! would not be needed if we can handle ASCII codes that way in UltraEdit.
I'm not sure. I see UltraEdit can do regular expression replacement on ASCII. But UltraEdit displays ASCII continuously, not by lines. So I passed it to you and thought maybe you could shift the script way to carry out in UltraEdit what that EmEditor script(macro) could do.

No I didn't mean I got the ultimate solution. Your help is still needed. And I hope you could understand I'm comparing methods/tools to choose the better one.

EmEditor seems not stable for this task. Running that script(macro) on a 5000-line file caused it to crash. (Using regular expression replacement that way might make it exhausted. I'll make time to give the details in EmEditor forum later.)

Comparatively, UltraEdit can accomplish my manual procedure on more than 5000 lines stably in an acceptable period of time.

And even if there are other tools competent for the task, I still want to take a comparison.

So please go on, since you've prepared, if you are still willing to help me.
If the example data is not enough for testing, I can make a large simulation file later. (The real data can not be revealed. So I think it would be better I test the script myself and tell you the result afterwards.)

But thanks any way!

Mofi · Jan 25, 2014#102014-01-25T14:28+00:00

Okay, I wrote a small UltraEdit script to work on byte stream for inserting separator characters. As you have not provided an input file on which I could run the script and no output file on which I could verify the output of the script, I just can hope that the script produces the output you want for your files.

Code: Select all

if (UltraEdit.document.length > 0)  // Is any file opened?
{
   do
   {  // Ask the script user for the separator character.
      var sSeparator = UltraEdit.getString("Please enter separator character.\nFor tab character enter ^t or \\t.\nTab is used if nothing entered.",1);

      // Use horizontal tab character if script user enters nothing.
      if (sSeparator.length == 0) sSeparator = "\t";

      // Replace the strings "^t" and "\t" by horizontal tab character.
      else if ((sSeparator == "^t") || (sSeparator == "\\t")) sSeparator = "\t";

      // In case of user entered a longer string, inform the user about the
      // mistake and request entering the separator character once again.
      if (sSeparator.length > 1)
      {
         UltraEdit.messageBox("The string \""+sSeparator+"\" is an invalid separator definition.\n\nPlease enter just 1 character or ^t or \\t.");
      }
   }
   while (sSeparator.length > 1);

   do
   {  // Ask the script user for the field widths separated by commas.
      var sFieldWidths = UltraEdit.getString("Please enter the fields widths.\n\nDefault is 8,50,30 if nothing entered.",1);

      // If script user has nothing entered, use default 8,50,30.
      if (sFieldWidths.length == 0) sFieldWidths = "8,50,30";

      // Validate if user has not entered an invalid string
      // containing a character which is not a digit or a comma.
      var bInvalidString = (sFieldWidths.search(/[^\d,]/) >= 0) ? true : false;

      if (!bInvalidString)  // Does the string contain only digits and commas?
      {
         // A string consisting only of commas and/or zeros is also not allowed.
         if (sFieldWidths.search(/[1-9]/) < 0) bInvalidString = true;
      }

      // In case of user entered an invalid character or something not usful
      // like only commas, zeros or a mixture of commas and zeros, inform the
      // user about the mistake and request entering the field widths again.
      if (bInvalidString)
      {
         UltraEdit.messageBox("The string \""+sFieldWidths+"\" is not a valid field width definition.\n\nPlease enter just numbers greater 0 separated by commas.");
      }
   }
   while (bInvalidString);

   // Convert the field width string into an array of numbers.
   var asFieldWidths = sFieldWidths.split(",");
   var anFieldWidths = new Array();
   for (var nNumber = 0; nNumber < asFieldWidths.length; nNumber++)
   {
      // Ignore empty strings caused by 2 or more commas in series.
      if (!asFieldWidths[nNumber].length) continue;
      // Convert decimal number as string into a real number.
      var nNextFieldWidth = parseInt(asFieldWidths[nNumber],10);
      // Ignore numbers with value 0.
      if (nNextFieldWidth == 0) continue;
      // Append the next field width number to the number array.
      anFieldWidths.push(nNextFieldWidth);
   }

   // Define environment for this script.
   // The script works on the file in hex edit mode.
   UltraEdit.insertMode();
   if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
   else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
   if(!UltraEdit.activeDocument.hexMode) UltraEdit.activeDocument.hexOn();

   // Hex edit mode cannot be enabled on an empty file.
   if (UltraEdit.activeDocument.hexMode)
   {
      // Get content of file with user clipboard 9 as byte stream into variable.
      UltraEdit.activeDocument.selectAll();
      UltraEdit.selectClipboard(9);
      UltraEdit.activeDocument.copy();
      var sByteStream = UltraEdit.clipboardContent;
      UltraEdit.clearClipboard();
      // Move caret to top of the active file discarding the selection.
      UltraEdit.activeDocument.top();

      // Copy the byte stream back to user clipboard 9, but with inserting
      // the separator character in each line according to the field widths.
      var nBytePos = 0;
      do
      {
         // The bytes of current line are copied with inserting the separator
         // character. There is no check if a line-feed is within the bytes
         // because one line is too short according to the field widths.
         for(nNumber = 0; nNumber < anFieldWidths.length; nNumber++)
         {
            // Copy from current position in byte stream the number
            // of bytes according to next field width number.
            UltraEdit.clipboardContent += sByteStream.substr(nBytePos,anFieldWidths[nNumber]);
            nBytePos += anFieldWidths[nNumber];
            // Break this loop if end of byte stream reached
            // which would be unexpected here in this loop.
            if (nBytePos >= sByteStream.length) break;
            // Append the separator character.
            UltraEdit.clipboardContent += sSeparator;
         }

         // After last field width copy the rest of the line up to next
         // line-feed character. So the width of last field must no be
         // entered at all. Just carriage return (if present) and line-feed
         // is copied if width of last field was entered also at beginning.
         if (nBytePos < sByteStream.length)
         {
            // Find the next line-feed character terminating current line.
            var nNextLineFeed = sByteStream.indexOf("\n",nBytePos);
            // If found in byte stream, increase returned position of line-feed.
            if (nNextLineFeed >= 0) nNextLineFeed++;
            // Else the file ends with no line termination and rest of file must be copied.
            else nNextLineFeed = sByteStream.length;
            // Append the rest of the line to current byte stream in clipboard.
            UltraEdit.clipboardContent += sByteStream.substring(nBytePos,nNextLineFeed);
            nBytePos = nNextLineFeed;
         }
      }
      while (nBytePos < sByteStream.length);

      // Create a new file and make sure it is an ASCII file with DOS line terminators.
      UltraEdit.newFile();
      UltraEdit.activeDocument.unixMacToDos();
      UltraEdit.activeDocument.unicodeToASCII();

      // Insert 1 space and switch to hex edit mode.
      UltraEdit.activeDocument.write(" ");
      UltraEdit.activeDocument.hexOn();

      // Select the space character and replace it with the new byte stream.
      UltraEdit.activeDocument.selectAll();
      UltraEdit.activeDocument.paste();
      UltraEdit.activeDocument.top();

      // Clear the clipboard with new byte stream and reselect Windows clipboard.
      UltraEdit.clearClipboard();
      UltraEdit.selectClipboard(0);
   }
}