How can I match the nth occurrence of something?

layperson · Feb 03, 2019#12019-02-03T15:42+00:00

Hi.

I have about 1,000 long lines of HTML look like the one below.

5351 "aCombinationOption" : [{"id":87116473,"price":0,"stockQuantity":509,"regOrder":0,"optionName1":"1. ABC","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":false},{"id":87116474,"price":50,"stockQuantity":313,"regOrder":1,"optionName1":"DEF","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":false},{"id":87116475,"price":0,"stockQuantity":1509,"regOrder":2,"optionName1":"3. GHI","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":false},{"id":87116476,"price":12000,"stockQuantity":9530,"regOrder":3,"optionName1":"JKL","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":false}],

I want to select the first id numbers in every line using regular expression one by one in UEStudio v18.20.0.18.

1. How can I select the first one 87116473?

a) (?<=\"id\":)\d+.*?
b) \"id\":\d+.*?
c) (?:\"id\"(\d+)){1}

The a) and b) above select four things highlighted in orange color and c) doesn't work.

I know I can select the first one by including the square bracket in the regular expression like (?<=\[{\"id\":)\d+ but I'd like to select them without it.

2. Is it possible to select the second, third and fourth each by using regular expression in ever line of HTML?

** ABC, DEF, GHI, and JKL are product names in the drop down list box.

3. I told you I have about 1,000 lines look like the one above.

I want to count the number of IDs each line.

There are four ids in the first line of HTML and I want UE to type in four in the first line, the next number of ID numbers in the second line, and so on.

I know the #3 question is not good because I am just asking the answer without studying any computer programming language. lol

I am sure I will do study in the middle of Feb this year!

Any help will be greatly appreciated.

Mofi · Feb 04, 2019#22019-02-04T07:18+00:00

The data are obviously JSON data in a single line. Do you know the commands Reformat JSON and Compress JSON as well as JSON manager introduced with UltraEdit for Windows v24.10? I recommend using Reformat JSON to make the JSON data more readable and finally after editing Compress JSON can be used to compact the data.

Let us look on: (?<=\"id\":)\d+.*?

This Perl regular expression is searching for one or more digits after the string "id": by using a positive lookbehind which does not match (select) characters. The double quote has no special meaning in Perl regular expressions. So it is not necessary to escape each of them with a backslash. The expression .*? is useless here because it matches non-greedy 0 or more characters except new line characters. For a positive search after matching already one or more digits it is always enough that .*? matches nothing. So this expression part is really useless here. The search expressions can be also just (?<="id":)\d+ for matching the integer value of element id.

Let us look next on: \"id\":\d+.*?

Well, it is nearly the same search expression than first one with the difference that the string "id": is also matched/selected because of not using a positive lookbehind. Again using just "id":\d+ is absolutely enough to find the same strings.

Let us look last on: (?:\"id\"(\d+)){1}

This search expression does not find anything because of : is missing in search string which is always between "id" and the integer number. The non-capturing group with the multiplier {1} for matching the expression inside the group exactly one times is completely useless here. {1} is the default for an expression. So the usage of {1} makes most often not sense, especially not here for the JSON data. {0,1} makes often sense which means zero or one times like ? after an expression group or a single, literally interpreted character. {1,} makes also often sense which means one or more times like multiplier + and of course something like {1,5} which means one to five times.

I recommend to read power tip Perl regular expression using backreferences how 1 to 9 (soon more after release of UltraEdit for Windows v26) capturing groups can be used to find a string and reformat the found strings.

It is unclear what you want to do with the found identifier numbers and the product names you also want to find.

Do you want to extract those data into a new file formatted and saved as CSV file?

That would be a classic JSON data to CSV data task.

A Perl regular expression replace cannot do mathematical operations and string to integer and integer to string conversions like finding a number string, converting the string into an integer value using decimal system, modify the integer value with a mathematical operation, convert the resulting integer value to a string and replace the found string by the new string. It is also not possible that a Perl regular expression replace inserts a string representing an integer value which the replace increments internally by one. The replace function is for replacing a found string by another string. That's it. For a task like searching for an integer number string being replaced by a number string representing an integer value incremented by one on each find occurrence it is necessary to use an UltraEdit script and the script code contains the additional instructions to initialize an integer variable with a specific value, searches for a string in file, increments the integer value on successful find, converts the integer value into a string and replaces the found string by the new string created by the script code.

Please post how the example data should look like after modification respectively what should be written into a new file and how this new file should look like after finishing data processing. Then we can suggest something on how to process the input data to wanted output data.

layperson · Feb 04, 2019#32019-02-04T09:02+00:00

Hi,

Thank you very much for your concern and help, Mofi.

I didn't even know that the escape letter is not necessary for double quote(") in Perl.

5351 "aCombinationOption" : [{"id":87116473,"price":0,"stockQuantity":509,"regOrder":0,"optionName1":"1. ABC","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":false},{"id":87116474,"price":50,"stockQuantity":313,"regOrder":1,"optionName1":"DEF","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":false},{"id":87116475 ,"price":0,"stockQuantity":1509,"regOrder":2,"optionName1":"3. GHI","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":false},{"id":87116476,"price":12000,"stockQuantity":9530,"regOrder":3,"optionName1":"JKL","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":false}],

4419 "aCombinationOption" : [{"id":896834,"price":0,"stockQuantity":99089,"regOrder":0,"optionName1":"KKK","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":true},{"id":814835,"price":10400,"stockQuantity":9470,"regOrder":1,"optionName1":"LLL","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":true},{"id":514836,"price":31300,"stockQuantity":99844,"regOrder":2,"optionName1":"MMM","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":true},{"id":464837,"price":13000,"stockQuantity":99659,"regOrder":3,"optionName1":"NNN","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":true},{"id":894838,"price":37400,"stockQuantity":99963,"regOrder":4,"optionName1":"OOO","optionName2":null,"optionName3":null,"optionName4":null,"optionName5":null,"todayDispatch":true}],

These are only two lines in the file with the similar pattern of about 1,000 lines.

I want to extract id numbers highlighted in color and the value after optionName1 highlighted in green.

For numbers:

The first occurrence of the id numbers is:

87116473 <-- first one in the first line
896834 <-- first one in the second line

The second occurrence of id numbers:

87116474 <-- second one in the first line
814835 <-- second one in the second line

and so on...

As you see, the second line has more one more id number than the first.

So, the fifth occurrence of id number:

null(or no value) <--- the fifth one in the first line
894838 <--- the fifth one in the second line

For the value after optionName1:

The first occurrence of optionName1:

ABC <--- the first value of optionName1 in the first line
KKK <--- the first value of optionName1 in the second line

The second occurrence of optionName1:

DEF <---- you know...
LLL <---- you know...

The second line has one more optionName1 like the id number above.

After extracting each value, I want to put them into Microsoft Excel as shown in attached image.

How can I extract each one?

fleggy · Feb 04, 2019#42019-02-04T11:02+00:00

Hi,

you can parse the text if you know the maximum count of "id:" per one line. For example lets say that there are up to 6 id's per line.

Find:

Code: Select all

^(?:(?:(?!{"id":).)*+){"id":(\d+)(?:(?:(?!"optionName1":).)*+)"optionName1":"([^"]++)"(?:(?:(?!{"id":).)*+)(?:(?:(?:(?!{"id":).)*+){"id":(\d+)(?:(?:(?!"optionName1":).)*+)"optionName1":"([^"]++)"(?:(?:(?!{"id":).)*+))?(?:(?:(?:(?!{"id":).)*+){"id":(\d+)(?:(?:(?!"optionName1":).)*+)"optionName1":"([^"]++)"(?:(?:(?!{"id":).)*+))?(?:(?:(?:(?!{"id":).)*+){"id":(\d+)(?:(?:(?!"optionName1":).)*+)"optionName1":"([^"]++)"(?:(?:(?!{"id":).)*+))?(?:(?:(?:(?!{"id":).)*+){"id":(\d+)(?:(?:(?!"optionName1":).)*+)"optionName1":"([^"]++)"(?:(?:(?!{"id":).)*+))?(?:(?:(?:(?!{"id":).)*+){"id":(\d+)(?:(?:(?!"optionName1":).)*+)"optionName1":"([^"]++)"(?:(?:(?!{"id":).)*+))?.*$

Replace:

Code: Select all

$1, $3, $5, $7, $9, $11, "$2", "$4", "$6", "$8", "$10", "$12"

Such replace creates from your "two line sample" this:

87116473, 87116474, 87116475, 87116476, , , "1. ABC", "DEF", "3. GHI", "JKL", "", ""

896834, 814835, 514836, 464837, 894838, , "KKK", "LLL", "MMM", "NNN", "OOO", ""

BR, Fleggy

layperson · Feb 05, 2019#52019-02-05T05:29+00:00

Hi, Fleggy.

Thanks for your help but I am afraid your regular expression doesn't work.

It just gives an error "The complexity of matching the expression has exceeded available resources."

Looks like it's way too long for UES to run.

I am still googling.

Thanks anyway. : )

fleggy · Feb 05, 2019#62019-02-05T10:11+00:00

Hmmm, "The complexity of..." error is a very annoying bug in last few UE(S) versions. Unfortunately even the approaching version 26 suffers from this bug.
I tested this regex in Notepad++ (no access to UE at work) which also uses Boost regex library and 1024 lines processing was very fast.

And what about this? Does it work in UES?

Find:

Code: Select all

^.+?{"id":(\d+).+?"optionName1":"([^"]++)"(?:.+?{"id":(\d+).+?"optionName1":"([^"]++)")?(?:.+?{"id":(\d+).+?"optionName1":"([^"]++)")?(?:.+?{"id":(\d+).+?"optionName1":"([^"]++)")?(?:.+?{"id":(\d+).+?"optionName1":"([^"]++)")?(?:.+?{"id":(\d+).+?"optionName1":"([^"]++)")?.*$

Replace:

Code: Select all

$1, $3, $5, $7, $9, $11, "$2", "$4", "$6", "$8", "$10", "$12"

layperson · Feb 05, 2019#72019-02-05T16:57+00:00

Thank you very much, fleggy

The second regular expression works in UE. (I use UES, too.) The first one works well in Notepad++. I sometimes use Notepad++, but mostly UES.

I understand how your regular expression works and I think it's very smart way of solving the issue. However, the problem is there are some lines with more than 20 optionNames.

I don't think UE or Notepad++ supports 20 or more backreferences.

I think, I need to use JavaScript.

fleggy · Feb 05, 2019#82019-02-05T18:46+00:00

layperson wrote: However, the problem is there are some lines with more than 20 optionNames.

I don't think UE or Notepad++ supports 20 or more backreferences.

Hi layperson,

I am really glad that the second pattern works for you. You can expand the pattern with more blocks (?:.+?{"id":(\d+).+?"optionName1":"([^"]++)")? until the maximum length of the pattern is not crossed (30000 characters).
I don't know how many groups can be backreferenced but I successfully used $256 in my test in UES18.20.0.18/UE25.20.0.166 :)

BR, Fleggy

Mofi · Feb 05, 2019#92019-02-05T20:37+00:00

Here is a script solution for this data extraction task.

Code: Select all

var sSeparator = ",";   // Separator to use in CSV file created by the script.

if (UltraEdit.document.length > 0)  // Is any file opened?
{
   // Define environment for this script.
   UltraEdit.insertMode();
   if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
   else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
   UltraEdit.activeDocument.selectAll();
   if (UltraEdit.activeDocument.isSel())
   {
      // Get the data of interest with unwanted data between
      // and all line-feeds into an array of strings.
      var asData = UltraEdit.activeDocument.selection.match(/(?:"id":\d+[\t ]*?(?:,".+?){3},".+?:".+?"|\n)/g);
      UltraEdit.activeDocument.top();  // Cancel the selection.
      if (asData)                      // Was anything found at all?
      {
         var asDataValues  = [];   // array of arrays each with two arrays with variable number of strings
         var asIdentifiers = [];   // array of identifier strings in current line/row
         var asOptionNames = [];   // array of option name strings in current line/row
         var nTotalRowCount = 0;   // total number of data rows
         var nTotalColCount = 0;   // total number of data columns or more precise total
                                   // number of identifiers and option names per data row

         // Run the loop on each found string in data array.
         for (var nDataIndex = 0; nDataIndex < asData.length; nDataIndex++)
         {
            // Is the current string not a line-feed?
            if (asData[nDataIndex] != "\n")
            {
               // Use a regular expression to get just the wanted data
               // (identifier, option name) separated by a horizontal tab.
               asData[nDataIndex] = asData[nDataIndex].replace(/"id":(\d+)[\t \r\n]*?(?:,".+?){3},".+?:"(?:\d+\. )?(.*?)"/,"$1\t$2");
               // Add to array of identifiers of current line the identfier.
               asIdentifiers.push(asData[nDataIndex].replace(/(\d+).+$/,"$1"));
               // Add to array of option names of current line the option name.
               asOptionNames.push(asData[nDataIndex].replace(/\d+\t(.*)$/,"$1"));
            }
            else  // Line-feed marks end of a line/data row in JSON data file.
            {     // Was the previous processed string a data string?
               if (asIdentifiers.length)
               {
                  nTotalRowCount++;
                  // Has this line more identifiers then all lines before?
                  if (asIdentifiers.length > nTotalColCount)
                  {
                     nTotalColCount = asIdentifiers.length;
                  }
                  // Create a new array for this data row in CSV file
                  // and assign the identifier strings array and the
                  // option names array to this data row array.
                  var asRowData = new Array(2);
                  asRowData[0] = asIdentifiers;
                  asRowData[1] = asOptionNames;
                  // Append this array to the array containing structured
                  // all data values to write into CSV file.
                  asDataValues.push(asRowData);
                  // Create new arrays for identifiers and option names
                  // for the next data line read from JSON data file.
                  asIdentifiers = [];
                  asOptionNames = [];
               }
            }
         }

         // Delete all strings read from JSON file to free memory.
         asData.splice(0,asData.length);

         if (nTotalRowCount)
         {
            // Get type of line termination and code page of active file.
            var nLineTerm = UltraEdit.activeDocument.lineTerminator;
            var nEncoding = UltraEdit.activeDocument.codePage;
            var sFileName = UltraEdit.activeDocument.path;

            // Create a new file and make sure it has the same encoding
            // and the same line termination type as the JSON data file.
            UltraEdit.newFile();

            if (nEncoding != UltraEdit.activeDocument.codePage)
            {
               if (UltraEdit.activeDocument.codePage == 65001)
               {
                  UltraEdit.activeDocument.UTF8ToASCII();
               }
               else if ((UltraEdit.activeDocument.codePage == 1200) ||
                        (UltraEdit.activeDocument.codePage == 1201))
               {
                  UltraEdit.activeDocument.unicodeToASCII();
               }

               if (nEncoding == 65001)
               {
                  UltraEdit.activeDocument.ASCIIToUTF8();
               }
               else if ((nEncoding == 1200) || (nEncoding == 1201))
               {
                  UltraEdit.activeDocument.ASCIIToUnicode();
               }
            }

            // Create a string consisting only of separators according
            // to total number of identifiers / option names.
            var sSeparators = "";
            for (nDataIndex = 0; nDataIndex < nTotalColCount; nDataIndex++)
            {
               sSeparators += sSeparator;
            }

            // The search expression at top searched just for line-feeds.
            // So it is not possible that the JSON file is a MAC file.
            UltraEdit.activeDocument.unixMacToDos();
            var sLineTerm;
            if (nLineTerm < 1)
            {
               sLineTerm = "\r\n";
            }
            else
            {
               sLineTerm = "\n";
               UltraEdit.activeDocument.dosToUnix();
            }

            // Run the following loop on each data row array.
            for (var nRowIndex = 0; nRowIndex < nTotalRowCount; nRowIndex++)
            {
               // Join first all identifiers of current row to one string.
               var sDataRow = asDataValues[nRowIndex][0].join(",");
               // Has this data row less identifiers than at least one other
               // data row? Yes, append appropriate number of separators.
               if (asDataValues[nRowIndex][0].length < nTotalColCount)
               {
                  sDataRow += sSeparators.substr(asDataValues[nRowIndex][0].length);
               }

               // Append to data row string the option names enclosed
               // in double quotes.
               var nColCount = asDataValues[nRowIndex][1].length;
               for (var nColIndex = 0; nColIndex < nColCount; nColIndex++)
               {
                  sDataRow += ',"';
                  sDataRow += asDataValues[nRowIndex][1][nColIndex];
                  sDataRow += '"';
               }
               // Has this data row less option names than at least one other
               // data row? Yes, append appropriate number of separators. It
               // is not necessary to double quote the empty values.
               sDataRow += sSeparators.substr(nColCount);

               // Append the data row strings to the array of
               // data to write finally into the new file.
               asData.push(sDataRow);
            }

            // Append and empty string to data row strings array to
            // have also the last line finally terminated in CSV file.
            asData.push("");

            // Join the data row strings to a large block with determined
            // line terminator and write this block into the new file.
            UltraEdit.activeDocument.write(asData.join(sLineTerm));

            // Has the JSON data file a file name detected
            // on containing a backslash or a forward slash?
            if (sFileName.search(/[\\\/]/) >= 0)
            {
               // Replace the file extension of JSON file by ".csv".
               sCsvFileName = sFileName.replace(/\.[^.]+$/,".csv");
               // Append the file extension ".csv" if the
               // JSON file has no file extension at all.
               if (sCsvFileName == sFileName) sCsvFileName += ".csv";
               // Save the new file as CSV file with name of JSON file.
               UltraEdit.saveAs(sCsvFileName);
            }
            // Move caret to top in new file although not really needed.
            UltraEdit.activeDocument.top();
         }
      }
   }
}

I tested it on a file containing your two lines 600 times and it finished creating the CSV file with 1200 lines and saving it with name of JSON file with file extension changed to .csv in less than one second.

The resulting CSV file for the two posted lines is:

Code: Select all

87116473,87116474,87116475,87116476,,"ABC","DEF","GHI","JKL",
896834,814835,514836,464837,894838,"KKK","LLL","MMM","NNN","OOO"

fleggy · Feb 06, 2019#102019-02-06T08:19+00:00

Meanwhile you can use the regex method ;)
E.G. for up to 40 id's per line (short version):

F: ^.+?{"id":(\d+).+?"optionName1":"([^"]++)" plus 39 times (?:.+?{"id":(\d+).+?"optionName1":"([^"]++)")? plus the rest of line .*$
R: $1, $3, ..., $77, $79, "$2", "$4", "$6", ..., "$78", "$80"