Manipulating string with REGEX

mightymax · Sep 27, 2021#12021-09-27T21:41+00:00

This should be a simple task. Get the string, make changes to the string and write it to the document. I've tried to make my REGEX's as general as possible so it can capture all instances found in the document.

Code: Select all

function DeleteEmptyLines ()
{
  UltraEdit.perlReOn();
  UltraEdit.activeDocument.findReplace.mode=0;
  UltraEdit.activeDocument.findReplace.matchCase=false;
  UltraEdit.activeDocument.findReplace.matchWord=false;
  UltraEdit.activeDocument.findReplace.regExp=true;
  UltraEdit.activeDocument.findReplace.searchDown=true;
  UltraEdit.activeDocument.findReplace.searchInColumn=false;
  UltraEdit.activeDocument.findReplace.preserveCase=false;
  UltraEdit.activeDocument.findReplace.replaceAll=true;
  UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
  UltraEdit.activeDocument.top();
  UltraEdit.activeDocument.findReplace.replace("^(?:[\t ]*(?:\r?\n|\r))+", "");
}


function convertInputTables()
{
//Declare variables
var sInputConditionString = "";
var sStartTabMat= "";
var sReqCond = "";
var Regex = "";
var sMark = "****************************************************************************************************";
UltraEdit.perlReOn();
UltraEdit.activeDocument.findReplace.mode=0;
UltraEdit.activeDocument.findReplace.matchCase=false;
UltraEdit.activeDocument.findReplace.matchWord=false;
UltraEdit.activeDocument.findReplace.regExp=true;
UltraEdit.activeDocument.findReplace.searchDown=true;
UltraEdit.activeDocument.findReplace.searchInColumn=false;
UltraEdit.activeDocument.findReplace.preserveCase=false;
UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.findReplace.replaceAll=true;
UltraEdit.activeDocument.findReplace.replace('(<tabmat frame=".*?".*?>)', '\r\n$1');
DeleteEmptyLines ();

UltraEdit.activeDocument.top();


while (UltraEdit.activeDocument.findReplace.find('<tabmat frame=".*?".*?>[\s\S]+?Input Conditions[\s\S]+?<emphasis.*?>[\s\S]+?Additional Data:[\s\S]+?<\/tabmat>'))
{
//UltraEdit.activeDocument.cut();
sInputConditionString = UltraEdit.activeDocument.selection;
UltraEdit.activeDocument.cut();

var Regex = /<\/tbody><\/tgroup><\/tabmat>/g;
sRvmEnds = sInputConditionString.replace(Regex, "");

//REplace first match in string
var Regex = /<tabmat(.*?)>([\\s\\S]+?<tbody>)/m;
sInputResult = sRvmEnds.replace(Regex, "<table$1>$2");

// remove remaining <tabmat code from string
Regex = /<tabmat.*?>[\\s\\S]+?<tbody>/g;
sInputResult2 = sInputResult.replace(Regex, "");

// append ending table tags to the end of string
var sAddEnding = sInputResult2 + "<\/tbody><\/tgroup><\/table>";

UltraEdit.activeDocument.write(sAddEnding);

}
}
convertInputTables();

XML Sample:

Code: Select all

<para>
<tabmat frame="none" colsep="0" pgwide="0">
<tgroup cols="2" align="left">
<colspec colname="col1" align="left" colwidth="0.99in">
<colspec colname="col2" colwidth="0.78*">
<spanspec namest="col1" nameend="col2" spanname="span1">
<tbody>
<row>
<entry spanname="span1" colsep="0" align="left">
<emphasis type="u">
<?Pub _font TypeSize="10pt" FamName="Arial" Underline="heavy" ScoreSpace="yes"> Input Conditions<?Pub /_font></emphasis>.</entry></row></tbody></tgroup></tabmat>
<tabmat frame="none" colsep="0" pgwide="0">
<tgroup cols="2" align="left">
<colspec colname="col1" align="left" colwidth="1.0in">
<colspec colname="col2" colwidth="0.78*">
<spanspec namest="col1" nameend="col2" spanname="span1">
<tbody>
<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Applicability: </emphasis>
<?Pub _hardspace>All</entry></row></tbody></tgroup></tabmat>
<tabmat frame="none" colsep="0" pgwide="0">
<tgroup cols="1">
<colspec colname="col1" colwidth="1.00*">
<tbody>
<row>
<entry>
<emphasis type="b">Required Conditions:</emphasis></entry></row>
<row valign="top">
<entry>
<randlist>
<item>
<emphasis type="u" color="blue">
<xref xrefid="para5.10.1"></emphasis></item></randlist></entry></row></tbody></tgroup></tabmat>
<tabmat frame="none">
<tgroup cols="1">
<colspec colname="col1" colwidth="1.00*">
<tbody>
<row>
<entry>
<emphasis type="b">Personnel Recommended:</emphasis>
<?Pub _hardspace>2 </entry></row>
<row>
<entry>
<randlist>
<item>Technician A: </item>
<item>Technician B: </item></randlist></entry></row></tbody></tgroup></tabmat>
<tabmat frame="none" colsep="0" pgwide="0">
<tgroup cols="2">
<colspec colname="colspec0" align="left">
<colspec colname="colspec1">
<spanspec namest="colspec0" nameend="colspec1" spanname="span1">
<tbody>
<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Support Equipment:</emphasis>
<?Pub _hardspace>None</entry></row></tbody></tgroup></tabmat>
<tabmat frame="none" colsep="0" pgwide="0">
<tgroup cols="2">
<colspec colname="colspec0" align="left">
<colspec colname="colspec1">
<spanspec namest="colspec0" nameend="colspec1" spanname="span1">
<tbody>
<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Consumables: </emphasis>
<?Pub _hardspace>None</entry></row></tbody></tgroup></tabmat>
<tabmat frame="none" pgwide="0">
<tgroup cols="2">
<colspec colname="col1" colwidth="0.20*">
<colspec colname="col2" colwidth="1.93*">
<spanspec namest="col1" nameend="col2" spanname="span1">
<tbody>
<row>
<entry spanname="span1" colsep="0" align="left">
<emphasis type="b">Safety Conditions:</emphasis>
<?Pub _hardspace>None</entry></row></tbody></tgroup></tabmat>
<tabmat frame="none" colsep="0" pgwide="0">
<tgroup cols="2">
<colspec colname="colspec0" align="left">
<colspec colname="colspec1">
<spanspec namest="colspec0" nameend="colspec1" spanname="span1">
<tbody>
<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Additional Data:</emphasis>
<?Pub _hardspace>None</entry></row></tbody></tgroup></tabmat></para>
<subpara2 verstatus="ver">
<title>Removal</title>
<para></para>

Mofi · Sep 28, 2021#22021-09-28T06:13+00:00

In a JavaScript string must be escaped each backslash of a Perl regular expression search or replace string executed by the Perl regular expression engine of UltraEdit and not by the JavaScript interpreter itself with one more backslash. So the last line in the function DeleteEmptyLines must be:

Code: Select all

UltraEdit.activeDocument.findReplace.replace("^(?:[\\t ]*(?:\\r?\\n|\\r))+", "");

The JavaScript string "^(?:[\\t ]*(?:\\r?\\n|\\r))+" becomes on execution the Perl regular expression search string ^(?:[\t ]*(?:\r?\n|\r))+ executed by the Perl regular expression engine of UltraEdit.

The next Perl regular expression search or replace string on which to escape each backslash by one more backslash is in the line:

Code: Select all

UltraEdit.activeDocument.findReplace.replace('(<tabmat frame=".*?".*?>)', '\\r\\n$1');

The Perl regular expression in the while loop must be also adapted to:

Code: Select all

while (UltraEdit.activeDocument.findReplace.find('<tabmat frame=".*?".*?>[\\s\\S]+?Input Conditions[\\s\\S]+?<emphasis.*?>[\\s\\S]+?Additional Data:[\\s\\S]+?</tabmat>'))

A forward slash like in </tabmat> must not be escaped with a backslash in a Perl regular expression search or replace string. That is only necessary in a regular expression object of JavaScript on which / is used to mark beginning and end of the regular expression.

So here is your script with all corrections in the Perl regular expression search and replace strings executed by the Perl regular expression engine of UltraEdit and in the regular expressions of the three regular expression objects executed by the JavaScript core interpreter directly and some other small improvements:

Code: Select all

function DeleteEmptyLines ()
{
   UltraEdit.perlReOn();
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.searchInColumn=false;
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=true;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("^(?:[\\t ]*(?:\\r?\\n|\\r))+", "");
}

function convertInputTables()
{
   UltraEdit.perlReOn();
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.searchInColumn=false;
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replaceAll=true;
   UltraEdit.activeDocument.findReplace.replace('(<tabmat frame=".*?".*?>)', '\\r\\n$1');
   DeleteEmptyLines ();

   UltraEdit.activeDocument.top();

   //Declare variables
   var sAddEnding;
   var sInputConditionString;
   var sInputResult;
   var sInputResult2;
   var sRvmEnds;
   var Regex1 = /<\/tbody><\/tgroup><\/tabmat>/g;
   var Regex2 = /<tabmat(.*?)>([\s\S]+?<tbody>)/m;
   var Regex3 = /<tabmat.*?>[\s\S]+?<tbody>/g;

   while (UltraEdit.activeDocument.findReplace.find('<tabmat frame=".*?".*?>[\\s\\S]+?Input Conditions[\\s\\S]+?<emphasis.*?>[\\s\\S]+?Additional Data:[\\s\\S]+?</tabmat>'))
   {
      sInputConditionString = UltraEdit.activeDocument.selection;
      sRvmEnds = sInputConditionString.replace(Regex1, "");
      // Replace first match in string
      sInputResult = sRvmEnds.replace(Regex2, "<table$1>$2");
      // remove remaining <tabmat code from string
      sInputResult2 = sInputResult.replace(Regex3, "");
      // append ending table tags to the end of string
      sAddEnding = sInputResult2 + "</tbody></tgroup></table>";
      UltraEdit.activeDocument.write(sAddEnding);
   }
}

convertInputTables();

The output of the script above for the example XML file contents is:

Code: Select all

<para>
<table frame="none" colsep="0" pgwide="0">
<tgroup cols="2" align="left">
<colspec colname="col1" align="left" colwidth="0.99in">
<colspec colname="col2" colwidth="0.78*">
<spanspec namest="col1" nameend="col2" spanname="span1">
<tbody>
<row>
<entry spanname="span1" colsep="0" align="left">
<emphasis type="u">
<?Pub _font TypeSize="10pt" FamName="Arial" Underline="heavy" ScoreSpace="yes"> Input Conditions<?Pub /_font></emphasis>.</entry></row>

<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Applicability: </emphasis>
<?Pub _hardspace>All</entry></row>

<row>
<entry>
<emphasis type="b">Required Conditions:</emphasis></entry></row>
<row valign="top">
<entry>
<randlist>
<item>
<emphasis type="u" color="blue">
<xref xrefid="para5.10.1"></emphasis></item></randlist></entry></row>

<row>
<entry>
<emphasis type="b">Personnel Recommended:</emphasis>
<?Pub _hardspace>2 </entry></row>
<row>
<entry>
<randlist>
<item>Technician A: </item>
<item>Technician B: </item></randlist></entry></row>

<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Support Equipment:</emphasis>
<?Pub _hardspace>None</entry></row>

<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Consumables: </emphasis>
<?Pub _hardspace>None</entry></row>

<row>
<entry spanname="span1" colsep="0" align="left">
<emphasis type="b">Safety Conditions:</emphasis>
<?Pub _hardspace>None</entry></row>

<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Additional Data:</emphasis>
<?Pub _hardspace>None</entry></row></tbody></tgroup></table></para>
<subpara2 verstatus="ver">
<title>Removal</title>
<para></para>

mightymax · Sep 29, 2021#32021-09-29T11:16+00:00

Mofi, thank you for taking the time to work with me on this. I'm still getting inconsistent results ... sometimes there is a table and sometimes there are only rows.

I'll back soon.

Sep 30, 2021#42021-09-30T17:29+00:00

I ran the script on a large file but there were empty some tables missing the table headers and only the 'row' tags were left. The first tabmat tags were missing from the tag results. Listed below are some examples. I've attached a source file and a scripted file for your review. At first I thought the first REGEX was selecting more than one tabmat table, and the second table would be missing its table header, resulting with the missing header. But I've tested it and that doesn't seem to be the issue. I tried to see what could be triggering a tabmat->tbody to be missing but haven't had much luck. I can't select the whole file to script because there are many other tabmat tables that should not be converted.

Missing Table header
Line
2284
3252
3720
4138
4499

The resulting SGM tabmat table should look like this:

Code: Select all

<table frame="none" colsep="0" pgwide="0">
<tgroup cols="2" align="left">
<colspec colname="col1" align="left" colwidth="0.99in">
<colspec colname="col2" colwidth="0.78*">
<spanspec namest="col1" nameend="col2" spanname="span1">
<tbody>
<row>
<entry spanname="span1" colsep="0" align="left">
<emphasis type="u">
<?Pub _font TypeSize="10pt" FamName="Arial" Underline="heavy" ScoreSpace="yes"> Input Conditions<?Pub /_font></emphasis>.</entry></row>
<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Applicability: </emphasis>
<?Pub _hardspace>All</entry></row>
<row>
<entry>
<emphasis type="b">Required Conditions:</emphasis></entry></row>
<row valign="top">
<entry>
<randlist>
<item>
<emphasis type="u" color="blue">
<xref xrefid="para4.3.2"></emphasis></item></randlist></entry></row>
<row>
<entry>
<emphasis type="b">Personnel Recommended:</emphasis>
<?Pub _hardspace>1 </entry></row>
<row>
<entry>
<randlist>
<item>Technician A: performs groundcrew function</item></randlist></entry></row>
<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Support Equipment:</emphasis>
<?Pub _hardspace>None</entry></row>
<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Consumables: </emphasis>
<?Pub _hardspace>None</entry></row>
<row>
<entry spanname="span1" colsep="0" align="left">
<emphasis type="b">Safety Conditions:</emphasis>
<?Pub _hardspace>None</entry></row>
<row>
<entry spanname="span1" colsep="1" align="left">
<emphasis type="b">Additional Data:</emphasis>
<?Pub _hardspace>None</entry></row></tbody></tgroup></table>

Edit: The two attached *.sgm files were removed after being no longer needed.

Mofi · Oct 02, 2021#52021-10-02T15:00+00:00

Problem 1: Invalid line endings

The file obfuscated_source.sgm has lines ending with carriage return + carriage return + line-feed (0D 0D 0A) and carriage return + line-feed + carriage return (0D 0A 0D). That are definitely not valid line endings. Either a program created this file wrong or it was downloaded wrong by you from a server or it was prepared wrong by you before uploading it to the forum or the forum processed the file data wrong on upload. That is one reason why I prefer archive files as attachments like a 7-Zip, RAR or ZIP archive as I can be sure after download and extraction of the archive file to have exactly the same files as the person who compressed the files into the archive file and uploaded it to the forum as attachment to the post. Other reasons are the reduced storage space required on files are compressed into an archive file and less data is transferred on upload and download on using an archive file.

The file obfuscated_scripted.sgm has lines ending with just carriage return (0D) and lines ending with carriage return + line-feed (0D 0A) after I downloaded it from the forum.

I created first copies of the two downloaded files and converted the two copies in UltraEdit to text files with carriage return + line-feed (0D 0A) as line ending on all lines in the two files with removing empty lines caused by 0D 0A 0D.

I strongly recommend that you find out where is the cause of the invalid line endings 0D 0D 0A and 0D 0A 0D and fix that if possible. Otherwise the script should correct the line endings immediately on start as done by the script below.

Problem 2: Wrong expression to select the block to reformat

I think, there should be found and matched first a tabmat element which contains a ?Pub _font element with value Input Conditions by using the search expression <tabmat frame=".*?".*?>[\s\S]+?Input Conditions. But this expression matches much more as it matches everything from next found tabmat element up to a line containing Input Conditions independent on how many tabmat are between the beginning of next tabmat element and the tabmat element really containing ?Pub _font element with value Input Conditions. For that reason there is already matched by this expression the first <tabmat frame="none" verstatus="ver"> on line 75 and everything up to first occurrence of Input Conditions on line 2373 in obfuscated_source.sgm. That is definitely incorrect.

A solution is using the expression <tabmat frame=".*?".*?>(?:[\s\S](?!</tabmat))+?Input Conditions which does not match characters outside the current tabmat element. The tabmat beginning on line 75 in obfuscated_source.sgm is ignored by this expression as it does not contain the string value Input Conditions anywhere inside. A negative lookahead is used to stop matching any character on reaching the end tag of current tabmat element.

Additional Data: is between <emphasis type="b"> and </emphasis> and for that reason the search expression should not search for any start tag <emphasis existing multiple times before the emphasis element with the value Additional Data: as done with [\s\S]+?<emphasis.*?>[\s\S]+?Additional Data:. There should be searched explicitly for the emphasis element with the value Additional Data: as done with [\s\S]+?<emphasis type=".+">Additional Data:.

The last part [\s\S]+?</tabmat> of the search expression is okay.

So the full expression to use for finding the block to reformat is:

<tabmat frame=".*?".*?>(?:[\s\S](?!</tabmat))+?Input Conditions[\s\S]+?<emphasis type=".+">Additional Data:[\s\S]+?</tabmat>

Each backslash in this Perl regular expression search string must be escaped with one more backslash in the script to pass the search string correct from JavaScript core engine to the Perl regular expression engine of UltraEdit.

Problem 3: Insert a line ending before starting tag of tabmat only where necessary

There is used the following Perl regular expression replace all in your script:

Code: Select all

UltraEdit.activeDocument.findReplace.replace('(<tabmat frame=".*?".*?>)', '\\r\\n$1');

That inserts a carriage return and line-feed left to every starting tag of tabmat element even on being already at beginning of a line resulting in empty lines in final file.

It is better to use this Perl regular expression replace all in script which searches for a character not being a carriage return or a line-feed without selecting this character and there is next the starting tag <tabmat checked with a positive lookahead. Only in this case a carriage return and line-feed pair is inserted into the file.

Code: Select all

UltraEdit.activeDocument.findReplace.replace('[^\\r\\n]\\K(?=<tabmat)', '\\r\\n');

Problem 4: Third regular expression applied on selected block does not remove carriage return and line-feed

The third regular expression replace executed by JavaScript core engine on the selected block is:

Code: Select all

var Regex3 = /<tabmat.*?>[\s\S]+?<tbody>/g;

That expression does not match the carriage return and line-feed after tag <tbody> as done by the slightly extended expression below.

Code: Select all

var Regex3 = /<tabmat.*?>[\s\S]+?<tbody>[\r\n]*/g;

Here is the script with the explained improvements which I think produces the correct result now.

Code: Select all

function DeleteEmptyLines ()
{
   UltraEdit.activeDocument.top();
   UltraEdit.perlReOn();
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.searchInColumn=false;
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=true;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   UltraEdit.activeDocument.findReplace.replace("^(?:[\\t ]*(?:\\r?\\n|\\r))+", "");
}

function convertInputTables()
{
   UltraEdit.activeDocument.top();
   UltraEdit.perlReOn();
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.searchInColumn=false;
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=true;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   // Insert carriage return and line-fed on starting tag
   // of element tabmat not found at beginning of a line.
   UltraEdit.activeDocument.findReplace.replace('[^\\r\\n]\\K(?=<tabmat)', '\\r\\n');

   //Declare variables
   var sAddEnding;
   var sInputResult1;
   var sInputResult2;
   var sRvmEnds;
   var Regex1 = /<\/tbody><\/tgroup><\/tabmat>/g;
   var Regex2 = /<tabmat(.*?)>([\s\S]+?<tbody>)/m;
   var Regex3 = /<tabmat.*?>[\s\S]+?<tbody>[\r\n]*/g;

   UltraEdit.activeDocument.top();
   while (UltraEdit.activeDocument.findReplace.find('<tabmat frame=".*?".*?>(?:[\\s\\S](?!</tabmat))+?Input Conditions[\\s\\S]+?<emphasis type=".+">Additional Data:[\\s\\S]+?</tabmat>'))
   {
      sRvmEnds = UltraEdit.activeDocument.selection.replace(Regex1, "");
      // Replace first match in string
      sInputResult1 = sRvmEnds.replace(Regex2, "<table$1>$2");
      // remove remaining <tabmat code from string
      sInputResult2 = sInputResult1.replace(Regex3, "");
      // append ending table tags to the end of string
      sAddEnding = sInputResult2 + "</tbody></tgroup></table>";
      UltraEdit.activeDocument.write(sAddEnding);
   }
   UltraEdit.activeDocument.top();
}

// Convert the file on being detected as file with DOS/Windows line endings
// or with temporary conversion from UNIX/MAC to DOS/Windows first to UNIX.
// Then convert the file to DOS/Windows and remove all empty lines caused
// by a carriage return after a carriage return + line-feed pair. That
// makes sure that the invalid line endings 0D 0D 0A and 0D 0A 0D are
// all corrected to 0D 0A and the file is finally saved as text file
// with DOS/Windows line endings which means with 0D 0A as line ending.
if (UltraEdit.activeDocument.lineTerminator < 1) UltraEdit.activeDocument.dosToUnix();
UltraEdit.activeDocument.unixMacToDos();
DeleteEmptyLines();
convertInputTables();

mightymax · Oct 04, 2021#62021-10-04T14:10+00:00

Mofi, thank you for taking the time to explain all the errors. Your guidance is really appreciated.
Max

Oct 08, 2021#72021-10-08T17:02+00:00

Mofi I need the script to extend to the end para tag after the end tabmat table. There can be a <line>, <brk> or other tag up until the end para tag </para>. So the script needs to be written for this type of ending....'</tabmat>Text test test <line>test line test</line></para>' or </tabmat></para>. The content between the last </tabmat> and </para> must be in two groups so I can put them into seperate variables (allTabTables & endText) to reconstruct the tags to be written.

The code below never matches the tag groups which should be separating the <tabmat....</tabmat and 'text...</para>'. I need help with this regular expresion.

Code: Select all

var sInputCond = UltraEdit.activeDocument.selection.replace(/(.*?)/,"$1");

var SelectTabMatREGEX = /(<tabmat frame=".*?".*?>(?:[\\s\\S](?!<\/tabmat))+?Input Conditions[\\s\\S]+?<\/tabmat>([\\s\\S]+?))?((<line.*?>)?<\/para>)/m;
var match = SelectTabMatREGEX.exec(sInputCond);
if (match != null) {
var allTabTables = match[1];
  var endText = match[3];
} else {
// Match attempt failed
}

Mofi · Oct 10, 2021#82021-10-10T09:38+00:00

Please see the revised script file ConvertTabMat387849a.js with comments explaining the solution used by me which is easier and faster on execution of the script and the result file result.sgm produced by this script on execution on example file test.sgm in the attached ZIP file.

There are some other small improvements made on the script as you can see on comparing your version with this version using UltraCompare. The other improvements on some regular expressions have no effect on the result file. They just simplify a little bit some search expressions.

mightymax · Oct 11, 2021#92021-10-11T16:08+00:00

Mofi this is great! Thank you for taking the time to comment the changes.

Oct 11, 2021#102021-10-11T19:04+00:00

Mofi, how did you get UES to show the different line endings?

lines ending with carriage return + carriage return + line-feed (0D 0D 0A) and carriage return + line-feed + carriage return (0D 0A 0D)

Mofi · Oct 12, 2021#112021-10-12T06:10+00:00

I clicked on ribbon tab Edit on last but one item Hex mode after opening the *.sgm file and used next File - Revert to saved to get the file reloaded in hex edit mode as stored on hard disk without interpreting the newline characters. I have licensed also Total Commander which can display also the bytes of a file in hex mode which shows the same as UltraEdit in hex edit mode.

mightymax · Nov 10, 2021#122021-11-10T22:53+00:00

The script goes through a file and removes misc. tags. The output is a clean SGM file (Attachment BrokenTablesDesiredResults.xml)
The issue is the script is stopping at the end of the warning tag because it has an end </para tag before the last tabmat table. I can't seem to write the REGEX to look beyond the end </para> tag until it's the final end tag in the captured data 'var sInputCond = UltraEdit.activeDocument.selection.replace(/(.*?)/,"$1");'
I'm doing a match on the file found and splitting it with 'allTabTables0 = match[1];' then I search that variable and remove any end tags 'allTabTables = allTabTables0.replace(/<\/tbody><\/tgroup><\/tabmat>/img, "");' But I'm still off base. I wouldn't need to check the variable for end table tags if my original regex finds beyond the warning <para and into the next tags. Until it trully is the last </para> tag.

Your help is appreciated.
Max

Mofi · Nov 14, 2021#132021-11-14T12:59+00:00

A regular expression is in general the wrong approach to find the matching end tag in an XML file on which the element could contain itself multiple times nested.

The attached ZIP file contains the example input file BrokenTables.xml, the expected output file BrokenTablesDesiredResults.xml with small modifications in comparison to BrokenTablesDesiredResults.xml in your ZIP file which should be verified first by you by running a text file comparison and finally the updated script file 38784 v7 to v9a-1225.js. I suggest to run a text file comparison with the version in your ZIP file with ignoring whitespace differences to see all modifications made by me on the script to get finally the expected output.

You might want to look also the Fleggy's post Matching tag pairs. It would be perhaps better to make use of the first expression adapted to your needs for SGM reformatting.

The input and example files contain only the table on which the reformat failed. So I could not verify if the modified code works for other tables in an SGM file.

mightymax · Nov 15, 2021#142021-11-15T15:34+00:00

thank you Mofi! I think I'm making this much more difficult then it needs to me.

UltraEdit, UltraCompare, UEStudio forums