Perl regular expression in script not working as intended - UltraEdit, UltraCompare, UEStudio forums

We've updated our Privacy Policy and by continuing you're agreeing to the updated terms.

Ok

Perl regular expression in script not working as intended

20 posts

martix 5 Newbie martix 5

Aug 20, 2009#12009-08-20T14:38+00:00

This is Perl RE:
I have a regex that is not working as it is supposed to, I do not know why:

Code: Select all

.*?\(.*?[0-9]\).*?$\r\n^((?!\(.*?[0-9]\)).)*$|^.*?\(.*?[0-9]\).*?$

It is supposed to match the two-liner option if possible and the one-liner otherwise. However it only ever matches one line!

Mofi 6,685587 Grand Master Mofi 6,685587

Aug 20, 2009#22009-08-20T15:17+00:00

I'm not a Perl regular expression expert. It would be really helpful to have some lines which shows what should be found by this expression and what should be ignored. Without deeply analyzing it and trying to understand what should by this expression, what's about:

^.*?\(.*?[0-9]\).*?(\r\n((?!\(.*?[0-9]\)).)*$|$)

Best regards from an UC/UE/UES for Windows user from Austria

martix 5 Newbie martix 5

Aug 20, 2009#32009-08-20T15:54+00:00

Hmm... thanks

This seems to work quite nicely and is not so bulky. Thanks

Edit: Actually... it works on manual find, however when executing through a script it always matches the first 2 lines in which the first contains some number.

Aug 21, 2009#42009-08-21T02:59+00:00

I have this regex that in UltraEdit in 3 different usage instances evaluates in 3 different ways!

Code: Select all

^.*?\(.*?[0-9]\).*?(\r\n((?!\(.*?[0-9]\)).)*$|$)

RE explanations: Match any one or 2 lines, the first of which contains a parenthesized word and digit(also making sure if it is a one-line match, not to take 2 anyway).

Now, as far as I can tell Search->Find is the only place this evaluates correctly(i.e. the same as logic and a few other places say).

However when I try to integrate this into a script it fails miserably...
Option one(the UE way):

Code: Select all

UltraEdit.perlReOn();
UltraEdit.newFile();
UltraEdit.document[1].setActive();
UltraEdit.document[0].top();
UltraEdit.document[0].findReplace.mode = 0;
UltraEdit.document[0].findReplace.matchCase = false;
UltraEdit.document[0].findReplace.regExp = true;

var str = UltraEdit.document[0].findReplace.find("^.*?\(.*?[0-9]\).*?(\r\n((?!\(.*?[0-9]\)).)*$|$)");
UltraEdit.document[2].write("\r\n");
while (str){
UltraEdit.document[0].copy();
UltraEdit.document[2].bottom();
UltraEdit.document[2].paste();
str = UltraEdit.document[0].findReplace.find("^.*?\(.*?[0-9]\).*?(\r\n((?!\(.*?[0-9]\)).)*$|$)");
UltraEdit.document[2].write("\r\n");
}

- Result: It ignores parts of the RE it doesn't seem to like.
Right now it finds any 2 lines the first of which contains a number, completely ignoring the parenthesis requirement!

Option two(the JS way):

Code: Select all

UltraEdit.document[0].selectAll();
var txt = UltraEdit.document[0].selection;
UltraEdit.newFile();

var RE = /^.*?\(.*?[0-9]\).*?(\r\n((?!\(.*?[0-9]\)).)*$|$)/gim;
var Matches = RE.exec(txt);
var i = Matches.length;
UltraEdit.outputWindow.showWindow(true);
UltraEdit.outputWindow.write("Matches found:\r\n" + i);
UltraEdit.newFile();
for (var x = 0; x<i; x++){
	UltraEdit.document[1].write(Matches[x]);
}

- Result: A couple completely random matches out of ~900 possible.

Match count comparison by RE interpretation:
Proper interpret(Search menu): 858
findReplace method: 2337
JavaScript exec method: 3

Mofi 6,685587 Grand Master Mofi 6,685587

Aug 21, 2009#52009-08-21T05:39+00:00

You know that in JavaScript strings the backslash is an escape character and therefore you have to use

"^.*?\\(.*?[0-9]\\).*?(\\r\\n((?!\\(.*?[0-9]\\)).)*$|$)"

to pass the regular expression string correct from the JavaScript engine to the Perl regular expression engine?

If you just use
"^.*?\(.*?[0-9]\).*?(\r\n((?!\(.*?[0-9]\)).)*$|$)"
in a JavaScript the Perl engine will get
"^.*?(.*?[0-9]).*?(rn((?!(.*?[0-9])).)*$|$)"

And following script is surely much faster:

Code: Select all

UltraEdit.perlReOn();
UltraEdit.newFile();
UltraEdit.document[1].setActive();

UltraEdit.document[0].bottom();
if (UltraEdit.document[0].isColNumGt(1)) {
   UltraEdit.document[0].insertLine();
   if (UltraEdit.document[0].isColNumGt(1)) UltraEdit.document[0].deleteToStartOfLine();
}
UltraEdit.document[0].top();
UltraEdit.document[0].findReplace.mode=0;
UltraEdit.document[0].findReplace.matchCase=false;
UltraEdit.document[0].findReplace.matchWord=false;
UltraEdit.document[0].findReplace.regExp=true;
UltraEdit.document[0].findReplace.searchAscii=false;
UltraEdit.document[0].findReplace.searchDown=true;
UltraEdit.document[0].findReplace.searchInColumn=false;

UltraEdit.selectClipboard(9);
UltraEdit.clearClipboard();

while (UltraEdit.document[0].findReplace.find("^.*?\\(.*?[0-9]\\).*?\\r(\\n((?!\\(.*?[0-9]\\)).)*\\r\\n|\\n)")) {
  UltraEdit.document[0].copyAppend();
}
UltraEdit.document[2].write("\r\n");
UltraEdit.document[2].paste();
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(0);

For comments see similar script I have posted here. There is also a user-submitted script under Downloads - Extras - Macros & Scripts which maybe works for you, too.

martix 5 Newbie martix 5

Aug 21, 2009#62009-08-21T17:52+00:00

With the script you presented - it is indeed faster and works quite consistently and nicely. Nifty little trick using the CR/LF.

As of right now the script did the job it was supposed to do, and for that I am enormously grateful. Thank you very much for your time.

The next part is optional reading, but I believe if I'm not mistaken that it shows a bug:
I did earlier experiment with JS a bit and found there to be something wrong with the JS's regex interpreter(as far as I can tell the following code isn't faulty).

The JS method does not work no matter what, and from what I read about JS the \...\ format does not need it, only the new RegExp("...") one does (according to your definition as well).

Code: Select all

var txt = "Sample (Entry 1)- 1st line lalala\n2nd line from entry 1 to match\nSecond line to match (Entry 2)- one liner test";

//var RE = new RegExp("^.*?\\(.*?[0-9]\\).*?(\\n((?!\\(.*?[0-9]\\)).)*$|$)", "gm");
var RE = /^.*?\(.*?[0-9]\).*?(\n((?!\(.*?[0-9]\)).)*$|$)/gm;
var Mucho = RE.exec(txt);
UltraEdit.outputWindow.write(Mucho[0]);
var i = Mucho.length;
UltraEdit.outputWindow.showWindow(true);
UltraEdit.outputWindow.write("Matches found:\r\n" + i);
UltraEdit.newFile();
for (var x = 0; x<i; x++){
 UltraEdit.activeDocument.write("{"+Mucho[x]+"}");
 UltraEdit.activeDocument.write("[Match "+x+"]")
}

Code: Select all

var txt = "#Sample (Entry 1)- 1st line lalala@2nd line to match from entry 1@#Second line to match (Entry 2)- one liner test@";
var RE = /#[^@]*?\([^@]*?[0-9]\)[^@]*?(@((?!\([^@]*?[0-9]\))[^@])*@|@)/gm;

In the second snippet I replaced start of line with "#" and end of line with "@". And they yield the same results(with respect to the replacements).
And so do both regex definition methods(/.../ and new RegExp("..."))

Mofi 6,685587 Grand Master Mofi 6,685587

Aug 22, 2009#72009-08-22T07:57+00:00

I'm not a JavaScript expert and also not an expert for the regular expression object, but I think when applying a regular expression on a string $ means end of string and not end of line and ^ means start of string and not start of line. I don't believe that there is really a bug in JavaScript core engine v1.7 used by UltraEdit. I think you make a mistake here and because I'm not an expert I can't tell you exactly which one.

Best regards from an UC/UE/UES for Windows user from Austria

ridgerunner 16 Basic User ridgerunner 16

Answer 1 of 4

Aug 24, 2009#82009-08-24T01:44+00:00

Let me address this thread by responding to the posts in reverse order starting with mofi's most recent...

Mofi wrote:I'm not a JavaScript expert and also not an expert for the regular expression object, but I think when applying a regular expression on a string $ means end of string and not end of line and ^ means start of string and not start of line. I don't believe that there is really a bug in JavaScript core engine v1.7 used by UltraEdit. I think you make a mistake here and because I'm not an expert I can't tell you exactly which one.

Just like Perl, the Javascript 'm' multi-line modifier determines how the '^' and '$' metachars behave. When the 'm' modifier is applied, '^' and '$' match the beginning and end of each and every line in the string. When the 'm' modifier is NOT applied (the default), then '^' and '$' match only at the beginning and end of the entire string.

Answer 2 of 4

Aug 24, 2009#92009-08-24T01:49+00:00

Next up...

martix wrote:...
The next part is optional reading, but I believe if I'm not mistaken that it shows a bug: ...

Code: Select all

var txt = "Sample (Entry 1)- 1st line lalala\n2nd line from entry 1 to match\nSecond line to match (Entry 2)- one liner test";

//var RE = new RegExp("^.*?\\(.*?[0-9]\\).*?(\\n((?!\\(.*?[0-9]\\)).)*$|$)", "gm");
var RE = /^.*?\(.*?[0-9]\).*?(\n((?!\(.*?[0-9]\)).)*$|$)/gm;
var Mucho = RE.exec(txt);
UltraEdit.outputWindow.write(Mucho[0]);
var i = Mucho.length;
UltraEdit.outputWindow.showWindow(true);
UltraEdit.outputWindow.write("Matches found:\r\n" + i);
UltraEdit.newFile();
for (var x = 0; x<i; x++){
	UltraEdit.activeDocument.write("{"+Mucho[x]+"}");
	UltraEdit.activeDocument.write("[Match "+x+"]")
}

...

No bug. Your script runs just fine and its output is correct (although at first glance, it may appear to be giving erroneous results). And both forms of the regex ("string" and /RegExp/) work equally well. Running this script from UE v14 produces the following in the output window:

Code: Select all

Running script: martix_20090823.js
========================================================================================================
Sample (Entry 1)- 1st line lalala
2nd line from entry 1 to match
Matches found:
3
Script succeeded.

And this is what appears in the newly created "UE.activeDocument":

Code: Select all

{Sample (Entry 1)- 1st line lalala
2nd line from entry 1 to match}[Match 0]{
2nd line from entry 1 to match}[Match 1]{h}[Match 2]

Now you are probably thinking: "3 matches? WTF? the regex should have found only two matches!" And... "just what the heck is that third "h" match[2] anyway?!" Well, the problem here is that you are using the RE.exec() method, which always returns an array containing all the details of just one match. That is, when RE.exec() is run, it finds the first match (which is the first two whole lines of the subject string), and returns an array with the details of that match having the following members:

Code: Select all

Mucho[0] = the overall match = (the first two lines from the subject).
Mucho[1] = the contents of capturing group 1 = "\n2nd line from entry 1 to match"
Mucho[2] = the contents of capturing group 2 = "h"

These are indeed the correct details of the first match of the given regex and subject string. Your regex has two capturing groups; one that matches the optional second line and another which repeatedly matches one char of the second line - thus there are 3 "Matches". (If your regex had 4 capturing groups, the array returned by RE.exec() would have 5 members.) Now, if you run the RE.exec() a second time, it will find the second overall match and return the details of that match in another array. This is because the RE object has a "lastIndex" property, which keeps track of the position of the last match and tells it where to begin the next search on a target string. (If you inspect RE.lastIndex with your given script, you will see that it is set to 64 after the first and only run of RE.exec().)

To get an array containing all the matches in your subject text, use the string "match()" method (instead of the RegExp "exec()" method). In other words, change this:

Code: Select all

var Mucho = RE.exec(txt);

To this:

Code: Select all

var Mucho = txt.match(RE);

I think that you will find that this change will provide the results you expected... Note for more detailed information about all the regular expression methods available in Javascript (i.e. there are five of them: str.search(), str.replace(), str.match(), re.exec() and re.test()), I highly recommend reading: "Javascript: The Definitive Guide - 5th Edition" by David Flanagan. In fact, you can read most of chapter 11 (which covers regex) for free at this link (start on page 199).

Answer 3 of 4

Aug 24, 2009#102009-08-24T01:52+00:00

and...

Mofi wrote:... If you just use
"^.*?\(.*?[0-9]\).*?(\r\n((?!\(.*?[0-9]\)).)*$|$)"
in a JavaScript the Perl engine will get
"^.*?(.*?[0-9]).*?(rn((?!(.*?[0-9])).)*$|$)" ...

Well, not exactly true. The '\r' and '\n' sequences are valid escape sequences in a JavaScript string and the backslash does NOT need to be escaped for them to work. (They will merely be encoded as a literal carriage return and linefeed in the resulting string, which the regex engine then handles A-Ok just like any other literal character.) However, you are absolutely correct about the need for double backslashes to properly escape the literal parentheses.

and...

Mofi wrote:... And following script is surely much faster:

Code: Select all

UltraEdit.perlReOn();
UltraEdit.newFile();
UltraEdit.document[1].setActive();

UltraEdit.document[0].bottom();
if (UltraEdit.document[0].isColNumGt(1)) {
   UltraEdit.document[0].insertLine();
   if (UltraEdit.document[0].isColNumGt(1)) UltraEdit.document[0].deleteToStartOfLine();
}
UltraEdit.document[0].top();
UltraEdit.document[0].findReplace.mode=0;
UltraEdit.document[0].findReplace.matchCase=false;
UltraEdit.document[0].findReplace.matchWord=false;
UltraEdit.document[0].findReplace.regExp=true;
UltraEdit.document[0].findReplace.searchAscii=false;
UltraEdit.document[0].findReplace.searchDown=true;
UltraEdit.document[0].findReplace.searchInColumn=false;

UltraEdit.selectClipboard(9);
UltraEdit.clearClipboard();

while (UltraEdit.document[0].findReplace.find("^.*?\\(.*?[0-9]\\).*?\\r(\\n((?!\\(.*?[0-9]\)).)*\\r\\n|\\n)")) {
  UltraEdit.document[0].copyappend();
}
UltraEdit.document[2].write("\r\n");
UltraEdit.document[2].paste();
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(0);

Actually this script has several serious problems:

Line 22: As written, this script fails immediately (UE v14.20) with a messagebox stating: "You have entered an invalid regular expression!" This is because the last closing literal parentheses in the regex is not properly escaped with a double backslash, resulting in an overall unbalance of nested parentheses.
Line 23: Another script error. "copyappend()" should be "copyAppend()"
Lines 25,26: Document[2]? This generates a script error if there is no Document 2, and if this document does exist, blindly write some data to it. This script makes assumptions about which document you want to search (must be Document[0]), and the number of files initially open (2). Not good form!
Your regex requires a trailing '\r\n' line termination and will fail to correctly match a line at the end of the string that has no '\r\n'. (I guess this is why the script modifies the file, adding a newline at the end if it is not already there.)
This script modifies the file you are searching by putting a blank line at the end if it does not already exist.

Answer 4 of 4

Aug 24, 2009#112009-08-24T01:54+00:00

and Finally...

Regarding the regex, it really should be able to match the last line in the file even if there is no line terminator there. It would also be nice if it matches for files having either DOS ('\r\n') or Unix ('\n') line termination styles. And there is no need for any capturing groups, so these can be changed to non-capturing, and one of the lazy star quantifiers can also be changed to greedy. These changes will improve efficiency just a wee bit. Here is a Javascript regex which does the trick (specified in both "string" and /RegExp/ literal syntax:

Code: Select all

var RE = new RegExp("^.*?\\(.*?[0-9]\\).*(?:\r?\n|$)(?:(?:(?!\\(.*?[0-9]\\)).)*(?:\r?\n|$))?", "gm");
var RE = /^.*?\(.*?[0-9]\).*(?:\r?\n|$)(?:(?:(?!\(.*?[0-9]\)).)*(?:\r?\n|$))?/gm;

And here is a UE32 script which utilizes this improved regex. It performs the search on the currently active file and presents the search results in the Output window. It does not modify any files.

Code: Select all

// File: Regex_not_working_as_intended_solution_20090823.js
// Purpose: Answer "Regex not working as intended" UE32 regex thread question.
// Directions: Set the file you wish to search as the active file before running this script.
// Notes: The regex in this script works for files having either DOS or Unix line terminations.
// The line termination sequences are captured along with the matched lines of text.
// It also matches text at the very end of the string (with no line termination on last line).
// Script progress and search results are written to the output window.
// Matches from files having Unix line terminations, will look kind of funny with
// the '\n' line terminator displayed as little squares in the output window.
// The contents of the currently selected clipboard are cleared.
// The file pointer is reset to the top of the file.
// No files are modified.

// local variables
var RE = "^.*?\\(.*?[0-9]\\).*(?:\r?\n|$)(?:(?:(?!\\(.*?[0-9]\\)).)*(?:\r?\n|$))?";
var matchCount = 0; // count of matches
var matchData = ''; // string containing match data

// set the UE search/find regex mode
UltraEdit.perlReOn();
UltraEdit.activeDocument.findReplace.mode=0;
UltraEdit.activeDocument.findReplace.matchCase=false;
UltraEdit.activeDocument.findReplace.matchWord=false;
UltraEdit.activeDocument.findReplace.regExp=true;
UltraEdit.activeDocument.findReplace.searchAscii=false;
UltraEdit.activeDocument.findReplace.searchDown=true;
UltraEdit.activeDocument.findReplace.searchInColumn=false;

// initialize output window to display search progress and results
UltraEdit.outputWindow.showWindow(true); // ensure Output window is displayed
UltraEdit.outputWindow.clear(); // reset the output window data
UltraEdit.outputWindow.write("Searching file for all matches...");

// search for matches one at a time. count and accumulate string with match data
UltraEdit.activeDocument.top(); // start search from the top of the file
while (UltraEdit.activeDocument.findReplace.find(RE)) {
  matchCount++;
  UltraEdit.activeDocument.copy(); // copy match data to clipboard
  // Build a string containing each match with count and line numbers
  matchData += '\r\n----------------------------------------------------------';
  matchData += '\r\nMatch[' + matchCount + '] at Line ';
  matchData += UltraEdit.activeDocument.currentLineNum + ':\r\n"';
  matchData += UltraEdit.clipboardContent + '"'
}
UltraEdit.clearClipboard(); // clear last match from clipboard
UltraEdit.outputWindow.write("Done. Number of matches found = " + matchCount + matchData);
UltraEdit.activeDocument.top(); // reset file pointer to the top of the file

I hope this helps.
Regards from Salt Lake City - Cheers!

Mofi 6,685587 Grand Master Mofi 6,685587

Aug 24, 2009#122009-08-24T05:57+00:00

ridgerunner, many thanks for that lessons.

I learned a little bit more about regular expression using in JavaScript (or better JScript). You are right with \r\n in JavaScript strings being already translated to 0x0D 0x0A and therefore are correct passed to the Perl engine. You are also right with the errors in my script. It was very quick and dirty coded without the ability to test it because of missing example. So I have not even run it once which explains the simple syntax errors (corrected now in my post). And of course the script was not written for general usage. I did not invest any thought on how to code this script for save usage in general.

I did not know till now that $ used in a Perl regular expression search works also on last line of a file without line termination. The legacy UltraEdit/Unix regular expression engines fail in this case. They require a line terminator on last line of the file when using $ and last line should be also found. Of course that is a well-known bug of the legacy regexp engines because end of file should be interpreted also as end of line as the Perl regexp engine does.

However, many thanks again for your lessons. I really appreciate it.

Best regards from an UC/UE/UES for Windows user from Austria

ridgerunner 16 Basic User ridgerunner 16

Aug 24, 2009#132009-08-24T14:14+00:00

Mofi wrote:... However, many thanks again for your lessons. I really appreciate it.

You are most certainly welcome! Of all the forums I participate in, (a LOT), you are far and away the most helpful person I have come across. With over 2700 posts, (most of which are in-depth with a very high signal-to-noise ratio), you have indeed done (and are still doing) a great service to the IDM company - and more importantly, you have helped many, many UltraEdit users. So it is thus, a rare pleasure for me to have the opportunity to help you!

Thanks for all your many contributions!
Jeff Roberson
Salt Lake City Utah USA

p.s. I originally meant to make only a quick, brief response to this thread, but my OCD kicked in and I ended up spending all day on it. (D'oh!) However, it did give me an opportunity to spend some time learning more about scripting within UE, which is something that I have not really played around with much. In preparing my response, I ended up reading several of your other posts which were very helpful in putting together a decent solution. Thanks again!

martix 5 Newbie martix 5

Aug 24, 2009#142009-08-24T15:56+00:00

Well I thank you as well. This was a one-shot data-extraction task(hence the specific and "dirty" code you find splattered all over the thread).
But then again it was also a great exercise in UE scripting(this was my first encounter with it) and also in JavaScript and regex which I need every now and then and are invaluable tools for quite a varied assortment of necessary tasks!
Now code-wise:
A1: Yup, the "m" takes care of that issue. Most tuts I read though didn't really manage to mention it.

A2: Thanks for clearing up on how .exec() works. No one really explained all the JSregex methods all too well to me however hard I tried to make them.

So that's one confusion lifted.
A3: The mistakes were quite simple ones and fixable in seconds. And it also worked without the double-escape of line-terminator literals.
A4: Well that'd be the graceful, generic solution, but for a onetime task, a working one is sufficient as well.

Besides, capturing groups are easier to write and look at than non-capturing ones.

In the end the thread was exceptionally beneficial, cleared up quite a few misunderstandings I had about the .exec() and a bunch of other things. And I have a feeling people are gonna come back looking for it

wchu01 3 Newbie wchu01 3

Oct 01, 2009#152009-10-01T14:06+00:00

Hello, I am having some issue with my regex expression which I replaced yours with. JavaScript Lint is complaining of a syntax error however I am really blind to it as I really did just paste my expression in there.

The regex (/?<==)\w+(?=;|,/) is to find the text between two anchors("=" and ";" or ",") for your reference.

Thank you for any assistance.

Code: Select all

//Enviroment Setup
 //UltraEdit.insertMode();
 if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
 else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
 UltraEdit.perlReOn();
 UltraEdit.activeDocument.findReplace.matchCase=true;
 UltraEdit.activeDocument.findReplace.matchWord=false;
 UltraEdit.activeDocument.findReplace.regExp=false;
 UltraEdit.activeDocument.findReplace.mode=0;

 // Move cursor to top of active file and run the initial search.
 UltraEdit.activeDocument.top();

 //To ensure search always start from the first column
 UltraEdit.activeDocument.key("HOME");

 UltraEdit.activeDocument.findReplace.regExp=true;

 var selectedText = UltraEdit.activeDocument.selection; /* get selected text */

 if(selectedText.length>0)
 { /* anything selected ? */

   /* let's search for a simple SQL date patttern */
   var findResult = selectedText.match(/?<==)\w+(?=;|,/); /* JavaScript use it's own Perl regexp engine */

   if(findResult==null)
   { /* nothing found returns a null object */
     UltraEdit.messageBox("Pattern not found");
   }
   else
   {
     UltraEdit.messageBox("Pattern found: "+findResult);
   }
 }

 UltraEdit.outputWindow.write("Active="+findResult);

Read more posts (5 remaining)

20 posts