Why is selecting an HTML element from start to end tag with a Perl regular expression sometimes not working as expected?

Frenchy62620 · Oct 01, 2020#12020-10-01T11:22+00:00

Hi,

I have just started using UEStudio and I am testing some functionalities of UEStudio version 20.0.0.50.

I want to select all text (and all lines) between two HTML tags with the tags included like <nav sometext and </nav>. There are plenty of solutions to do that, but Perl is my friend and so I use this syntax: (?s)^<nav sometext.*</nav> to select the desired text. (?s) means that dot don't stop at newline, so the search continues on next line.

So the problem begins when I am using file with plenty of lines (> 2000 lines for example), with small file no problem by using script or by using the find selection.

Sample of script:

Code: Select all

UltraEdit.activeDocument.top();

UltraEdit.activeDocument.findReplace.matchCase=false;
UltraEdit.activeDocument.findReplace.matchWord=false;
UltraEdit.activeDocument.findReplace.searchAscii=false;
UltraEdit.activeDocument.findReplace.searchDown=true;
UltraEdit.activeDocument.findReplace.searchInColumn=false;

UltraEdit.perlReOn();
UltraEdit.activeDocument.findReplace.mode = 0;
UltraEdit.activeDocument.findReplace.regExp = true;
var pattern = '(?s)^<nav id="course-timeline".*</nav>$'
UltraEdit.activeDocument.findReplace.find(pattern);

UltraEdit.outputWindow.write('before' + UltraEdit.activeDocument.isFound());    
UltraEdit.outputWindow.write(UltraEdit.activeDocument.selection);    
UltraEdit.outputWindow.write('after');

The find seems to be lost with the file attached, see attached bug1.zip.

Any idea why?
Terry

Mofi · Oct 01, 2020#22020-10-01T12:29+00:00

It is possible to select a large block with using a Find and hold key Shift on clicking on button Next, i.e. use the Find Select feature The selection can be in this case even several hundred MB or even some GB. The Find Select feature is also available for UltraEdit macros. But it is not possible to select a large block with a Perl regular expression find. The Perl regular expression function loads the matched byte stream into stack and the stack is limited in its size to some MB. For that reason it is not possible to use a Perl regular expression Find to select a large block. How large a block can be for selection with a Perl regular expression depends on current stack usage and is therefore not really predictable. I have never seen a problem with a selection less than 4 MiB, but even this value is not guaranteed in any way.

I wrote an UltraEdit script function to select a block of any size using an UltraEdit script which is a replacement for the Find Select feature not available as scripting command, see the topic Replacment for macro command Find Select to select a block between or with two found strings. I recommend to use the function FindSelectOuter for your task.

Frenchy62620 · Oct 01, 2020#32020-10-01T12:45+00:00

Thanks for help, Mofi.

But the problem is not the selection because the selection is just for 100/150 lines. The problem comes when the file has great number of lines, maybe there is a collateral problem.

I go and look at your script.

Mofi · Oct 01, 2020#42020-10-01T14:06+00:00

See also Wrong result with a find using option match whole word in rare cases with longer word across a block boundary (fixed).

UltraEdit and UEStudio do not load an entire file into memory for processing. Large files are processed in blocks. A Perl regular expression is always run on currently loaded file block, only if the Perl regular expression function returns to UltraEdit that it needs more characters (bytes) before or after currently loaded file block to determine if currently matched character stream results in a positive or negative match, UltraEdit loads previous/next file block and passes it to the Perl regular expression function. So it depends now on the Perl regular expression search string if a file block boundary could result in an unexpected character stream match/selection or not.

There are many text editors which always load the entire text file into memory for processing them. Those text editors cannot be used to to process large CSV and LOG files at least in the past. But nowadays more and more PC have more and more RAM and so more and more larger files can be loaded as one block into RAM.

IDM Computer Solutions extend also the block size used by UltraEdit and UEStudio from time to time. While 20 years ago the block size was limited to 64 KB as most file stream functions still use, the file load block size used by UltraEdit increased in the last 20 years several times. I am not a developer of UltraEdit or UEStudio and so I don't know which file block size is used at the moment by UltraEdit v27.10 and UEStudio v20.00, but I could find it out with Process Monitor if you would like to know it.

Most users of UltraEdit or UEStudio do not recognize ever that UE/UES manages the file contents in blocks. Some Perl regular expression finds/replaces selecting a larger block or special finds are the rare cases on which users can observe from time to time an issue caused by processing the contents of a larger file in blocks.

I recommend to look on Fleggy's post Matching tag pairs.

An expression like .* is always problematic as it means any character 0 or more times greedy, especially in combination with (?s). Nothing matched after <nav sometext is valid on reaching end of currently processed character stream with </nav> not found at all resulting in a negative match. This expression is greedy which means it does not stop matching of any character including newline characters on reaching next occurrence of </nav>, but on last occurrence of </nav> in file. So there is only no positive match if the starting tag cannot be found at beginning of a line or there is no </nav> anywhere in file after an already matched starting tag and the fixed text defined in search string. So in real the Perl regular expression searches first for <nav sometext at beginning of a line, then it matches everything up to end of character stream which is (should be) end of file (any character 0 or more times greedy) and then scans back from end in reverse direction to find </nav> before reaching initial position of matched <nav sometext at beginning of a line. I know it is hard to understand for a user of Perl regular expressions to understand how it is working in the background and why a not good expression applied on specific data results in not expected find behavior although it is very clear for a Perl regular expression expert knowing how the engine (function) works.

PS: Whenever somebody on Stack Overflow asks for help on a regular expression search string used to match a block in an HTML, XHTML or XML file, the regular expression experts reply first with a comment not doing that with a Perl regular expression search as it is hard to do that on data which are structured nested very often by design.