Search and count strings - only 50% is found

Peter · Mar 30, 2020#12020-03-30T16:59+00:00

Prologue:
I though I already wrote about this problem, but maybe I mix it up with this
special-replace-all-problem-with-ultrae ... 18315.html
----------------------------------------------------
Version 26.20.0.74
I have this text:

Code: Select all

a.txt
b.txt
c.txt
d.txt

I want to count all occurrences of ".txt" at the end of the line and define a search:

Code: Select all

Search: *.txt^p
in current file: yes
Use Reg ex: yes
Command: Count all

My result:
- 2 found
- result list: 2 lines displayed
- Highlight results: 4 lines highlighted
(I tried it also with 2405 lines and got 1203 results)
When I remove "^p" it would count correct in this file, but of course it would also find unwanted strings on other positions.

What's going wrong?

fleggy · Mar 30, 2020#22020-03-30T18:48+00:00

Hi Peter,

it is a bug, obviously. The simpler expression .txt^p is OK. Perl variants also do work flawlessly.

BR, Fleggy

EDIT: version 26.20.0.68 x64

Mofi · Mar 31, 2020#32020-03-31T05:51+00:00

The issue here is * which means zero or more characters except newline characters (greedy) in UltraEdit regular expression. I wrote greedy in round brackets as the legacy UltraEdit regular expression does not really support the concept of greedy and non-greedy searches. But * results usually in a greedy match while ?++ results usually in a non-greedy match according to my experiences on using UltraEdit regular expression engine. The word usually means not always. It depends on search string and text to search if characters are matched greedy or non-greedy.

A non-regular expression count of all occurrences using only .txt^p returns four as a result if the last line has also a DOS/Windows line termination. The number of found occurrences is also four with using UltraEdit regular expression ?++.txt^p.

But on using UltraEdit regular expression search string *.txt^p the count result is the half because of * is greedy and does not stop matching zero or more characters left to .txt for all four lines or results in matching nothing on every even line on interpreting entire file content as one character stream. An UltraEdit regular expression search with *.txt^p results in matching one line after the other because of the character stream searched on each find changes with each find as the end of current selection marks the beginning of the character stream to search for the next occurrence of a matching string.

In fact the usage of * without something fixed left or without something fixed right of the asterisk results often in an undefined behavior of such an UltraEdit regular expression search. I know that for more than 20 years and for that reason avoid * without something left and right to the asterisk to define where to start and where to stop matching zero or more characters. With * at beginning of the search string sometimes nothing is matched by the asterisk and sometimes the match begins and/or ends unexpected. That is the problem with undefined behavior. The result is not predictable.

Another solution would be using here UltraEdit regular expression search string *.txt$ which results also in four occurrences found as the newline characters of end of each line on searching entire character stream of the file are not matched on using $.

Using as UltraEdit regular expression search string %*.txt^p returns also the unexpected result two instead of expected result four. This is caused by ^p matching line termination of all odd lines and so all even lines are ignored on the search for all occurrences of the expression because of beginning of line is not found for the even lines as previous search having matched already the line termination of the odd lines. The result is again different on using %*.txt^p for an UltraEdit regular expression search because of it finds and selects each line one after the other.

The difference between count with %*.txt^p can be explained from a programmers point of view because of clicking on Count all results in interpreting all characters of the file as one character stream while running a search matching one occurrence after the other the character stream changes for each search always from end of current selection being the beginning of a line to end of file. So % matches on searching for one occurrence after the other four times the beginning of the current character stream while on same search expression used with Count all results in interpreting % just once as beginning of character stream (top of file) and once more as beginning of line three after CR+LF are not matched by the search expression at end of second line.

The UltraEdit regular expression engine never sets character offset back on Count all on using % at beginning of the search expression and ^p or (^n or ^r) at end of the search expression. So each second line termination is not interpreted as beginning of a line as the offset within the character stream on searching for % is already beyond carriage return and line-feed matched before.

I know that this behavior is hard to understand on never having written code to parse a character stream as I have done several times. Therefore a user has problems to understand the different behaviors without knowing what is going on in the background depending on which character stream is really parsed by an expression.

In my opinion the unexpected result of two instead of four on using the UltraEdit regular expression engine with *.txt^p or with %*.txt^p as search string on pressing Count all is not a bug of the UltraEdit regular expression engine. The unexpected result is caused in my opinion because of using the wrong expression for this task and the unexpected result of two is correct on thinking which character stream is processed in this case and how this character stream is processed in this case.