How to find duplicate data in consecutive paragraphs?

Rajesh · Jul 22, 2016#12016-07-22T13:47+00:00

Dear Mofi,

I want to search some similar text in two paras one by one.
For that, i create an pattern for that in below:

It's working nice....
(.+.)[ .,](.+.)\r\n)(\1)

But when I search this pattern to select exact text then it's not working:
(?<=(.+.)[ .,](.+.)\r\n)(\1)

Example:
Craig, Robert Fenton. University of California, Los Angeles, Special Collections.
Craig, Robert Fenton. University of Oxford, Special Collections. 2014, pp. 20-13.

Rajesh

Mofi · Jul 23, 2016#22016-07-23T08:05+00:00

Both regular expression search strings in Perl syntax are invalid regular expressions and therefore can't work at all.

The search string

([^.]+?\.).*\r\n\1.*

can be used to find and select two consecutive paragraphs starting both with same string up to first full stop in each paragraph.

Rajesh · Jul 23, 2016#32016-07-23T11:30+00:00

Dear Mofi,

Thanks to your reply.

Actually I want to find DUPLICATE text in two paragraphs like below:

1. First and second paragraph have duplicate part "Craig, Robert Fenton. University of".
2. I want to search this text "Craig, Robert Fenton. University of" in second paragraph in a single search.
3. After search the caret will be after text "Craig, Robert Fenton. University of" on second paragraph.

Example:
Craig, Robert Fenton. University of California, Los Angeles, Special Collections.
Craig, Robert Fenton. University of Oxford, Special Collections. 2014, pp. 20-13.

Rajesh

Mofi · Jul 23, 2016#42016-07-23T16:19+00:00

It is impossible to use a regular expression search to search for any string of any length which exists more than once in two consecutive paragraphs. Regular expression searches are based on clearly defined rules. Those vague requirements make it impossible to define those rules. I was wrong with this statement as fleggy demonstrates below.

You may use the following script. Please read the comments of the script and adjust value of variable nMinimalEqualLength to your requirements.

Code: Select all

if (UltraEdit.document.length > 0)  // Is any file opened?
{
   // Define environment for this script. This script is designed to
   // run a search from current position in active file to end of file.
   UltraEdit.insertMode();
   if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
   else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();

   var nMinimalEqualLength = 4;

   var bDuplicateFound = false;

   // Get current caret position in active file.
   var nLine = UltraEdit.activeDocument.currentLineNum;
   var nColumn = UltraEdit.activeDocument.currentColumnNum;

   // Move caret to beginning of current line respectively to
   // first non whitespace character in current line depending
   // on configuration setting Home key always goes to column 1.
   UltraEdit.activeDocument.key("HOME");

   // Define the parameters to find two consecutive reference paragraphs
   // using a case-sensitive UltraEdit regular expression search.
   UltraEdit.ueReOn();
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   if (typeof(UltraEdit.activeDocument.findReplace.searchInColumn) == "boolean")
   {
      UltraEdit.activeDocument.findReplace.searchInColumn=false;
   }

   // Run this find for two consecutive reference paragraphs until either
   // nothing found anymore up to end of file or duplicate data found at
   // beginning of two found reference paragraphs.
   while (UltraEdit.activeDocument.findReplace.find('<p class="ref">*</p>^r^n<p class="ref">*</p>'))
   {
      // Get from the two found reference paragraphs just the text without the
      // tags <p class="ref"> and </p> and split the two lines up to two strings.
      var asParagraphsText = UltraEdit.activeDocument.selection.replace(/<p class="ref">|<\/p>/g,"").split("\r\n");

      // Run a case-sensitive character by character comparison in a
      // loop from beginning of each paragraph text until either end of
      // first found paragraph text or end of second found paragraph text
      // is reached or the currently compared characters are not equal.
      nCharIndex = 0;
      while ((nCharIndex < asParagraphsText[0].length) && (nCharIndex < asParagraphsText[1].length))
      {
          if (asParagraphsText[0][nCharIndex] != asParagraphsText[1][nCharIndex]) break;
          nCharIndex++;
      }

      // Are there at least so many equal characters at beginning of the
      // found reference paragraphs as the defined at top of this script?
      if (nCharIndex >= nMinimalEqualLength)
      {
         // Set caret in file to beginning of current line with second paragraph.
         UltraEdit.activeDocument.key("HOME");
         // Search case-sensitive for the equal string to select it and
         // exit script. Note: ^ in search string is not escaped although
         // it would be required to find a duplicate string containing ^.
         UltraEdit.activeDocument.findReplace.regExp=false;
         UltraEdit.activeDocument.findReplace.find(asParagraphsText[0].substring(0,nCharIndex));
         bDuplicateFound = true;
         break;
      }

      // There are not enough equal characters at beginning of the two
      // found reference paragraphs. So move caret in file to beginning
      // of current line with second paragraph and run regular expression
      // find once again.
      UltraEdit.activeDocument.key("HOME");
   }

   // Set caret to initial position if there could not be found any
   // duplicate string with at least nMinimalEqualLength at beginning
   // of two consecutive reference paragraphs.
   if (!bDuplicateFound)
   {
      if (typeof(UltraEdit.activeDocumentIdx) == "undefined") nColumn++;
      UltraEdit.activeDocument.gotoLine(nLine,nColumn);
   }
}

It was tested on following data:

Code: Select all

<p class="ref">Craig, Robert Fenton. University of California, Los Angeles, Special Collections.</p>
<p class="ref">Craig, Robert Fenton. University of California, Los Angeles, Special Collections.</p>
<p class="ref">Craig, Robert Fenton. University of Oxford, Special Collections. 2014, pp. 20-13.</p>
<p class="ref">Craig, Robert Fenton. University of California, Los Angeles</p>
<p class="ref">Doe, Jone. University of Oxford, Special Collections. 2014, pp. 20-13.</p>
<p class="ref">Doe, Jane. University of California, Los Angeles, Special Collections.</p>
<p class="ref">Asterix, gaul</p>
<p class="ref">Asterix and Obelix, gauls</p>
<p class="ref">Miraculix, gaul</p>

The 1. script run from top of file selects entire string within the paragraph tags in line 2.
The 2. script run selects entire string within the paragraph tags in line 3.
The 3. script run selects "Craig, Robert Fenton. University of " in line 4.
The 4. script run selects "Doe, J" in line 6.
The 5. script run selects "Asterix" in line 8.
The 6. script run results in canceling the selection and keeping caret positioned in line 8 column 23.

fleggy · Jul 24, 2016#52016-07-24T10:08+00:00

Hi Rajesh,

try this Perl replace:

Find what:

Code: Select all

(<p class="ref">)(.*)((.(?!/p>))*</p>)
(?1)\2(?3)

Replace with:

Code: Select all

$&
~~~\2~~~

From the text:

Code: Select all

<p class="ref">Craig, Robert Fenton. University of California, Los Angeles, Special Collections.</p>
<p class="ref">Craig, Robert Fenton. University of Oxford, Special Collections. 2014, pp. 20-13.</p>

<p class="ref">Craig, Robert Fenton. University of California, Los Angeles, Special Collections.</p>
<p class="ref">Oogwan, Robert Fenton. University of Oxford, Special Collections. 2014, pp. 20-13.</p>

it creates:

Code: Select all

<p class="ref">Craig, Robert Fenton. University of California, Los Angeles, Special Collections.</p>
<p class="ref">Craig, Robert Fenton. University of Oxford, Special Collections. 2014, pp. 20-13.</p>
~~~Craig, Robert Fenton. University of ~~~

<p class="ref">Craig, Robert Fenton. University of California, Los Angeles, Special Collections.</p>
<p class="ref">Oogwan, Robert Fenton. University of Oxford, Special Collections. 2014, pp. 20-13.</p>
~~~~~~

which you can easily parse.

EDIT1: A nicer Find what

EDIT2: But forgotten old group number in Replace with. Now it is correct.

Mofi · Jul 24, 2016#62016-07-24T14:13+00:00

fleggy, very impressive. I have never seen before a recursive expression and so don't know anything about it including how to use it. But it looks like there are indeed rare use cases where it makes sense to use a recursive expression.

Where is $& documented?

I could not find anything about this expression in Boost library documentation. I can see that this expression references always entire matched string and found it documented in perlre documentation. But the regular expression functions within Perl interpreter are different to regular expression implementation in Boost library as far as I know. So I don't use the perlre documentation.

fleggy · Jul 24, 2016#72016-07-24T15:20+00:00

Hi Mofi, I even didn't know that I used a recursion

I just simply wanted to reference the group definition itself and not the particular match.

My favourite sources:
http://www.rexegg.com/regex-disambiguation.html
http://www.regular-expressions.info/ref ... ckref.html

BR, Fleggy

Mofi · Jul 24, 2016#82016-07-24T17:22+00:00

fleggy, many thanks for the links. I studied both webpages and also other pages linked from there.

Well, your search expression is no real Perl recursive expression because of not using (?R) or a subroutine reference inside the subroutine definition itself as I know now after studying the advanced RexEgg tutorial pages. But I think the find expression is nevertheless solved by the Perl regular expression function using recursion as otherwise I would from a C/C++ programmers point of view not understand how the output results are produced by this expression.

(.*) matches any character except a newline character 0 or more times greedy and ((.(?!/p>))* does the same as long as /p> is not following the current character, i.e. stop on left angle bracket of paragraph end tag. So how does the regular expression search functions find out how many characters should be matched by the first expression and how many by the second expression. I'm quite sure it runs a recursion to find that out.

I think, the search string

(.{4,}).*\r\n\1.*

would be easier to understand for most users. It works the same as the multi-line search string posted by fleggy using subroutines with the difference that at least the first 4 characters within the consecutive paragraphs must be equal which I used in my script as minimum length. The case-sensitivity of the two strings at beginning of the paragraphs is controlled by the find option Match case.

PS: I have added a link to RexEgg website to forum announcement Readme for the Find/Replace/Regular Expressions forum.

fleggy · Jul 24, 2016#92016-07-24T17:53+00:00

Hi Mofi,

Mofi wrote:(.*) matches any character except a newline character 0 or more times greedy and ((.(?!/p>))* does the same as long as /p> is not following the current character, i.e. stop on left angle bracket of paragraph end tag. So how does the regular expression search functions find out how many characters should be matched by the first expression and how many by the second expression. I'm quite sure it runs a recursion to find that out.

AFAIK there is no recursion - I think it's a normal backtracing. At first (.*) matches as much as can and if this match fails inside the second line then regexp engine backtraces and tries a one character shorter match until a match is found on both lines (a zero length match in the worst case). ((.(?!/p>))* matches the rest of line on both lines. At least I think it works this way. Does it make sense?

BR, Fleggy

EDIT1: you are absolutely right that my expression is unnecessarily complex. My first attempts used lookahead for and I had to use ((.(?!/p>))* to stop before it.

Mofi · Jul 24, 2016#102016-07-24T18:23+00:00

You are right with backtracing. This is the key here. And I'm right with recursion, as backtracing is done inside using recursive function calls with modified starting points within the data. I think, I got it now how this expression works inside.

Arydigital · Aug 02, 2017#112017-08-02T07:53+00:00

Both regular expression search strings in Perl syntax are https://www.arydigital.tv/videos/category/katto/ invalid regular expressions and therefore can't work at all.

Mofi · Aug 02, 2017#122017-08-02T13:13+00:00

Arydigital, there are not two Perl regular expression find and replace strings posted by Fleggy. There is just one multi-line Perl regular expression find and replace string posted.

The single line version suitable for text files with DOS/Windows line endings (carriage return + line-feed):

Find what: ()(.*)((.(?!/p>))*)\r\n(?1)\2(?3)
Replace with: $&\r\n~~~\2~~~

The multi-line and the single line find/replace variants work both with UltraEdit v24.10 using Boost Perl regular expression library.

fleggy · Aug 02, 2017#132017-08-02T19:33+00:00

Hi Arydigital,

believe me that I always test my every single Perl regex before I post here anything. Maybe your version of UE does not support all used features.

BR, Fleggy