Find tags inside XML element and update attribute of element accordingly

don_bradman · Feb 07, 2016#12016-02-07T03:33+00:00

Hi!

I would like a macro which will find the value inside \tag{...} from each of the expression <disp-formula id="deqn*>...</disp-formula>, copy it and paste on deqn* position like deqn5 in the below sample expression 1 and deqn1-3a in case of the sample expression 2.

Sample expression 1:

Code: Select all

<disp-formula id="deqn*"><text-notation="math">\begin{equation*}
x+y=5 \tag{5}
\end{equation*}</text-notation="math"></disp-formula>

Sample expression 2:

Code: Select all

<disp-formula id="deqn*"><text-notation="math">\begin{align*}
y=5 \tag{1}\cr
dx+dy=p \tag{3a}
\end{align*}</text-notation="math"></disp-formula>

Keep in mind that there could be any number of \tag{...} in a single expression <disp-formula id="deqn*>...</disp-formula> and the macro will put the values like deqn{first value inside \tag-last value inside \tag} as for example:

Code: Select all

<disp-formula id="deqn*"><text-notation="math">\begin{equation*}
x=5 \tag{5}
y=3 \tag{6}
x+y=8 \tag{7}
\end{equation*}</text-notation="math"></disp-formula>

should become

Code: Select all

<disp-formula id="deqn5-7"><text-notation="math">\begin{equation*}
x=5 \tag{5}
y=3 \tag{6}
x+y=8 \tag{7}
\end{equation*}</text-notation="math"></disp-formula>

And

Code: Select all

<disp-formula id="deqn*"><text-notation="math">\begin{equation*}
x+y=5 \tag{5}
\end{equation*}</text-notation="math"></disp-formula>

should become

Code: Select all

<disp-formula id="deqn5"><text-notation="math">\begin{equation*}
x+y=5 \tag{5}
\end{equation*}</text-notation="math"></disp-formula>

Mofi · Feb 07, 2016#22016-02-07T14:29+00:00

Here is the macro for this task using two complex Perl regular expression Replace All using back-references.

InsertMode
ColumnModeOff
HexOff
Top
PerlReOn
Find MatchCase RegExp "(?s)(<disp-formula id="deqn)[^"]*?("(?:.(?!/disp-formula))+?.\\tag\{)([^}]+?)(\}(?:.(?!/disp-formula))+.\\tag\{)([^}]+?)\}"
Replace All "\1\3-\5\2\3\4\5}"
Top
Find MatchCase RegExp "(?s)(<disp-formula id="deqn)[^"]*?("(?:.(?!/disp-formula|\\tag))+?.\\tag\{)([^}]+?)(\}(?:.(?!/disp-formula|\\tag))+?</disp-formula>)"
Replace All "\1\3\2\3\4"

Explanation for first search string:

(?s) ... this expression at beginning of search string results in . matching also newline characters which is by default not the case.

(<disp-formula id="deqn) ... begin of string to find enclosed in marking (capturing) parentheses to avoid the need to write the same string once again in replace string and instead back-reference this fixed part of found string with \1 in replace string.

[^"]*? ... 0 or more characters NOT being a double quote non greedy because of question mark after asterisk multiplier. The non greedy matching behavior results in stopping matching any character except double quote on first double quote found even if that results in a negative match for the entire expression.

(...) ... second marking group to keep the string matched by the expression inside the parentheses unmodified in replace string back-referenced with \2.

"(?:.(?!/disp-formula))+? \\tag\{ ... this expression matches

a double quote,
any character (including newline characters) 1 or more times non greedy because of question mark after plus,
any character and \tag{ whereby the backslash and the opening brace must be both escaped with a backslash to be interpreted as literal characters.

The special here is that not simply .+? is used to match any character up to the character left of next \tag{ as this would also match beyond current disp-formula element if it does not contain \tag{ at all, but there is one anywhere in file below.

A character is only matched if the string right of it is not end tag of disp-formula element. (?!/disp-formula) is a negative lookahead which does not match characters. It just checks if next to current character there is NOT the string /disp-formula otherwise the current character is not matched which would be the left angle bracket.

A non marking group (?:...) is used around expression .(?!/disp-formula) to be able to apply the plus multiplier for matching 1 or more characters where next there is NOT the string /disp-formula.

The expression so far matches everything from start tag of disp-formula up to first \tag{ found within same element.

([^}]+?) ... this expression in third marking group matches 1 or more characters not being a closing brace non greedy.

(\}(?:.(?!/disp-formula)) \\tag\{) ... the expression in fourth capturing group is nearly the same as the expression in second capturing group. It starts with an escaped closing brace instead of a double quote. But the main difference is the missing question mark after plus multiplier. Therefore the matching behavior for . is now greedy. The matching does not stop on next \tag{ found within same disp-formula element because of greedy matching behavior. Now the expression matches as much characters as possible to get nevertheless a positive match for the entire search expression. Therefore this time . matches everything up to last \tag{ in same disp-formula element.

([^}]+?)\} ... this expression matches 1 or more characters not being a closing brace non greedy in fifth marking group and the closing brace which is specified also in replace string to be kept unmodified.

The replace string of first search string keeps most of found string unmodified because of back-referencing (nearly) everything found in right order with the exception of the deqn identifier string which is modified with string of first and last \tag{...} separated by a hyphen.

The first search string matches only disp-formula elements with at least two \tag{...} embedded.

The second search string is for disp-formula elements with exactly one \tag{...} embedded.

The second search expression is similar to first one. The main difference are the two negative lookaheads containing now not only /disp-formula, but on OR expression with /disp-formula OR \tag.

When first special defined .+? expression with a negative lookahead in a non marking group stops matching any character, the string right of last matched character must be any character and \tag{.

And when second special defined .+? expression with a negative same lookahead in a non marking group stops matching any character, the string right of last matched character must be end tag of same disp-formula element.

These two restrictions make sure that the second search does not match

more than one disp-formula element,
a disp-formula element with no \tag{...} embedded,
a disp-formula element with more than one \tag{...} embedded.

don_bradman · Feb 08, 2016#32016-02-08T16:36+00:00

First of all, thanks a lot man. Really appreciate your reply.

And one more thing: Is there a way to run a particular macro on all, let's say .xml or .txt files inside a folder? If so, how?

Thank you in advance for all your help

Mofi · Feb 08, 2016#42016-02-08T17:42+00:00

There is Run Macro on all files within folder. But this macro contains just 2 Perl regular expression replaces. So you can simply open Search - Replace in Files and run the 2 Perl regxp replaces on *.xml;*.txt in the folder selected by you.

As macro code:

Code: Select all

PerlReOn
ReplInFiles MatchCase RegExp Log "C:\Temp\Folder Name\" "*.xml;*.txt" "(?s)(<disp-formula id="deqn)[^"]*?("(?:.(?!/disp-formula))+?.\\tag\{)([^}]+?)(\}(?:.(?!/disp-formula))+.\\tag\{)([^}]+?)\}" "\1\3-\5\2\3\4\5}"
ReplInFiles MatchCase RegExp Log "C:\Temp\Folder Name\" "*.xml;*.txt" "(?s)(<disp-formula id="deqn)[^"]*?("(?:.(?!/disp-formula|\\tag))+?.\\tag\{)([^}]+?)(\}(?:.(?!/disp-formula|\\tag))+?</disp-formula>)" "\1\3\2\3\4"

don_bradman · Feb 14, 2016#52016-02-14T12:48+00:00

How do I search for the string "<disp-formula id="deqn*>...</disp-formula>" which is a multiline string containing various other tags?

I have tried "(?s)<disp-formula id="deqn\*">.*</disp-formula>" to find the required string, but it selects the first "<disp-formula id="deqn*"> to the last "</disp-formula>". But I want to capture all 18 occurrences present in a file, not everything from first start to last end tag. How to do that?

Mofi · Feb 14, 2016#62016-02-14T14:41+00:00

Use (?s)<disp-formula id="deqn\*">.*?</disp-formula> to change the matching behavior from greedy - as much characters as possible for a positive find - to non greedy - as less characters as possible for a positive find. The question mark after multiplier * or + is most often needed on Perl regular expression searches.

don_bradman · Feb 14, 2016#72016-02-14T15:03+00:00

First of all, thanks a ton for your help.

Now using "(?s)<disp-formula id="deqn\*">.*?</disp-formula>" I have created a macro to copy all strings matching the regex to a new file with of course using your macro template. But the strings "</disp-formula> and <disp-formula id="deqn*>" are appearing as "</disp-formula><disp-formula id="deqn*>" in the new file, whereas I want every "<disp-formula id="deqn*>" to start in a new line. The found strings should be in new file listed as:

<disp-formula id="deqn*">...</disp-formula>
<disp-formula id="deqn*">...</disp-formula>
<disp-formula id="deqn*">...</disp-formula>

The macro is given below:

Code: Select all

InsertMode
ColumnModeOff
HexOff
PerlReOn
Top
Clipboard 9
ClearClipboard
Loop 0
Find RegExp "(?s)<disp-formula id="deqn\*">.*?</disp-formula>"
IfFound
CopyAppend
Else
ExitLoop
EndIf
EndLoop
NewFile
Paste
Top
Find RegExp "(<disp-formula id="deqn\*">.*?</disp-formula>)"
Replace All "\1\r\n"
ClearClipboard
Clipboard 0

Could you please point out the reason why problem I mentioned above is occurring?

Mofi · Feb 15, 2016#82016-02-15T05:49+00:00

The Find+Replace All executed on new file is:

Code: Select all

Find RegExp "(<disp-formula id="deqn\*">.*?</disp-formula>)"
Replace All "\1\r\n"

The search string does not start with (?s) as search string to find the elements with their multiline values. Therefore this Find+Replace All can't find any string in the new file resulting in inserting no carriage return + line-feed.

I suggest to use as Find+Replace All:

Code: Select all

Find "</disp-formula>"
Replace All "</disp-formula>^p"

This is a simple non regular expression Find+Replace All inserting a DOS line termination after each end tag. Or perhaps better:

Code: Select all

Find "</disp-formula>"
Replace All "</disp-formula>^p^p"

Inserting two DOS line terminations after each end tag makes it easier to see where each disp-formula element starts and ends.

don_bradman · Feb 15, 2016#92016-02-15T09:58+00:00

What if I use

Code: Select all

Find RegExp "(?s)(<disp-formula id="deqn\*">.*?</disp-formula>)"
Replace All "\1\r\n"

instead of

Code: Select all

Find RegExp "(<disp-formula id="deqn\*">.*?</disp-formula>)"
Replace All "\1\r\n"

Will the macro work properly then? Or will there be any modification on the files?
As you might have known by now that I have very little knowledge of Perl regex.

Mofi · Feb 15, 2016#102016-02-15T17:36+00:00

That would of course also work as it can be found out by a simple trial. It is just a little bit slower for inserting a DOS line termination after each end tag because of a more complex search in comparison to the non regular expression replace all suggested by me. But the result would be the same.