Select value between two other values

tigron · Mar 03, 2015#12015-03-03T06:00+00:00

Hello,

I am a linguist and have not much knowledge with regular expressions. I would be glad if anybody could tell me how to select information between two values. I have two big files containing 60.000 and 120.000 strings / lines and I need to extract information from each line.
Any help is very much appreciated. Thanks a lot

File1:

I need the text between <orth> and </orth>, as well as between <trans> and </trans>

Code: Select all

<entry id="15" version="1.2" HE="true" xmlns="http://www.wadoku.de/xml/entry"><form><orth>インスリン</orth><pron>いんすりん</pron><pron type="hatsuon">いんすりん</pron></form><gramGrp><pos type="N"/></gramGrp><sense><usg type="dom">Med.</usg><trans><tr><token genus="n" type="N">Insulin</token></tr></trans></sense></entry>

File 2:

I need the text between <keb> and </keb> as well as between <gloss> and </gloss>

Code: Select all

  <entry>
    <ent_seq>1000080</ent_seq>
    <k_ele>
      <keb>漢数字ゼロ</keb>
    </k_ele>
    <r_ele>
      <reb>かんすうじゼロ</reb>
    </r_ele>
    <info>
      <audit>
        <upd_date>2012-09-10</upd_date>
        <upd_detl>Entry created</upd_detl>
      </audit>
      <audit>
        <upd_date>2012-09-11</upd_date>
        <upd_detl>Entry amended</upd_detl>
      </audit>
      <audit>
        <upd_date>2012-09-11</upd_date>
        <upd_detl>Entry amended</upd_detl>
      </audit>
    </info>
    <sense>
      <pos>noun (common) (futsuumeishi)</pos>
      <xref>○・まる・1</xref>
      <xref>漢数字</xref>
      <gloss xml:lang="eng">"kanji" zero</gloss>
    </sense>

Mofi · Mar 03, 2015#22015-03-03T07:08+00:00

Run a Perl regular expression Find with advanced option List lines containing string enabled using search string <(orth|trans|keb|gloss).*?>.+?</\1>. This is a Perl regular expression using backreferences directly in search string.

A window opens with all lines found. Right click into this window and left click on Copy to Clipboard.

Open a new file and make sure it is a Unicode file (UTF-16 or UTF-8) by using ASCII to Unicode or ASCII to UTF-8 in File - Conversions or the encoding selector in status bar at bottom.

Paste the found lines with Ctrl+V into the new file and move caret to top of file with Ctrl+Home.

Run now a Perl regular expression Replace All with search string ^.*?>(.+?)</(?:orth|trans|keb|gloss)>.*$ and replace string \1 to remove everything from beginning of line to opening tag, the opening tag and the closing tag and rest of line if there is something else on line at all.

I can explain the regular expressions if you are really interested in.

By the way: The Find/Replace dialogs have an icon button with .* (in older versions a magnifying glass or a triangle arrow) which opens the regular expression builder list. Most special characters used here are explained in the list for search and for replace string. ?: after opening parenthesis changes a capturing group where string found by the expression within the parentheses can be back-referenced to a non-capturing group. The question mark after multiplier * or + changes the expression from greedy to non-greedy, i.e. instead of matching as much as possible for a positive match (greedy), match as less as possible for a positive match (non-greedy).

tigron · Mar 03, 2015#32015-03-03T10:37+00:00

Hello Mofi,

thanks for your post. It looked so promising, but sadly this occurred:

file 1: since all strings contain one of "orth|trans|keb|gloss" it just takes a lot of time and then finds equivalents in every string and I find myself copying the whole content.
If I then use the Perl regular expression, it deletes one string after another completely. :/

file 2: looked better. But I rechecked and it just finds 12 out of 284 containing strings. I checked if the other words in the original file were in strings not using "orth|trans|keb|gloss", but they do :/

I know how to use regular expressions a bit, I will have to read about "Perl regex backreferences" to understand this.

Greetings from Tokyo
Ich bin Deutscher, wir könnten also auch auf Deutsch schreiben

Mofi · Mar 04, 2015#42015-03-04T07:05+00:00

I prefer English as this forum is an English forum and others are perhaps also interested in.

For a regular expression search/replace it is incredible important to know the real contents of files. Otherwise a regular expression can work perfect for a sample but not for real file contents.

I expected that each XML element in the XML file is on its own line. I see that you have changed now the first example in your initial post. If there are lines with multiple XML elements in same line, use Format - XML Convert to CR/LFs.

The previously posted expression should find only lines containing:

Code: Select all

<orth...>...</orth>
<trans...>...</trans>
<keb...>...</keb>
<gloss...>...</gloss>

If there are other elements also starting with orth, trans, keb or gloss which should not be found, use <(orth|trans|keb|gloss)\b.*?>.+?</\1> to exclude those elements with similar name.

Next I assumed that the values of those 4 elements are within a line. If the Unicode strings of those 4 elements can be also on multiple lines, a better expression for initial search would be <(orth|trans|keb|gloss)\b.*?>[\s\S]+?</\1>

But the replace in the new file must be different with Unicode strings spanning over multiple lines, especially if you want to know which lines are from one XML element.

And of course I expect that there are no other elements within the elements orth, trans, keb and gloss.

tigron · Mar 13, 2015#52015-03-13T06:54+00:00

Hello,

this is the current situation:
I tried what you posted, but, sadly, the search outcome is the whole content again. Here is how the entries are structured. This are three exemplary entries. There are more than 62.000 lines, so to say more than 62.000 words in Japanese, and German translation. What you cannot see here is that every line starts with <entry id ...>. I attached a screenshot below.

Code: Select all

<entry id="15" version="1.2" HE="true" xmlns="http://www.wadoku.de/xml/entry"><form><orth>インスリン</orth><pron>いんすりん</pron><pron type="hatsuon">いんすりん</pron></form><gramGrp><pos type="N"/></gramGrp><sense><usg type="dom">Med.</usg><trans><tr><token genus="n" type="N">Insulin</token></tr></trans></sense></entry>
<entry id="73" version="1.2" xmlns="http://www.wadoku.de/xml/entry"><form><orth>バーン･ジョーンズ</orth><pron>ばーんじょーんず</pron><pron type="hatsuon">[Gr]ばーん･[Gr]じょーんず</pron></form><gramGrp><pos type="N"/></gramGrp><sense><usg type="dom">Persönlichk.</usg><trans><tr><text hasFollowingSpace="true">Edward C.</text><famn>Burne-Jones</famn><bracket><def><text>engl. Maler</text></def><birthdeath>1833–1898</birthdeath></bracket></tr></trans></sense></entry>
<entry id="208" version="1.2" xmlns="http://www.wadoku.de/xml/entry"><form><orth midashigo="true">キヴァ△汗国</orth><orth>キヴァハン国</orth><orth irr="true">キヴァ汗国</orth><pron>きう゛ぁはんこく</pron><pron type="hatsuon">[Gr]きう゛ぁ･はんこく</pron></form><gramGrp><pos type="N"/></gramGrp><sense><usg type="dom">Gebietsn.</usg><usg type="dom">Gesch.</usg><trans><tr><token genus="n" type="N">Khanat</token><text hasPrecedingSpace="true" hasFollowingSpace="true">von Chiwa</text><bracket><def><text>Reich im heutigen Usbekistan</text></def></bracket></tr></trans></sense></entry>

The other file is structured differently. There is an entry in each line which makes up the whole dictionary entry. Here are 160.1717 lines with about 61.000 word entries in Japanese and English translations.

Code: Select all

<JMdict>
  <entry>
    <ent_seq>1000080</ent_seq>
    <k_ele>
      <keb>漢数字ゼロ</keb>
    </k_ele>
    <r_ele>
      <reb>かんすうじゼロ</reb>
    </r_ele>
    <info>
      <audit>
        <upd_date>2012-09-10</upd_date>
        <upd_detl>Entry created</upd_detl>
      </audit>
      <audit>
        <upd_date>2012-09-11</upd_date>
        <upd_detl>Entry amended</upd_detl>
      </audit>
      <audit>
        <upd_date>2012-09-11</upd_date>
        <upd_detl>Entry amended</upd_detl>
      </audit>
    </info>
    <sense>
      <pos>noun (common) (futsuumeishi)</pos>
      <xref>○・まる・1</xref>
      <xref>漢数字</xref>
      <gloss xml:lang="eng">"kanji" zero</gloss>
    </sense>
  </entry>
  <entry>
    <ent_seq>1000080</ent_seq>
    <k_ele>
      <keb>漢数字ゼロ</keb>
    </k_ele>
    <r_ele>
      <reb>かんすうじゼロ</reb>
    </r_ele>
    <info>
      <audit>
        <upd_date>2012-09-10</upd_date>
        <upd_detl>Entry created</upd_detl>
      </audit>
      <audit>
        <upd_date>2012-09-11</upd_date>
        <upd_detl>Entry amended</upd_detl>
      </audit>
      <audit>
        <upd_date>2012-09-11</upd_date>
        <upd_detl>Entry amended</upd_detl>
      </audit>
    </info>
    <sense>
      <pos>noun (common) (futsuumeishi)</pos>
      <xref>○・まる・1</xref>
      <xref>漢数字</xref>
      <gloss xml:lang="eng">"kanji" zero</gloss>
    </sense>
  </entry>

I hope this helps.

Mofi · Mar 14, 2015#62015-03-14T10:32+00:00

The block with the 3 lines starting with <entry id= looks as below after using Format - XML Convert CR/LFs:

Code: Select all

<entry id="15" version="1.2" HE="true" xmlns="http://www.wadoku.de/xml/entry">
   <form>
      <orth>インスリン</orth>
      <pron>いんすりん</pron>
      <pron type="hatsuon">いんすりん</pron>
   </form>
   <gramGrp>
      <pos type="N"/>
   </gramGrp>
   <sense>
      <usg type="dom">Med.</usg>
      <trans>
         <tr>
            <token genus="n" type="N">Insulin</token>
         </tr>
      </trans>
   </sense>
</entry>
<entry id="73" version="1.2" xmlns="http://www.wadoku.de/xml/entry">
   <form>
      <orth>バーン･ジョーンズ</orth>
      <pron>ばーんじょーんず</pron>
      <pron type="hatsuon">[Gr]ばーん･[Gr]じょーんず</pron>
   </form>
   <gramGrp>
      <pos type="N"/>
   </gramGrp>
   <sense>
      <usg type="dom">Persönlichk.</usg>
      <trans>
         <tr>
            <text hasFollowingSpace="true">Edward C.</text>
            <famn>Burne-Jones</famn>
            <bracket>
               <def>
                  <text>engl. Maler</text>
               </def>
               <birthdeath>1833–1898</birthdeath>
            </bracket>
         </tr>
      </trans>
   </sense>
</entry>
<entry id="208" version="1.2" xmlns="http://www.wadoku.de/xml/entry">
   <form>
      <orth midashigo="true">キヴァ△汗国</orth>
      <orth>キヴァハン国</orth>
      <orth irr="true">キヴァ汗国</orth>
      <pron>きう゛ぁはんこく</pron>
      <pron type="hatsuon">[Gr]きう゛ぁ･はんこく</pron>
   </form>
   <gramGrp>
      <pos type="N"/>
   </gramGrp>
   <sense>
      <usg type="dom">Gebietsn.</usg>
      <usg type="dom">Gesch.</usg>
      <trans>
         <tr>
            <token genus="n" type="N">Khanat</token>
            <text hasPrecedingSpace="true" hasFollowingSpace="true">von Chiwa</text>
            <bracket>
               <def>
                  <text>Reich im heutigen Usbekistan</text>
               </def>
            </bracket>
         </tr>
      </trans>
   </sense>
</entry>

What I could see now very easily is that there are <orth>, <orth midashigo="true"> and <orth irr="true"> (single line only) and trans element spreads over multiple lines containing other elements.

Please post the text you want to see from the block above in the new file exactly as you want it and I will write the UltraEdit macro with the Perl regular expressions required to reformat a copy of the file to wanted output.

And please post also for your second file example block what exactly should be the output for the second file.

tigron · Mar 22, 2015#72015-03-22T13:56+00:00

Hi again,

I checked the files again and deleted the unimportant data in beginning of each line in the first file.
Here is what I need (data needed are bold in both examples):

FILE1

The data between these tags:

<orth> </orth> <trans> </trans> <foreign> </foreign>

Example:

Code: Select all

<orth>[b]リア･ウィンドー[/b]</orth><sense><etym><text hasFollowingSpace="true">von engl.</text><foreign><text>[b]rear window[/b]</text></foreign></etym><trans><tr><token genus="n" type="N">[b]Heckfenster[/b]</token></tr></trans></sense><ref id="2364150" type="main" subentrytype="head"/></entry>

FILE2

The data between these tags:

<keb> </keb> <reb> </reb> <gloss xml:lang="eng"> </gloss>

As you can see there is no <keb> in the second example below. To sum up:
- there is always <reb>
- there is sometimes <keb>
- there is <gloss xml:lang="eng">

Example 1

Code: Select all

  <entry>
    <ent_seq>1000080</ent_seq>
    <k_ele>
      <keb>[b]漢数字ゼロ[/b]</keb>
    </k_ele>
    <r_ele>
      <reb>[b]かんすうじゼロ[/b]</reb>
    </r_ele>
    <info>
      <audit>
        <upd_date>2012-09-10</upd_date>
        <upd_detl>Entry created</upd_detl>
      </audit>
      <audit>
        <upd_date>2012-09-11</upd_date>
        <upd_detl>Entry amended</upd_detl>
      </audit>
      <audit>
        <upd_date>2012-09-11</upd_date>
        <upd_detl>Entry amended</upd_detl>
      </audit>
    </info>
    <sense>
      <pos>noun (common) (futsuumeishi)</pos>
      <xref>○・まる・1</xref>
      <xref>漢数字</xref>
      <gloss xml:lang="eng">[b]"kanji" zero[/b]</gloss>
    </sense>
  </entry>

OR

Example 2

Code: Select all

  <entry>
    <ent_seq>1018470</ent_seq>
    <r_ele>
      <reb>[b]アポクロマート[/b]</reb>
    </r_ele>
    <sense>
      <pos>noun (common) (futsuumeishi)</pos>
      <lsource xml:lang="ger"/>
      <gloss xml:lang="eng">[b]Apochromat[/b]</gloss>
    </sense>
  </entry>

I don't think the first file is very problematic since all the information of one entry is in the same line, but in the second file it is not.

Optimal would be an output like:

Code: Select all

リア･ウィンドー rear window Heckfenster

and

Code: Select all

漢数字ゼロ かんすうじゼロ "kanji" zero

or

Code: Select all

アポクロマート Apochromat

But it is of course also good if the whole thing would be in the output, like:

Code: Select all

<orth>リア･ウィンドー</orth><foreign><text>rear window</text></foreign><trans><tr><token genus="n" type="N">Heckfenster</token></tr></trans>

and

Code: Select all

<keb>漢数字ゼロ</keb><reb>かんすうじゼロ</reb><gloss xml:lang="eng">"kanji" zero</gloss>

or

Code: Select all

<reb>アポクロマート</reb><gloss xml:lang="eng">Apochromat</gloss>

When I write this it seems do be very difficult. I hope it is not for you

THANKS A LOT!

Mofi · Mar 22, 2015#82015-03-22T18:19+00:00

It was no problem for me to write the macro which converts each of your XML examples to wanted optimal output using several Perl regular expression replaces.

The macro does not copy data to a new file. Instead it modifies active file to have finally only the wanted data in wanted format. So run the macro on copies of the original files or use after macro execution the command Save As to store reformatted file with a new name.

How to create a macro with the code below with macro property Continue if search string not found being checked is described in topic How to create a macro from a posted macro code?

Code: Select all

InsertMode
ColumnModeOff
HexOff
PerlReOn
Top
XMLConvertToCRLF
Top
Find MatchCase RegExp "(<foreign>)\s+<text>(.+?)</text>\s+(</foreign>)"
Replace All "\1\2\3"
Top
Find MatchCase RegExp "(<trans>)[\s\S]+?<token .*?>(.+?)</token>[\s\S]+?(</trans>)"
Replace All "\1\2\3"
Top
Find MatchCase RegExp "^(?:(?!</entry>|<orth|<trans|<foreign|<keb|<reb|<gloss).)*$\r\n"
Replace All ""
Top
TrimTrailingSpaces
Find MatchCase RegExp "^[ \t]+"
Replace All ""
Top
Find MatchCase RegExp "(?:\r\n(?!</entry>)|</entry>\r\n)"
Replace All ""
Top
Find MatchCase RegExp "\[/?b\]"
Replace All ""
Top
Find MatchCase RegExp "^<.+?>"
Replace All ""
Top
Find MatchCase RegExp "</[^<>]+>$"
Replace All ""
Top
Find MatchCase RegExp "</.+?><.+?>"
Replace All "\t"
Top

The macro is written for a file with DOS/Window line terminators. If the XML files are UNIX files not converted to DOS on opening, remove 3 times \r in macro code above.

The macro works only for specified XML elements if data within the XML elements are not spread over multiple lines.

No check is made if number of tabs is the same on all lines. So the result file is not a perfect CSV file with tab as separator which looks like you want as output from each XML file.

Please let me know if you are interested in what each Perl regular expression Replace ALL does exactly in case of macro is producing on real data the wanted result.