Hello,
I am a linguist and have not much knowledge with regular expressions. I would be glad if anybody could tell me how to select information between two values. I have two big files containing 60.000 and 120.000 strings / lines and I need to extract information from each line.
Any help is very much appreciated. Thanks a lot
File1:
I need the text between <orth> and </orth>, as well as between <trans> and </trans>
File 2:
I need the text between <keb> and </keb> as well as between <gloss> and </gloss>
I am a linguist and have not much knowledge with regular expressions. I would be glad if anybody could tell me how to select information between two values. I have two big files containing 60.000 and 120.000 strings / lines and I need to extract information from each line.
Any help is very much appreciated. Thanks a lot
File1:
I need the text between <orth> and </orth>, as well as between <trans> and </trans>
Code: Select all
<entry id="15" version="1.2" HE="true" xmlns="http://www.wadoku.de/xml/entry"><form><orth>インスリン</orth><pron>いんすりん</pron><pron type="hatsuon">いんすりん</pron></form><gramGrp><pos type="N"/></gramGrp><sense><usg type="dom">Med.</usg><trans><tr><token genus="n" type="N">Insulin</token></tr></trans></sense></entry>
I need the text between <keb> and </keb> as well as between <gloss> and </gloss>
Code: Select all
<entry>
<ent_seq>1000080</ent_seq>
<k_ele>
<keb>漢数字ゼロ</keb>
</k_ele>
<r_ele>
<reb>かんすうじゼロ</reb>
</r_ele>
<info>
<audit>
<upd_date>2012-09-10</upd_date>
<upd_detl>Entry created</upd_detl>
</audit>
<audit>
<upd_date>2012-09-11</upd_date>
<upd_detl>Entry amended</upd_detl>
</audit>
<audit>
<upd_date>2012-09-11</upd_date>
<upd_detl>Entry amended</upd_detl>
</audit>
</info>
<sense>
<pos>noun (common) (futsuumeishi)</pos>
<xref>○・まる・1</xref>
<xref>漢数字</xref>
<gloss xml:lang="eng">"kanji" zero</gloss>
</sense>