Recurring quotes

SiSL · Jun 07, 2008#12008-06-07T01:18+00:00

Greetings,

I have text like following example:

<div>
   <b>parent text 1</b>
      <div>
          <i>child text1</i>
              <div>
                 <u>child text 2</u>
              </div>
      </div>
</div>

<i>normal text 1</i>

<div>
   parent text 2
    <div>
          child text 3
     </div>
</div>

normal text 2

I want to replace it so that I can only see <i>normal text 1</i> & normal text2. How may I do such thing? When both are same codes such as (<div>) in this case, I can't manage to find out a regex Unix or Perl style? Even if I have to do multiple steps...

pietzcker · Jun 07, 2008#22008-06-07T12:57+00:00

A few questions:

1. Am I understanding you correctly that you want to remove everything between the outermost <div> and </div> tags?
2. Will those tags always be at the start of the line (column 1)?
3. When you say "codes such as <div> in this case", what other cases do you expect?

SiSL · Jun 07, 2008#32008-06-07T13:21+00:00

pietzcker wrote:A few questions:

1. Am I understanding you correctly that you want to remove everything between the outermost <div> and </div> tags?
2. Will those tags always be at the start of the line (column 1)?
3. When you say "codes such as <div> in this case", what other cases do you expect?

1. Yes, anything that is in <div> including other possible divs in div...

2. No, they might be random..

3. well, any codes recurring itself inside. Like <ul><ul>test</ul></ul> similiar to that...

pietzcker · Jun 07, 2008#42008-06-07T18:25+00:00

OK, this makes things a bit complicated since nested structures don't lend themselves well for regexes. If the tags will not always be at column 1, will at least the opening and closing tag be at the same column? From your second example, I fear they might not be...

Another question for clarification: The "rule" that you need implemented is: "Delete a tag and everything up to and including its closing tag if and only if it contains another tag of the same name"?

I.e., delete

Code: Select all

<div>
    blah
    <div>
        hello
    </div>
    ladida
</div>

but don't delete

Code: Select all

<div>
    blah
    <p>
        hello
    </p>
    ladida
</div>

Is that it?

SiSL · Jun 07, 2008#52008-06-07T19:00+00:00

pietzcker wrote:OK, this makes things a bit complicated since nested structures don't lend themselves well for regexes. If the tags will not always be at column 1, will at least the opening and closing tag be at the same column? From your second example, I fear they might not be...

Another question for clarification: The "rule" that you need implemented is: "Delete a tag and everything up to and including its closing tag if and only if it contains another tag of the same name"?

Is that it?

more like "Delete a tag and everything up to and including its closing tag even if it contains another tag of the same name";

Honestly I can do more than replaces if necessary there, yet, need regex not to delete wrong closing tag

I thought of thinking of regex structure as following:

Match <div>..</div> if it does not have <div> in it.
Then repeat same till all <div>..</div> structure done..

pietzcker · Jun 07, 2008#62008-06-07T20:40+00:00

The problem is that there is no way to make a regex count nested tags. You can't find the matching </div> tag if there is an unspecified number of <div>/</div> tags in-between, unless you know some other way to distinguish the tags (like the level of indentation). If the opening and closing tag are indented the same way, then the following regex could work:

(?s)^([ \t]*)<div>.*?^\1</div>

(?s) switches the regex engine to "dot matches all" mode.
^([ \t]*) looks for any whitespace before the first <div> tag it encounters (which must be the first non-whitespace on the line) and remembers it in backreference no. 1.
.*?^\1 then matches as much as it has to until the next occurence of a line that contains a </div> tag at the same indentation level (exact same sequence of spaces and/or tabs!) as before.

(In UE 14.00b, the .*? can be written as .* (which makes the regex faster) because of a "bug" in the regex engine. Since IDM might correct that bug some day, I wouldn't do so unless performance is an issue.)

This regex will malfunction in certain conditions. E. g., it will match the following in its entirety because it contains a <div>/</div> pair on the same line:

Code: Select all

<div>remove</div>
don't remove!
<div>
    remove
</div>

If you can't be sure of your indentation levels, you could use UE's reindentation feature before applying this regex.

This solution - if it were applicable - would certainly be the easiest. However, I wouldn't bet my life on it always matching corresponding tags (see above).

A safer solution (like you proposed in your previous post) would be:

(?s)<div>(?:(?!<div>).)*?</div>

This will match any <div>/</div> pair that doesn't contain a <div> within, regardless of whether there are line breaks in-between. Of course, you will have to apply this regex over and over again until UE won't find any more matches.

SiSL · Jun 07, 2008#72008-06-07T23:03+00:00

That's great

Thank you so much...

It would be okay for me. Now that considering that would be "proper" formated HTML, we can assume every <div> has a </div> itself... So counting <div>'s or </div>'s will give us appropiate number of (not nested divs) but time we should repeat our replace.

Thanks a lot