How to replace tag in an XHTML file with a regular expression replace?

How to replace tag in an XHTML file with a regular expression replace?

49
Basic UserBasic User
49

    Apr 07, 2017#1

    Hi everyone

    I have a task

    Input
    <p class="indented">»Er ist ein <span class="bold">Freund von Stone. Es geht um eine <span class="gray font3">Zimmerrenovierung, <span class="bold">kann also <span class="italic">spät werden, weil meine anderen Termine jetzt</span></span> alle</span> nach hinten <span class="italic">rutschen</span>.«</span></p>
    <p class="indented"><span class="italic">Freund von Stone. Es geht um eine Zimmerrenovierung, <span class="bold">kann <span class="gray font4">also <span class="italic">spät werden, weil meine anderen Termine jetzt</span></span> alle</span> nach hinten <span class="italic">rutschen</span>.«</span></p>

    Output
    <p class="indented">»Er ist ein <b>Freund von Stone. Es geht um eine <span class="gray font3">Zimmerrenovierung, <b>kann also <i>spät werden, weil meine anderen Termine jetzt</i></b> alle</span> nach hinten <i>rutschen</i>.«</b></p>
    <p class="indented"><i>Freund von Stone. Es geht um eine Zimmerrenovierung, <b>kann <span class="gray font2">also <i>spät werden, weil meine anderen Termine jetzt</i></span> alle</b> nach hinten <i>rutschen</i>.«</i></p>

    How to replace tag <span class="bold"> with <b>, <span class="italic"> with <i> like adobe example

    18672
    MasterMaster
    18672

      Apr 08, 2017#2

      Hi Samir,

      try this Perl regex:
      (?s)(<(span) class="(?(3)[^"]++|(?|(b)old|(i)talic))">((?>(?:(?!<\2\b)(?!</\2\b).)++|(?1))*+)</\2>)

      and replace:
      <\3>\4\</\3>

      Unfortunately you have to repeate Replace All until nothing is replaced because of nested blocks.

      BR, Fleggy

      49
      Basic UserBasic User
      49

        Apr 08, 2017#3

        Thanks fleggy
        This pattern has worked very good
        I am not understand properly this pattern, please can you explain this pattern in details....

        18672
        MasterMaster
        18672

          Apr 08, 2017#4

          Hi Samir

          hope this will help you. I am not a very good teacher :)

          Code: Select all

          (?s)                          -- . matches also CR/LF
          (                             -- the 1st group begins. Used in recursion
            <(span) class="             -- match <span class=" and capture span in the 2nd group
            (?(3)                       -- test if the group 3 has been already captured
              [^"]++                    -- YES: we are already in the recursion and whatever can be matched until the "
              |                         -- NO: we are on the top level and only bold or italic can be matched (the very beginning tag must contain bold or italic)
              (?|                       -- the branch reset group begins
                (b)old|(i)talic         -- match bold or italic and capture the first letter in the 3rd group
              )                         -- end of the branch reset group
            )                           -- end of the test
            ">                          -- match the rest of the tag
            (                           -- the 4th group begins. It matches the inner part between <span><\span>
                                        -- a recursion will be used to find the correct closing tag
                                        -- the inner part must consist either from any text but opening/closing tag or from another <span></span> block or can be empty
              (?>                       -- atomic group for better performance
                (?:                     -- non-capturing group for better performance
                  (?!<\2\b)(?!</\2\b).  -- match any character if the current text is not <span or <\span
                )++                     -- and possessively repeat
                |                       -- OR
                (?1)                    -- if the current text is <span or <\span then try to match it recursively
              )*+                       -- this part can repeat 0 or more times possessively
            )                           -- end of the 4th group
            </\2>                       -- match the closing tag
          )                             -- end of the 1st group
          

            Apr 10, 2017#5

            Hi Samir,

            this modified regex is better:

            (?s)(<(span)\b(?(3)[^>]*+|(?: class="(?|(b)old|(i)talic))")>((?>(?:(?!<\2\b)(?!</\2\b).)++|(?1))*+)</\2>)

            because now it works with any attributes in the tag <span> in the text between <span class="bold"/"italic"> and </span>
            E.G. the previous one fails in this text:

            <span class="bold">First part<span class="italic">nested<span style="color:blue"> part</span></span>final part</span>

            BR, Fleggy

              Apr 10, 2017#6

              And here is a version which does not need a condition. I think this regex is more comprehensible. On the top level it begins only with <span class="bold"/"italic"> and the inner recursion begins with any form of tag <span>.

              (?s)<(span) class="(?|(b)old|(i)talic)">(((?>(?:(?!<\1\b)(?!</\1\b).)++|<\1\b[^>]*+>(?3))*+)</\1>)

              and replace with:
              <\2>\4\</\2>

              BR, Fleggy


              PeM
              You can simply modify the beginning part

              <(span) class="(?|(b)old|(i)talic)">

              to match any other tag and keep the rest of the regex same if the beginning part contains two capturing groups. Otherwise you have to renumber them accordingly.
              For example:

              <(div) style="padding:(\d+)px">(((?>(?:(?!<\1\b)(?!</\1\b).)++|<\1\b[^>]*+>(?3))*+)</\1>)

              or without the second capturing group inside the tag

              <(div) style="padding:\d+px">(((?>(?:(?!<\1\b)(?!</\1\b).)++|<\1\b[^>]*+>(?2))*+)</\1>)