How to replace angle brackets in an HTML file which are part of the text and not of tags?

How to replace angle brackets in an HTML file which are part of the text and not of tags?

12
Basic UserBasic User
12

    Apr 19, 2019#1

    Should I able to find and replaces angular brackets open and closing into hex value between any tags?

    Input
    <p class="para">My <b>name is<Somenath>Chatterjee</b></p>
    <p class="para">My <b>name is > Somenath < Chatterjee</b></p>
    <p class="para">My > <b>name is > Somenath Chatterjee</b></p>
    <p class="para">My < <b>name is > Somenath Chatterjee</b></p>

    Expected Output
    <p class="para">My <b>name is&#x003C;Somenath&#x003E;Chatterjee</b></p>
    <p class="para">My <b>name is &#x003E; Somenath &#x003C; Chatterjee</b></p>
    <p class="para">My &#x003E; <b>name is &#x003E; Somenath Chatterjee</b></p>
    <p class="para">My &#x003C; <b>name is &#x003E; Somenath Chatterjee</b></p>

    Kindly help me out.
    Thanks in advance

    6,603548
    Grand MasterGrand Master
    6,603548

      Apr 19, 2019#2

      No, this is not possible because a replace does not know where < is used as beginning of an HTML element and where > is used as end of an HTML element like p and b and where < and > is used in text. Of course if you know all HTML tags used by you, you can write a long Perl regular expression replace search string which uses lookbehinds and lookaheads to find out in which context the angle brackets are used to determine if it should be replaced or not.

      BTW: Somebody of your company asked exactly the same a few months ago. UltraEdit has built-in support for HTML Tidy which finds out where an angle bracket is used at beginning/end of an HTML tag and where it is part of the text (most likely) and suggests to replace the latter by &lt; and &gt; for valid HTML.
      Best regards from an UC/UE/UES for Windows user from Austria

      18672
      MasterMaster
      18672

        Apr 19, 2019#3

        Hi isomenath,

        and what about this?
        - replace < and > to some safe placeholders for all paired tags
        - replace remaining < and > to hex values
        - replace placeholders back to < and >

        I am not familiar with HTML so this is probably naive approach. But you can try ;)


        1. Replace < to `` and > to ~~ for all paired tags
        Repeat Perl Replace All until nothing replaced:
        F: (?s)(?<MAIN><(?<HEAD>(?<TAG>\w++)\b[^>]*+)>(?<BODY>(?>(?:(?!<\k<TAG>\b)(?!</\k<TAG>\b).)++|(?&MAIN))*+)</\k<TAG>>)
        R: ``$+{HEAD}~~$+{BODY}``/$+{TAG}~~
         or
        R: ``$2~~$4``/$3~~

        2. Replace remaining < and > to hex values
        F: <(?!(?:br|hr|<more tags delimited by | if needed>)>)
        R: &#x003C;

        F: (?<!<br)(?<!<hr)<more negative lookbehinds if needed>>
        R: &#x003E;

        3. Replace placeholders back to < and >
        F: ``
        R: <

        F: ~~
        R: >


        BR, Fleggy

        EDIT: added numbered groups in 1st replace (named groups are supported in "Replace with" since UE26), modified 2nd replace - how to handle empty elements (unpaired tags)

          Apr 23, 2019#4

          Hello,

          I prepared a pattern to match the correct tag pair even when a single tag is inside such a pair. Eg. <XXX>....<XXX/>.....</XXX>. You can find it here
          So the modified pattern for the first step should look like:

          F: (?s)(?<MAIN><\w++\b[^>]*+(?<=/)>|<(?<HEAD>(?<TAG>\w++)\b[^>]*+)(?<!/)>(?<BODY>(?>(?:(?!<\k<TAG>\b)(?!</\k<TAG>\b).)++|(?&MAIN))*+)</\k<TAG>>)(?<!/>)

          Replace remains the same.
          I hope that it the final pattern :)

          BR, Fleggy

            Apr 23, 2019#5

            Well, here is hopefully the complete scenario:

            1. Replace < to `` and > to ~~ for all paired tags
            Repeat Perl Replace All until nothing replaced:
            F: (?s)(?<MAIN><\w++\b[^>]*+(?<=/)>|<(?<HEAD>(?<TAG>\w++)\b[^>]*+)(?<!/)>(?<BODY>(?>(?:(?!<\k<TAG>\b)(?!</\k<TAG>\b).)++|(?&MAIN))*+)</\k<TAG>>)(?<!/>)
            R: ``$+{HEAD}~~$+{BODY}``/$+{TAG}~~
             or
            R: ``$2~~$4``/$3~~

            2. Replace < to `` and > to ~~ for all selfclosing single tags like <XXX />
            F: <(?<ELEMENT>\w++\b[^>]*+)(?<=/)>
            R: ``$+{ELEMENT}~~
             or
            R: ``$1~~

            3. Replace < to `` and > to ~~ for remaining single tags (list of fix names)
            F: <(?<ELEMENT>(?:br|hr|img|etc...)\b[^>]*+)>
            R: ``$+{ELEMENT}~~
             or
            R: ``$1~~

            4. Replace remaining < and > to hex values
            F: <
            R: &#x003C;

            F: >
            R: &#x003E;

            5. Replace placeholders back to < and >
            F: ``
            R: <

            F: ~~
            R: >


            BR, Fleggy