Removing spaces from a text string to validate XHTML

Removing spaces from a text string to validate XHTML

14
Basic UserBasic User
14

    Aug 19, 2008#1

    Hi, I'm new to this board and fairly new to Regular Expressions. I have used every combination of words in the Forum search feature, but have not found a solution for this.

    I'm trying to get rid of spaces in anchors, refs and ids in files that I am trying to convert to XHTML from old HTML files that were generated by an ancient MS WYSIWYG application that generates lots of bad code. I have cleaned up 90% of it but I'm stuck with this problem:

    Code: Select all

    <ul>
          <li><a href="#three word title">three word title </a></li>
          <li><a href="#four word title?">four word title?</a> </li>
          <li><a href="#5 word title (and some words in brackets 0r parentheses)">5 word title (and some words in brackets 0r parentheses)</a> </li>
        </ul>
        <p>A bunch of text here. </p>
        <h2><a name="three word title"></a>three word title
        </h2>
        <p>Another paragraph of text</p>
        <h2><a name="four word title?"></a>four word title?</h2>
        <p>Another paragraph of text</p>
        <h2><a name="5 word title (and some words in brackets 0r parentheses)" id="5 word title (and some words in brackets 0r parentheses)"></a>5 word title (and some words in brackets 0r parentheses)</h2>
        <p>Another paragraph of text and some tables.</p>
    I would like to be able to take out all the spaces (and special characters) from the href, name and id tags, to that they can at least yield valid XHTML 1.0. If there is a way in Ultraedit to do this, I'd love to hear of it. I've looked at all the regex pages that I can find to no avail.
    As you'll note in the code above, there is sometimes an "id", sometimes not, my hair is falling out.
    I have thousands and thousands of these to clean up and I'm chasing my tail right now it seems.

    Thanks.

    I'm using UE 14.10.0.1018 and am trying to use Perl RegEx.

    6,686585
    Grand MasterGrand Master
    6,686585

      Aug 20, 2008#2

      I think, a single regular expression can't do all the necessary steps to convert those anchors, references and ids to valid XHTML. So here is a macro which worked on your example. The macro property Continue if search string not found must be checked for this macro.

      The first regex inserts letter 'n' at start of every anchor, id or reference to an anchor if the anchor/id starts with a number which is not allowed in XHTML if I remember correct.

      Next in a loop all references to an anchor, all anchors and all ids are searched and selected without selecting the surrounding double quotes or the part of a reference which is not an anchor name because you could also have something like href="../file.htm#5 words". All spaces and tabs inside the selection are next converted to an underscore. Then all characters which are not letters A-Z and not a-z, not digits 0-9, not '-' and not '_' are deleted to get an valid anchor/id. I hope that this never produces an empty anchor/id. You could insert a check for an empty anchor/id after the loop with an appropriate find if you want.

      Because now it is possible that an anchor/id starts with an underscore, the last regex replaces every underscore at start of an anchor/id with letter 's'.

      You can adapt 'n' and 's' to something different if you like.

      InsertMode
      ColumnModeOff
      HexOff
      PerlReOn
      Top
      Find RegExp "(href="[^"#]*#|id="|name=")(\d)"
      Replace All "\1n\2"
      Loop
      Find RegExp "(href="[^"#]*#|id="|name=")"
      IfNotFound
      ExitLoop
      EndIf
      Find RegExp "[^"]+""
      IfFound
      StartSelect
      Key LEFT ARROW
      Find RegExp SelectText "[ \t]"
      Replace All "_"

      Find RegExp SelectText "[^a-z0-9\-_]"
      Replace All ""
      EndSelect
      EndIf
      EndLoop
      Top
      Find RegExp "(href="[^"#]*#|id="|name=")_"
      Replace All "\1s"
      Best regards from an UC/UE/UES for Windows user from Austria

      14
      Basic UserBasic User
      14

        Aug 20, 2008#3

        Mofi,
        I have tried your macro and it works thank you very much. I have replaced the "_" with "" so that it just takes the spaces out. Is there a way of sending the change results to the Output Window so that I can rapidly check for anomalies?

        The regular expression I had been working on looked something like this:

        (<a href="#)([\w\s\(\)\".-]*)(">?)([\w\W\s]*)(<a name=")\2(" id=")\2(">)

        and replace was like this
        $1h$2$3$4$5n$2$6i$2$7

        h, n and i are in there just to make sure that the href, name and id start with a letter.

        And I had been trying to incorporate the part of the following somehow, to remove the spaces.

        ## Given string, needs to squeeze
        my $string = "squeezing spaces fro m t h e string";
        ## regex that squeeze the spaces
        $string =~ s/\s//g;
        print "$string \n";

        I found the above at the link below while I was scouring the web for a solution. I obviously didn't figure out how to incorporate it in my RegEx.
        http://icfun.blogspot.com/2008/07/perl- ... ueeze.html

        It seemed to me like I should be able to do it, but I'm not that proficient with the Reg Ex yet, though they are fun when I get them to work without being too greedy... and even if I spend an hour or two trying to figure one out and it saves me tens of hours or more of messing around fixing up bad code it's worth it.

        I should also mention that I have to use Dreamweaver at work where I'm not allowed to install my own software, so I use UltraEdit (UE3) off my USB thumb drive.

        6,686585
        Grand MasterGrand Master
        6,686585

          Aug 20, 2008#4

          Tim wrote:I have replaced the "_" with "" so that it just takes the spaces out.
          In this case you should remove the 2 lines

          Find RegExp SelectText "[ \t]"
          Replace All "_"

          from the macro and you will get the same result faster.
          Tim wrote:Is there a way of sending the change results to the Output Window so that I can rapidly check for anomalies?
          Not with a macro. A macro cannot write to the output window. You would need to translate that macro into a script. Then you could write into the output window the full file name with the current line number and the new anchor/id/reference if the Replace All (in selection only) in the loop has changed something. The other 2 replace all must be packed into a loop and executed as single replaces from top to bottom of the file to be able to report changes into the output window.
          Best regards from an UC/UE/UES for Windows user from Austria

          14
          Basic UserBasic User
          14

            Aug 20, 2008#5

            Thanks again Mofi.

            For now I'll make the change that you suggest and stick with what works.

              Nov 14, 2008#6

              Mofi,
              I reintroduced your line to replace spaces with _ as I was running into trouble with large lists where 1.0 would become 10 and later on I'd have an item 10 thus creating duplicate anchors...
              There is one thing that I hadn't noticed until recently. This macro is taking all the . out of my meta tags:

              meta name="dc.description" becomes meta name="dcdescription" for example.

              I don't see any obvious fix in the macro so unless you do, I'll just keep on going like this and do a search and replace in files at the end of each day on the 7 meta tags that it changes.

              6,686585
              Grand MasterGrand Master
              6,686585

                Nov 14, 2008#7

                The first idea I have to ignore meta names is to change in the 3 regular expression search strings |name=" by |<a[ \t]+name="

                With that modification the meta names would not be found anymore. But also following anchors would then not be found anymore:

                <a href="#top" name="anchor 1">
                <a
                name="anchor 2">

                So maybe it's better to not modify the regular expression strings.

                A second solution would be to run the macro not from top of the file. You could insert a Find below every command to position the cursor below the meta tags in the head of the file. For example following could work:

                InsertMode
                ColumnModeOff
                HexOff
                PerlReOn
                Top
                Find "</head>"
                Find RegExp "(href="[^"#]*#|id="|name=")(\d)"
                Replace All "\1n\2"
                Loop
                Find RegExp "(href="[^"#]*#|id="|name=")"
                IfNotFound
                ExitLoop
                EndIf
                Find RegExp "[^"]+""
                IfFound
                StartSelect
                Key LEFT ARROW
                Find RegExp SelectText "[ \t]"
                Replace All "_"

                Find RegExp SelectText "[^a-z0-9\-_]"
                Replace All ""
                EndSelect
                EndIf
                EndLoop
                Top
                Find "</head>"
                Find RegExp "(href="[^"#]*#|id="|name=")_"
                Replace All "\1s"
                Best regards from an UC/UE/UES for Windows user from Austria

                14
                Basic UserBasic User
                14

                  Nov 14, 2008#8

                  Thanks Mofi but this send Ultraedit into an "You have entered an invalid regular expression!" message and endless loop and I have to Ctrl+Alt+Del to close the program.
                  I cut and paste the code from the screen, removed trailing spaces..

                  Any idea what is happening?

                  I really like the idea of excluding the header.

                  6,686585
                  Grand MasterGrand Master
                  6,686585

                    Nov 15, 2008#9

                    Tim wrote:I cut and paste the code from the screen, removed trailing spaces..
                    I have done the same now and the macro worked fine on your example code with an additional line containing </head> at top of your example.

                    The regular expression search strings are the same as before. Only the 2 red written non regular expresssion searches are inserted. So I don't have any idea why the Perl regex search strings should be now invalid. Check the macro code again.

                    By the way: Normally it is possible to exit a macro by pressing and holding the key ESC for a few seconds. Only if this does not help, it is necessay to kill UltraEdit with the task manager.
                    Best regards from an UC/UE/UES for Windows user from Austria

                    14
                    Basic UserBasic User
                    14

                      Nov 27, 2008#10

                      Sorry about the delayed response Mofi, I wasn't ignoring you.
                      Anyway, I have tried quite a few times and still get an error.

                      6,686585
                      Grand MasterGrand Master
                      6,686585

                        Nov 27, 2008#11

                        If you still use UE 14.10.0.1018 please update to latest version 14.20.1.1001. In the attached ZIP file is the macro file which worked on my computer. If you still get the error message because of an invalid Perl regular expression, try each of them manually via the Find dialog to detect which one is interpreted on your computer as invalid.
                        Validate_XHTML.zip (357 Bytes)   309
                        Contains the macro file with the macro code posted before.
                        Best regards from an UC/UE/UES for Windows user from Austria

                        14
                        Basic UserBasic User
                        14

                          Nov 27, 2008#12

                          Thanks Mofi - Your version works great. When I have lots of time I will try to find out exactly what the difference is between the 2 files, besides the name. :?
                          Sorry about the smiley I couldn't resist.
                          I have another, kind of related, ongoing issue that I am having the hardest time finding any information on. I have converted and validated a few thousand XHTML 1.0 Strict files in French using iso-8859-1. The problem is that many of them still have the é (ALT+233) instead of the &eacute; ( & eacute ; ) format, which is fine in the real world. But I have to encode everything within the body with the & eacute; & Agrave ; HTML format for some reason that is beyond my comprehension. I have tried various batch files using HTML Tidy, but it always changes more than just the "extended/special" characters. All I want to do is change all the special characters without reformatting the whole documents. A tool existed in Homesite+, and you could save a macro, but I don't have that software anymore and I would like to be able to find a way to do a batch file that accomplishes this.
                          If there was a way to do a Search and Replace in files *.htm...

                          6,686585
                          Grand MasterGrand Master
                          6,686585

                            Nov 28, 2008#13

                            In the HTML toolbar of UltraEdit is the command Convert Special Characters in Selected Text to HTML Entities (icon shows < -> &lt) which you can use to convert all ANSI characters to HTML entities. You have to select everything (Ctrl+A) and then run this command from the toolbar.

                            I have created also last year a complete tag list of the HTML 4.01 entities. Open View - Views/Lists - Tag List and select HTML - Special Characters or download and use HTML Tags and Entities which contains everything HTML writers need.

                            Because you want to make this conversion on many HTML files at once, I have quickly with 1 regular expression replace and some extra typing converted the first HTML entity list in html_tags.txt to a macro code. This macro runs replace in files on *.htm (and also *.html because of Windows) in the current working directory of UltraEdit to convert ANSI characters into HTML entities. The log in the output window after macro execution is only from the last replace in files and not the total summary of all replaces.

                            Run manually a Find In Files with no (empty) search string, *.htm as file type specification and .\ for the directory to check what is the current working directory of UltraEdit and therefore on which files this macro will work. Depending on your settings the current working directory is either the "Start In" directory of the shortcut used to start UltraEdit, or the directory of the active document or the directory of the last opened file. It is not possible to set the current working directory directly in UltraEdit.

                            Or you open the file "HTML_Entities.uem" and replace all ".\" by "your directory path\". You can add additionally also with a simple replace all command the ReplInFiles parameter Recursive to every ReplInFiles line to run the replaces also on *.htm files in all subdirectories. Then open the macro file "HTML_Entities.mac", copy that modified macro code into the Windows clipboard, open the macro editor and replace the existing macro code by your modified macro code with Ctrl+V. Close the macro editor and run the macro.

                            The macro is written for ANSI HTML files. Don't run it on Unicode HTML files of any type (UTF-8, UTF-16). From the HTML entity list in the tag list file following entities are not present in the macro: &emsp; &ensp; &thinsp; &zwj; &zwnj; &shy; &quot; &amp; &lt; &gt;

                            The macro file was created with UE v11.20 and therefore can be used also with older versions of UltraEdit and not only UE v14.xx. But the macro was not fully tested. I have it executed only once with UE v11.20 and currently latest v14.20.1.1000 on a HTML file containing German umlauts.
                            HTML_Entities.zip (1.81 KiB)   300
                            Contains the macro source code and the macro file for converting ANSI characters to HTML entities.
                            Best regards from an UC/UE/UES for Windows user from Austria

                            14
                            Basic UserBasic User
                            14

                              Nov 29, 2008#14

                              Thanks again Mofi.

                              Code: Select all

                              InsertMode
                              ColumnModeOff
                              HexOff
                              UnixReOff
                              ReplInFiles Recursive MatchCase Log "C:\inetpub\wwwroot\testing\" "*.htm" " "
                              "&nbsp;"
                              ReplInFiles Recursive MatchCase Log "C:\inetpub\wwwroot\testing\" "*.htm" "—"
                              "&mdash;"
                              ......
                              Seems to work well. I just ran the macro on 250+ .htm files it took about a minute, I've had a look through some of them and there doesn't seem to be any nasty surprises. Then again that's why we backup files before doing this type of mass operation.