HTML Indexing Macros

HTML Indexing Macros

4
NewbieNewbie
4

    May 14, 2006#1

    Hi, new to UltraEdit. I have been looking through the macros and search and replace forum (to no avail) to see if I could figure out how to do the following:

    I have an OCR scanned book which consists of 40 chapters each of which are individual HTML pages. I have bookmarked each of the original pages with a simple format p001 - p600 while marking up the text.

    I have now arrived at the index which have OCR scanned fine and the entries alphabetical (of course) have one or more page references primary in the form [space]7. [space]27. [space]127.

    What I wish to do is find each number in this form and check which chapter it belongs to (i.e. from a manually created table within the macro) and then change the entry to

    <a href="01_Thetitle.htm#p007" title="7." target="_self">
    <a href="03_anothertitle.htm#p027" title="27." target="_self">

    as appropriate for each found entry.

    The second form used in the index is for multiple pages 97-102 of which there are a handful and could be done manually so long as they are avoided with the main macro.

    6,675585
    Grand MasterGrand Master
    6,675585

      May 15, 2006#2

      I think, I have successfully created the macro you want.

      First check your index file for bad page references like <p>Yellin. 463</p> where no '-' or '.' is after the page number before you run the macros on the index file. You can search for such numbers with following UltraEdit style regex string: [0-9]++[0-9]++[0-9][~.^-]
      There are few such bad page references which must be fixed manually.

      The macro needs the macro property Continue if a Find with Replace not found checked. You should run this macro only with the index file open. No other file should be opened when starting the macro. And please modify the red marked path specifications to your htm files in the FindInFiles and the regex Find command.

      I hope you have never more than 1 <a name="P???"></a> in a single line in your htm files because this would result in a not solved number to link conversion because the macro deletes everything of a line after <a name="P???"></a>.

      I have not used the content file to get the infos for the page references. I have used a find in files command. It's better to work with original source.

      The loop after the FindInFiles command is a workaround for a bug of UltraEdit. The FindInFiles result edit window has not always the focus after the command was executed.

      UE v12.00 produces an Unicode result file and there are some problems with regex replaces on Unicode files. So I temporary switch to hex mode to check if the result file is an Unicode file (because you use v12 of UE) and if so convert it to ASCII before the macro continues to reformat the result with some regular expression replaces to a list of URLs in the format you want. The auto detection of Unicode result file and conversion lets the macro work for v12 of UE and also all previous versions which produces an ASCII result file.

      After the result file is converted into the list of URLs, the macro switches back to the index file and sets the cursor position to begin of the index list. This is necessary because the head contains also some "number." entries.

      Then the first macro below searches in a loop for page numbers (a number followed by a '-' or '.') and replaces it with the appropriate link from the result file.

      While writing this explanation, I thought that this is a slow approach and found a faster solution. So this first macro is just for education on UE macro writing.

      InsertMode
      ColumnModeOff
      HexOff
      UnixReOff
      FindInFiles MatchCase RegExp PreserveCase Log "F:\Temp\chapter\" "*.htm" "<a name="P[0-9]+">"
      Loop
      Find RegExp Up "%Search complete, found "
      IfFound
      DeleteLine
      ExitLoop
      Else
      NextWindow
      EndIf
      EndLoop
      Top
      HexOn
      Find "00"
      IfFound
      HexOff
      Top
      UnicodeToASCII
      Else
      HexOff
      EndIf
      Find "----------------------------------------^p"
      Replace All ""
      Find RegExp "%Find '*^p"
      Replace All ""
      Find RegExp "%Found '*^p"
      Replace All ""
      Find RegExp "%F:\Temp\chapter\^(*/[0-9]+: ^)*<a name="P"
      Replace All "^1<a name="P"
      Find RegExp "</a>*$"
      Replace All "</a>"
      Find RegExp "%^(*^)/[0-9]+: <a name="P^(0++^)^([0-9]+^)"></a>"
      Replace All "<a href="^1#P^2^3" title="Page ^3" target="_self">^3</a>"
      NextWindow
      Top
      Find "<h1>INDEX"
      IfNotFound
      ExitMacro
      EndIf
      Clipboard 9
      Loop
      Find RegExp "[0-9]+[^-.]"
      IfNotFound
      ExitLoop
      EndIf
      StartSelect
      Key LEFT ARROW
      Copy
      EndSelect
      PreviousWindow
      Top
      Find "Page ^c"
      IfFound
      Key HOME
      StartSelect
      Key END
      Copy
      EndSelect
      EndIf
      NextWindow
      Paste
      EndLoop
      ClearClipboard
      Clipboard 0

      So here is the second version which is much faster. It copies each hyperlink from the result file into the index file and replaces all appropriate page numbers in the index file with the hyperlink. That was a little bit tricky because you have not used leading zeros for the page numbers and so a page number like 25 also exists in 125, 225, 325, ... The solution was to use find option MatchWord and because a '-' is not a word delimiter the numbers with following a '-' has to be converted to something different before running the replaces and convert it back to a single '-' at the loop exit. This macro is much faster because window switching is decreased to the number of pages instead of number of page references.

      InsertMode
      ColumnModeOff
      HexOff
      UnixReOff
      FindInFiles MatchCase RegExp PreserveCase Log "F:\Temp\chapter\" "*.htm" "<a name="P[0-9]+">"
      Loop
      Find RegExp Up "%Search complete, found "
      IfFound
      DeleteLine
      ExitLoop
      Else
      NextWindow
      EndIf
      EndLoop
      Top
      HexOn
      Find "00"
      IfFound
      HexOff
      Top
      UnicodeToASCII
      Else
      HexOff
      EndIf
      Find "----------------------------------------^p"
      Replace All ""
      Find RegExp "%Find '*^p"
      Replace All ""
      Find RegExp "%Found '*^p"
      Replace All ""
      Find RegExp "%F:\Temp\chapter\^(*/[0-9]+: ^)*<a name="P"
      Replace All "^1<a name="P"
      Find RegExp "</a>*$"
      Replace All "</a>"
      Find RegExp "%^(*^)/[0-9]+: <a name="P^(0++^)^([0-9]+^)"></a>"
      Replace All "<a href="^1#P^2^3" title="Page ^3" target="_self">^3</a>"
      NextWindow
      Top
      Find "<h1>INDEX"
      IfNotFound
      ExitMacro
      EndIf
      EndSelect
      Key END
      "
      "
      Find RegExp "^([0-9]+-^)"
      Replace All "^1---- "
      Loop
      PreviousWindow
      IfEof
      CloseFile NoSave
      DeleteLine
      Find "----- "
      Replace All "-"
      ExitLoop
      EndIf
      Clipboard 9
      StartSelect
      Key END
      Copy
      EndSelect
      Key HOME
      Find "Page"
      EndSelect
      Key RIGHT ARROW
      SelectWord
      Clipboard 8
      Copy
      EndSelect
      Key HOME
      Key DOWN ARROW
      NextWindow
      Clipboard 9
      Paste
      StartSelect
      Key HOME
      EndSelect
      Clipboard 8
      Find MatchWord "^c."
      Replace All "^s."
      Key RIGHT ARROW
      Key LEFT ARROW

      StartSelect
      Key END
      EndSelect
      Find MatchWord "^c-----"
      Replace All "^s-----"
      Key LEFT ARROW
      Key RIGHT ARROW

      StartSelect
      Key HOME
      Delete
      EndSelect
      EndLoop
      ClearClipboard
      Clipboard 9
      ClearClipboard
      Clipboard 0


      For adding navigation links before words with A, B, C ... you can use this macro which also needs Continue if a Find with Replace not found checked, although there is no replace used (wrong property title). Run this macro after the page number to URL conversion macro.

      InsertMode
      ColumnModeOff
      HexOff
      Top
      Find "<h1>INDEX"
      IfNotFound
      ExitMacro
      EndIf
      Find MatchCase "<p>A"
      IfFound
      EndSelect
      Key HOME
      "
      Insert here your HTML code for the navigation links for letter A
      "
      EndIf
      Find MatchCase "<p>B"
      IfFound
      EndSelect
      Key HOME
      "
      Insert here your HTML code for the navigation links for letter B
      "
      EndIf

      And so on till

      Find MatchCase "<p>Z"
      IfFound
      EndSelect
      Key HOME
      "
      Insert here your HTML code for the navigation links for letter Z
      "
      EndIf


      Add UnixReOn or PerlReOn (v12+ of UE) at the end of the macros if you do not use UltraEdit style regular expressions by default - see search configuration. Macro command UnixReOff sets the regular expression option to UltraEdit style.
      Best regards from an UC/UE/UES for Windows user from Austria

      4
      NewbieNewbie
      4

        May 15, 2006#3

        Thank you very much, I just 'idly' looked at the forum, the most I expected was an acknowledgement of 'receipt', not the complete macros! which I hope to use this evening, again thanks a lot also for the explanation, which hopefully will also allow me to learn to use the UltraEdit macro languages

        I will post after use

          May 15, 2006#4

          I have now run the main index macro,
          On first run I saw several 'unindexed' numbers I realised straight away that these were 'missing' or wrongly named bookmarks and indeed was able to correct most easily, running a second time I managed to find a few more 'mistakes in the index file were missing spaces caused a few misses.
          However there seems to be a bug, and I don't think it's the code

          with the 123-125. type of entry it seems to be adding a partial anchor, but for the number immediately preceding the first

          330a href="38_TheLittleCountryFa Away.htm#P329" title="Page 329" target="_self">329-338

          at first I thought it was only happening were the entry wasn't the first on a line, but finding one dissavowed that

          from the first run I saw one that had been correct were now 'wrong' and a third run confirmed this - so no ryme or reason, and I have edited out the 'additions' using DW's search and replace.

          Thank you again (haven't used the A-Z yet)

          6,675585
          Grand MasterGrand Master
          6,675585

            May 16, 2006#5

            You are right. There was a bug in the macro. I edited the macro code to fix it. See the inserted blue lines. If there is no page reference to replace for a specific URL, the selection is still present after the replace all. So the macro must always unselect the URL which can be done by a simple cursor move.
            Best regards from an UC/UE/UES for Windows user from Austria

            4
            NewbieNewbie
            4

              May 19, 2006#6

              Thanks again, didn't have as great a success with the a-z it seems to call out for a loop, but I ended up doing it manually. In basic (20 years ago) I would have used the character codes for the loop and swapped them to ASCII for the compare, would this have been possible here?

              6,675585
              Grand MasterGrand Master
              6,675585

                May 19, 2006#7

                The a-z macro is not a loop and cannot be realized with a loop. It's just a record of what you have done manually or exactly what I think you wanted to do manually. The macro language of UE does not support variables, math expressions or conditions like a compare. So the a-z macro cannot be done with a simple loop as every programmer would do it with a real programming language.
                Best regards from an UC/UE/UES for Windows user from Austria