Cleaning Filtered HTML from Microsoft Word documents to be used to paste into content areas of Web Site CMS systems

Cleaning Filtered HTML from Microsoft Word documents to be used to paste into content areas of Web Site CMS systems

5

    May 20, 2021#1

    Hello!

    Well, after working on posting content to online CMS systems, from clients sending me a Microsoft Word file, I have come up with this:
    First, save the file out in Microsoft Word as a Filtered HTML for CMS systems.
    It generates a fair HTML output, but there is still a lot of in-line formatting and other stuff that needs to be edited out.
    I wrote this macro to help.
    It still requires manual work to set up ordered and unordered lists, but a lot better than doing all the formatting deletion.

    I have attached sample files from the original Word file, and the cleaned up result of the macro.

    Here is the macro:

    Code: Select all

    InsertMode
    ColumnModeOff
    HexOff
    Top
    InsertMode
    ColumnModeOff
    HexOff
    UltraEditReOn
    Find "<html>"
    Key DEL
    UltraEditReOn
    Find "<head>"
    UltraEditReOn
    StartSelect
    Find Select "</head>"
    EndSelect
    Key DEL
    Top
    UltraEditReOn
    Find "<body"
    UltraEditReOn
    StartSelect
    Find Select ">"
    EndSelect
    Key DEL
    Loop 0
    Top
    UltraEditReOn
    Find "class="
    IfFound
    StartSelect
    Find Select ">"
    Key LEFT ARROW
    EndSelect
    Key DEL
    Else
    ExitLoop
    EndIf
    EndLoop
    Top
    UltraEditReOn
    Find "style="
    IfFound
    StartSelect
    Find Select ">"
    Key LEFT ARROW
    EndSelect
    Key DEL
    Else
    ExitLoop
    EndIf
    EndLoop
    Top
    UltraEditReOn
    Find "</body>"
    Key DEL
    Top
    UltraEditReOn
    Find "</html>"
    Key DEL
    Top
    Loop 0
    Find "<span"
    IfFound
    StartSelect
    Find Select ">"
    Key DEL
    "<span>"
    Else
    ExitLoop
    EndIf
    EndLoop
    Top
    Loop 0
    Find "<p "
    IfFound
    StartSelect
    Find Select ">"
    Key DEL
    "<p>"
    Else
    ExitLoop
    EndIf
    EndLoop
    Top
    PerlReOn
    Find MatchCase RegExp "^(?:[\t ]*(?:\r?\n|\r))+"
    Replace All ""
    UltraEditReOn
    Top
    Find "<p>"
    Key DEL
    "<h3>"
    Find "</p>"
    Key DEL
    "</h3>"
    
    There may be shorter ways using complex pattern matching, but probably anyone can understand this one.
    If anyone has any suggestions, please let me know.

    Mark
    Marketing_Newsletter.doc (48 KiB)   1
    Microsoft Word document to save as filtered HTML file with MS Word
    Cleaned_up_Fixed_Marketing_Newsletter.htm (13.93 KiB)   1
    Results HTML file after macro execution to clean up the HTML file created by Microsoft Word

    6,688587
    Grand MasterGrand Master
    6,688587

      May 24, 2021#2

      I suggest the following macro for this task using mainly Perl regular expression replaces because it is easier with the Perl regular expression engine to handle multiple variants with one expression than on using legacy UltraEdit regular expression engine.

      Code: Select all

      InsertMode
      ColumnModeOff
      HexOff
      Top
      UltraEditReOn
      StartSelect
      Find MatchCase RegExp Select "<body[~>]++>[^t ]++^p"
      EndSelect
      IfSel
      Delete
      Else
      ExitMacro
      EndIf
      Top
      TrimTrailingSpaces
      PerlReOn
      Find MatchCase RegExp "(?:\r\n)?<div class=WordSection1>(?:\r\n)*"
      Replace All "<div>"
      Find MatchCase RegExp "(?:\r\n)?</div>(?:\r\n)+</body>(?:\r\n)+</html>(?:\r\n)+"
      Replace All "</div>"
      Top
      Find MatchCase RegExp "<span[^>]*?>|</span>"
      Replace All ""
      Top
      Find MatchCase RegExp "[\t ]+(?=<br>)"
      Replace All ""
      Top
      Find MatchCase RegExp "(?:\r\n)?<p[^>]*?>(?:<b>)?&nbsp;(?:</b>)?</p>"
      Replace All "<br>"
      Top
      Find MatchCase RegExp "<p[^>]*?>(?:<b>)?([\s\S]+?)(?:</b>)?</p>"
      Replace "\r\n<h3>\1</h3>"
      Find MatchCase RegExp "(?:\r\n)?<p[^>]*?>"
      Replace All ""
      Top
      Find MatchCase "</p>"
      Replace All "<br>"
      Top
      Find MatchCase RegExp "(?:<br>\r\n)+(?=</div>)"
      Replace "\r\n"
      Top
      Find MatchCase RegExp "(?:<br>\r\n)+[\t ]*\x95[\t ]*"
      Replace All "</li>\r\n<li>"
      Top
      Loop 0
      Find MatchCase "</li>^p<li>"
      Replace "^p^p<ul>^p<li>"
      IfFound
      Find MatchCase RegExp "(?:<br>\r\n){2,}"
      Replace "</li>\r\n</ul>\r\n\r\n"
      Else
      ExitLoop
      EndIf
      EndLoop
      Top
      Find MatchCase RegExp "\x96"
      Replace All "&endash;"
      Top
      Find MatchCase RegExp "\x97"
      Replace All "&emdash;"
      Top
      Find MatchCase RegExp "\xA0"
      Replace All "&nbsp;"
      
      I tested this macro with UE v28.10.0.26 (currently latest version) as well as with UE v22.20.0.49 (latest version for Windows XP).

      I think, it produces a better result than your macro.

      Please let me know if the macro has to handle more variants than those found by me on analyzing the HTML file created from the Microsoft Word document file attached to your post using MS Word 2010.
      Please let me also know if I should explain some regular expressions or some macro code sequences or even the entire code and all expressions
      Best regards from an UC/UE/UES for Windows user from Austria