Cleaning Filtered HTML from Microsoft Word documents to be used to paste into content areas of Web Site CMS systems

markschnegg · May 20, 2021#12021-05-20T00:40+00:00

Hello!

Well, after working on posting content to online CMS systems, from clients sending me a Microsoft Word file, I have come up with this:
First, save the file out in Microsoft Word as a Filtered HTML for CMS systems.
It generates a fair HTML output, but there is still a lot of in-line formatting and other stuff that needs to be edited out.
I wrote this macro to help.
It still requires manual work to set up ordered and unordered lists, but a lot better than doing all the formatting deletion.

I have attached sample files from the original Word file, and the cleaned up result of the macro.

Here is the macro:

Code: Select all

InsertMode
ColumnModeOff
HexOff
Top
InsertMode
ColumnModeOff
HexOff
UltraEditReOn
Find "<html>"
Key DEL
UltraEditReOn
Find "<head>"
UltraEditReOn
StartSelect
Find Select "</head>"
EndSelect
Key DEL
Top
UltraEditReOn
Find "<body"
UltraEditReOn
StartSelect
Find Select ">"
EndSelect
Key DEL
Loop 0
Top
UltraEditReOn
Find "class="
IfFound
StartSelect
Find Select ">"
Key LEFT ARROW
EndSelect
Key DEL
Else
ExitLoop
EndIf
EndLoop
Top
UltraEditReOn
Find "style="
IfFound
StartSelect
Find Select ">"
Key LEFT ARROW
EndSelect
Key DEL
Else
ExitLoop
EndIf
EndLoop
Top
UltraEditReOn
Find "</body>"
Key DEL
Top
UltraEditReOn
Find "</html>"
Key DEL
Top
Loop 0
Find "<span"
IfFound
StartSelect
Find Select ">"
Key DEL
"<span>"
Else
ExitLoop
EndIf
EndLoop
Top
Loop 0
Find "<p "
IfFound
StartSelect
Find Select ">"
Key DEL
"<p>"
Else
ExitLoop
EndIf
EndLoop
Top
PerlReOn
Find MatchCase RegExp "^(?:[\t ]*(?:\r?\n|\r))+"
Replace All ""
UltraEditReOn
Top
Find "<p>"
Key DEL
"<h3>"
Find "</p>"
Key DEL
"</h3>"

There may be shorter ways using complex pattern matching, but probably anyone can understand this one.
If anyone has any suggestions, please let me know.

Mark

Mofi · May 24, 2021#22021-05-24T16:27+00:00

I suggest the following macro for this task using mainly Perl regular expression replaces because it is easier with the Perl regular expression engine to handle multiple variants with one expression than on using legacy UltraEdit regular expression engine.

Code: Select all

InsertMode
ColumnModeOff
HexOff
Top
UltraEditReOn
StartSelect
Find MatchCase RegExp Select "<body[~>]++>[^t ]++^p"
EndSelect
IfSel
Delete
Else
ExitMacro
EndIf
Top
TrimTrailingSpaces
PerlReOn
Find MatchCase RegExp "(?:\r\n)?<div class=WordSection1>(?:\r\n)*"
Replace All "<div>"
Find MatchCase RegExp "(?:\r\n)?</div>(?:\r\n)+</body>(?:\r\n)+</html>(?:\r\n)+"
Replace All "</div>"
Top
Find MatchCase RegExp "<span[^>]*?>|</span>"
Replace All ""
Top
Find MatchCase RegExp "[\t ]+(?=<br>)"
Replace All ""
Top
Find MatchCase RegExp "(?:\r\n)?<p[^>]*?>(?:<b>)?&nbsp;(?:</b>)?</p>"
Replace All "<br>"
Top
Find MatchCase RegExp "<p[^>]*?>(?:<b>)?([\s\S]+?)(?:</b>)?</p>"
Replace "\r\n<h3>\1</h3>"
Find MatchCase RegExp "(?:\r\n)?<p[^>]*?>"
Replace All ""
Top
Find MatchCase "</p>"
Replace All "<br>"
Top
Find MatchCase RegExp "(?:<br>\r\n)+(?=</div>)"
Replace "\r\n"
Top
Find MatchCase RegExp "(?:<br>\r\n)+[\t ]*\x95[\t ]*"
Replace All "</li>\r\n<li>"
Top
Loop 0
Find MatchCase "</li>^p<li>"
Replace "^p^p<ul>^p<li>"
IfFound
Find MatchCase RegExp "(?:<br>\r\n){2,}"
Replace "</li>\r\n</ul>\r\n\r\n"
Else
ExitLoop
EndIf
EndLoop
Top
Find MatchCase RegExp "\x96"
Replace All "&endash;"
Top
Find MatchCase RegExp "\x97"
Replace All "&emdash;"
Top
Find MatchCase RegExp "\xA0"
Replace All "&nbsp;"

I tested this macro with UE v28.10.0.26 (currently latest version) as well as with UE v22.20.0.49 (latest version for Windows XP).

I think, it produces a better result than your macro.

Please let me know if the macro has to handle more variants than those found by me on analyzing the HTML file created from the Microsoft Word document file attached to your post using MS Word 2010.
Please let me also know if I should explain some regular expressions or some macro code sequences or even the entire code and all expressions