Importing MS Word DOC files

Importing MS Word DOC files

20
Basic UserBasic User
20

    Mar 02, 2014#1

    The UE "File" -> "Open" menu has, as possible file extension for input, the "DOC" extension. But it's seemingly broken, opening the file in hex mode, and then that's it. Is this a bug?

    At the moment, to input DOC files into UE, I have first to output my MS Word files to a .TXT format, before importing into UE. It means all the mark-up is lost, and all the bold, underlining, italics, etc. have to be reinserted. For book-length manuscripts this can't be done and should really be unnecessary.

    Does UE have a functioning import filter for MS Word somewhere? Or some other filter?

    What is necessary is some way of rescuing the basic mark-up: italics, bold, underline, footnotes. (If one tries it via MS Word's "export to html" function, the problem then becomes all the junk code that then needs to be removed - as everyone knows who's ever worked on such files in Dreamweaver.)

    The two alternative seems to be either: too much code (MS Word's html export function) or: too little code - the ".TXT" export function, while losing all the mark-ups.

    Is there a filter or conversion software somewhere that let's me choose which MS Word mark-up I can rescue into UE?

    Or perhaps a filter software that will strip from html everything except those few mark-up codes?

    Any suggestions would be appreciated.
    best,
    fvgfvg

    6,686585
    Grand MasterGrand Master
    6,686585

      Mar 02, 2014#2

      UltraEdit is a text editor and not a word processing application like OpenOffice or MS Word. Therefore UltraEdit can open only pure text files.

      Yes, there is DOC in the list of file types by default. But that does not mean that UltraEdit supports opening Microsoft Word documents of any MS Word version stored in binary format. *.doc is just in the list for text files which have the file extension DOC although usually such files are MS Word documents and not pure text files on computers running Windows. On Linux systems there are lots of *.doc which are pure text files. The list of file types as shown in File - Open and File - Save As dialog can be customized by the user via Advanced - Configuration - File Types.

      Microsoft Word has 2 options to save a Word document as HTML file: Web Page (*.htm;*.html) and Web Page, Filtered (*.htm;*.html). By selecting the Web Page, Filtered option, Word will strip out extraneous tags used by Microsoft Office programs. It depends on the structure of the Word document how many HTML formatting tags exist after the Save As with Web Page, Filtered option.

      For more details about saving as filtered HTML see:
      Alternatively it is possible to select the text which should be copied to UltraEdit with some formatting tags in MS Word, press Ctrl+C to copy the text with the formatting tags to clipboard and use in UltraEdit in submenu Edit - Paste Special the command HTML Source.

      However, as I'm quite sure nobody has coded up to now an UltraEdit macro to clean up the HTML code of MS Word with regular expression replaces and keep just bold, italic, underline and combinations of those formatting tags plus footnotes, you will have to code that for yourself.

      For example if there is a word file with following text and formatting:
      Example wrote:bold - fett

      bold underlined - fett unterstrichen

      bold italic - fett kursiv

      bold italic underlined - fett kursiv unterstrichen

      italic - kursiv

      italic underlined - kursiv unterstrichen

      underlined - unterstrichen
      And this text is selected in word, copied with Ctrl+C to clipboard and pasted in UltraEdit into a new ASCII file with DOS line terminators using Edit - Paste Special - HTML Source, the result is among a large header block and a small footer block:

      Code: Select all

      <!--StartFragment-->
      
      <p class=MsoNormal><b style='mso-bidi-font-weight:normal'>bold</b> - fett</p>
      
      <p class=MsoNormal><b style='mso-bidi-font-weight:normal'><u>bold underlined</u></b>
      - fett unterstrichen</p>
      
      <p class=MsoNormal><b style='mso-bidi-font-weight:normal'><i style='mso-bidi-font-style:
      normal'>bold italic</i></b> - fett kursiv</p>
      
      <p class=MsoNormal><b style='mso-bidi-font-weight:normal'><i style='mso-bidi-font-style:
      normal'><u>bold italic underlined</u></i></b> - fett kursiv unterstrichen </p>
      
      <p class=MsoNormal><i style='mso-bidi-font-style:normal'>italic</i> - kursiv</p>
      
      <p class=MsoNormal><i style='mso-bidi-font-style:normal'><u>italic underlined</u></i>
      - kursiv unterstrichen</p>
      
      <p class=MsoNormal><u>underlined</u> - unterstrichen</p>
      
      <!--EndFragment-->
      
      Next an UltraEdit macro is used with following code:

      Code: Select all

      InsertMode
      ColumnModeOff
      HexOff
      UltraEditReOn
      Top
      Find MatchCase Select "<!--StartFragment-->^p^p"
      IfFound
      Delete
      Find MatchCase "^p<!--EndFragment-->^p</body>^p^p</html>^p"
      Replace ""
      Top
      Find MatchCase RegExp "<^(/++[bui]^)[~>]++>"
      Replace All "[^1]"
      Find MatchCase RegExp "<[~>]+>"
      Replace All ""
      ReturnToWrap
      TrimTrailingSpaces
      Bottom
      IfColNumGt 1
      InsertLine
      IfColNumGt 1
      DeleteToStartofLine
      EndIf
      EndIf
      Top
      EndIf
      The macro reformats the pasted HTML text to:

      Code: Select all

      [b]bold[/b] - fett
      
      [b][u]bold underlined[/u][/b] - fett unterstrichen
      
      [b][i]bold italic[/i][/b] - fett kursiv
      
      [b][i][u]bold italic underlined[/u][/i][/b] - fett kursiv unterstrichen
      
      [i]italic[/i] - kursiv
      
      [i][u]italic underlined[/u][/i] - kursiv unterstrichen
      
      [u]underlined[/u] - unterstrichen
      
      As you can see the formatted text in Microsoft Word was converted into a text with BBCode tags as used here for the MS Word example.

      I wrote this topic with the MS Word example and the macro in about 30 minutes.

      This quickly written macro does not support footnotes and of course does surely not work for all Microsoft Word files out there in the world. But it can be used as template for an enhanced version. And of course instead of a macro it is possible to code the same also as UltraEdit script and with the Perl regular expression engine instead of the UltraEdit regexp engine.
      Best regards from an UC/UE/UES for Windows user from Austria

      20
      Basic UserBasic User
      20

        Mar 02, 2014#3

        Thank you Mofi. One always learns from your careful explanations. I'll experiment a bit with the possible MS Word export filters, and try to figure out just what those "paste special" commands in the "Edit" menu are for ("HTML source" and "raw RTF").

        If I understand you correctly the "File Types" heading in the "Configuration" Menu is there simply to group together a file list - nothing more, and most certainly does not imply the presence of an import filter. (I can't really believe that I'm the only person in the world who's tried to import MS Word files into UE, so the problem must be widely known. Perhaps something to pass on to UE as a request for a future update?)

        best, and thank you.
        fvgfvg

        6,686585
        Grand MasterGrand Master
        6,686585

          Mar 03, 2014#4

          fvgfvg wrote:I can't really believe that I'm the only person in the world who's tried to import MS Word files into UE, so the problem must be widely known.
          You are not the only person. There are others not knowing the difference between an application designed for writing / editing pure text files and an application designed for writing structured and formatted documents. Once the difference is explained to those people, there is no request left. So why should UltraEdit have a feature to import Microsoft Word files from binary format or the newer XML format? That would most likely blow up uedit32.exe dramatically with no real benefit for approximately 99.99% of all UltraEdit users. It is possible in MS Word to save a file as pure text file or the contents of the MS Word document is copied and pasted as pure text into a text file in UltraEdit or any other text editor. So there are in MS Word features to save or export the text of a document as pure text and therefore there is no real need for an import filter.

          There are also people not knowing the difference between spreadsheet applications like Microsoft Excel and text editing applications which of course can also edit CSV files. A CSV file stores field values in pure text format and therefore those files can be edited with text editors. But that does not mean that text editors are spreadsheet applications.

          If you request support for word processing support in UltraEdit, you could also request from Microsoft support for source code editing features in MS Word. Yes, it is possible to write C/C++/C#, JavaScript, PHP and HTML code in MS Word, but nobody does it. Why? Simply because a word processing application as MS Word is not designed for that task like a text editor.

          MS Word has support for the very powerful scripting language Visual Basic. So what you want to do could be also done directly in MS Word by writing a Visual Basic macro which takes the contents of a file, removes all formatting than those you want to keep, and save the text with those formatting encoded with tags (HTML, BBCode, RTF, ...) into a pure text file. There is no real need for a text editor for this task.

          Conclusion: To do a task use the application designed best for this task.
          Best regards from an UC/UE/UES for Windows user from Austria

          20
          Basic UserBasic User
          20

            Mar 04, 2014#5

            Just so we don't get our wires crossed, Mofi. (Sie können einem aber ganz ordentlich die Ohren waschen - meine Güte..!) I've been experimenting with trying to get simple MS-Word mark-ups into UE: italics, bold.
            (I am, I believe, perfectly aware of the difference between an application designed for writing / editing pure text files and an application designed for writing structured and formatted documents)

            After doing a bit of experimenting - also with that "Paste Special", "HTML source" command, (also trying out your macro) I've decided the way to go is to export the DOC in HTML, and then write my own script. That was really what I was asking: if someone else has already written such a script...
            best,
            fvgfvg
            p.s. I haven't got that "Raw RTF" command to work. It doesn't seem to mean .RTF files, that I can see.

            6,686585
            Grand MasterGrand Master
            6,686585

              Mar 04, 2014#6

              Your postscript might be a sign of me that you do not really know what the 2 commands in submenu Paste Special really do.

              First, most Windows users think there is only 1 Windows clipboard. But that is not true, there are multiple Windows clipboards. A list of the clipboards available can be found in the Microsoft Developer Network article Clipboard Formats and the other pages referenced from this page.

              If Paste Special - HTML Source or Paste Special - Raw RTF or both or neither can be used in UltraEdit depends on the application in which the data was copied to the clipboard.

              Microsoft Word 2007 (and later versions) does not copy text (anymore) to non standard clipboard for Rich Text Format, at least not always. A paste from RTF clipboard is no problem in Microsoft Word 2007+.

              But other applications like Microsoft Internet Explorer 8 support the clipboards for pure text, for HTML coded text, and for RTF coded text simultaneously. For example I can select in IE8 a formatted text and press Ctrl+C, then start UltraEdit and do following:
              • Press Ctrl+N to open a new file and press Ctrl+V to get the copied text pasted into the file as pure text - text file,
              • press Ctrl+N to open a new file and use Paste Special - HTML Source to get same copied text with HTML tags pasted into the file - HTML file,
              • press Ctrl+N to open a new file and use Paste Special - Raw RTF to get same copied text with RTF tags pasted into the file - RTF file.
              So I get 3 different files with a single Ctrl+C in IE8 and using the 3 paste commands in UltraEdit.

              Another example of an application supporting all 3 clipboards is Microsoft Outlook as it supports text emails, HTML emails and RTF emails and therefore must support also the text clipboards, the HTML clipboard and the RTF clipboard.
              UltraEdit is not capable to find out which clipboard contains useful data for the user. Therefore all 3 paste commands are always available. The user itself must know or find out by trial and error into which clipboard the other application has copied the data.

              PS: I have seen many *.doc files which are in real RTF files and not Microsoft Word documents. MS Word creates RTF files with formatting tags not supported by Wordpad - the standard Windows application for RTF files. With using DOC as file extension for an RTF file, the document writer can force the usage of Microsoft Word for opening this file on double click and not Wordpad. Well, if Microsoft Word is installed, the file extensions RTF and DOC are automatically associated with MS Word instead of Wordpad as default application for open. So in my point of view it does not make much sense to give an RTF file the extension DOC, except when the author wants to make clear for the user that this file contains a documentation for whatever.
              Best regards from an UC/UE/UES for Windows user from Austria

              20
              Basic UserBasic User
              20

                Mar 05, 2014#7

                Thank you. I now know how to copy a block of text from a DOC file, into UE via the "Edit->Past Special->HTML Source" command. That's something new for me - I didn't know how that worked.

                But does that solve my problem? No, I'm afraid it doesn't. Instead of ZERO mark-up from a DOC file, I've now got mark-up buried under a mountain of 'bloat-code'.

                1) What I'd hoped for, in the "File-Open->All files->all files (DOC)" and then in the "Paste Special" command, was a FILTER capable of extracting my mark-up from all of that superfluous html. If you work a lot with DOC files (and who on this planet doesn't?) then getting that mark-up into UE in such a way that it can be used, is important.

                2) Please examine the following files to understand what it is that I am requesting from UE:

                http://fvg.fvg.f-m.fm/EV_ordinary_cut_a ... MARKUP.txt (7k)
                http://fvg.fvg.f-m.fm/EV_code-bloat_via ... source.txt (26k) <-- please note size..
                http://fvg.fvg.f-m.fm/EV_html_filtered_ ... LEANER.txt (8k) <-- please note size..

                The utility that I found last night with a bit of googling, and that I've used here, is this - http://www.wordhtmlcleaner.co.uk/ - it provides EXACTLY the functionality that I am requesting.

                best,
                fvgfvg