Find In Files / .docx

Find In Files / .docx

1
NewbieNewbie
1

    May 19, 2014#1

    After doing a "Find in Files" search, I got suspicious of the results and found that while old-style Word (.doc) is searched, new-style Word (.docx, .docm) is not. That's really frustrating, given how long the .docx format has been out there. It also means I can't trust that function in UltraEdit. Is there any hope for improvement? I hate to spend more money on UltraFinder for a function that supposedly exists in the product I bought.

    6,606548
    Grand MasterGrand Master
    6,606548

      May 19, 2014#2

      UltraEdit does not support a real search in *.doc files.

      The text is stored in a *.doc file either as ANSI encoded text stream or as UTF-16 encoded text stream. Microsoft Word uses for a text block always the encoding best suited for the block. If all characters of a text block can be encoded using the standard code page, it stores the characters for this text block in ANSI encoding with 1 byte per character. But if a text block contains a Unicode character, the entire block (and not just this single character) is stored in the *.doc file using UTF-16 encoding, i.e. two bytes per character.

      The ANSI text streams and the UTF-16 text streams (with selecting UTF-16 Unicode encoding in the Find in Files window) can be found by UltraEdit like also other text not visible in opened Word document like file names also stored as ANSI or UTF-16 text stream. But this does not mean that UltraEdit really supports *.doc files on Find in Files like UltraCompare Professional supports for text comparison or UltraFinder supports on searching for files containing a specified text. There is nowhere documented that UltraEdit really supports *.doc files created by Microsoft Word. The way of how MS Word stores data in *.doc files just makes it possible that UltraEdit can find text in the binary files like UltraEdit can find also text for example in executables or libraries.

      But a *.docx file contains the text data completely different. A MS Word document in DOCX format is a set of XML files packed with ZIP compression into a single file with extension DOCX. The compression makes the text unreadable in binary format for applications which do not unpack the DOCX container and search for the text in the included *.xml files.

      UC and UF really convert the *.doc and *.docx files to RTF format stored temporarily in %TEMP% folder and extract from RTF file the pure text information for the text comparison respectively the text search. Therefore UC and UF can really find text as displayed in Microsoft Word. But UltraEdit just runs a simple byte stream search and does not really interpret *.doc files.

      Open a *.doc file and a *.docx file in UltraEdit. Both are opened in hex edit mode as both are binary files. Scroll down in the *.doc file and look on the columns displaying the binary bytes using system code page and you will suddenly see text like in Microsoft Word. But if you scroll down in the *.docx file, you will never see text on right side of the hex edit display.
      Best regards from an UC/UE/UES for Windows user from Austria

      115
      Power UserPower User
      115

        May 19, 2014#3

        Thanks for the good information, Mofi. I knew the new Office formats used XML but I was not aware of the compression. You could put together a forum help entry on why UltraEdit doesn't work with Office files, but then I'm not sure that the people who would benefit the most from such a document would read it.

        I've seen a lot of folks here on the forums (and in the real world) not understand how word processing programs differ from text processing programs. They don't realize all of what has to go on in the background to produce WYSIWYG and how all of that markup data has to be in the file with the text itself. They don't need to know how it works but they need to understand that more than words are in the file.