Wrong result with a find using option match whole word in rare cases with longer word across a block boundary (fixed)

Wrong result with a find using option match whole word in rare cases with longer word across a block boundary (fixed)

1581
Power UserPower User
1581

    Aug 13, 2020#1

    I have ...
    • UE 26.20.0.74
    • a file with 42 MB, 1 million lines, max. 100 chars long
    • UTF-8
    • Lines which contain e.g. <Funktion> and <FunktionHierarchisch> (with brackets)
    When I make a simple search (no regular expression) ...
    • for whole word <Funktion> -> it finds nothing
    • for whole word Funktion -> it finds also FunktionHierarchisch and lists it in the Find String List, but Find next (F3) goes to the correct word.
    The results are okay on reducing the file to 100 lines. A file with 150,000 lines brings more wrong results.

    Counting the word Funktion brings 50,'000 results, so maybe ...
    • the result list cannot handle the result (I can see the word "placeholder" for 1/10 second in the list)
    • or the brackets make problems
    • or the CamelCase syntax is the problem?
    Any experience with this?
    Zwischenablage-2.png (25.03KiB)
    Zwischenablage-1.png (49.6KiB)
    UE 26.20.0.74 German / Win 10 x 64 Pro

    19176
    MasterMaster
    19176

      Aug 14, 2020#2

      Hi,

      I think you have a problem with Match whole word ON because there is another text immediately after the ending > so it is not the whole word.
      I prepared a random file containing 1mio lines with 100000 occurrences of <Funktion> and <FunktionHierarchisch>. Find works fine for me with the option Match whole word OFF.
      I tried:
      <Funktion> MWW OFF
      Funktion  MWW ON
      No problem found. UE 27.00.0.94

      BR, Fleggy

      6,688587
      Grand MasterGrand Master
      6,688587

        Aug 17, 2020#3

        A find executed with enabled option Match whole word does not mean that the string to find must be a word. The string to find can have non-word characters anywhere in string, at beginning, in the middle, at end. The searched string must be just found in file at beginning of file or there is a non-word character left to found string and there is also a non-word character right of found string or at end of file with enabled option Match whole word. So a search with <Funktion> with enabled option Match whole word finds this search string for example at beginning of a line (beginning of file or a newline character left) and there is next a space character or any other non-word character or end of file.

        I am not sure if beginning of file and end of file is 100% correct for a large file. That should be true, but UltraEdit processes a large file in blocks. UE loads from a large file always just a block into memory for processing it. So it could be that beginning of file is in real beginning of character stream buffer and end of file is in real end of character stream in buffer. UltraEdit should merge end of previous block read from file and beginning of current block read from file on a positive match with enabled find option Match whole word at beginning of current block to make sure there is left in file really a non-word character or no character at all. UltraEdit should merge end of current block read from file and beginning of next block read from file on a positive match with enabled find option Match whole word at end of current block to make sure there is right in file really a non-word character or no more character at all. A real problem is if the entire current block is a positive match for a (regular expression) search and so more characters prior and after current block are required to validate if the match is really positive and if more characters than those in current character stream buffer are matched by the searched string/expression. Of course it is possible to define a regular expression find which selects by mistake the entire file contents which will not really work for large and huge files as this would mean UltraEdit needs to load the entire file contents into memory which could not be possible at all. Therefore a selection of a large block with using a regular expression search is possible only in theory, but not in real world. However, there is no large block to find in this case.

        I tried to reproduce this issue with a file containing 1 million lines, see the attached RAR archive file test.rar with just 3 290 bytes which contains file Test.txt with 39 180 000 bytes. But 32-bit UltraEdit for Windows v26.20.0.68 and v27.00.0.94 lists on a search for Funktion with a checked Match whole word only the 990 000 lines containing <Funktion> as I could verify with copying the list of found lines to clipboard, pasting them into a new file and searching from top of new file for FunktionH with quick find which could not find this string.

        Peter, is it possible for you to compress your file into a small 7-Zip, RAR or ZIP archive file and attach the archive file to your next post on containing non-confidential data?

        That would make it easier for me to reproduce the issue. I will report this issue if I could reproduce it with your file with UltraEdit for Windows v26.20.0.68 and v27.00.0.94.

        Edit: The RAR archive file was removed later after Peter attached a file which made it possible to reproduce the issue.
        Best regards from an UC/UE/UES for Windows user from Austria

        1581
        Power UserPower User
        1581

          Aug 17, 2020#4

          Hi Mofi

          I didn't understand all the details of your posting, but I played around with your file, these results:
          • I could not reproduce that it finds FunktionHierarchisch when I search for whole word Funktion. I thought, I saw it the first time, but cannot prove it.
          • It still does not find <Funktion> as whole word.
          • Maybe it is a refreshing problem - when I have Highlight all items found and List lines containing string, then some things appear (very quick). The word "placeholder" appears, and e.g. the third resulting line of the previous search is replaced by the third result of the current search, while all other lines stay unchanged until the end. I made a screencast of the described behavior: https://knowledge.autodesk.com/communit ... 7fd87a7a80
          UE 26.20.0.74 German / Win 10 x 64 Pro

          6,688587
          Grand MasterGrand Master
          6,688587

            Aug 17, 2020#5

            The word FunktionHierarchisch is definitely never found in my test file on searching for word Funktion with enabled Match whole word from top of file after opening it. I might change file contents to get perhaps one <FunktionHierarchisch> with an offset in file resulting in <Funktion being at end of block read from file. I hoped to get your file to save me time to find out at which position in file <FunktionHierarchisch> must be placed to reproduce the issue.

            <Funktion> is found in file with enabled Match whole word only on modifying one of the lines containing this tag and insert a space or any other non-word character after >. My file contains on all lines with <Funktion> left to this string a horizontal tab which is a non-word character. So condition one that a non-word character is left to searched string is true on all lines containing <Funktion>. But there is always right of <Funktion> the word character K and for that reason the second condition with a non-word character right to searched string is false on all lines in file containing the string <Funktion>.

            In other words the first, second and last line below would be listed on searching for <Funktion> with enabled Match whole word while third and fourth line would be ignored for the reason written in the line.

            Code: Select all

            <Funktion><!-- Matched as left is the beginning of the file and right is an angle bracket. -->Kontrollschacht</Funktion>
            <Funktion> <!-- Matched as left is a line-feed and right is a space, both non-word characters. -->Kontrollschacht</Funktion>
             <Funktion>Kontrollschacht <!-- Ignored as left is the non-word character space, but right is the word character K. --></Funktion>
            <!-- Ignored<Funktion> as left is the word character d while right is the non-word character space. --></Funktion>
            <!-- Matched as this last line has an angle bracket left and no line termination at end and so the searched string is at end of file.--><Funktion>
            Best regards from an UC/UE/UES for Windows user from Austria

            1581
            Power UserPower User
            1581

              Aug 18, 2020#6

              a) <Funktion> as whole word:
              Thanks for the example, it's clear now.

              b) Found "too much"
              So - I made it. Here is a part of the original, but modified file (it was confidential). I reduced it to approx. 69,000 lines and replaced (nearly) all characters with "1". It behaves as described above: Funktion as whole word also shows FunktionHierarchisch in results list.
              It seem to be a problem of the "block reading / data streaming / file size" and not a problem of the related characters because it was a hard work to change the file and to keep the problem.
              When I removed (large) data selections or replaced some strings or removed the trailing spaces, then the error disappeared in most cases. I hope you can reproduce now the problem and it does not disappear.
              ue_wrong_search.png (103.46KiB)
              wrong_search.7z (24.4 KiB)   0
              UE 26.20.0.74 German / Win 10 x 64 Pro

              6,688587
              Grand MasterGrand Master
              6,688587

                Aug 18, 2020#7

                Thank you for the example file, great work!

                The false positive find issue is reproducible with your file with UltraEdit for Windows v26.20.0.68 and with currently latest v27.00.0.94.

                So I reported this issue to IDM support by email and received already the reply that the issue was reproducible and is forwarded to a developer for investigation. My report is in attached RAR archive file.

                The block size is obviously 2 816 000 bytes (2 750 KiB) as that is the number of bytes from top of file to end of <Funktion of start tag <FunktionHierarchisch> on line 68 376 not counting the three bytes of UTF-8 BOM. Inserting or removing a leading space on line 68 376 left to start tag <FunktionHierarchisch> avoids the false positive find. Inserting or removing a character left to end tag </FunktionHierarchisch> makes no difference. So the issue is indeed that <Funktion is found at end of block of 2 816 000 characters resulting in the false positive as next character is not read from file to test for non-word character.

                The issue is not reproducible with UE v22.20.0.49 using your file. This just means this older version of UltraEdit uses a smaller block size on processing a large file. I'm pretty sure, the issue could be reproduced also with UE v22.20.0.49 on finding the right position for <FunktionHierarchisch> in file.

                I viewed also the screencast multiple times. The refresh issue is explainable for me and it is a feature and not a bug. UltraEdit starts the find after your click on button Next  and starts refreshing the view with the found lines. Such a view is coded by Microsoft to refresh first just the active line in the view which is the reason why getting the third line updated first. But the find task takes more than a second to finish. For that reason UltraEdit stops printing the results to the view and shows in status bar at bottom on left side (prompt area) a progress bar. A longer running find is finished faster if window refreshes are avoided as much as possible as those window refreshes take more time then searching for a string in hundred thousands of characters. The user has the ability to cancel the find by pressing key ESC, or as done by you, by clicking on a find starting button, in this case the button Previous. The results of the canceled find operation is displayed now in the view refreshed just once. The progress bar in status bar is removed and prompt area shows again the suitable prompt text. This feature was introduced with UE v26.20. So you can read about this feature in file changes.txt in program files folder of UltraEdit.
                wrong_find_result_report.rar (27.79 KiB)   0
                Report of false positive find with match whole word

                  Oct 12, 2020#8

                  This issue is fixed with UltraEdit for Windows v27.10 and UEStudio v20.10.
                  Best regards from an UC/UE/UES for Windows user from Austria