How to find files containing (two) words in any order?

How to find files containing (two) words in any order?

1
NewbieNewbie
1

    Jul 25, 2018#1

    Hello,

    I am using the Find in Files feature to search for a few keywords in thousands of files. I am able to use Perl to do an OR search. So say my keywords I want are cat and dog, I can do cat|dog and it will pull up files that have either cat or dog or both.

    I would like to be able to search only for files that have cat AND dog in them, regardless of order or line in the file. The files are text files. I have spent hours researching this to no avail. Any information here is appreciated. My ultimate goal here would be to be able to write a quick macro or something that will immediately search certain directories after the user enters in the appropriate keywords.

    Thank you

    6,680583
    Grand MasterGrand Master
    6,680583

      Jul 26, 2018#2

      For just two keywords and on small files with less than 100 KiB it would be the fastest to use the Perl regular expression search string (?s)\bcat\b.+\bdog\b|\bdog\b.+\bcat\b to find files containing cat and dog or dog and cat.

      But this approach is not useful to find files which must contain multiple words. For such cases an UltraEdit script is required. I can imagine two different approaches:
      1. The script uses Find in Files to search all files for first word with output written to output window. The script takes the file names from output window containing the first word and store them in a string array (first file names list). Then the script uses Find in Files again with second word and again takes the file names containing this second word and store them in an additional string array (second file names list). This procedure is done for each keyword. Finally the script takes one file name after the other from first first file names list and searches for this file name in all other file names list. Each file name from first file names list found in all other file names list is output into the output window to get finally the list of file names containing all words.
      2. The script uses Find in Files to search all files for first word with output written to output window. The script takes the file names from output window containing the first searched word and store them in a file names list. Next the script searches just in each file in the file names list for second keyword and remove a file name from the list if the file does not contain second keyword. This Find in Files on each file in file names list is done for each keyword until the file names list becomes empty (no file contains all keywords) or all keywords have been searched and the final file names list contains only the names of the files containing all keywords which is written to output window.
      The second approach is most likely faster, especially if the thousands of files to search for is reduced already after first search to just a few files.

      But UltraEdit's Find in Files is not really designed for that task. It would be best for this task if searching for a keyword in a file is immediately stopped on having found first occurrence of the keyword in the file, then reset position in file to beginning and search for next word and if this word is also found somewhere in file, reset position in file again to beginning and search for third word and so on until either all words were found in file or one word is not found. Then this loop is applied on next file. That would be much more efficient, but is not possible with Find in Files. This approach could be used only on opening each file and using regular Find (with lots of display updates) or in case of the files are small, loading file content of each file into memory as string and search for first occurrence of each keyword with JavaScript's String.search() function for each word in file content (display updates reduced to minimum resulting in faster finishing). But opening each file in UltraEdit on thousands of files to use this method can be slower than using the second Find in Files when the number of files containing first searched word is low.

      The time required to finish this special search task depends extremely on the number of window updates which takes much more time than searching for a word in a small file. UltraEdit as text editor is not designed for doing something completely in background with no visual indication for the user. So my last advice is writing for this task a PowerShell script and call this PowerShell script via a user tool with the parent directory for the search as first parameter and the words every file must contain as further arguments. This would be definitely the fastest method for this special search task.
      Best regards from an UC/UE/UES for Windows user from Austria

      4
      NewbieNewbie
      4

        Jul 28, 2020#3

        I am looking to find in files using a regular expression in PERL or UltraEdit format for lines that have two words such as "cat" and "dog" for my work. Can anyone help me? I would like to see an example of a search string because, well, I am dumb as a brick when it comes to absorbing programming code.

        The Help files get very technical and so far in my searching on the internet I am not finding what I am looking for.

        Much thanks in advance.

        19476
        MasterMaster
        19476

          Jul 28, 2020#4

          Hi,

          Just to be sure - two words on the same line? Or two words anywhere in the file even on different lines?

          Thanks, Fleggy

          EDIT:
          Lets say you need to find all lines containing both words CAT and DOG (or words containing these strings - e.g. Doggy is fine as well as dog). Then this Perl regex search pattern can be used (for example):
          (?:cat(?=.*dog)|dog(?=.*cat)).*$

          It means:
          • Try to find CAT followed by DOG on the same line
          • or try to find DOG followed by CAT on the same line
          • and if CAT or DOG was found then match the rest of the line so the next search starts on the next line.
          And if you need to find the exact words (\b means the word boundary):
          (?:\bcat\b(?=.*\bdog\b)|\bdog\b(?=.*\bcat\b)).*$

          BR, Fleggy

          6,680583
          Grand MasterGrand Master
          6,680583

            Jul 28, 2020#5

            Best regards from an UC/UE/UES for Windows user from Austria

            4
            NewbieNewbie
            4

              Jul 28, 2020#6

              Yes, trying to find two words on the same line in multiple files in a folder. The script worked great! Thank you very much!

              19476
              MasterMaster
              19476

                Jul 29, 2020#7

                BTW: Here is another pattern without the alternation. For education purpose :)

                (?<ANIMAL>\b(?:dog|cat)\b).+(?!\k<ANIMAL>)(?&ANIMAL).*$

                4
                NewbieNewbie
                4

                  Jul 29, 2020#8

                  So how does that one work? It looks like its looking for Animal as well as Dog and Cat or is Animal just a title?

                  19476
                  MasterMaster
                  19476

                    Jul 29, 2020#9

                    Yes, ANIMAL is just a name of the group. This group has its definition - match either dog or cat and when the match is found then this group also has its content - the matched word. Lately you can reference to both - to the definition (?&ANIMAL) or to the matched (captured) word \k<ANIMAL>.

                    (?<ANIMAL>\b(?:dog|cat)\b) try to find the word dog or cat and capture it under a group named ANIMAL. This is the definition of ANIMAL.
                    .+ skip all following characters.
                    (?!\k<ANIMAL>) test if the text at the current position is not the already captured word. If the word is the same then the pattern fails and the new search starts. But if the text is different then continue.
                    (?&ANIMAL) try to find a word as it is defined by the group ANIMAL (dog or cat). But this can match only the animal not matched yet because of the previous test.
                    .*$ and finally skip the rest of line because we have a match and the rest is not interesting.

                    4
                    NewbieNewbie
                    4

                      Jul 29, 2020#10

                      Thanks for the explanation. I have what I need and the  script worked great. Thanks much!