regex question about how to exclude items

regex question about how to exclude items

60
Advanced UserAdvanced User
60

    May 05, 2009#1

    I want to find all occurrences of ftp but not sftp
    exclude lines beginning with a # or the word echo

    Code: Select all

    ftp.bat a b c d
    sftp a b c d f
    echo ftp 
    #  ftp
       #  ftp
       echo ftp
    ftp_asc.bat 12 2 3 4
              sftp rkrktkt .
    ftp -iv 
    
    lines 1,7 and 9 should be selected

    236
    MasterMaster
    236

      May 05, 2009#2

      The simplest Perl regex that works on your example would be ^\s*ftp
      That only matches ftp at the start of the line (including optional whitespace).

      Is that sufficient?

      If you need to match a line like

      get ftp.test.com

      then this gets complicated and can't be done with a single Perl regex.

      60
      Advanced UserAdvanced User
      60

        May 05, 2009#3

        Code: Select all

        ^\s*(ftp[^ ]*) ([^ ]+).([^ ]+).([^ ]+).([^ ]+.)
        I tried this so I can capture ftp..... until the first space or tab.
        It seems to capture "ftp " as well as "ftp.bat"

        I also tried to capture all non-space or tabs after the ftp... and I found that this seems to work.

        Is there a better way. I want each parameter in a seperate capture group.

        Thanks,
        Steven

        236
        MasterMaster
        236

          May 06, 2009#4

          I'd try to be more specific - in your regex there is the potential risk for a lot of backtracking since . and [^ ] can match the same characters, and if a match fails, then the possible permutations the regex engine has to try before it gives up can be quite a lot. So assuming that you expect at least one and not more than four parameters after ftp(.bat), and that parameters are always separated by spaces, not by tabs, then I'd choose:

          ^\s*(ftp\S*) (\S+) ?(\S+)? ?(\S+)? ?(\S+)?

          I chose \S because it matches any non-space character (thereby avoiding to match newlines in case you have a line with less than four parameters following ftp(.bat)). Since you need every parameter in a separate capture group there is no better way.

          \2 will contain the first parameter, \3 through \5 the following parameters, if present.

          60
          Advanced UserAdvanced User
          60

            May 06, 2009#5

            Thanks again for the great advice!!!!
            Everything works great!!!

              May 07, 2009#6

              one small problem I found so I need to go back to excluding lines beginning with
              a '#' or 'echo' and they just have to precede ftp anywhere for me not to want to match.

              Code: Select all

              #   ftp
                  sftp
                echo   ftp oiipo poipipiopoi ipoiipoipi 
                  # ftp a  v b 
              ftp_ascii.bat 
              $batch_dir/jobs/ftp.bat diriri
              ./ftp.batch kdkdkdd flflflf
              /a/vb/x/jobs/ftp_aaa.bat kdkdkd ckckckc
                 ftp dkdkdkf fkfkfk
                 ftp_bin.ksh   dndndndn nfmfmf
                 ftp_ascI-ro_bin.bat dfdkdkd d d d d d
              $batch_dir/jobs/sftp.bat diriri   
              define ftp=d.x
              ftp-out
              
              lines 5-11 should match
              so for right now, if there is a # or echo in front do not show it.
              I did not realize at first I also filtered out all path stuff. and therefor the lines I wanted to see.

              This is what I have so far.

              Code: Select all

              ^\s*?[^#e]*[^s]ftp\S*
              but is missing lines beginning with ftp.....

              I am now trying this.

              Code: Select all

              ^(\s*?[^#e]*[^s])?ftp\S+[ \t]*(\S+)?[ \t]*(\S+)?[ \t]*(\S+)?[ \t]*(\S+)?
              but this is taking a long time to go through all 10,000 files.

              236
              MasterMaster
              236

                May 07, 2009#7

                This is a problem :)

                Perl regexes do not support variable-length lookbehind. Therefore there is no one-stop solution.

                Two possibilities: On a copy of the files, first remove all the lines that contain # ftp or echo ftp. Then search for the original regex (omitting the leading \s*).

                Or (depending what you want to do with the matches) create a macro that will first look for a line that doesn't contain # ftp or echo ftp and then check that line for what you're looking for. And do with it whatever you need.
                What sounds more like it should work for you?

                60
                Advanced UserAdvanced User
                60

                  May 07, 2009#8

                  Since I am going through lots of files and subdirectories, would a macro work or is this a script job?

                  And thanks, that is what I get for using another tool :D to create the regex.

                  236
                  MasterMaster
                  236

                    May 08, 2009#9

                    What do you want to do with the matches - do you want to change something or delete them, or do you need a list or...?

                    60
                    Advanced UserAdvanced User
                    60

                      May 08, 2009#10

                      I need to create a list of the following script names containg the following.

                      ftp something
                      ftp_ascii. something
                      ftp_binary. somthing
                      ftp. something
                      ftp*exe something
                      but not sftp anything.



                      The list I first created contained lots of scripts. I noticed some only had ftp or whatever I needed on comment lines only

                      # ftp .......
                      echo ftp
                      cat ftp.log
                      define ftp_log=rororo
                      ...
                      I do not need these scripts.

                      The same with a few other items that can start the line.
                      Since I only want to generate the list.
                      I copied all to another subdir --> fixthem
                      I can do what I want to these scripts in this directory.
                      I took your advice and removed all lines from scripts the started with [...]# the ... are spaces or tabs
                      Next removed all lines that started with [....]echo
                      and down the list. Working with over 10,000 files was very fast on these when UE was asked to remove these lines.
                      Next I seached for the \s*ftp\S* and found a much cleaner list.
                      I was getting greedy, or lazy ( regex pun :D ). trying to do it all in one regex pass.
                      I still am trying to figure out what lookahead and lookbehind do.
                      UE worked on these files very fast with the simple regexes. I was finished in under 30 minutes after several removals and tests.
                      Usually it it is the simple process that makes the most sense. I was so engrossed in making this regex work, I did not think about a simpler solution.
                      I usually have a hard time with macros in UE, maybe all of this could be done in a macro.
                      Even using this manual process was much easier than going through over the first set of matches with over 1,000 scripts reporting ftp someting. The list is now down to about 300 real "ftp" scripts.

                      236
                      MasterMaster
                      236

                        May 08, 2009#11

                        Great. I guess I would have done it the same way. No need to devise a fiendishly clever solution that takes hours to develop when you can spend 30 minutes once and be done with it. Even if it is a terribly boring task... But the inner geek will always compel us to SOLVE THE PROBLEM!

                        Lookaround isn't all that difficult. A so-called lookahead assertion checks if it is possible (positive lookaround) or impossible (negative lookaround) to match a regular expression, either starting from the current position (lookahead) or ending there (lookbehind). However, it doesn't consume any characters of the string it's looking at; the regex engine stays at the same position in the string. So foo(?=bar) will match foo (and nothing more) in the string foobar, but not in the string foobaz.

                        16
                        Basic UserBasic User
                        16

                          May 08, 2009#12

                          Interesting thread (and challenging problem). This is an excellent example of the type of complex regex problem that is readily solved in one (multi-step) whack with PowerGrep, which can first section the file (eliminating the various kinds of comment lines), then process the remaining lines and generate an output file with just the desired ftp list data. If you are into regexs, PowerGrep is an amazingly powerful (albeit spendy) tool.

                          That said, here is a regex that should do the trick in one whack.

                          Code: Select all

                          ^(?>\s*)(?!#|echo|cat|define).*?(?<!s)ftp\S*(?:[ \t]+(\S+))?(?:[ \t]+(\S+))?(?:[ \t]+(\S+))?(?:[ \t]+(\S+))?
                          Here it is again with comments in free spacing mode

                          Code: Select all

                          ^(?>\s*)                     # atomically consume any/all leading whitespace
                          (?!\#|echo|cat|define)       # ensure first non-space char is not in blacklist
                          .*?                          # lazily grab everything up to 'ftp' (if any)
                          (?<!s)ftp\S*                 # match 'ftp*' but only if it is not preceded by 's'
                          (?:[ \t]+(\S+))?             # capture first parameter into $1
                          (?:[ \t]+(\S+))?             # capture second parameter into $2
                          (?:[ \t]+(\S+))?             # capture third parameter into $3
                          (?:[ \t]+(\S+))?             # capture fourth parameter into $4
                          This regex avoids a lot of backtracking and should be pretty fast. You can easily add more "blacklisted first words" and more captured parameters if necessary.

                          Cheers!

                          p.s. Reading "Mastering Regular Expressions (3rd Edition)" is highly recommended. This is hands-down the most useful book I've ever read in my 28 years as a Programmer and Engineer. Pays for itself very quickly!

                          60
                          Advanced UserAdvanced User
                          60

                            May 10, 2009#13

                            It runs through regex buddy just fine also.l Pretty fast too.
                            It was very easy to add more specific exclusions.


                            Thanks,
                            :mrgreen: