regex question about how to exclude items

sklad2 · May 05, 2009#12009-05-05T17:26+00:00

I want to find all occurrences of ftp but not sftp
exclude lines beginning with a # or the word echo

ftp.bat a b c d
sftp a b c d f
echo ftp 
#  ftp
   #  ftp
   echo ftp
ftp_asc.bat 12 2 3 4
          sftp rkrktkt .
ftp -iv

lines 1,7 and 9 should be selected

pietzcker · May 05, 2009#22009-05-05T21:15+00:00

The simplest Perl regex that works on your example would be ^\s*ftp
That only matches ftp at the start of the line (including optional whitespace).

Is that sufficient?

If you need to match a line like

get ftp.test.com

then this gets complicated and can't be done with a single Perl regex.

sklad2 · May 05, 2009#32009-05-05T21:54+00:00

Code: Select all

^\s*(ftp[^ ]*) ([^ ]+).([^ ]+).([^ ]+).([^ ]+.)

I tried this so I can capture ftp..... until the first space or tab.
It seems to capture "ftp " as well as "ftp.bat"

I also tried to capture all non-space or tabs after the ftp... and I found that this seems to work.

Is there a better way. I want each parameter in a seperate capture group.

Thanks,
Steven

pietzcker · May 06, 2009#42009-05-06T07:06+00:00

I'd try to be more specific - in your regex there is the potential risk for a lot of backtracking since . and [^ ] can match the same characters, and if a match fails, then the possible permutations the regex engine has to try before it gives up can be quite a lot. So assuming that you expect at least one and not more than four parameters after ftp(.bat), and that parameters are always separated by spaces, not by tabs, then I'd choose:

^\s*(ftp\S*) (\S+) ?(\S+)? ?(\S+)? ?(\S+)?

I chose \S because it matches any non-space character (thereby avoiding to match newlines in case you have a line with less than four parameters following ftp(.bat)). Since you need every parameter in a separate capture group there is no better way.

\2 will contain the first parameter, \3 through \5 the following parameters, if present.

sklad2 · May 06, 2009#52009-05-06T13:06+00:00

Thanks again for the great advice!!!!
Everything works great!!!

May 07, 2009#62009-05-07T16:17+00:00

one small problem I found so I need to go back to excluding lines beginning with
a '#' or 'echo' and they just have to precede ftp anywhere for me not to want to match.

Code: Select all

#   ftp
    sftp
  echo   ftp oiipo poipipiopoi ipoiipoipi 
    # ftp a  v b 
ftp_ascii.bat 
$batch_dir/jobs/ftp.bat diriri
./ftp.batch kdkdkdd flflflf
/a/vb/x/jobs/ftp_aaa.bat kdkdkd ckckckc
   ftp dkdkdkf fkfkfk
   ftp_bin.ksh   dndndndn nfmfmf
   ftp_ascI-ro_bin.bat dfdkdkd d d d d d
$batch_dir/jobs/sftp.bat diriri   
define ftp=d.x
ftp-out

lines 5-11 should match
so for right now, if there is a # or echo in front do not show it.
I did not realize at first I also filtered out all path stuff. and therefor the lines I wanted to see.

This is what I have so far.

Code: Select all

^\s*?[^#e]*[^s]ftp\S*

but is missing lines beginning with ftp.....

I am now trying this.

Code: Select all

^(\s*?[^#e]*[^s])?ftp\S+[ \t]*(\S+)?[ \t]*(\S+)?[ \t]*(\S+)?[ \t]*(\S+)?

but this is taking a long time to go through all 10,000 files.

pietzcker · May 07, 2009#72009-05-07T21:17+00:00

This is a problem

Perl regexes do not support variable-length lookbehind. Therefore there is no one-stop solution.

Two possibilities: On a copy of the files, first remove all the lines that contain # ftp or echo ftp. Then search for the original regex (omitting the leading \s*).

Or (depending what you want to do with the matches) create a macro that will first look for a line that doesn't contain # ftp or echo ftp and then check that line for what you're looking for. And do with it whatever you need.
What sounds more like it should work for you?

sklad2 · May 07, 2009#82009-05-07T21:55+00:00

Since I am going through lots of files and subdirectories, would a macro work or is this a script job?

And thanks, that is what I get for using another tool

to create the regex.

pietzcker · May 08, 2009#92009-05-08T06:30+00:00

What do you want to do with the matches - do you want to change something or delete them, or do you need a list or...?

sklad2 · May 08, 2009#102009-05-08T13:11+00:00

I need to create a list of the following script names containg the following.

ftp something
ftp_ascii. something
ftp_binary. somthing
ftp. something
ftp*exe something
but not sftp anything.

The list I first created contained lots of scripts. I noticed some only had ftp or whatever I needed on comment lines only

# ftp .......
echo ftp
cat ftp.log
define ftp_log=rororo
...
I do not need these scripts.

The same with a few other items that can start the line.
Since I only want to generate the list.
I copied all to another subdir --> fixthem
I can do what I want to these scripts in this directory.
I took your advice and removed all lines from scripts the started with [...]# the ... are spaces or tabs
Next removed all lines that started with [....]echo
and down the list. Working with over 10,000 files was very fast on these when UE was asked to remove these lines.
Next I seached for the \s*ftp\S* and found a much cleaner list.
I was getting greedy, or lazy ( regex pun

). trying to do it all in one regex pass.
I still am trying to figure out what lookahead and lookbehind do.
UE worked on these files very fast with the simple regexes. I was finished in under 30 minutes after several removals and tests.
Usually it it is the simple process that makes the most sense. I was so engrossed in making this regex work, I did not think about a simpler solution.
I usually have a hard time with macros in UE, maybe all of this could be done in a macro.
Even using this manual process was much easier than going through over the first set of matches with over 1,000 scripts reporting ftp someting. The list is now down to about 300 real "ftp" scripts.

pietzcker · May 08, 2009#112009-05-08T15:56+00:00

Great. I guess I would have done it the same way. No need to devise a fiendishly clever solution that takes hours to develop when you can spend 30 minutes once and be done with it. Even if it is a terribly boring task... But the inner geek will always compel us to SOLVE THE PROBLEM!

Lookaround isn't all that difficult. A so-called lookahead assertion checks if it is possible (positive lookaround) or impossible (negative lookaround) to match a regular expression, either starting from the current position (lookahead) or ending there (lookbehind). However, it doesn't consume any characters of the string it's looking at; the regex engine stays at the same position in the string. So foo(?=bar) will match foo (and nothing more) in the string foobar, but not in the string foobaz.

ridgerunner · May 08, 2009#122009-05-08T21:13+00:00

Interesting thread (and challenging problem). This is an excellent example of the type of complex regex problem that is readily solved in one (multi-step) whack with PowerGrep, which can first section the file (eliminating the various kinds of comment lines), then process the remaining lines and generate an output file with just the desired ftp list data. If you are into regexs, PowerGrep is an amazingly powerful (albeit spendy) tool.

That said, here is a regex that should do the trick in one whack.

Code: Select all

^(?>\s*)(?!#|echo|cat|define).*?(?<!s)ftp\S*(?:[ \t]+(\S+))?(?:[ \t]+(\S+))?(?:[ \t]+(\S+))?(?:[ \t]+(\S+))?

Here it is again with comments in free spacing mode

Code: Select all

^(?>\s*)                     # atomically consume any/all leading whitespace
(?!\#|echo|cat|define)       # ensure first non-space char is not in blacklist
.*?                          # lazily grab everything up to 'ftp' (if any)
(?<!s)ftp\S*                 # match 'ftp*' but only if it is not preceded by 's'
(?:[ \t]+(\S+))?             # capture first parameter into $1
(?:[ \t]+(\S+))?             # capture second parameter into $2
(?:[ \t]+(\S+))?             # capture third parameter into $3
(?:[ \t]+(\S+))?             # capture fourth parameter into $4

This regex avoids a lot of backtracking and should be pretty fast. You can easily add more "blacklisted first words" and more captured parameters if necessary.

Cheers!

p.s. Reading "Mastering Regular Expressions (3rd Edition)" is highly recommended. This is hands-down the most useful book I've ever read in my 28 years as a Programmer and Engineer. Pays for itself very quickly!

sklad2 · May 10, 2009#132009-05-10T09:55+00:00

It runs through regex buddy just fine also.l Pretty fast too.
It was very easy to add more specific exclusions.

Thanks,