How to find a regex string which is not inside certain tags?

don_bradman · Jul 13, 2017#12017-07-13T09:59+00:00

I have some files which has lots of url's in it which I need to put inside a tag, say <uri>...</uri>.
There are some url's which already has the tag, so I want to find those which are not inside the tag <uri>...</uri>
and <email>...</email> as emails also have the keywords .com, .gov, .in, .org, .ftp, .net in them.

sample text:

Code: Select all

<p>IEEE Aerospace and Electronic Systems Magazine is a monthly magazine that publishes articles concerned with the various aspects of systems for space, air, ocean, or ground environments <email>[email protected]</email> as well as news and information of interest to IEEE Aerospace and <email>[email protected]</email> Electronic Systems Society members (ieee.org).</p>
<p>The boundaries of acceptable subject matter has been intentionally left flexible so that the Magazine amiac.lio.in/se can follow the research activities, technology applications and future trends (http://gogl.net/oli?nom=14) to better meet the needs of the members of the IEEE ieee.com.op Aerospace and Electronic Systems Society. IEEE <uri>ieeexplore.ieee.org/themes</uri> Aerospace and Electronic Systems Magazine articles apprise readers of new developments, new applications of cornerstone technology, and news of society members, meetings, and related items.</p>
<p>A description for this result is not available because of this site's robots.txt</p>

The regex I'm currently using is: (\.com|\.gov|\.in|\.org|\.ftp|\.net)(?!(</email>|</uri>))

Which is not perfect as it only works when each of the strings .com, .gov, .in, .org, .ftp, .net are immediately followed by
</email> or </uri>

Can anyone help?

fleggy · Jul 13, 2017#22017-07-13T13:20+00:00

OK, lets suppose that all emails are tagged.
You can find a lot of regexes for checking URI on internet. I choose one and did some modifications.

1) tag all URIs without ending tag </uri> or </email>
F:

Code: Select all

(?x)
(?>
(?:(https?|ftp)://)?                                      # protocol
(?:
 (?:[a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+           # username
 (?::(?:[a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?     # password
 @                                                          # auth requires @
)?                                                        # usr:pwd is optional
(?:
 (?:[a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)+             # domain segments
 (?!txt\b|doc\b|zip\b|pdf\b)                              # ignore common file extensions
 [a-z][a-z0-9-]*[a-z0-9]                                  # top level domain
 |                                                        # OR
 (?:(?:\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}       #
 (?:\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])                # IP address
)
(?::\d+)?                                                 # optionally port
(?:
 (?:/+(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*    # path
 (?:\?(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?    # query string
)?                                                        # path and query are optional
(?:
 #(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*
)?                                                        # optionally fragment
)                                                         # whole URI as an atomic group to avoid backtracing when negative lookahead fails
(?!</email>|</uri>)

R:

Code: Select all

<uri>$&</uri>

This regex also matches file names so I added a negative lookahead for some file extensions. You probably will have to add another extensions (zip, xls, docx, etc...)
Moreover the character ")" can be a part of URI. To keep the solution as simple as possible the regex does not test if the ending ")" is or is not a part of the URI. It is much easier to do a correction in the 2nd step.

2) change (<uri>...)</uri> to (<uri>...</uri>)
F:

Code: Select all

\(<uri>(?:.(?!</uri>))++\K\)</uri>(?!\))

R:

Code: Select all

</uri>)

BR, Fleggy

don_bradman · Jul 13, 2017#32017-07-13T13:43+00:00

Hi Fleggy,

Really appreciate your help

But I don't need that complicated regex as all the url's are fairly simple and each of them contain one/more of the keywords
.com, .gov, .in, .org, .ftp, .net and I just want to find those keywords in the entire file as long as they are not
inside any of the tags <email>...</email> <uri>...</uri>.

So, I was just hoping that there could be a simple regex search using lookahead/lookbehind to find those keywords.
Thank you in advance.

fleggy · Jul 13, 2017#42017-07-13T15:37+00:00

Hi Don,

Could confirm this URL list which I have found in your short sample text?
ieee.org
amiac.lio.in/se
http://gogl.net/oli?nom=14
ieee.com.op

Is it OK? And are you sure that .com, .gov, .in, .org, .ftp, .net is enough?

Thanks, Fleggy

don_bradman · Jul 13, 2017#52017-07-13T15:51+00:00

Yes Fleggy. Now some of the url's are made up.
But the list of keywords are all that are in my files.

fleggy · Jul 13, 2017#62017-07-13T15:57+00:00

But ieee.com.op does not fit your rule

Anyway, here is a simplified version. I think anything simpler will not work correctly.

1)
F:

Code: Select all

(?x)
(?>
(?:(https?|ftp)://)?                                      # protocol
(?:[a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)+              # domain segments
(?=(?:com|gov|in|org|ftp|net)\b)                          # only certain domains are allowed
(?>[a-z][a-z0-9-]*[a-z0-9])                               # top level domain
(?!\.)
(?:
 (?:/+(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*    # path
 (?:\?(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?    # query string
)?                                                        # path and query are optional
)                                                         # whole URI as an atomic group to avoid backtracing when negative lookahead fails
(?!</email>|</uri>)

R:

Code: Select all

<uri>$&</uri>

2)
F:

Code: Select all

\(<uri>(?:.(?!</uri>))++\K\)</uri>(?!\))

R:

Code: Select all

</uri>)

BR, Fleggy

don_bradman · Jul 13, 2017#72017-07-13T16:56+00:00

Hi Fleggy,
I'm sorry that my explaining skill is not the best in English as it is not my native language.
I was not looking for a regex to find a complete url but just those keywords which are not inside the said tags.
Anyways, I will try to explain what I'm trying to do in a different sample below(which does not contain url)

Suppose, I want to find a string/regex string e.x. \d\.\u in a file excluding contents inside tags <ab>..</ab> and
<x-df>..</x-df>

Sample:

Code: Select all

<p>IEE55.E Aerospace and Electronic Systems 4.Magazine is a monthly magazine that publishes articles concerned with the various aspects of systems for space, air, ocean, or ground environments <email>[email protected]</email> as well as news and information of interest to IEEE Aerospace and <email>[email protected]</email> Electronic Systems Society members (<std>ieee</std>).</p>
<p>The boundaries of acceptable subject matter has been intentionally left flexible so that the Magazine <ab>a.4.Amiac</ab> can follow the research activities, <x-df>1.P.m</x-df> technology applications and future trends A.10.BB to better <x-df>16.L</x-df> meet the needs of the members of the IEEE Aerospace and Electronic Systems Society. IEEE  Aerospace and Electronic Systems Magazine articles apprise readers of new developments, new applications of cornerstone technology, and news of society members, meetings, and related items.</p>
<p>A description for this result is not available because of this site's robots.txt</p>

If I do a search using the regex \d\.\u in the above sample the search will find the below strings.

5.E
4.M
4.A
1.P
0.B
6.L

My goal is to find the strings which are colored in green and ignore the red ones as it falls inside the two specified tags that I want the search to ignore contents of.

Hopefully, I've made it more understandable

Thanks as always
Sorry if I wasted your time

fleggy · Jul 13, 2017#82017-07-13T17:11+00:00

To be absolutely sure - you need to find the main domain part of URL which is not surrounded by <uri> and </uri> tags?
E.G. in your first sample

IEEE Aerospace and Electronic Systems Magazine is a monthly magazine that publishes articles concerned with the various aspects of systems for space, air, ocean, or ground environments <email>[email protected]</email> as well as news and information of interest to IEEE Aerospace and <email>[email protected]</email> Electronic Systems Society members (ieee.org).
The boundaries of acceptable subject matter has been intentionally left flexible so that the Magazine amiac.lio.in/se can follow the research activities, technology applications and future trends (http://gogl.net/oli?nom=14) to better meet the needs of the members of the IEEE ieee.com.op Aerospace and Electronic Systems Society. IEEE <uri>ieeexplore.ieee.org/themes</uri> Aerospace and Electronic Systems Magazine articles apprise readers of new developments, new applications of cornerstone technology, and news of society members, meetings, and related items.
A description for this result is not available because of this site's robots.txt

and domain names should be limited only to .com, .gov, .in, .org, .ftp, .net. Is it correct?

Thanks, Fleggy

Jul 13, 2017#92017-07-13T17:25+00:00

Well, try this. I hope this is what you want

Code: Select all

(?x)
(?:(https?|ftp)://)?                                      # protocol
(?:[a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)+              # domain segments
(?=(?:com|gov|in|org|ftp|net)\b)                          # only certain domains are allowed
\K(?>[a-z][a-z0-9-]*[a-z0-9])                             # top level domain
(?!\.)                                                    # do not end prematurely inside url like ieee.com.op
(?=                                                       # the rest of url tested in lookahead
 (?>
  (?:
   (?:/+(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*    # path
   (?:\?(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?    # query string
  )?                                                      # path and query are optional
 )                                                        # and atomic to avoid backtracing when the following negative lookahead fails
 (?!</email>|</uri>)                                      # the additive negative lookahead to check ending tag
)                                                         # end of lookahead

BR, Fleggy

don_bradman · Jul 14, 2017#102017-07-14T14:36+00:00

Thanks fleggy.