Regex to identify a full-stop as a sentence delimiter

dictdoc · Jul 29, 2012#12012-07-29T15:47+00:00

Hello,
Splitting a sentence using the full-stop/question-mark/exclamation is a common device. Whereas the question-mark / exclamation do not pose too much of a problem; the full-stop as a sentence delimiter raises certain issues because of its varied use, as shown in the examples below:

The temperature was 32.8 degrees Celsius. (Temperature)
His B.Sc. degree was deemed insufficient. (Acronym)
He owed the bank USD 4000.50 which he had not paid back. (Currency)
On 27.07.2004 a major earthquake occurred. (Date)
It was 17.05 by the clock. (Time)

just to name a few.
A Perl script would do the job, but since I am working on dynamic data where on the fly detection is needed, I am looking for a regex which can do the job and correctly ignore the above cases and identify only valid ones.
input:
Quote:
The temperature was 32.8 degrees Celsius. His B.Sc. degree was deemed insufficient. He owed the bank USD 4000.50 which he had not paid back. On 27.07.2004 a major earthquake occurred. It was 17.05 by the clock.
What I need is that the regex should identify only sentences delimited with a full-stop.
The expected output would be:

The temperature was 32.8 degrees Celsius.
His B.Sc. degree was deemed insufficient.
He owed the bank USD 4000.50 which he had not paid back.
On 27.07.2004 a major earthquake occurred.
It was 17.05 by the clock.

and not for example

His B.
Sc.
degree was deemed insufficient.

One of the techniques I tried was the following regex:

Code: Select all

\.\w[A-Z]

which stated that
Locate a full-stop followed by a word in Caps (with or without space) or a full-stop at eof. But it just didn't work.
AWK or PERL would do the job but since the data is dynamic and has to be processed on-line the only solution is a regex.
Many thanks in advance.

Sorry I had posted this earlier without the Subject and guess it did not get registered.

rhapdog · Jul 29, 2012#22012-07-29T19:12+00:00

It's Sunday, so I don't do any 'in-depth' thinking. However, although I won't try to come up with a regex for you, I will give some hints on what to look for.

For starters, for "full-stop", have the regex look for a ". " (following space). This will eliminate most of the "false positives" so to speak, however, it won't handle the period at the end of an abbreviation followed by a space. Generally speaking, in most cases, it may be enough to also check to see if one of the following conditions exist AFTER the full-stop and space:
1. The next Alphabetic character after any spaces or symbols is capitalized, denoting the beginning of a new sentence.
2. There is an end of line marker before any further alpha characters.
3. There is an end of line marker instead of a space.
4. The full-stop is followed by an end of file.

That won't be 100% accurate, because there is a lot more "intelligence" needed, as there will always be exceptions.

Another example that would be problematic would be "www.ultraedit.com", but by requiring the space character after the "." it will filter out most issues. That is,

Code: Select all

.{space|eoln|oef}

Also, if it has a space after it, but two more stops before it, then it is an ellipses ... and may not be the end of a sentence, as it may indicate a portion of a quote was left out. Need to be able to check for that as well.

I basically give this info out, because before you can write a solution to a problem, you have to be able to understand the problem. Hopefully someone can take this as a starting point and knock something out that will help.

Mofi · Jul 30, 2012#32012-07-30T05:23+00:00

Following the rules of rhapdog using the Perl regular expression search string \. +(\u) and as replace string .\r\n\1 results in replacing the space(s) after a punctuation mark by a carriage return plus line-feed if next character is an upper case letter. That works quite well for the English example given.

Punctuation mark at end of line or end of file must not be found as no need to make a modification in these cases. Ellipses must be also not taken into account because either ... is at end of a sentence with a space and an upper case letter following or it is in the middle of a sentence with next letter (after space) being lower case.

Of course false positive detections are nevertheless possible, especially in other languages like German where every noun is written in upper case. Even the Grammar checking routines of Microsoft Word often fail to identify punctuation marks in German text correct. For such texts a human with good knowledge of the Grammar of the language must read the text and make the corrections.