A regex to identify medial consonant combos in an IPA Transcription

gimley · Mar 01, 2017#12017-03-01T11:48+00:00

I am working on identifying consonant combos in a dictionary of English to IPA which is have compiled.
A short sample is given below.

Code: Select all

ability=ə'bɪlɪti:
abject='æbʤekt
abjection=æb'ʤekʃən
abjections=æb'ʤekʃənz
abjectly='æbʤekt-li:
abjuration=,æbʤʊə'reɪʃən
abjurations=,æbʤʊə'reɪʃənz
abjure=əb'ʤʊər
abjured=əb'ʤʊərəd
abjures=əb'ʤʊəz
abjuring=əb'ʤʊərɪŋg
ablative='æb'lətɪv
ablatives='æb'lətɪvz
ablaut='æb'l aʊt
ablauts='æb'l aʊts
ablaze=ə'bleɪz
able='eɪbəl
able-bodied=,eɪbəl-'bɔːdi:d
abler='eɪblər
ablest='eɪblest
ablution=ə'bluːʃən
ablutions=ə'bluːʃənz
ably='eɪb-li:
abnegation=,æbnɪ'geɪʃən

As it can be seen the data structure is:

Code: Select all

English word=IPA

What I need is a macro or regex replace which will identify all consonant conjuncts i.e. two consonants coming together under the following conditions:

Only in the IPA column i.e. the column to the right.
Only when not in initial or final position i.e. in the middle of the IPA string.
Conjuncts are consonants which come together in the middle of the word and are not separated by a ' or a , [stress markers]. 2 or 3 consonants can come together. 3 consonant combos are rare.
All the consonants which can form combos in medial position are are given below.

Code: Select all

[bdfghjklmnprstvwyzðŋʃʒʤʧθ]

Explanation:
Thus a combo of the type b'l in the middle of the word such as

Code: Select all

'æb'lətɪv will not be considered since the consonants are separated by a '
able-bodied=,eɪbəl-'bɔːdi:d since l and b are separated by a hyphen and also a '

but

Code: Select all

'ɔːdliː will be identified since the two consonants immediately come together.

At present I am doing the cleanup by hand. A macro which would identify such combinations and separate the consonant clusters by a + would be most useful as in:

Code: Select all

,æbnɪ'geɪʃən

would become

Code: Select all

,æb+nɪ'geɪʃən

Three consonant combos are rare and if found could be separated with + sign.
Many thanks for solving this conundrum.

L.Willms · Mar 01, 2017#22017-03-01T12:01+00:00

What about ^[bdfghjklmnprstvwyzðŋʃʒʤʧθ][bdfghjklmnprstvwyzðŋʃʒʤʧθ]{2,3}^[bdfghjklmnprstvwyzðŋʃʒʤʧθ]?

I.e. 2 or 3 medial consonants preceded and succeeded by a character which is not a medial consonant. I would think that this includes word boundaries.

The accent circumflex ^ means not the following character or class of characters.

{2,3} instructs the regex engine to consider only sequences of 2 or 3 characters of the character class given before.

To insert a plus sign "+" in front of the medial consonant combo, I suggest this

(^[bdfghjklmnprstvwyzðŋʃʒʤʧθ])([bdfghjklmnprstvwyzðŋʃʒʤʧθ]{2,3})(^[bdfghjklmnprstvwyzðŋʃʒʤʧθ])
replace by
$1+$2$3

I pass on Mofi's recommendation to use the Pearl regex engine.

I hope that I have not mixed various dialects of the regex language.

Mar 01, 2017#32017-03-01T12:50+00:00

The above would reproduce only the one directly accent non-medial consonant instead of the whole string. Correct would be:

(^[bdfghjklmnprstvwyzðŋʃʒʤʧθ]*)([bdfghjklmnprstvwyzðŋʃʒʤʧθ]{2,3})(^[bdfghjklmnprstvwyzðŋʃʒʤʧθ]*)
replace by
$1+$2$3

Although I get uncertain about the proper nesting if [] and () and where to place the negation operator "^" and where the repetition operator "*" (arbitrary number of occurrences, including none).

The parentheses "()" include a sub-expression which can be referenced at a later place in the expression by their sequence number.

fleggy · Mar 01, 2017#42017-03-01T14:50+00:00

Hi Gimley,

here is the first iteration of the Perl pattern which finds only the first combo in the transcription:

^([^=]+=[',]?+..*?[)\K([bdfghjklmnprstvwyzðŋʃʒʤʧθ]{2,3})[',:]?(?=.+$)

I am working on another pattern which will find all combos in one transcription. I hope your UE supports all used features

BR, Fleggy

Are you sure that y should be among consonants?

EDIT 1:

The 2nd version which finds all combos (I removed y from the list):

(?<![',=])[bdfghjklmnprstvwzðŋʃʒʤʧθ]{2,3}(?=[^',:\r\n])(?!.*=)

For three consonant combos use this Perl replace:
Find what: (?<![',=])([bdfghjklmnprstvwzðŋʃʒʤʧθ])([bdfghjklmnprstvwzðŋʃʒʤʧθ]{2})(?=[^',:\r\n])(?!.*=)
Replace with: \1+\2

EDIT 2:

I see only a two consonant combo (bn) in ,æbnɪ'geɪʃən and the pattern above finds only bn in this transcription. Did I get something wrong?

gimley · Mar 01, 2017#52017-03-01T23:22+00:00

Dear Fleggy,

Many thanks for all help on the forum.

The Perl regex works perfectly. However when I use the search string (?<![',=])[bdfghjklmnpstvwzðŋʃʒʤʧθ]{2,3}(?=[^',:\r\n])(?!.*=) with the replace string \1+\2 the regex engine "eats up" the consonant clusters. bolster='bolstər becomes bolster='bo+ər.

How do I search and replace where + is inserted between 2 consonant strings?

I have an old UltraEdit 15.20. Is that responsible?

Incidentally y in IPA is a semi-consonant and acts as a glue to bind the preceding consonant.

fleggy · Mar 02, 2017#62017-03-02T06:23+00:00

Hi Gimley,

I am not sure what you need to do. If you want to insert + after the first consonant in any consonant cluster then use this Perl search:

(?<![',=])([bdfghjklmnprstvwzðŋʃʒʤʧθ])([bdfghjklmnprstvwzðŋʃʒʤʧθ]{1,})(?=[^',:\r\n])(?!.*=)

and as replace string \1+\2

If you need to "break up" only three consonant combos then use this Perl search:

(?<![',=])([bdfghjklmnprstvwzðŋʃʒʤʧθ])([bdfghjklmnprstvwzðŋʃʒʤʧθ]{2})(?=[^',:\r\n])(?!.*=)

and as replace string (again) \1+\2

BR, Fleggy

Mofi · Mar 02, 2017#72017-03-02T06:26+00:00

Gimley,

I have added colors to the replies to make it easier to see where parentheses are used to build a capturing group and which groups exist in search string.

(?<![',=]) ... is a negative lookbehind expression not matching any character.

[bdfghjklmnpstvwzðŋʃʒʤʧθ]{2,3} ... matches 2 or 3 characters, but not in a capturing group.

(?=[^',:\r\n]) ... is a positive lookahead not matching any character.

(?!.*=) ... is a negative lookahead not matching any character.

You have used the replace string \1+\2 which references two capturing groups. But this search string does not contain any capturing group at all. Therefore both backreferences in replace string reference nothing. The other search string posted by Fleggy where the replace string was also added contains two capturing groups as you hopefully can see now after I added the color formatting.

UltraEdit, UltraCompare, UEStudio forums