A regex to identify medial consonant combos in an IPA Transcription

A regex to identify medial consonant combos in an IPA Transcription

6
NewbieNewbie
6

    Mar 01, 2017#1

    I am working on identifying consonant combos in a dictionary of English to IPA which is have compiled.
    A short sample is given below.

    Code: Select all

    ability=ə'bɪlɪti:
    abject='æbʤekt
    abjection=æb'ʤekʃən
    abjections=æb'ʤekʃənz
    abjectly='æbʤekt-li:
    abjuration=,æbʤʊə'reɪʃən
    abjurations=,æbʤʊə'reɪʃənz
    abjure=əb'ʤʊər
    abjured=əb'ʤʊərəd
    abjures=əb'ʤʊəz
    abjuring=əb'ʤʊərɪŋg
    ablative='æb'lətɪv
    ablatives='æb'lətɪvz
    ablaut='æb'l aʊt
    ablauts='æb'l aʊts
    ablaze=ə'bleɪz
    able='eɪbəl
    able-bodied=,eɪbəl-'bɔːdi:d
    abler='eɪblər
    ablest='eɪblest
    ablution=ə'bluːʃən
    ablutions=ə'bluːʃənz
    ably='eɪb-li:
    abnegation=,æbnɪ'geɪʃən
    
    As it can be seen the data structure is:

    Code: Select all

    English word=IPA
    What I need is a macro or regex replace which will identify all consonant conjuncts i.e. two consonants coming together under the following conditions:
    1. Only in the IPA column i.e. the column to the right.
    2. Only when not in initial or final position i.e. in the middle of the IPA string.
    3. Conjuncts are consonants which come together in the middle of the word and are not separated by a ' or a , [stress markers]. 2 or 3 consonants can come together. 3 consonant combos are rare.
    4. All the consonants which can form combos in medial position are are given below.

    Code: Select all

    [bdfghjklmnprstvwyzðŋʃʒʤʧθ]
    Explanation:
    Thus a combo of the type b'l in the middle of the word such as

    Code: Select all

    'æb'lətɪv will not be considered since the consonants are separated by a '
    able-bodied=,eɪbəl-'bɔːdi:d since l and b are separated by a hyphen and also a '
    but

    Code: Select all

    'ɔːdliː will be identified since the two consonants immediately come together.
    At present I am doing the cleanup by hand. A macro which would identify such combinations and separate the consonant clusters by a + would be most useful as in:

    Code: Select all

    ,æbnɪ'geɪʃən
    would become

    Code: Select all

    ,æb+nɪ'geɪʃən
    Three consonant combos are rare and if found could be separated with + sign.
    Many thanks for solving this conundrum.

    9
    NewbieNewbie
    9

      Re: A regx to identify medial consonant combos in an IPA Transcription

      Mar 01, 2017#2

      What about ^[bdfghjklmnprstvwyzðŋʃʒʤʧθ][bdfghjklmnprstvwyzðŋʃʒʤʧθ]{2,3}^[bdfghjklmnprstvwyzðŋʃʒʤʧθ]?

      I.e. 2 or 3 medial consonants preceded and succeeded by a character which is not a medial consonant. I would think that this includes word boundaries.

      The accent circumflex ^ means not the following character or class of characters.

      {2,3} instructs the regex engine to consider only sequences of 2 or 3 characters of the character class given before.

      To insert a plus sign "+" in front of the medial consonant combo, I suggest this

      (^[bdfghjklmnprstvwyzðŋʃʒʤʧθ])([bdfghjklmnprstvwyzðŋʃʒʤʧθ]{2,3})(^[bdfghjklmnprstvwyzðŋʃʒʤʧθ])
      replace by
      $1+$2$3

      I pass on Mofi's recommendation to use the Pearl regex engine.

      I hope that I have not mixed various dialects of the regex language.

        Mar 01, 2017#3

        The above would reproduce only the one directly accent non-medial consonant instead of the whole string. Correct would be:

        (^[bdfghjklmnprstvwyzðŋʃʒʤʧθ]*)([bdfghjklmnprstvwyzðŋʃʒʤʧθ]{2,3})(^[bdfghjklmnprstvwyzðŋʃʒʤʧθ]*)
        replace by
        $1+$2$3

        Although I get uncertain about the proper nesting if [] and () and where to place the negation operator "^" and where the repetition operator "*" (arbitrary number of occurrences, including none).

        The parentheses "()" include a sub-expression which can be referenced at a later place in the expression by their sequence number.

        19476
        MasterMaster
        19476

          Mar 01, 2017#4

          Hi Gimley,

          here is the first iteration of the Perl pattern which finds only the first combo in the transcription:

          ^([^=]+=[',]?+..*?[)\K([bdfghjklmnprstvwyzðŋʃʒʤʧθ]{2,3})[',:]?(?=.+$)

          I am working on another pattern which will find all combos in one transcription. I hope your UE supports all used features :)

          BR, Fleggy

          Are you sure that y should be among consonants?

          EDIT 1:

          The 2nd version which finds all combos (I removed y from the list):

          (?<![',=])[bdfghjklmnprstvwzðŋʃʒʤʧθ]{2,3}(?=[^',:\r\n])(?!.*=)

          For three consonant combos use this Perl replace:
          Find what: (?<![',=])([bdfghjklmnprstvwzðŋʃʒʤʧθ])([bdfghjklmnprstvwzðŋʃʒʤʧθ]{2})(?=[^',:\r\n])(?!.*=)
          Replace with: \1+\2

          EDIT 2:

          I see only a two consonant combo (bn) in ,æbnɪ'geɪʃən and the pattern above finds only bn in this transcription. Did I get something wrong?

          6
          NewbieNewbie
          6

            Mar 01, 2017#5

            Dear Fleggy,

            Many thanks for all help on the forum.

            The Perl regex works perfectly. However when I use the search string (?<![',=])[bdfghjklmnpstvwzðŋʃʒʤʧθ]{2,3}(?=[^',:\r\n])(?!.*=) with the replace string \1+\2 the regex engine "eats up" the consonant clusters. bolster='bolstər becomes bolster='bo+ər.

            How do I search and replace where + is inserted between 2 consonant strings?

            I have an old UltraEdit 15.20. Is that responsible?

            Incidentally y in IPA is a semi-consonant and acts as a glue to bind the preceding consonant.

            19476
            MasterMaster
            19476

              Mar 02, 2017#6

              Hi Gimley,

              I am not sure what you need to do. If you want to insert + after the first consonant in any consonant cluster then use this Perl search:

              (?<![',=])([bdfghjklmnprstvwzðŋʃʒʤʧθ])([bdfghjklmnprstvwzðŋʃʒʤʧθ]{1,})(?=[^',:\r\n])(?!.*=)

              and as replace string \1+\2

              If you need to "break up" only three consonant combos then use this Perl search:

              (?<![',=])([bdfghjklmnprstvwzðŋʃʒʤʧθ])([bdfghjklmnprstvwzðŋʃʒʤʧθ]{2})(?=[^',:\r\n])(?!.*=)

              and as replace string (again) \1+\2

              BR, Fleggy

              6,681583
              Grand MasterGrand Master
              6,681583

                Mar 02, 2017#7

                Gimley,

                I have added colors to the replies to make it easier to see where parentheses are used to build a capturing group and which groups exist in search string.

                (?<![',=]) ... is a negative lookbehind expression not matching any character.

                [bdfghjklmnpstvwzðŋʃʒʤʧθ]{2,3} ... matches 2 or 3 characters, but not in a capturing group.

                (?=[^',:\r\n]) ... is a positive lookahead not matching any character.

                (?!.*=) ... is a negative lookahead not matching any character.

                You have used the replace string \1+\2 which references two capturing groups. But this search string does not contain any capturing group at all. Therefore both backreferences in replace string reference nothing. The other search string posted by Fleggy where the replace string was also added contains two capturing groups as you hopefully can see now after I added the color formatting.
                Best regards from an UC/UE/UES for Windows user from Austria