UltraEdit pattern to find/search for joined words

UltraEdit pattern to find/search for joined words

5
NewbieNewbie
5

    Feb 26, 2015#1

    I am trying to find an UltraEdit pattern to find joined words in a text file. For example, let us consider the following portion of text:

    Code: Select all

    proficiency in or access toEnglish. Reporting mathematicalthinking, even for main languageEnglish-speakers, is not a simpleprocess because of the linguistic and
    I tried with:

    Code: Select all

    [a-z]+[A-Z][a-z]+
    which correctly finds

    Code: Select all

    toEnglish
    and

    Code: Select all

    languageEnglish
    but what if the words

    Code: Select all

    mathematical thinking
    and

    Code: Select all

    simple process
    are joined?

    Any help appreciated.

    P.S: Question originally posted on Stackoverflow.
    Regards,
    Sandeep
    It is easy to be born, it is difficult to be a human being.:)

    115
    Power UserPower User
    115

      Feb 26, 2015#2

      There are some things that a computer program cannot do or at least cannot perform without much more effort than by human labor. This is one of them.

      You got away with a partial computer solution because you had a simple case where you could identify that part of the problem and break it down into steps a computer could execute. While it appears easy to ask the computer to split apart a word that starts with a lowercase letter and contains an uppercase letter it is because you can clearly define your requirements and have available to you a set of instructions that can perform the task. Note that the logic you created would fail to fix DoctorSmith or Doctorsmith or doctorJohnSmith as these examples do not meet those requirements yet they are clearly wrong.

      While you or I can look at a value and see that it is actually joined words without the delimiter that should separate them, for UltraEdit or any program to perform this task would require a tremendous amount of logic and extensive dictionaries of the languages that would be expected to be found in the text. You cannot just split every word into possible new words because it is the nature of language to make new words out of old words. I did not run together any words on purpose yet here in this small amount of text are several words that are naturally formed from two or more words or that appear to be so formed. "cannot" "apart" "away" "lowercase" "uppercase" "together" "without". You would have to identify a word that is not found in your dictionary and then break it apart again and again until you end up with words that are in your dictionary. You would have to allow for new words and acronyms.

      It would be just too much work.

      6,675585
      Grand MasterGrand Master
      6,675585

        Feb 27, 2015#3

        Mick and the commentators on Stack Overflow are absolutely right. Your large words database in your brain built up over years can't be replaced by a regular expression. What you can do is executing Edit - Spell Check and let the spell-checker identify all words not included in the dictionary. That's it.
        Best regards from an UC/UE/UES for Windows user from Austria

        5
        NewbieNewbie
        5

          Feb 27, 2015#4

          Thank you Mick and Mofi.
          Regards,
          Sandeep
          It is easy to be born, it is difficult to be a human being.:)