Find splits of compound words within a dictionary file

Find splits of compound words within a dictionary file

24
Basic UserBasic User
24

    May 02, 2012#1

    Dear all,
    I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list.
    An example would make this clear
    annamarie
    mariechristine
    johnsmith
    johnjoseph smith
    john
    smith
    anna
    marie
    mary
    christine
    The program should split the words in the list basing itself on the single forms which are there. Thus
    annamarie anna-marie
    mariechristine marie christine
    johnsmith john smith
    johnjosephsmith
    In the case of the last since
    joseph
    is missing, the program could suitably tag the missing element and show the word as
    john !joseph! smith
    The script/macro would prove especially helpful in separating words in languages such as German whch have a large number of compounded words.
    I have a script in awk which does something similar but it takes words from an external dictionary, whereas here I need to bootstrap.
    Any help given would be gratefully acknowledged.

    6,682583
    Grand MasterGrand Master
    6,682583

      Jun 10, 2012#2

      I first thought that this is not possible without a dictionary database and therefore did not think too much about this task over.

      But today I looked again on it and I think, I have found a good working solution with following script:

      Code: Select all

      if (UltraEdit.document.length > 0) {
      
         // Get all words from the file.
         UltraEdit.columnModeOff();
         UltraEdit.activeDocument.selectAll();
         var asWords = UltraEdit.activeDocument.selection.split("\r\n");
         UltraEdit.activeDocument.top();
      
         // If the last line is an empty line, remove it from the list.
         if (!asWords[asWords.length-1].length) asWords.pop();
      
         // Create a new array for included words and their positions in the word.
         var nWordCount = asWords.length;
         var asMultiWords = new Array(nWordCount);
         var nWordIndex = 0;
         while (nWordIndex < nWordCount) asMultiWords[nWordIndex++] = "";
      
         // Find out which words are included completely in other words.
         for (nWordIndex = 0; nWordIndex < nWordCount; nWordIndex++) {
      
            // Ignore the words where it was already detected that it includes
            // at least 1 other word as this word is surely not a single form.
            if (asMultiWords[nWordIndex].length) continue;
      
            // Get actual word into a separate string.
            var sActWord = asWords[nWordIndex];
      
            // Record in which other words this word is included.
            for (var nIndex = 0; nIndex < nWordCount; nIndex++) {
               var nPos = asWords[nIndex].indexOf(sActWord);
               if (nPos < 0) continue;
               if (nIndex == nWordIndex) continue;
               if (nPos < 10) asMultiWords[nIndex] += "0";
               asMultiWords[nIndex] += nPos.toString() + "|" + sActWord + " ";
               // Note: "Words" longer 99 characters are not supported by this script.
            }
         }
      
         // Create the result list.
         var sResult = "";
         for (nWordIndex = 0; nWordIndex < nWordCount; nWordIndex++) {
      
            // Ignore the words not containing any other word.
            if (!asMultiWords[nWordIndex].length) continue;
      
            // Add this word with other words included to the result
            sResult += asWords[nWordIndex] + " =";
      
            // Put the included words again into an array of strings.
            var asFoundWords = asMultiWords[nWordIndex].split(" ");
            asFoundWords.pop();
      
            // Build the result string as requested which is quite complicated
            // as it must be found out in which order the included words must
            // be arranged. The array of included words with position (00 to 99,
            // no longer strings are supported) at beginning is sorted according
            // to the position number to get the included words in correct oder.
            // And parts of the word can be also not found on any other line.
            // Also the "word" can be a string containing spaces which must
            // be ignored to build the result correct.
            asFoundWords.sort();
            var nLastPos = 0;
            for (nIndex = 0; nIndex < asFoundWords.length; nIndex++) {
      
               // Split up the string with position in word and included word.
               var sPos = asFoundWords[nIndex].substr(0,2);
               var sWord = asFoundWords[nIndex].substr(3);
               nPos = parseInt(sPos,10);  // Convert the position back to number.
      
               // Is this word the expected string part in main word.
               if (nPos != nLastPos) {
                  // There is a part of the word not listed on any line.
                  sActWord = asWords[nWordIndex];
      
                  // Ignore spaces at begin of not included part.
                  while (nLastPos < sActWord.length) {
                     if (sActWord[nLastPos] != ' ') break;
                     nLastPos++;
                  }
      
                  // Ignore spaces at end of not included part.
                  var nEndPos = nPos - 1;
                  while (nEndPos > nLastPos) {
                     if (sActWord[nEndPos] != ' ') break;
                     nEndPos--;
                  }
      
                  // Was something other than spaces not included?
                  if (nLastPos != ++nEndPos) {
                     // Yes, include this string part enclosed in exclamation marks.
                     sResult += " !" + sActWord.substring(nLastPos,nEndPos) + "!";
                  }
                  nLastPos = nPos;
               }
               // Appended the included word and update position for next word.
               sResult += " " + sWord;
               nLastPos += sWord.length;
            }
            // After every word append a DOS line termination.
            sResult += "\r\n";
         }
         if (sResult.length) {  // Anything to output build?
            UltraEdit.newFile();
            UltraEdit.activeDocument.write(sResult);
            UltraEdit.activeDocument.top();
         }
         else UltraEdit.messageBox("No word included in any other word.");
      }
      The result of this script on your input example is:

      Code: Select all

      annamarie = anna marie
      mariechristine = marie christine
      johnsmith = john smith
      johnjoseph smith = john !joseph! smith