Sorting a file with frequency count on word length

Sorting a file with frequency count on word length

24
Basic UserBasic User
24

    Mar 22, 2013#1

    Hello,
    I have a file which has the following structure

    word space frequency

    The file is around 30,000 headwords each along with its frequency. The words have different lengths. What I need is a script which can sort the file on length of the headword and once the file is sorted on length: smallest to largest; sort each such set of words having the same length on their frequency.
    At present I do this in Excel using the

    Code: Select all

    =Len(text)
    formula, but this is getting tedious.
    I am giving below a sample input file

    Code: Select all

    about 1903238
    and 14291859
    are 1487971
    but 2994482
    can 1915289
    come 1541623
    for 3296048
    from 2207336
    get 2081392
    have 5930242
    here 1558771
    him 1571291
    just 1756270
    know 2221467
    like 1845600
    not 3091071
    now 1453264
    one 1988291
    out 1812292
    right 1410555
    say 2345958
    she 2123744
    that 7834407
    the 29962169
    there 1957160
    they 2684414
    think 1398723
    this 3814998
    was 1399013
    what 3327049
    when 1465219
    who 1543711
    with 3983564
    would 1346905
    you 12345509
    your 2329896
    The expected output would be:

    Code: Select all

    the 29962169
    and 14291859
    you 12345509
    for 3296048
    not 3091071
    but 2994482
    say 2345958
    she 2123744
    get 2081392
    one 1988291
    can 1915289
    out 1812292
    him 1571291
    who 1543711
    are 1487971
    now 1453264
    was 1399013
    that 7834407
    have 5930242
    with 3983564
    this 3814998
    what 3327049
    they 2684414
    your 2329896
    know 2221467
    from 2207336
    like 1845600
    just 1756270
    here 1558771
    come 1541623
    when 1465219
    there 1957160
    about 1903238
    right 1410555
    think 1398723
    would 1346905
    As you can see the file has been sorted on length and then on frequency count value.

    Any help given would avoid the tedium of loading the file each time in Excel. Many thanks in advance.

    6,675585
    Grand MasterGrand Master
    6,675585

      Mar 22, 2013#2

      Here is a script for that task which is not really optimized for speed.

      Code: Select all

      function sortByWordLengthAndCount (sFirstWord,sSecondWord)
      {
         // Get length of the 2 words compared for sort.
         var nWordLength1 = sFirstWord.indexOf(' ');
         var nWordLength2 = sSecondWord.indexOf(' ');
         // Is word 2 is shorter than word 1?
         if (nWordLength2 < nWordLength1)
         {
            return 1;  // Word 1 and 2 must change their order in array.
         }
         // Is word 2 is longer than word 1?
         if (nWordLength2 > nWordLength1)
         {
            return 0;  // Nothing to change on order for these 2 words.
         }
         // Words have identical length, compare the frequency count values.
         var nFrequency1 = parseInt(sFirstWord.substr(++nWordLength1),10);
         var nFrequency2 = parseInt(sSecondWord.substr(nWordLength1),10);
         // Is frequency of word 2 greater than the frequency of word 1?
         if (nFrequency2 > nFrequency1)
         {
            return 1;  // Word 1 and 2 must change their order in array.
         }
         // Is frequency of word 2 lower than the frequency of word 1?
         if (nFrequency2 < nFrequency1)
         {
            return 0; // Nothing to change on order for these 2 words.
         }
         // Compare the words (lines with identical frequency values). This is an
         // alphabetical compare for words with same length and same frequency value.
         if (sFirstWord > sSecondWord)
         {
            return 1;
         }
         return 0;
      }
      
      // =========================================================================
      
      if (UltraEdit.document.length > 0)  // Is any file opened?
      {
         // Define environment for this script.
         UltraEdit.insertMode();
         UltraEdit.columnModeOff();
      
         UltraEdit.activeDocument.selectAll();
         var asLines = UltraEdit.activeDocument.selection.split("\r\n");
         // Remove last line string if it is an empty string.
         var bLastLineHasLineTerm = false;
         if (asLines[asLines.length-1].length == 0)
         {
            asLines.pop();
            bLastLineHasLineTerm = true;
         }
         // Sort the lines using the special sort criteria.
         asLines.sort(sortByWordLengthAndCount);
         // Join the lines to a block in user clipboard 9.
         UltraEdit.selectClipboard(9);
         UltraEdit.clipboardContent = asLines.join("\r\n");
         if (bLastLineHasLineTerm) UltraEdit.clipboardContent += "\r\n";
         // Paste the block over selection of entire content in active file.
         UltraEdit.activeDocument.paste();
         UltraEdit.clearClipboard();
         UltraEdit.selectClipboard(0);
         UltraEdit.activeDocument.top();
      }
      By the way: The BBCode code tags are not only for script/programming language code. The code tags should be used for every preformatted text. If the preformatted text is a code sequence or something else does not matter. The guys who named the BBCode code tags in this manner most likely thought that it is mainly used for real code which is true in most forums. But in a forum for a text editor there is often the need to post preformatted text which is not a code.asLines