Find Line not in Sort Order

Find Line not in Sort Order

17
Basic UserBasic User
17

    Oct 11, 2014#1

    Dear All,

    I want to find the line that was not in sort order.
    I have converted some pdf file into xml. Now I have to find out the split lines. The file was sorted by either Name of the author or numbered list.

    If one sentence was split into two, the second line must have a great chance to be not in alphabetical order and I need a solution to find it.

    Eg.,

    Code: Select all

    1. author 1 sample text 1920. England.
    2. Arun sample text
    1951. France.
    3. Kumar sample text 1854 America.
    Here I want to find the 3rd line because it was split from previous line.

    And another example.

    Code: Select all

    Arun sample text 1951. France.
    Author1 sample text. 1920. England.
    Author5 sample text
    1920. England.
    Kumar sample text 1854.
    America.
    Mofi sample text 2014, Austria.
    Here I have to find out 4th and 6th line. The file have thousands of lines. Is there any workaround to find this.

    Thanks in advance.
    Arun

    6,688587
    Grand MasterGrand Master
    6,688587

      Oct 11, 2014#2

      For your first example I suggest to use the Perl regular expression ^(?:[^\d\r\n].*|\d+[^\d.].*|\d+\.(?:[ \t]+\S+){1,4}[ \t]*)$ as search string for a Find to find and select a line
      • [^\d\r\n].* ... not starting with a digit, OR
      • \d+[^\d.].* ... starting with a number, but no dot after the number, OR
      • \d+\.(?:[ \t]+\S+){0,4}[ \t]* ... starts with a number and a dot, but has less than 5 strings between spaces/tabs.
      For your second example I suggest to use a Find with Perl regular expression ^(?:\S+[ \t]*){1,4}$ as search string.

      It finds entire lines with less than 5 strings between spaces/tabs in the line.
      Best regards from an UC/UE/UES for Windows user from Austria

      17
      Basic UserBasic User
      17

        Oct 17, 2014#3

        Hi,

        Thanks for your reply.

        But is there any way to check the sort order?

        Thanks and Regards
        Arun

        6,688587
        Grand MasterGrand Master
        6,688587

          Oct 17, 2014#4

          For small files you can use this UltraEdit script.

          Code: Select all

          if (UltraEdit.document.length > 0)  // Is any file opened?
          {
             // Define environment for this script.
             UltraEdit.insertMode();
             if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
             else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
          
             // Define line terminator type.
             var sLineTerm = "\r\n"
             if (UltraEdit.activeDocument.lineTerminator == 1) sLineTerm = "\n";
             else if (UltraEdit.activeDocument.lineTerminator == 2) sLineTerm = "\r";
          
             // Select entire file and load all lines to an array of strings.
             UltraEdit.activeDocument.selectAll();
             var asLines = UltraEdit.activeDocument.selection.split(sLineTerm);
             // Remove last string if it is empty because of file ends with a line termination.
             if (!asLines[asLines.length-1].length) asLines.pop();
             // Go to top of file and cancel selection.
             UltraEdit.activeDocument.top();
          
             var nLine = 0;
             while (nLine < asLines.length)
             {
                // Get number at beginning of the line or the entire line if
                // the regular expression object does not match on the line.
                var sLineNumRead = asLines[nLine].replace(/^(\d+)\..*$/,"$1");
                nLine++;   // Index starts with 0, but line counting with 1.
                // Convert line number to a decimal string.
                var sLineNumExpected = nLine.toString(10);
                // Compare the two strings.
                if (sLineNumRead != sLineNumExpected)
                {
                   // This line does not start with the right line number.
                   // Set caret to beginning of this line and break script.
                   UltraEdit.activeDocument.gotoLine(nLine,1);
                   break;
                }
             }
          }
          
          Or use this script to check always only from current line to end of file which avoids on multiple execution to recheck the lines already checked before.

          Code: Select all

          if (UltraEdit.document.length > 0)  // Is any file opened?
          {
             // Define environment for this script.
             UltraEdit.insertMode();
             if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
             else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
          
             // Define line terminator type.
             var sLineTerm = "\r\n"
             if (UltraEdit.activeDocument.lineTerminator == 1) sLineTerm = "\n";
             else if (UltraEdit.activeDocument.lineTerminator == 2) sLineTerm = "\r";
          
             // Move caret to beginning of current line and get the line number.
             UltraEdit.activeDocument.gotoLine(0,1);
             var nLineNumber = UltraEdit.activeDocument.currentLineNum;
          
             // Select from current line to end of file.
             UltraEdit.activeDocument.selectToBottom();
          
             if (UltraEdit.activeDocument.isSel())
             {
                var asLines = UltraEdit.activeDocument.selection.split(sLineTerm);
                // Remove last string if it is empty because of file ends with a line termination.
                if (!asLines[asLines.length-1].length) asLines.pop();
          
                // Go to beginning of selection and cancel selection.
                UltraEdit.activeDocument.gotoLine(nLineNumber,1);
          
                for(var nLineIndex = 0; nLineIndex < asLines.length; nLineIndex++)
                {
                   // Get number at beginning of the line or the entire line if
                   // the regular expression object does not match on the line.
                   var sLineNumRead = asLines[nLineIndex].replace(/^(\d+)\..*$/,"$1");
                   // Convert line number to a decimal string.
                   var sLineNumExpected = nLineNumber.toString(10);
                   // Compare the two strings.
                   if (sLineNumRead != sLineNumExpected)
                   {
                      // This line does not start with the right line number.
                      // Set caret to beginning of this line and break script.
                      UltraEdit.activeDocument.gotoLine(nLineNumber,1);
                      break;
                   }
                   nLineNumber++;
                }
             }
          }
          Best regards from an UC/UE/UES for Windows user from Austria

          17
          Basic UserBasic User
          17

            Oct 17, 2014#5

            Dear Mofi,

            Can I get a macro or script to find the sort order of the text that does not contain numbered list.

            That is the line started with names and I want to check the names are in alphabetical order or not.

            Is that possible by any means?

            Thanks in Advance.
            Arun

            6,688587
            Grand MasterGrand Master
            6,688587

              Oct 17, 2014#6

              For checking sort order of the list starting with names, I would suggest to
              1. press Ctrl+A, Ctrl+C and Ctrl+N to copy the list into a new file,
              2. open File - Sort - Advanced Sort/Options,
              3. select Ascending, uncheck Remove duplicates, uncheck Ignore case (or check it, whatever is better), enter for Key 1 the values 1 and -1 (= entire line) and 0 for all values of all other keys,
              4. run the Sort. The settings are remembered for next sort.
              Now you have a sorted list and if you want, you can compare it with the original list using File - Compare.
              Best regards from an UC/UE/UES for Windows user from Austria

              17
              Basic UserBasic User
              17

                Oct 18, 2014#7

                Dear Mofi,

                Thanks for your reply.

                But in files I am working has many sections.

                It was like references of books. And it literally contains 50~200 chapter per book and it has to sort individually by the chapter heading.

                E.g.:

                Code: Select all

                <Chapter/>
                Reference 1
                Reference 2
                Reference 3
                Reference 4
                Reference 5
                <Chapter/>
                Reference 1
                Reference 2
                Reference 3
                <Chapter/>
                Reference 1
                Reference 2
                Reference 3
                Reference 4
                Reference 5
                Reference 6
                I have to sort them within the section.

                6,688587
                Grand MasterGrand Master
                6,688587

                  Oct 18, 2014#8

                  Is that block a real block, with tabs, spaces, section separators, etc.?

                  Please post a real block example as otherwise I would waste my time on developing an UltraEdit script working perfect for what you have posted, but not working on real data.
                  Best regards from an UC/UE/UES for Windows user from Austria

                  17
                  Basic UserBasic User
                  17

                    Oct 18, 2014#9

                    Dear Mofi,

                    Here is some sample data.

                    Please delete the sample after analyzing it. Done!

                    I have blocks like this in my file and each block contains 50-200 lines and a file contains 2,000 to 5000 lines.

                    6,688587
                    Grand MasterGrand Master
                    6,688587

                      Oct 18, 2014#10

                      Here is a script to check alphabetic order of the lines within each section.

                      Code: Select all

                      if (UltraEdit.document.length > 0)  // Is any file opened?
                      {
                         // Define environment for this script.
                         UltraEdit.insertMode();
                         if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
                         else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
                      
                         // Define line terminator type.
                         var sLineTerm = "\r\n"
                         if (UltraEdit.activeDocument.lineTerminator == 1) sLineTerm = "\n";
                         else if (UltraEdit.activeDocument.lineTerminator == 2) sLineTerm = "\r";
                      
                         // Get the line number of current line and move caret to beginning of line
                         // above. The line above must be also loaded into the array of lines in case
                         // of current line is not in correct order in comparison to the line above.
                         var nCurrentLine = UltraEdit.activeDocument.currentLineNum;
                         var nLineNumber = (nCurrentLine > 1) ? nCurrentLine - 1 : 1;
                         UltraEdit.activeDocument.gotoLine(nLineNumber,1);
                      
                         // Select from line above to end of file.
                         UltraEdit.activeDocument.selectToBottom();
                      
                         if (UltraEdit.activeDocument.isSel())
                         {
                            var asLines = UltraEdit.activeDocument.selection.split(sLineTerm);
                            // Remove last string if it is empty because of file ends with a line termination.
                            if (!asLines[asLines.length-1].length) asLines.pop();
                      
                            // Go to beginning of selection and cancel selection.
                            UltraEdit.activeDocument.gotoLine(nCurrentLine,1);
                      
                            var bNewSection = true;
                            for(var nLineIndex = 0; nLineIndex < asLines.length; nLineIndex++)
                            {
                               // Does this line separate the sections?
                               if (asLines[nLineIndex] == "<References/>")
                               {
                                  bNewSection = true;
                               }
                               else     // Reference line to evaluate.
                               {
                                  // First reference line in a new section?
                                  if (bNewSection)
                                  {
                                     bNewSection = false; // Nothing to do on this line.
                                  }
                                  else  // Compare this line with the line before.
                                  {
                                     if (asLines[nLineIndex-1] > asLines[nLineIndex])
                                     {
                                        // This line or the line before is according to case-sensitive
                                        // alphabetic sort order not on correct line in the list. Set
                                        // caret to beginning of this line and break script.
                                        UltraEdit.activeDocument.gotoLine(nLineNumber,1);
                                        break;
                                     }
                                  }
                               }
                               nLineNumber++;
                            }
                         }
                      }
                      
                      I have added today to my second post above a second script to check sort order of lines starting with a number only from current line to end of file.

                      This script checks only the lines from the line above the current line to end of file to avoid rechecking lines already checked before after running the script the first time from top of file. The line above (if not top of file) must be also read-in by this script in case of caret is positioned on a line which is not in correct order in comparison to the line above.

                      The caret position does not change if all lines from current line to end of file within each section are sorted alphabetically.

                      If a line is found which is not in correct order, the caret is positioned on this line. Then you can look on this line and the line above and decide which line to move to which position. After the manual fix, run the script again to check the other lines below.
                      Best regards from an UC/UE/UES for Windows user from Austria

                      17
                      Basic UserBasic User
                      17

                        Oct 19, 2014#11

                        Dear Mofi,

                        thanks for your reply. Your script was working good.

                        Thanks, Arun