Tapatalk

Equivalent to egrep? (Extract lines from big files)

Equivalent to egrep? (Extract lines from big files)

3
NewbieNewbie
3

    Mar 09, 2006#1

    Hello, first of all sorry for my poor English, I'm French :D

    Here is the problem:

    I have to work on huge files (4 000 000 line and more). They are UNIX type text files. Generally we use egrep to extracts all the lines of the file that match a label or a list of label disposed in column in an other files. But egrep is limited on our UNIX version on the number of lines in the "list of extract to do" file.

    I think UltraEdit is able to do that.

    Example of "work" file

    files look like this (little extract)

    Code: Select all

    .
    .
    .
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMMO:Name=PC000WL          PhyVal=40470                Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=40470                PCU Acquisition line 30
    2006/02/28-10:36:42.719  PATMMO:Name=PC0023V          PhyVal=100.2318             Label/Unit=V            Limit=  5 Status=GO (GO ) RawVal=534                  PCU 100V Bus Voltage
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMMO:Name=PC000WL          PhyVal=40469                Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=40469                PCU Acquisition line 30
    2006/02/28-10:36:42.719  PATMMO:Name=PC0023V          PhyVal=100.0441             Label/Unit=V            Limit=  5 Status=GO (GO ) RawVal=533                  PCU 100V Bus Voltage
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.792  PATMMO:Name=HM1002L          PhyVal=0                    Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=0                    Data Filed Header Flag
    2006/02/28-10:36:42.792  PATMMO:Name=HM1003L          PhyVal=2047                 Label/Unit=IDLE         Limit=  0 Status=GO (GO ) RawVal=2047                 Application Process Identifier
    2006/02/28-10:36:42.792  PATMMO:Name=HM1007L          PhyVal=0                    Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=0                    Source Sequence Count
    2006/02/28-10:36:42.792  PATMMO:Name=HM1008L          PhyVal=807                  Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=807                  Packet Length  
    2006/02/28-10:36:44.854  PATMMO:Name=HM1003L          PhyVal=**********           Label/Unit=             Limit=  0 Status=UND(GO ) RawVal=1791                 Application Process Identifier
    2006/02/28-10:36:44.854  PATMMS:Name=HM1003L          PhyVal=**********           Label/Unit=             RawVal=1791                 Status=UND(GO ) 
    2006/02/28-10:36:44.854  PATMMO:Name=HM1008L          PhyVal=5                    Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=5                    Packet Length  
    2006/02/28-10:36:44.856  PATMMO:Name=HM1003L          PhyVal=**********           Label/Unit=             Limit=  0 Status=UND(UND) RawVal=1790                 Application Process Identifier
    2006/02/28-10:36:44.856  PATMMO:Name=HM1008L          PhyVal=3                    Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=3                    Packet Length  
    2006/02/28-10:36:44.857  PATMMO:Name=HM1002L          PhyVal=1                    Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=1                    Data Filed Header Flag
    2006/02/28-10:36:44.857  PATMMO:Name=HM1003L          PhyVal=104                  Label/Unit=AOCS_S_D_02  Limit=  0 Status=GO (UND) RawVal=104                  Application Process Identifier
    2006/02/28-10:36:44.919  PATMEF:Name=HB5020X          size=      80
    .
    .
    .
    
    And I want to extract all the lines matching HM1008L and all the line matching HB5020X in an unique file.

    The list of strings to be extracted is placed on a file in rows (one string per line) with a max of approximately 100 lines.

    The extracted lines have to be sorted in the same order as in the original file (by date of the first row).

    So for my example I would like to see in new file:

    Code: Select all

    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
    2006/02/28-10:36:42.792  PATMMO:Name=HM1008L          PhyVal=807                  Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=807                  Packet Length  
    2006/02/28-10:36:44.854  PATMMO:Name=HM1008L          PhyVal=5                    Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=5                    Packet Length  
    2006/02/28-10:36:44.919  PATMEF:Name=HB5020X          size=      80
    Thanks for all help I can get here.

    6,685587
    Grand MasterGrand Master
    6,685587

      Mar 09, 2006#2

      That's no problem. First make sure that option UNIX Style Regular Expressions is NOT checked at Advanced - Configuration - Searching.

      Then open your file and press CTRL+F to open the find dialog.

      Enter ^{HB5020X^}^{HM1008L^} in the Find What field. This regular expression in UltraEdit style means find word HB5020X OR word HM1008L.

      Enable options List Lines Containing String and Regular Expressions and maybe also if necessary Match Case.

      Then press the button Find Next. A dialog will be automatically opened showing you the lines where either HB5020X or HM1008L is found.

      Press the button Clipboard to copy these lines into the clipboard and close the dialog with button Close.

      Open a new file and paste the content from the clipboard and you should get what you want.
      Best regards from an UC/UE/UES for Windows user from Austria

      3
      NewbieNewbie
      3

        Mar 09, 2006#3

        OK, I have searched in help how to give to the find function multiple entry, but could not find it.

        What is the limit of arguments (in term of numbers) for the OR expression in search string?

        6,685587
        Grand MasterGrand Master
        6,685587

          Mar 09, 2006#4

          Current version 11.20b of UltraEdit supports only 2 arguments: Word1 OR Word2. Next version 12.00 of UltraEdit has true Perl regular expression support which supports much more.
          Best regards from an UC/UE/UES for Windows user from Austria

          3
          NewbieNewbie
          3

            Mar 09, 2006#5

            I need this function for up to 50 or 100 arguments. That's why I'm asking for a macro that can do that ;)

            6,685587
            Grand MasterGrand Master
            6,685587

              Mar 10, 2006#6

              Well, then we use a macro or better we must use 2 macros.

              Nested loops are not possible but we need 2 loops for this task. So we have to put the inner loop into a second macro named CollectLines. This macro needs macro property Continue if a Find with Replace not found.

              Here is the code for macro CollectLines:

              Code: Select all

              Loop 
              Find "^c"
              IfFound
              SelectLine 
              Clipboard 8
              CopyAppend 
              Clipboard 9
              Else
              ExitLoop
              EndIf
              EndLoop
              This simple macro collects all lines in clipboard 8 which contain the string in clipboard 9.

              The much bigger macro is the main macro. But before I start with the explanation of the main macro, I have to explain what you have to do for preparation.

              You have to open your large file and have to insert at top of the file your arguments - one argument per line. The arguments can be simple words or phrases (including spaces so be careful with trailing spaces). And you have to insert a line with the character » at start of the line as marker for "end of arguments - real content starts". Example according to your example:

              Code: Select all

              HM1008L
              HB5020X
              »
              2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
              2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
              2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
              2006/02/28-10:36:42.719  PATMMO:Name=PC000WL          PhyVal=40470                Label/Unit=             Limit=  0 Status=GO (GO ) RawVal=40470                PCU Acquisition line 30
              2006/02/28-10:36:42.719  PATMMO:Name=PC0023V          PhyVal=100.2318             Label/Unit=V            Limit=  5 Status=GO (GO ) RawVal=534                  PCU 100V Bus Voltage
              2006/02/28-10:36:42.719  PATMEF:Name=HB5020X          size=      80
              The main macro needs also macro property Continue if a Find with Replace not found and trim trailing spaces before copying the macro code to the edit macro dialog because HTML adds a space at every line end and this space is bad for
              "
              "
              where no space should be at the end of first " line.

              The macro first makes sure, that the file ends with a linefeed, which is important here. Next it switches to column mode, selects all real content lines but without selecting any characters. This is the reason, why the file must end with a linefeed. It next inserts at every real content line the separator string #!##! and inserts an increasing number before this string at start of every content line. Back to normal edit mode and top of the file and clear clipboard 8 which will be the container for the lines.

              Now for every argument, select it, copy it to clipboard 9, let the macro CollectLines find and collect all lines containing this argument. Blank lines in the argument section are ignored (IfSel). When the list of arguments is processed - line starting with character » is reached - the macro first cleans up the main file by removing all line numbers and the string which separates the line number from the real start string of every line (date string).

              The last section of the macro pastes the lines collected in clipboard 8 to a new file and sorts this file. Because of the temporarily inserted line number the correct line order is restored. Finally it deletes also in the result file the line numbers and the separator string.

              Code: Select all

              InsertMode
              ColumnModeOff
              HexOff
              UnixReOff
              Bottom
              IfColNum 1
              Else
              "
              "
              EndIf
              Top
              Find "»"
              Key HOME
              Key DOWN ARROW
              ColumnModeOn
              SelectToBottom
              "#!##!"
              EndSelect
              Top
              Find "»"
              Key HOME
              Key DOWN ARROW
              SelectToBottom
              ColumnInsertNum 1 1 LeadingZero 
              EndSelect
              ColumnModeOff
              Top
              Clipboard 8
              ClearClipboard
              Clipboard 9
              Loop 
              IfCharIs "0"
              ExitLoop
              EndIf
              StartSelect
              Key END
              IfSel
              Copy 
              EndSelect
              Key DOWN ARROW
              PlayMacro 1 "CollectLines"
              Top
              Find "^c"
              Key HOME
              IfColNumGt 1
              Key HOME
              EndIf
              Else
              EndSelect
              EndIf
              Key DOWN ARROW
              EndLoop
              Find RegExp "%[0-9]+#!##!"
              Replace All ""
              ClearClipboard
              NewFile
              Clipboard 8
              Paste 
              ClearClipboard
              Top
              SortAsc RemoveDup 1 -1 0 0 0 0 0 0
              Find RegExp "%[0-9]+#!##!"
              Replace All ""
              Clipboard 0
              UnixReOn
              Remove the last command UnixReOn, if you use regular expression in UltraEdit style by default instead of Unix style.
              For UltraEdit v11.10c and prior see Advanced - Configuration - Find - Unix style Regular Expressions.
              For UltraEdit v11.20 and later see Advanced - Configuration - Searching - Unix style Regular Expressions.
              The macro commands UnixReOn/UnixReOff modify this setting.
              Best regards from an UC/UE/UES for Windows user from Austria