Filter lines in large file based on variable criteria

Filter lines in large file based on variable criteria

11
Basic UserBasic User
11

    Oct 06, 2008#1

    Hi guys, I'm really new in this, I hope somebody help me.

    There are lists within a document that contain many of the 33 different "Types" of products, but not all of them.
    I need a kind of "filter" to extract only the number located between the strings "LIST:NUMBER=" and ",TYPES="
    for all lines within a specific "Type" look up.

    I have the follow pattern in a large text file (256MB about in size and 2.5 million lines more or less).

    *******************************************************************************************************
    SALE:NUMBER=12345678910:TYPE=XXXXX
    LIST:NUMBER=12345678910,TYPES=Type1-1&Type2-10&Type4-2&Type5-1&...&Type31-1&Type32-0&Type33-0

    SALE:NUMBER=56734520957:TYPE=XXXXX
    LIST:NUMBER=56734520957,TYPES=Type1-1&Type3-1&Type4-2&Type5-1&...&Type31-1&Type32-0&Type33-0

    SALE:NUMBER=77834002759:TYPE=XXXXX
    LIST:NUMBER=77834002759,TYPES=Type1-1&Type2-10&Type4-2&Type5-1&...&Type31-1&Type32-0
    .
    .
    more or less 2 million lines after
    .
    .
    SALE:NUMBER=23111109385:TYPE=XXXXX
    LIST:NUMBER=23111109385,TYPES=Type1-1&Type2-10&Type3-1&Type4-2&Type5-1&...&Type31-1&Type32-0&Type33-0
    *******************************************************************************************************
    What I need by examples;

    Example 1:
    If I want to filter for "Type2-10", the answer would be, in a new file, as follow:

    ************************************
    12345678910
    77834002759
    .
    .
    .
    23111109385
    ************************************
    Example 2:
    If I want to filter for "Type3-1" and "Type4-2", the answer would be, in a new file, as follow:

    ************************************
    56734520957
    .
    .
    .
    23111109385
    ************************************
    I made a macro that does a filter, but copies the complete line for every match and not only the number
    between the strings like I said before.

    Questions:
    1) I don't know how to say the macro extract in a new file only the numbers between the strings explained above for every match found.

    2) In other hand, I've used the next commands to make flexible the look up data, but something is wrong, because not always paste the same data. I think is something with the Clipboard but I don't know how to fix it.

    Code: Select all

    GetString "A filter over which Type?",
    CutAppend
    Find "^c"
    NewFile
    Paste
    The complete macro I have at the moment:

    Code: Select all

    InsertMode
    ColumnModeOn
    HexOff
    UnixReOff
    GotoLine 1 1
    GetString "A filter over which Type?",
    CutAppend
    Find "^c"
    NewFile
    Paste
    SaveAs "C:\Documents and Settings\My documents\Filter\^c List.TXT"
    Thanks in advance.

    Best regards.

    236
    MasterMaster
    236

      Oct 06, 2008#2

      Hi,

      a few thoughts from me:

      - This looks more like a job for a grep tool, not a text editor. With a tool like PowerGREP, this would be a 30 second job.
      - This surely can be done anyway with UE. I'm not sure if macros can do the job since I don't think that you can dynamically construct a regex search string using the clipboard. Mofi can surely answer that question. I'd suggest you use UE's scripting engine which surely wouldn't have a problem with that.
      - You could:

      first delete all blank lines
      Search for Perl regex ^[ \t]*\r\n
      Replace with nothing.

      then delete all lines that start with "SALE:"
      Search for ^SALE:.*\r\n
      Replace with nothing.

      then delete all the lines that don't contain your filter term, one by one:
      Search for ^LIST:NUMBER=(?:(?!Type2-10).)*$\r\n
      Replace with nothing, repeating this once for each filter term.

      Finally, clean up, removing everything but the number:
      Search for ^LIST:NUMBER=([^,]+),.*
      Replace with \1

      Then save under a different filename.

      As I said before, PowerGREP would do this in half a minute, including the definition of the search...

      Cheers,
      Tim

      6,675585
      Grand MasterGrand Master
      6,675585

        Oct 06, 2008#3

        Here is the macro which does the job. You have to enter only the type numbers without the word "Type". For example if you enter 2-10 you will get the result of example 1. The string you enter is interpreted as regular expression string in UltraEdit syntax. So you can for example use an OR expression like ^{3-1^}^{4-2^} to get the list numbers of the lines which contain "Type3-1" or "Type4-2" (= result of example 2). But the file name of the saved data will then look not very nice. And please note that the UltraEdit regular expression engine supports only 2 arguments for the OR expression. So something like ^{3-1^}^{4-2^}^{2-10^} is not possible with 1 macro execution.

        The macro property Continue if a Find with Replace not found or Continue if search string not found must be checked for this macro.

        InsertMode
        ColumnModeOff
        HexOff
        UnixReOff
        Bottom
        IfColNumGt 1
        InsertLine
        IfColNumGt 1
        DeleteToStartofLine
        EndIf
        EndIf
        Top
        NewFile
        Clipboard 9
        GetString "A filter over which Type?"
        SelectToTop
        Cut
        NextWindow
        Clipboard 8
        ClearClipboard
        Loop
        Clipboard 9
        Find RegExp "LIST:NUMBER=[0-9]+,TYPES*Type^c*^p"
        IfNotFound
        ExitLoop
        EndIf
        Clipboard 8
        CopyAppend
        EndLoop
        Top
        PreviousWindow
        Clipboard 8
        Paste
        ClearClipboard
        Top
        Find RegExp "LIST:NUMBER=^([0-9]+^)*$"
        Replace All "^1"
        Clipboard 9
        SaveAs "C:\Documents and Settings\My documents\Filter\Type^c List.TXT"
        ClearClipboard
        Clipboard 0

        The method Tim suggested with deletion of everything which is not of interest with regular expression replaces would be much faster, but we don't know which lines your source files contain in real.
        Best regards from an UC/UE/UES for Windows user from Austria

        11
        Basic UserBasic User
        11

          Oct 06, 2008#4

          Hi Tim and Mofi,

          Many thanks for answer my question.

          I tested your suggestions and run perfectly!, but the loop task is executed very slowly (40 min). I've done a macro only to extract the number between strings with the expression you gave me,

          Code: Select all

          Find RegExp "LIST:NUMBER=^([0-9]+^)*$"
          and works very fast, very nice!, but is the second part of what I need.

          After be trying and trying, I detected UltraEdit does the filter task very fast, no more than 30 seconds doing the next steps:


          1) Find option (Ctrl+F), with the option "List Lines Containing String" selected,
          2) Write the string what I want to find in every line in the document.

          Now UltraEdit answers with a new window named "Lines containing find string:"
          with 5 options, "Close, Goto, Bookmark All, Clipboard and Refresh".

          3) In this window I click on Clipboard option
          4) NewFile
          5) Paste
          6) Run macro with command:

          Code: Select all

          Find RegExp "LIST:NUMBER=^([0-9]+^)*$"
          7) It's done![/color]

          But the steps 1-5 were applied without variables like ^c, and now my problem is:

          1) How to use variables using GetString ""
          2) Force the Find function (Ctrl+F) to select the option "List Lines Containing String".
          3) Copy the lines filtered with clipboard option,

          because when I record the steps, the macro doesn't show intermediate steps 2 and 3 and
          looks like this.

          Code: Select all

          Find "Whatever"
          NewFile
          Paste
          May you please say me how to fix this?

          Thanks very much again.

          6,675585
          Grand MasterGrand Master
          6,675585

            Oct 07, 2008#5

            My loop is the replacement for a Find with List Lines Containing String and then copying the results to the clipboard. The List Lines Containing String option require user interactions and therefore cannot be run automatically from within a script or macro. What makes the macro so slow on your very large file is scrolling and displaying the content. A tool like PowerGREP as Tim suggested would do that much faster because it does not have to display the content during execution.

            Here is again an UltraEdit macro solution which should be much faster because no scrolling in the source file. But it is very important that you have only your source file open and no other file, or the macro will not produce the correct result. It uses a Find In Files in all open files to get the lines of interest with results written to an edit window which unfortunately scrolls. After collecting the data in the results window, all non interesting lines and data are deleted with regular expression replaces (English UE with default settings for the output format of Find In Files).

            If you are 100% sure that last line of your source file ends always with a line termination, remove the red colored code to make the macro faster. You can remove this code also when the last line surely never is a data line of interest.

            I hope you use latest version of UltraEdit because previous versions had several problems with macro command FindInFiles.

            InsertMode
            ColumnModeOff
            HexOff
            UnixReOff
            Bottom
            IfColNumGt 1
            InsertLine
            IfColNumGt 1
            DeleteToStartofLine
            EndIf
            EndIf

            NewFile
            Clipboard 9
            GetString "A filter over which Type?"
            SelectToTop
            Cut
            CloseFile NoSave
            FindInFiles RegExp OpenFiles "" "" "LIST:NUMBER=[0-9]+,TYPES*Type^c"
            Top
            Find MatchCase "----------------------------------------^p"
            Replace All ""
            Find RegExp "Search complete, found?+^p"
            Replace All ""
            Find RegExp "%F[oui]+nd 'LIST:NUMBER?+^p"
            Replace All ""
            Find RegExp "%*LIST:NUMBER=^([0-9]+^)*$"
            Replace All "^1"
            SaveAs "C:\Documents and Settings\My documents\Filter\Type^c List.TXT"
            ClearClipboard
            Clipboard 0
            Best regards from an UC/UE/UES for Windows user from Austria

            11
            Basic UserBasic User
            11

              Oct 08, 2008#6

              Hi guys, thanks very much, really!

              I've tested both solutions, the first macro procedure gave by pietzcker (Perl style) and Mofi's last one (UltraEdit style).

              Both runs extremely faster than other macros I've used for this task, both have similar duration times,
              that is from 3 to 4 minutes of execution.

              Thank you very much, because I've learned a lot with your examples.

              One more question:

              For example, you gave me the solution for search occurences of Type1-1 OR Type2-1 doing "^{1-1^}^{2-2^}" , but

              How can I do a search for Type1 AND Type2 occurence at the same time in Perl and UltraEdit or Unix style?,
              in this case Type1 and Type2 must be both in the string. Which symbol I have to use to represent the logical operation "AND"?

              Thanks in advance.

              Best regards

              6,675585
              Grand MasterGrand Master
              6,675585

                Oct 08, 2008#7

                In UltraEdit syntax the search string is 1-1?+Type4-2 where ?+ means any character 1 or more times except a new line character.

                In Perl/Unix syntax the search string is 1-1.+Type4-2 where .+ means any character 1 or more times except a new line character.

                But it is now important that you specify the types in the order as they exist in the lines. For example 4-2.+Type1-1 will not find any line.
                Best regards from an UC/UE/UES for Windows user from Austria

                11
                Basic UserBasic User
                11

                  Oct 09, 2008#8

                  Hi Mofi,

                  I'm new, like I said before, in this subject and the only thing that I can do is to smile :D when I look how this little macro with the Regex you gave me works exactly what I wanted.

                  This is a very useful tool that I didn't know about.

                  Many thanks for help me to begin in this wonderful theory.

                  PD: (Mofi) The AND operand you gave me, passed the test, obviously it works :!:


                  Best Regards from Honduras.

                    Filter lines in large file based on variable criteria with Regex

                    Oct 11, 2008#9

                    Hi eveybody,

                    I did this question in macro section, and they helped me a lot, it worked great - see above.

                    But now I want to learn how to do it with a script.

                    How extract using only one Regex (in a new file or in the same one) the numbers between "LIST:NUMBER=" and ",TYPES="
                    for every line that matches the search wanted? (The strings wanted would be the "TypeX-X").

                    The Regex it could be an If-then expression, I think, but I don't have idea how to do it.

                    Many thanks in advance.

                    Best Regards