Giant file, column search help

Giant file, column search help

4
NewbieNewbie
4

    Jan 23, 2009#1

    So I have a file that has 30,000+ columns. I want to count the number of lines that contain a non-blank value in column 25,000.

    Using the search tool, I'm limited to xxxx values when searching by column, so I tried deleting enough columns (20000) to get my column to be searched under 9999 (5000).

    However, it appears that UE still thinks the columns go out to 30,000+. When I try to count values in any column (up to 9999) all I get are "0 occurances" for any value (?, *), as if UE has not realized that I deleted those columns (not just the values within the column). Shouldn't UE renumber the columns when it deleted the values?

    What's the best way to count (via column) when the column number to be searched is > 9999?

    236
    MasterMaster
    236

      Jan 23, 2009#2

      I don't have a file like that to test on, but you could try searching for the Perl regular expression ^.{24999}\S (and use the "count all" button in the search dialog).

      4
      NewbieNewbie
      4

        Jan 23, 2009#3

        Thanks! It looks like it's working so far...

          Jan 24, 2009#4

          One more thing, is this command string-able? Like can I put 2 or more of these together to count multiple columns?

          ^.{24999}\S + ^.{25999}\S + ^.{26999}\S

          ??

          6,604548
          Grand MasterGrand Master
          6,604548

            Jan 24, 2009#5

            ^.{24999}\S

            means find a string beginning on start of a line with 24999 characters of any type except newline characters and the 25.000th character is not a white-space character (space, tab newline).

            You should be able to combine such searches with an OR expression:

            ^(.{24999}\S|.{25999}\S|.{26999}\S)

            But I have not tested if this is really possible.
            Best regards from an UC/UE/UES for Windows user from Austria

            236
            MasterMaster
            236

              Jan 24, 2009#6

              Mofi's expression will work if you want to match a line that has at least one non-blank character in column 25000, 26000 and/or 27000. It could be optimized a little by writing it as

              ^.{24999}(?:.{1000}){0,2}?\S

              ^.{24999} will match the first 24999 characters of the line.

              (?:.{1000}){0,2}? will match zero, one or two occurences of 1000 consecutive characters, preferring as few as possible (that's the reason for the final ?, making the previous expression "lazy").

              \S will then match a non-blank character.

              If you want an "and"-evaluation, i. e. match only if there is a non-blank in all three positions, then you could use

              ^.{24999}\S(?:.{999}\S){2}

              4
              NewbieNewbie
              4

                Jan 26, 2009#7

                Thanks fellas, I'm making progress. And thank you for the color coding, it helps tremendously!

                Now I think the last is still to combine counts. The above examples work close, but aren't getting exactly what I'm looking for.

                To help explain, here's an example of what I'm doing, and what I'm looking for. As you can see, my data is columned, with space values where appropriate. Using pietzcker' example, I can search a column and count the populated fields. In this case "4" is correct:


                So I then go through and do the rest of the columns as well, and add them up (in this case, 4 + 4 + 2 + 1 = 11).

                However, as I have 200 columns to search in 60 files, I would like to have a string that sums (a file at a time) this way:


                When I try mofi's example, it seems to stop at the first column count. Maybe because I have space values between the columns I want to count?

                Thanks again!!!

                236
                MasterMaster
                236

                  Jan 26, 2009#8

                  I see. Well, Mofi's and my first regex will report "a match" if at least one of the three positions contains a non-blank character. Therefore you get one match even if the current line could match three times (strictly speaking: the regex aborts after the first successful match and doesn't even try the other options if the first one is already met).

                  The only thing you can do is use one regex for each position (^.{24999]\S, ^.{25999}\S etc.), apply them sequentially and add the results - just like you did. Since it appears to be too much work to do manually, you'd need a script. Now I don't know UE's JavaScript well enough, perhaps Mofi or jorrasdk can help.