How to find lines in CSV file with less or more than X tabs within a line?

How to find lines in CSV file with less or more than X tabs within a line?

2

    Jun 21, 2012#1

    Hi

    I am new to UltraEdit macros and I'm getting familiar with the syntax, but struggling to find a count like function?

    I have a file containing 1.8 million records with 329 columns. The records in the file are all tab delimited.

    Each record should have 328 tab delimiters, but I know some records contain more than this number.

    I want a macro that will count the number of tab delimiters and locate me all the records that have more than 328 tab delimiters inside the record.

    Any help would be much appreciated.

    Thank you.

    6,610548
    Grand MasterGrand Master
    6,610548

      Jun 21, 2012#2

      There is no need for a macro. Just press Ctrl+F to open the Find dialog. Enter or copy/paste as string to find ^(?:.*?\t){329} and enable the option Regular Expressions: Perl.

      Press button Advanced to open the advanced find options pane and make sure the Perl regular expression engine is selected as required. Check additionally the option List Lines Containing String. Run the Find with click on button Next.

      A window opens listing all lines with at least 329 tab characters. You can copy this list to clipboard and paste it into a new file or double click on the listed lines to jump to this line for making corrections.

      I tested the expression before posting with UE v18.10.0.1014 on a quickly created file with 4 lines with two lines having 5 tabs, one having 6 tabs and one having 7 tabs and using ^(?:.*?\t){6} The list contained the 2 lines with more than 5 tabs.

      Explanation of the expression:

      ^ ... start every search at beginning of a line.

      (?:...) ... is a non-capturing group. Grouping is supported only by the Perl regular expression engine. ?: immediately after opening round bracket tells the Perl regular expression engine not to mark (or tag) the string found by the expression inside the round brackets for being back referenced in search or replace string with $1 respectively \1. Such special modifiers as ?: immediately after ( are documented not in help of UltraEdit about Perl regular expression as the explanation of all those modifiers fills books and are for experts only.

      .*?\t ... find 0 or more characters of any value except line ending characters followed by a tab character. The question mark after the asterisk makes the expression non greedy (= match as less characters as possible to get true for this expression).

      {6} ... the previous expression which is the non-capturing group finding something or nothing tab must match 6 times. If the line contains less tabs, the result of the find is false for this line.

      2

        Jun 22, 2012#3

        Thank you for your reply and explanation. I am beginning to understand regular expression syntax.

        I did follow your instructions and I am using 17.30.0.1014 (trial version at the moment).

        One last question please: I have found that my problem in the file is actually to do with a record that contains less than 328 delimiters. How can find the rows with this problem instead?

        Thanks

        6,610548
        Grand MasterGrand Master
        6,610548

          Jun 22, 2012#4

          Take a look on topic How to find line breaks in fields of a CSV file and remove it? which contains most likely the solution for your problem. I'm quite sure that the CSV file contains line breaks in data fields and the exporting application has not double quoted those fields or the importing application is poor coded and does not support line breaks in field strings.

          The negative search for finding lines with less than x tabs is much more complicated. Based on what pietzcker posted at How to delete all lines NOT containing specific word or string or expression? the expression to use to find and select lines with less than 328 tabs is:

          ^(?:(?!(?:.*?\t){328,}).)*$\r\n

          Please make sure that Find What field does not contain other invisible characters like spaces and line breaks when using this expression. A common mistake on copying an expression from the browser window and pasting into Find What field is that the line termination is copied too.

          1
          NewbieNewbie
          1

            Jul 19, 2012#5

            Are you still looking for a solution to this?

            I have an approach that is within UltraEdit, that will allow you to determine the line number on which the number of tabs is a number that is an exception.  The steps will be:
            • Make a copy of the file to reformat.
            • Change all tabs to, for example, tildes (all 09 becomes 7E). (This is just for visibility.)
            • Delete (find and replace with nothing) all characters that are not tildes or line delimiters; not a burdensome task if all you have are digits and decimal points; more of a bother if you do have letters also, but grit your teeth and you will be through it in about 12 minutes; Perl at _this_ step would speed things up well, e.g.:
              Using Perl for the engine (not Unix, not UltraEdit):
              • In the Find What box write: [A-Z]
                In the Replace With box make sure to have nothing.
                Make sure the option Match case is checked and then click on Replace All.
                That gets rid of all uppercase letters.
              • In the Find What box write: [a-z]
                In the Replace With box make sure to have nothing.
                Make sure the option Match case is checked and then click on Replace All.
                That gets rid of all lowercase letters.
              And so on until you get rid of whatever is in there besides tildes and line delimiters.
            • Then do massive reductions of tildes like 50 tildes plus line delimiter becomes line delimiter. Do this 6 times and you will have 1.8 million instances of ^p and 28 tildes and ^p and other instances of ^p and not 28 tildes and ^p.
            • Count the instances of ^p and 28 tildes and ^p (call this Y) and then, where X is the number of lines in the file, X - Y is the number of exceptional lines you must locate.
            • Then simply do a few searches for instances such as ^p and 29 tildes and ^p or whatever variations of N  for "300+N" you might want to check for, until any and all exceptional lines have been located.

            6,610548
            Grand MasterGrand Master
            6,610548

              Jul 20, 2012#6

              onceayear, some hints for you.

              If option Match Case is not enabled in Find/Replace dialog, [A-Z] is equal [a-z] is equal [A-Za-z].

              And it is possible to search with a negative character set. In Unix/Perl a search for all characters except ~ and line terminators is possible with [^~\r\n]. The character ^ immediately after [ means NOT. So this little expression means: find a character which is NOT a tilde and NOT a carriage return and NOT a line-feed.

              But all that work as you have described is not really necessary. There are just 2 Perl regular expression Finds needed to find all lines with more than 328 tabs or less than 328 tabs.

              Perl regular expression to find lines with more than 328 tabs in a CSV file: ^(?:[^\t\r\n]*\t){329,}.*\r\n

              Perl regular expression to find lines with less than 328 tabs in a CSV file: ^(?!(?:[^\t\r\n]*\t){328}).*\r\n

              If the CSV file uses a different character as separator, just the 4 \t in the 2 regular expression strings above must be replaced by the separator character.

              Note: The regular expression finds work only for a CSV file not containing a tab character (separator) within a field value enclosed in double quotes. Fixing or reformatting a CSV file with separator existing also within a double quoted field value is a nightmare with regular expressions and very often not possible.

              4
              NewbieNewbie
              4

                Jul 03, 2018#7

                This is handy, but how do I change this for the pipe character "|"?

                I did the below and doesn't seem to work

                ^(?:[^\t\r\n]*"|"){15}.*\r\n

                Basically, I have a CSV file that is pipe or vertical bar separated. If it doesn't contain the exact number of columns then my import into SQL will fail. Let me know what you think.

                Thanks.

                18572
                MasterMaster
                18572

                  Jul 03, 2018#8

                  Hi eripey,

                  ^(?:[^\|\t\r\n]*+\|){15}.*\r\n

                  For searching lines with less than 15 delimiters use this:
                  ^(?!(?:[^\|\r\n]*+\|){15})

                  BR, Fleggy

                  4
                  NewbieNewbie
                  4

                    Jul 03, 2018#9

                    The first one is to find an exact match and the second is to find one less than the number?

                    Is there a way to find the lines that don't match the number specified?

                    Also my end of line is a paragraph symbol if that makes any difference.

                    ^(?:[^\|\t\r\n]*+\|){3}.*\r\n

                    Above would find lines with more than 3 pipes, am I correct?

                    18572
                    MasterMaster
                    18572

                      Jul 03, 2018#10

                      The first one matches whole lines containing 15 or more delimiters.
                      The second one finds lines containing 14 or less delimiters.
                      Just try them. ;)

                      EDIT:

                      For searching lines with delimiter count <> 15 use this (for example):

                      ^(?!(?:[^\|\r\n]*+\|){15}[^\|\r\n§]*+§)

                      BR, Fleggy

                      4
                      NewbieNewbie
                      4

                        Jul 03, 2018#11

                        fleggy wrote:The first one matches whole lines containing 15 or more delimiters.
                        The second one finds lines containing 14 or less delimiters.
                        Just try them. ;)
                        Okay, thanks. I just did and it looks like it works!

                        6,610548
                        Grand MasterGrand Master
                        6,610548

                          Jul 03, 2018#12

                          The vertical bar | means OR in a Perl regular expression search string, except this character is escaped with a backslash or is within square brackets which define a character class.

                          It is of course possible to search for lines with 16 or more OR with less than 15 separators/delimiters:

                          ^(?:(?:[^|\r\n]*\|){16,}|(?!(?:[^|\r\n]*\|){15})).*$

                          By the way: A CSV file not containing the right number of separators/delimiters on a line contain usually a multi-line value. If such a multi-line value is correct enclosed in double quotes, every application with support for reading/importing data from CSV files should support them as otherwise it is not compliant according to CSV specification as defined in RFC 4180. See the Wikipedia article about comma-separated values which explains the CSV format very well and easy understandable.
                          Best regards from an UC/UE/UES for Windows user from Austria

                          4
                          NewbieNewbie
                          4

                            Jul 03, 2018#13

                            Thanks. It would be nice if they would enclose values with quotes, but in this case they don't. I have the lines all recorded so I can fix before import. At least with this I can check before and fix the values that contains the pipe by removing and then adding in after import is complete. That's about the only way I could think about doing right now.

                            In your example, if I know the amount of columns is 15. Thanks again, I will do this.