How to delete lines in CSV file if a certain value is found in defined data field?

How to delete lines in CSV file if a certain value is found in defined data field?

4
NewbieNewbie
4

    Mar 20, 2013#1

    I have not worked with scripts much, but feel there is a way to set bookmarks using toggleBookmark after finding particular rows. For instance, I have a column heading of "Entity Type Code" and I would like to bookmark rows where that column contains "2" (a text field) for the purpose of deleting those bookmarked rows. Any help?

    6,686585
    Grand MasterGrand Master
    6,686585

      Mar 20, 2013#2

      There is no need for a script for this task. A Perl regular expression Replace All executed from top of file can do that.

      For example you have a CSV file with DOS line terminations with tab as delimiter. The rows you want to delete are those having value 2 in fifth data column.

      The Perl regular expression search string for this task is: ^(?:[^\t\r\n]*\t){4}2\t.*\r\n

      The replace string is an empty string.

      As you can see hopefully with the colors you can easily adapt:
      • line terminator type - UNIX would be just \n instead of \r\n as needed for DOS
      • delimiter character - semicolon would be simply ; and a comma would be just ,
      • number of data fields left the field with the value of interest
      • value of interest in the specified field
      If the value of interest is in the first field, the search expression would be: ^value\t.*\r\n

      If the value of interest is in the last field, the search expression would be: ^[^\t\r\n]*\tvalue\r\n

      4
      NewbieNewbie
      4

        Mar 20, 2013#3

        Thank you, Mofi, for your response. It is a csv file, but I am in Windows, not Unix. I also do not understand what a Perl regular expression Replace All is. Is this something I can enter somewhere in UltraEdit? If so, where? Also, the field containing a value of "2" (with the quotes around it) is the second column. If you could clarify, it would be appreciated. Thank you.

        6,686585
        Grand MasterGrand Master
        6,686585

          Mar 21, 2013#4

          Click in menu Find on item Replace. Enter the search string and set the options like you can see on attached screenshot. Click on button Replace All.

          Please note that the last line of the CSV file must also have a line termination if this line contains "2" between first and second tab. You can verify that by pressing Ctrl+End. If the caret is blinking now at beginning of a line, the last line in the file has a line terminaton. Otherwise hit key RETURN to insert one. Then press Ctrl+Home to move caret back to top of file and open the Replace dialog with Ctrl+R or via menu.

          perl_regex_replace_all.png (5.89KiB)
          Perl regular expression Replace All to delete all lines with "2" between first and second tab in the line.

          4
          NewbieNewbie
          4

            Mar 21, 2013#5

            I'm getting "Search string '^(?:[^\t\r\n]*\t){1}"2"\t.*\r\n' not found."

            However, I just noticed your text below the image referring to "tabs". It is not tab delimited, but comma delimited. Is that what you were referring to?

            Changed my expression to ^(?:[^,\r\n]*,){1}"2",.\r\n and it worked!

            Thanks for all the help, Mofi!

              Mar 25, 2013#6

              Next I wanted to delete all lines with "" as second value. Unfortunately, the expression ^.*?,"",.*\r\n you suggested (in a now already deleted post) deleted all lines after the header lines. Definitely not what I wanted.

              Here is an example of my file, including the header line, but omitting a lot more columns:

              Code: Select all

              "Column1", "Column2", "Column3", "Column4"
              "123", "1", "", "Smith"
              "124", "1", "X", "Jones"
              "124", "", "", ""
              "125", "1", "XYZ", "Walker"
              In this example, the second to the last line has a "" value for Column2 and I would like to remove that row and any similar in pattern.

              Actually, I would like to know how to remove Column3, too. There are nearly 3,000,000 rows in this file and I am trying to reduce its size to be more managable.

              6,686585
              Grand MasterGrand Master
              6,686585

                Mar 25, 2013#7

                I could see now with some example lines of real data that the expression .*?, was not good in all my previous posts. This is a non-greedy expression, but it is nevertheless too greedy in case of multiple values in a line are empty values.

                Much better is the expression [^,\r\n]*, which matches in any case only one data field. This expression means: Find zero or more characters which are whether a comma, nor a carriage return, nor a line-feed (negative character set) followed by a comma.

                I modified the expressions in all posts above by the better one in case other forum members are reading here too.

                The non-capturing group with the multiplier {1} is not necessary for finding strings between first and second comma in a line as you already know.

                The better search string for your task for deleting all lines with "" (empty field in double quotes) between first and second comma is:

                ^[^,\r\n]*, *"",.*\r\n

                As in your examples lines there is a space character between first comma and the first double quote character of second value, the expression contains now also a space character followed by an asterisk. With this addition zero or more spaces are allowed between first comma and the double quote in second data field. So the lines 4 to 6 in the example below are found with this expression and deleted.

                Code: Select all

                "Column1", "Column2", "Column3", "Column4"
                "Line 2", "1", "", "Smith"
                "Line 3", "1", "X", "Jones"
                "Line 4","", "", ""
                "Line 5", "", "", ""
                "Line 6",  "", "", ""
                "Line 7", "1", "XYZ", "Walker"
                If you want to delete the third value from all lines independent of the value (empty or not empty), you need a tagged regular expression.

                The search string is for this task ^((?:[^,\r\n]*,){2})[^,\r\n]*, and the replace string is just \1

                If you or anybody else want to know more about Perl regular expressions to better understand them, see the IDM power tips:
                More useful links to webpages about Perl regular expressions can be found in the forum announcement Readme for the Find/Replace/Regular Expressions forum