Removing commas within double quoted data fields in a CSV file

Removing commas within double quoted data fields in a CSV file

691
Advanced UserAdvanced User
691

    Dec 04, 2013#1

    I have a set of strings separated by commas in the following form:

    "String1String","String2,String","String3String","String4,String",String5String,String6String

    Some strings are enclosed in double quotes but not all. For example String5 and String6 in the example above are not enclosed in double quotes

    Some strings themselves contain one comma within the string. These commas I wish to get rid of.
    This occurs in String2 and String4.

    How can I use a Find/Replace regular expression to get rid of the commas within the strings?

    My best idea was to construct a regular expression as follows:

    <double quote><0 or more chars><1 or more commas><0 or more chars><double quote>

    The regular expression I came up with is ".*?,.*" or grouped "(.*)(?,)(.*)"

    I cannot get this to work in UE when I test with Ctrl-F. Should this work? Is there a better idea?

    Many thanks for any reply

    6,681583
    Grand MasterGrand Master
    6,681583

      Dec 04, 2013#2

      So you have an absolutely valid CSV file as explained in Wikipedia article about Comma-separated values, but the application you want to import the data in the CSV file is bad coded as it does not ignore the separator character (the comma) within a double quoted data field as it should.

      As it is very easy to write in C/C++ an "import from CSV file" function which ignores the separator character within double quoted data fields and supporting also a double quote escaped with one more double quote in a double quoted data field, a regular expression for finding the separator character within double quoted data fields is much more complicated.

      For your CSV example with 6 data fields the following UltraEdit macro can be used:

      Code: Select all

      InsertMode
      ColumnModeOff
      HexOff
      Top
      PerlReOn
      Find MatchCase RegExp "^(".*?"(?<!"")(?!")|.*?),(".*?"(?<!"")(?!")|.*?),(".*?"(?<!"")(?!")|.*?),(".*?"(?<!"")(?!")|.*?),(".*?"(?<!"")(?!")|.*?),"
      Replace All "\1»\2»\3»\4»\5»"
      Find MatchCase ","
      Replace All " "
      Find MatchCase "»"
      Replace All ","
      You can run the 3 replaces also manually from top of the file.

      The first Perl regular expression Replace All replaces the 5 commas used as separator between the data fields in each line by character » which hopefully does not exist in your entire CSV file. The next normal Replace All replaces all remaining commas by spaces. And the last Replace All replaces all » back to commas.

      Another solution also using a macro works independent on number of data fields:

      Code: Select all

      InsertMode
      ColumnModeOff
      HexOff
      Top
      PerlReOn
      Loop 0
      Find MatchCase RegExp "".*?"(?<!"")(?!")"
      IfFound
      Find MatchCase SelectText ","
      Replace All " "
      EndSelect
      Key RIGHT ARROW
      Else
      ExitLoop
      EndIf
      EndLoop
      Top
      This solution can be much slower as the solution above as it selects in a loop every double quoted data field and replaces within each selection every comma by a space. This solution is slower as lots of display updates are made while the macro runs.
      Best regards from an UC/UE/UES for Windows user from Austria

      691
      Advanced UserAdvanced User
      691

        Dec 04, 2013#3

        Thank you Mofi. It will take me a little time to unpack this detailed reply. Thankyou :)