home.

tagged: regex

Replacing commas in fields in CSV files with regex

Sometimes -- in Comma Separated Value files -- you have commas inside the fields themselves.
 
These means, should you run them through sed, awk or whatever, based on commas you'll have extra fields:

afield,"another field","oh look, a false field",bugger

However, luckily, the field with the comma within is in double quotation marks.
 
This means we can run a regex to replace all such occurrances with the commas's unicode entity, \\u0027
 
The regex works like this:
  1. Look for text that starts with ,"
  2. Keep grabbing text, which is not the end of the qutotation mark, until we get a comma
  3. Keep on grabbing again until we reach a double quotation mark
Then we can output the grabbed text between such and replace , with \\u0027
 
The regex, in vim syntax, looks like this:

%s/\(,"[^\"]*\),\(.*"\)/\1\\\u0027\2/

\( and \( are the grouping, and the /\1\\u0027\2/ defines the replacement with the HTML entity, so they can be ignored for this explanation.

,"[^\"]*,.*"

Leaving us with ," saying start the match with such, then [^\"].* is saying only grab text that's not a double quotation mark.
 
Then, , is saying look for the comma in the quotation marks, and then .*" grabs everything until we get an ending quotation mark.
 
Then, since we're grouping everything except the comma, we can do the replacement: /\1\\u0027\2/

sed unix csv awk regex vim


Sanitizing CSV files with regex

Often, you want to use a CSV file, but commas within fields, double and single quotation marks can work trickily with some other programs.
  1. The first regex will replace all commas in double quotation fields with unicode entity (only if such is not the first field, however)
  2. The second will then remove all the double quotation marks
  3. The third will replace the single quotation marks with their unicode entity
These are all in vim syntax.

%s/\(,"[^\"]*\),\(.*"\)/\1\\u002C\2/
%s/"//g
%s/'/\\u0027/g

unix csv regex vim

Page 1 of 1