home.


Tagged: csv


Using AWK with CSV files with commas inbetween quotation marks

Sometimes you’ll get a CSV like: Here is something, And another thing, "OH LOOK, A COMMA WITHIN QUOTATION MARKS", something else.

This is annoying, since a normal awk separator like -F , will not work. But in modern version of awk, you can use -FPAT to use a regular expression.

Use awk -vFPAT='[^,]*|"[^"]*"'. This says you’re either looking for a field that ends in a comma, or looking for anything that begins and ends with quotation marks.

awk csv

Replacing commas in fields in CSV files with regex

Sometimes – in Comma Separated Value files – you have commas inside the fields themselves.

These means, should you run them through sed, awk or whatever, based on commas you’ll have extra fields:

afield,"another field","oh look, a false field",bugger

However, luckily, the field with the comma within is in double quotation marks.

This means we can run a regex to replace all such occurrances with the commas’s unicode entity, \\u0027

The regex works like this:

  1. Look for text that starts with ,"
  2. Keep grabbing text, which is not the end of the qutotation mark, until we get a comma
  3. Keep on grabbing again until we reach a double quotation mark

Then we can output the grabbed text between such and replace , with \\u0027

The regex, in vim syntax, looks like this:

%s/\(,"[^\"]*\),\(.*"\)/\1\\\u0027\2/

\( and \( are the grouping, and the /\1\\u0027\2/ defines the replacement with the HTML entity, so they can be ignored for this explanation.

,"[^\"]*,.*"

Leaving us with ," saying start the match with such, then [^\"].* is saying only grab text that’s not a double quotation mark.

Then, , is saying look for the comma in the quotation marks, and then .*" grabs everything until we get an ending quotation mark.

Then, since we’re grouping everything except the comma, we can do the replacement: /\1\\u0027\2/

sed unix csv awk regex vim

Sanitizing CSV files with regex

Often, you want to use a CSV file, but commas within fields, double and single quotation marks can work trickily with some other programs.

  1. The first regex will replace all commas in double quotation fields with unicode entity (only if such is not the first field, however)
  2. The second will then remove all the double quotation marks
  3. The third will replace the single quotation marks with their unicode entity

These are all in vim syntax.

%s/\(,"[^\"]*\),\(.*"\)/\1\\u002C\2/
%s/"//g
%s/'/\\u0027/g
unix csv regex vim

Using AWK to work with CSV files

Should you have a CSV file, you may want to convert that into another form.

AWK can help there. Here’s the basic AWK command for CSV files:

awk -v q="'" --field-separator ',' '{print q $1 $2 q}'

We’re saying, be verbose -v, use ' as the variable q (sometimes this is useful) and separate the fields using ,.

Then the work in {} is where is all happens. In this case we’re using print to print.

We’re printing first and the second field with no spaces inbetween (either a blank place in double quotation marks or a comma will give a space). We also use q to add a single quotation mark.

For example, given this CSV data in sample.csv:

david,jones,mastermind
chris,buckly,ethereal spirit
duncan,christmas,postman

This awk command cat sample.csv | awk -v q="'" --field-separator ',' '{print q $3 "=" $1 q}' will output:

'mastermind=david'
'ethereal spirit=chris'
'postman=duncan'

If you use AWK’s print to format a unix command, you can then pipe awk’s output to bash and run that command.

unix awk csv

Page 1 of 1