home.


Tagged: awk


Using AWK with CSV files with commas inbetween quotation marks

Sometimes you’ll get a CSV like: Here is something, And another thing, "OH LOOK, A COMMA WITHIN QUOTATION MARKS", something else.

This is annoying, since a normal awk separator like -F , will not work. But in modern version of awk, you can use -FPAT to use a regular expression.

Use awk -vFPAT='[^,]*|"[^"]*"'. This says you’re either looking for a field that ends in a comma, or looking for anything that begins and ends with quotation marks.

awk csv

Replacing commas in fields in CSV files with regex

Sometimes – in Comma Separated Value files – you have commas inside the fields themselves.

These means, should you run them through sed, awk or whatever, based on commas you’ll have extra fields:

afield,"another field","oh look, a false field",bugger

However, luckily, the field with the comma within is in double quotation marks.

This means we can run a regex to replace all such occurrances with the commas’s unicode entity, \\u0027

The regex works like this:

  1. Look for text that starts with ,"
  2. Keep grabbing text, which is not the end of the qutotation mark, until we get a comma
  3. Keep on grabbing again until we reach a double quotation mark

Then we can output the grabbed text between such and replace , with \\u0027

The regex, in vim syntax, looks like this:

%s/\(,"[^\"]*\),\(.*"\)/\1\\\u0027\2/

\( and \( are the grouping, and the /\1\\u0027\2/ defines the replacement with the HTML entity, so they can be ignored for this explanation.

,"[^\"]*,.*"

Leaving us with ," saying start the match with such, then [^\"].* is saying only grab text that’s not a double quotation mark.

Then, , is saying look for the comma in the quotation marks, and then .*" grabs everything until we get an ending quotation mark.

Then, since we’re grouping everything except the comma, we can do the replacement: /\1\\u0027\2/

sed unix csv awk regex vim

Using AWK to work with CSV files

Should you have a CSV file, you may want to convert that into another form.

AWK can help there. Here’s the basic AWK command for CSV files:

awk -v q="'" --field-separator ',' '{print q $1 $2 q}'

We’re saying, be verbose -v, use ' as the variable q (sometimes this is useful) and separate the fields using ,.

Then the work in {} is where is all happens. In this case we’re using print to print.

We’re printing first and the second field with no spaces inbetween (either a blank place in double quotation marks or a comma will give a space). We also use q to add a single quotation mark.

For example, given this CSV data in sample.csv:

david,jones,mastermind
chris,buckly,ethereal spirit
duncan,christmas,postman

This awk command cat sample.csv | awk -v q="'" --field-separator ',' '{print q $3 "=" $1 q}' will output:

'mastermind=david'
'ethereal spirit=chris'
'postman=duncan'

If you use AWK’s print to format a unix command, you can then pipe awk’s output to bash and run that command.

unix awk csv

Unix: Basics of Gawk

The basic syntax of this command is usually:

    gawk '<gawk command>' filename

Within the gawk command, the format is:

    BEGIN { <operations> } /<pattern match>/ { <operations> } /<another pattern>/ { <another op> } END { <operations> }

The BEGIN and END parts are optional. As is the pattern match.

Within the main operations block, you can use the NF variable to find many fields are on this line.

This gawk program will only print non-blank lines, since blank lines have a NF of 0:

    gawk '{ if(NF!=0) print $0 }' filename

The ‘print $0’ will print the whole line. The $1 will print the first field and so on. Fields are separated by spaces and/or tabs.

Pattern matching will allow you to only print lines with, in this case, the word ‘hello’ in them:

    gawk '/hello/ { print $0 }' filename

If you place a ‘~’ before the pattern match, it will be inverted.

There are many other things gawk can do, including variables, addition of fields and many more.

unix unix-awk unix-gawk

Unix: Killing a process using ps, grep and nawk

 ps ax | grep -v grep | grep PROCESSNAME | nawk '{print $1}'

This will show your processes, remove any with the word ‘grep’ in them, grep your PROCESSNAME, run nawk to get the pid of it.

If you pass this to kill -9, you’ll kill the service:

 kill -9 `ps ax | grep -v grep | grep PROCESSNAME | nawk '{print $1}'`
unix unix-grep unix-awk

Page 1 of 1