Tagged: sed

Replacing commas in fields in CSV files with regex

Sometimes – in Comma Separated Value files – you have commas inside the fields themselves.

These means, should you run them through sed, awk or whatever, based on commas you’ll have extra fields:

afield,"another field","oh look, a false field",bugger

However, luckily, the field with the comma within is in double quotation marks.

This means we can run a regex to replace all such occurrances with the commas’s unicode entity, \\u0027

The regex works like this:

  1. Look for text that starts with ,"
  2. Keep grabbing text, which is not the end of the qutotation mark, until we get a comma
  3. Keep on grabbing again until we reach a double quotation mark

Then we can output the grabbed text between such and replace , with \\u0027

The regex, in vim syntax, looks like this:


\( and \( are the grouping, and the /\1\\u0027\2/ defines the replacement with the HTML entity, so they can be ignored for this explanation.


Leaving us with ," saying start the match with such, then [^\"].* is saying only grab text that’s not a double quotation mark.

Then, , is saying look for the comma in the quotation marks, and then .*" grabs everything until we get an ending quotation mark.

Then, since we’re grouping everything except the comma, we can do the replacement: /\1\\u0027\2/

sed unix csv awk regex vim

Unix: Replace newlines with sed

Imagine we have this file

Line one

Line two

And we want to remove the double new line.

This ugly looking command will replace all the newlines with nothing.

sed ':a;N;$!ba;s/\n\n//' YOURFILE

The :a says create a label - we’ll need this in a moment.

The N says append the next line onto the current pattern - we need this since we’re matching two lines. So say we’re trying to match the double newline this will give us ‘\n\n’ in our pattern space.

The $ matches the last line, i.e. in ‘\n\n’ we’ll be right at the end. And the ! inverts that. So here’s we’re matching the first ‘\n’ since this is not the last line.

Then the ba means go back to our label that we just created. So if we’re not on the last line, go back and do the match on the next part. This seems to be so the newline doesn’t halt the match.

unix unix-sed

Page 1 of 1