Sometimes – in Comma Separated Value files – you have commas inside the fields themselves.
These means, should you run them through sed, awk or whatever, based on commas you’ll have extra fields:
afield,"another field","oh look, a false field",bugger
However, luckily, the field with the comma within is in double quotation marks.
This means we can run a regex to replace all such occurrances with the commas’s unicode entity,
The regex works like this:
Then we can output the grabbed text between such and replace
The regex, in vim syntax, looks like this:
\( are the grouping, and the
/\1\\u0027\2/ defines the replacement with the HTML entity, so they can be ignored for this explanation.
Leaving us with
," saying start the match with such, then
[^\"].* is saying only grab text that’s not a double quotation mark.
, is saying look for the comma in the quotation marks, and then
.*" grabs everything until we get an ending quotation mark.
Then, since we’re grouping everything except the comma, we can do the replacement:
Imagine we have this file
Line one Line two
And we want to remove the double new line.
This ugly looking command will replace all the newlines with nothing.
sed ':a;N;$!ba;s/\n\n//' YOURFILE
The :a says create a label - we’ll need this in a moment.
The N says append the next line onto the current pattern - we need this since we’re matching two lines. So say we’re trying to match the double newline this will give us ‘\n\n’ in our pattern space.
The $ matches the last line, i.e. in ‘\n\n’ we’ll be right at the end. And the ! inverts that. So here’s we’re matching the first ‘\n’ since this is not the last line.
Then the ba means go back to our label that we just created. So if we’re not on the last line, go back and do the match on the next part. This seems to be so the newline doesn’t halt the match.