Sometimes -- in Comma Separated Value files -- you have commas inside the fields themselves.
These means, should you run them through sed, awk or whatever, based on commas you'll have extra fields:
afield,"another field","oh look, a false field",bugger
However, luckily, the field with the comma within is in double quotation marks.
This means we can run a regex to replace all such occurrances with the commas's unicode entity,
The regex works like this:
Then we can output the grabbed text between such and replace
- Look for text that starts with
- Keep grabbing text, which is not the end of the qutotation mark, until we get a comma
- Keep on grabbing again until we reach a double quotation mark
The regex, in vim syntax, looks like this:
\( are the grouping, and the
/\1\\u0027\2/ defines the replacement with the HTML entity, so they can be ignored for this explanation.
Leaving us with
," saying start the match with such, then
[^\"].* is saying only grab text that's not a double quotation mark.
, is saying look for the comma in the quotation marks, and then
.*" grabs everything until we get an ending quotation mark.
Then, since we're grouping everything except the comma, we can do the replacement:
sed unix csv awk regex vim
Edit on Github!