home.


Tagged: regex


Replacing commas in fields in CSV files with regex

Sometimes – in Comma Separated Value files – you have commas inside the fields themselves.

These means, should you run them through sed, awk or whatever, based on commas you’ll have extra fields:

afield,"another field","oh look, a false field",bugger

However, luckily, the field with the comma within is in double quotation marks.

This means we can run a regex to replace all such occurrances with the commas’s unicode entity, \\u0027

The regex works like this:

  1. Look for text that starts with ,"
  2. Keep grabbing text, which is not the end of the qutotation mark, until we get a comma
  3. Keep on grabbing again until we reach a double quotation mark

Then we can output the grabbed text between such and replace , with \\u0027

The regex, in vim syntax, looks like this:

%s/\(,"[^\"]*\),\(.*"\)/\1\\\u0027\2/

\( and \( are the grouping, and the /\1\\u0027\2/ defines the replacement with the HTML entity, so they can be ignored for this explanation.

,"[^\"]*,.*"

Leaving us with ," saying start the match with such, then [^\"].* is saying only grab text that’s not a double quotation mark.

Then, , is saying look for the comma in the quotation marks, and then .*" grabs everything until we get an ending quotation mark.

Then, since we’re grouping everything except the comma, we can do the replacement: /\1\\u0027\2/

sed unix csv awk regex vim

Sanitizing CSV files with regex

Often, you want to use a CSV file, but commas within fields, double and single quotation marks can work trickily with some other programs.

  1. The first regex will replace all commas in double quotation fields with unicode entity (only if such is not the first field, however)
  2. The second will then remove all the double quotation marks
  3. The third will replace the single quotation marks with their unicode entity

These are all in vim syntax.

%s/\(,"[^\"]*\),\(.*"\)/\1\\u002C\2/
%s/"//g
%s/'/\\u0027/g
unix csv regex vim

Golang regex basics: Match and find

You can create a regular expressions conforming to RE2 using the Compile or MustCompile method on regexp.

    rp := regexp.MustCompile("[a-z]+")
    rp1, err := regexp.Compile("[a-z]+")

The first panics if the regex is incorrect. The second returns an error message if the regex is incorrect.

You can match with MatchString(str). If you omit the String in the method name, it will expect a byte array.

    foundBool := rp.MatchString("abc")

The Find methods – and others – confirm to this general pattern.

  • The FindString methods returns the first match
  • If you have the word All (FindAllString) in the name it returns a specifed number of matches, or all with -1.

    rp.FindString("abc def") // "abc"
    rp.FindAllString("abc def", -1) // ["abc", "def"]
    
  • If you have the word Submatch (FindAllSubmatch) it will give you any groups in the match, as two dimensional slice.

    rp := regexp.MustCompile("([a-z])([a-z])[a-z]+")
    rp.FindAllStringSubmatch("abc") // ["abc", "a", "b"]
    
  • If you have the word Index at the end it will return a int slice with the starting and ending point of the match, including groups

    rp := regexp.MustCompile("([a-z])([a-z])[a-z]+")
    rp.FindAllStringSubmatchIndex("abc") // [0, 3, 0, 1, 1, 2]
    

The first two ints are the start and end of the match. The next two are the next group match. The next two likewise.

golang golang-regex

Golang regex: Replace and split

You can use the ReplaceAllString methods to replace a string, and manipulate groups if needed.

    rp := regexp.MustCompile("([a-z]+) ([a-z]+)")
    rp.ReplaceAllString("abc def ghi", "$2 $1") // "def abc ghi"

$1 relates to the first group match, and $2 the same. You could just enter text to replace the entireity of the “abc def” string.

ReplaceAllLiteralString allows you to interpret the dollar sign literally.

You can also use a function to do the replacement. You cannot use a groups here, however, as of 1.1.1 anyway.

    rp.ReplaceAllStringFunc("abc def", func(s string) string {
            if(s=="abc") {
                    return "HA"
            } 
            return s        
    }) // "HA def"

You can split a string using the Split method. The second integer argument is the number of splits to perform. -1 means as many as possible.

    rp = regexp.MustCompile("a")
    i := rp2.Split("zzzzazzzzz", -1) // ["zzzz", "zzzz"]
golang golang-regex

Java: List all the files in a directory based on a regex

If you define a regular expression with Pattern, then use the matcher() and matches() methods on that within a File’s listFiles() method, you will get back a list of the files in the File’s directory based on the regex.

final Pattern p = Pattern.compile(regex);
File[] pagesTemplates = file.listFiles(new FileFilter() {
    @Override
    public boolean accept(File f) {
        return p.matcher(f.getName()).matches();
    }
});
java java-io java-regex

Page 1 of 1