9.11. Change the Delimiter Used in CSV Files

Problem

You want to change all field-delimiting commas in a CSV file to tabs. Commas that occur within double-quoted values should be left alone.

Solution

The following regular expression matches an individual CSV field along with its preceding delimiter, if any. The preceding delimiter is usually a comma, but can also be an empty string (i.e., nothing) when matching the first field of the first record, or a line break when matching the first field of any subsequent record. Every time a match is found, the field itself, including the double quotes that may surround it, is captured to backreference 2, and its preceding delimiter is captured to backreference 1.

Tip

The regular expressions in this recipe are designed to work correctly with valid CSV files only, according to the format rules discussed in Comma-Separated Values (CSV).

(,|
?
|^)([^",
]+|"(?:[^"]|"")*")?
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Here is the same regular expression again in free-spacing mode:

( , | 
?
 | ^ )   # Capture the leading field delimiter to backref 1
(                   # Capture a single field to backref 2:
  [^",
]+        #   Unquoted field
|                   #  Or:
  " (?:[^"]|"")* "  #   Quoted field (may contain escaped double quotes)
)?                  # The group is optional because fields may be empty
Regex options: Free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Using this regex and the code in Recipe 3.11, you can iterate over your CSV file and check the value of backreference 1 after each match. The necessary replacement string for each match depends on the value of this backreference. If it’s a comma, replace it with a tab character. If the backreference is empty or contains a line break, leave the value in place (i.e., do nothing, or put it back as part of a replacement string). Since CSV fields are captured to backreference 2 as part of each match, you’ll also have to put that back as part of each replacement string. The only things you’re actually replacing are the commas that are captured to backreference 1.

Example web page with JavaScript

The following code is a complete web page that includes two multiline text input fields, with a button labeled Replace between them. Clicking the button takes whatever string you put into the first text box (labeled Input), converts any comma delimiters to tabs with the help of the regular expression just shown, then puts the new string into the second text box (labeled Output). If you use valid CSV content as your input, it should show up in the second text box with all comma delimiters replaced with tabs. To test it, save this code into a file with the .html extension and open it in your favorite web browser:

<html>
<head>
<title>Change CSV delimiters from commas to tabs</title>
</head>

<body>
<p>Input:</p>
<textarea id="input" rows="5" cols="75"></textarea>

<p><input type="button" value="Replace" onclick="commasToTabs()"></p>

<p>Output:</p>
<textarea id="output" rows="5" cols="75"></textarea>

<script>
function commasToTabs() {
    var input  = document.getElementById("input"),
        output = document.getElementById("output"),
        regex  = /(,|
?
|^)([^",
]+|"(?:[^"]|"")*")?/g,
        result = "",
        match;

    while (match = regex.exec(input.value)) {
        // Check the value of backreference 1
        if (match[1] == ",") {
            // Add a tab (in place of the matched comma) and backreference
            // 2 to the result. If backreference 2 is undefined (because
            // the optional, second capturing group did not participate in
            // the match), use an empty string instead.
            result += "	" + (match[2] || "");
        } else {
            // Add the entire match to the result
            result += match[0];
        }

        // If there is an empty match, prevent some browsers from getting
        // stuck in an infinite loop
        if (match.index == regex.lastIndex) {
            regex.lastIndex++;
        }
    }

    output.value = result;
}
</script>
</body>
</html>

Discussion

The approach prescribed by this recipe allows you to pass over each complete CSV field (including any embedded line breaks, escaped double quotes, and commas) one at a time. Each match then starts just before the next field delimiter.

The first capturing group in the regex, (,| ? |^), matches a comma, line break, or the position at the beginning of the subject string. Since the regex engine will attempt alternatives from left to right, these options are listed in the order in which they will most frequently occur in the average CSV file. This capturing group is the only part of the regex that is required to match. Therefore, it’s possible for the complete regex to match an empty string since the ^ anchor can match once. The value matched by this first capturing group must be checked in the code outside of the regex that replaces commas with your substitute delimiters (in this case, tabs).

We haven’t yet gotten through the entire regex, but the approach described so far is already somewhat convoluted. You might be wondering why the regex is not written to match only the commas that should be replaced with tabs. If you could do that, a simple substitution of all matched text would avoid the need for code outside of the regex to check whether capturing group 1 matched a comma or some other string. After all, it should be possible to use lookahead and lookbehind to determine whether a comma is inside or outside a quoted CSV field, right?

Unfortunately, in order for such an approach to accurately determine which commas are outside of double-quoted fields, you’d need infinite-length lookbehind, which is available in the .NET regex flavor only (see Different levels of lookbehind for a discussion of the varying lookbehind limitations). Even .NET developers should avoid a lookaround-based approach since it would add significant complexity and also make the regex slower.

Getting back to how the regex works, most of the pattern appears within the next set of parentheses: capturing group 2. This second group matches a single CSV field, including any surrounding double quotes. Unlike the previous capturing group, this one is optional in order to allow matching empty fields.

Note that group 2 within the regex contains two alternative patterns separated by the | metacharacter. The first alternative, [^", ]+, is a negated character class followed by a one-or-more quantifier (+) that, together, match an entire unquoted field. For this to match, the field cannot contain any double quotes, commas, or line breaks.

The second alternative within group 2, "(?:[^"]|"")*", matches a field surrounded by double quotes. More precisely, it matches a double quote character, followed by zero or more non-double-quote characters or repeated (escaped) double quotes, followed by a closing double quote. The * quantifier at the end of the noncapturing group continues repeating the two options within the group until it reaches a double quote that is not repeated and therefore ends the field.

Assuming you’re working with a valid CSV file, the first match found by this regex should occur at the beginning of the subject string, and each subsequent match should occur immediately after the end of the last match.

See Also

Recipe 9.12 describes how to reuse the regex in this recipe to extract CSV fields from a specific column.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.2 explains how to match nonprinting characters. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset