Skipping CSV rows with odd numbers of quotes using ripgrep#

I’m working with several huge CSV files - over 5 million rows total - and I ran into a problem: it turned out there were a few lines in those files that imported incorrectly because they were not correctly escaped.

Here’s an example of an invalid line:

1
SAI Exempt,"Patty B"s Hats & Tees,LLC",,26 Broad St

The apostrophe in Patty B's Hats & Tees is incorrectly represented here as a double quote, and since that’s in a double quoted string it breaks that line of CSV.

I decided to filter out any rows that had an odd number of quotation marks in them - saving those broken lines to try and clean up later.

Finding rows with odd numbers of quotes#

StackOverflow offered this regular expression for finding lines with an odd number of quotation marks:

1
[^"]*"   # Match any number of non-quote characters, then a quote
2
(?:      # Now match an even number of quotes by matching:
3
 [^"]*"  #  any number of non-quote characters, then a quote
4
 [^"]*"  #  twice
5
)*       # and repeat any number of times.
6
[^"]*    # Finally, match any remaining non-quote characters

I translated this into a ripgrep expression, adding ^ to the beginning and $ to the end in order to match whole strings.

This command counted the number of invalid lines:

1
rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' --glob '*.csv' --count
2

3
04.csv:52
4
03.csv:42
5
02.csv:24
6
01.csv:29

Adding --invert-match showed me the count of lines that did NOT have an odd number of quotes:

1
rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' --glob '*.csv' --count --invert-match
2

3
05.csv:2829
4
04.csv:812351
5
03.csv:961311
6
02.csv:994265
7
01.csv:995404

This shows that the invalid lines are a tiny subset of the overall files.

Removing --count shows the actual content.

Importing into SQLite with sqlite-utils#

I used this for loop to import only the valid lines into a SQLite database:

1
for file in *.csv;
2
    do rg --invert-match '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' $file | \
3
    sqlite-utils insert my.db rows - --csv;
4
done;

Saving the broken lines for later#

To save the lines that contained odd numbers of double quotes I used this command:

1
rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' \
2
  --glob '*.csv' \
3
  --no-line-number \
4
  --no-filename > saved.txt

Since I don’t actually care which file they lived in - all of these CSV files share the same structure - I used --no-filename to omit the filename from the results and --no-line-number to omit the line number. The result is a saved.txt file containing just the raw CSV data that I skipped from the import.