Skipping CSV rows with odd numbers of quotes using ripgrep#
I’m working with several huge CSV files - over 5 million rows total - and I ran into a problem: it turned out there were a few lines in those files that imported incorrectly because they were not correctly escaped.
Here’s an example of an invalid line:
1SAI Exempt,"Patty B"s Hats & Tees,LLC",,26 Broad St
The apostrophe in Patty B's Hats & Tees
is incorrectly represented here as a double quote, and since that’s in a double quoted string it breaks that line of CSV.
I decided to filter out any rows that had an odd number of quotation marks in them - saving those broken lines to try and clean up later.
Finding rows with odd numbers of quotes#
StackOverflow offered this regular expression for finding lines with an odd number of quotation marks:
1[^"]*" # Match any number of non-quote characters, then a quote2(?: # Now match an even number of quotes by matching:3 [^"]*" # any number of non-quote characters, then a quote4 [^"]*" # twice5)* # and repeat any number of times.6[^"]* # Finally, match any remaining non-quote characters
I translated this into a ripgrep
expression, adding ^
to the beginning and $
to the end in order to match whole strings.
This command counted the number of invalid lines:
1rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' --glob '*.csv' --count2
304.csv:52403.csv:42502.csv:24601.csv:29
Adding --invert-match
showed me the count of lines that did NOT have an odd number of quotes:
1rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' --glob '*.csv' --count --invert-match2
305.csv:2829404.csv:812351503.csv:961311602.csv:994265701.csv:995404
This shows that the invalid lines are a tiny subset of the overall files.
Removing --count
shows the actual content.
Importing into SQLite with sqlite-utils#
I used this for loop to import only the valid lines into a SQLite database:
1for file in *.csv;2 do rg --invert-match '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' $file | \3 sqlite-utils insert my.db rows - --csv;4done;
Saving the broken lines for later#
To save the lines that contained odd numbers of double quotes I used this command:
1rg '^[^"]*"(?:[^"]*"[^"]*")*[^"]*$' \2 --glob '*.csv' \3 --no-line-number \4 --no-filename > saved.txt
Since I don’t actually care which file they lived in - all of these CSV files share the same structure - I used --no-filename
to omit the filename from the results and --no-line-number
to omit the line number. The result is a saved.txt
file containing just the raw CSV data that I skipped from the import.