Using DuckDB in Python to access Parquet data#

Did a quick experiment with DuckDB today, inspired by the bmschmidt/hathi-binary repo.

That repo includes 3GB of data in 68 parquet files. Those files are 45MB each.

DuckDB can run queries against Parquet data really fast.

I checked out the repo like this:

1
cd /tmp
2
git clone https://github.com/bmschmidt/hathi-binary
3
cd hathi-binary

To install it:

1
pip install duckdb

Then in a Python console:

1
>>> import duckdb
2
>>> db = duckdb.connect() # No need to pass a file name, we will use a VIEW
3
>>> db.execute("CREATE VIEW hamming AS SELECT * FROM parquet_scan('parquet/*.parquet')")
4
<duckdb.DuckDBPyConnection object at 0x110eab530>
5
>>> db.execute("select count(*) from hamming").fetchall()
6
[(17123746,)]
7
>>> db.execute("select sum(A), sum(B), sum(C) from hamming").fetchall()
8
[(19478990546114240096822710, 16303362475198894881395004, 43191807707832192976154883)]

There are 17,123,746 records in the 3GB of Parquet data.

I switched to iPython so I could time a query. First I ran a query to see what a record looks like, using .df().to_dict() to convert the result into a DataFrame and then a Python dictionary:

1
In [17]: db.execute("select * from hamming limit 1").df().to_dict()
2
Out[17]:
3
{'htid': {0: 'uc1.b3209520'},
4
 'A': {0: -3968610387004385723},
5
 'B': {0: 7528965001168362882},
6
 'C': {0: 5017761927246436345},
7
 'D': {0: 2866021664979717155},
8
 'E': {0: -8718458467632335109},
9
 'F': {0: 3783827906913154091},
10
 'G': {0: -883843087282811188},
11
 'H': {0: 4045142741717613284},
12
 'I': {0: -9144138405661797607},
13
 'J': {0: 3285280333149952194},
14
 'K': {0: -3352555231043531556},
15
 'L': {0: 2071206943103704211},
16
 'M': {0: -5859914591541496612},
17
 'N': {0: -4209182319449999971},
18
 'O': {0: 2040176595216801886},
19
 'P': {0: 860910514658882647},
20
 'Q': {0: 3505065119653024843},
21
 'R': {0: -3110594979418944378},
22
 'S': {0: -8591743965043807123},
23
 'T': {0: -3262129165685658773}}

Then I timed an aggregate query using %time:

1
In [18]: %time db.execute("select sum(A), sum(B), sum(C) from hamming").fetchall()
2
CPU times: user 1.13 s, sys: 488 ms, total: 1.62 s
3
Wall time: 206 ms
4
Out[18]:
5
[(19478990546114240096822710,
6
  16303362475198894881395004,
7
  43191807707832192976154883)]

206ms to sum three columns across 17 million records is pretty fast!