Identifying column combination patterns in a SQLite table#
Given a large, heterogeneous table I wanted to identify patterns in the rows in terms of which columns were not null.
Imagine a table like this for example:
1id field1 field2 field3 field4
I want to know how many records have values for (id, field1, field2) compared to the number of rows with values for (id, field3, field4).
I worked out the following query pattern for answering this question:
1select2 case when [pk] is not null then 'pk, ' else '' end ||3 case when [created] is not null then 'created, ' else '' end ||4 case when [planet_int] is not null then 'planet_int, ' else '' end ||5 case when [on_earth] is not null then 'on_earth, ' else '' end ||6 case when [state] is not null then 'state, ' else '' end ||7 case when [_city_id] is not null then '_city_id, ' else '' end ||8 case when [_neighborhood] is not null then '_neighborhood, ' else '' end ||9 case when [tags] is not null then 'tags, ' else '' end ||10 case when [complex_array] is not null then 'complex_array, ' else '' end ||11 case when [distinct_some_null] is not null then 'distinct_some_null, ' else '' end12 as columns,13 count(*) as num_rows14from15 [facetable]16group by17 columns18order by19 num_rows desc
This has the desired effect: it gives me back all of the combinations of not-null columns in the table, with a count for each one.
(Running this on a table with 1,000,000+ rows took about 40 seconds, so I had to use datasette data.db --setting sql_time_limit_ms 100000
to bump up the default time limit in Datasette.)
One remaining problem: how to generate the above query for an arbitrary table. I came up with the following SQL query for generating a SQL query like the above:
1select 'select2' || group_concat(' case when [' || name || '] is not null then ' || quote(name || ', ') || ' else '''' end', ' ||3') || '4 as columns,5 count(*) as num_rows6from7 [' || :table || ']8group by9 columns10order by11 num_rows desc' as query from pragma_table_info(:table)
Try that out in a demo that includes the datasette-query-links plugin.
This takes :table
as an input and generates SQL which can be used to generate column-combination counts.