Identifying column combination patterns in a SQLite table
Given a large, heterogeneous table I wanted to identify patterns in the rows in terms of which columns were not null.
Imagine a table like this for example:
id field1 field2 field3 field4I want to know how many records have values for (id, field1, field2) compared to the number of rows with values for (id, field3, field4).
I worked out the following query pattern for answering this question:
select case when [pk] is not null then 'pk, ' else '' end || case when [created] is not null then 'created, ' else '' end || case when [planet_int] is not null then 'planet_int, ' else '' end || case when [on_earth] is not null then 'on_earth, ' else '' end || case when [state] is not null then 'state, ' else '' end || case when [_city_id] is not null then '_city_id, ' else '' end || case when [_neighborhood] is not null then '_neighborhood, ' else '' end || case when [tags] is not null then 'tags, ' else '' end || case when [complex_array] is not null then 'complex_array, ' else '' end || case when [distinct_some_null] is not null then 'distinct_some_null, ' else '' end as columns, count(*) as num_rowsfrom [facetable]group by columnsorder by num_rows descThis has the desired effect: it gives me back all of the combinations of not-null columns in the table, with a count for each one.
(Running this on a table with 1,000,000+ rows took about 40 seconds, so I had to use datasette data.db --setting sql_time_limit_ms 100000 to bump up the default time limit in Datasette.)
One remaining problem: how to generate the above query for an arbitrary table. I came up with the following SQL query for generating a SQL query like the above:
select 'select' || group_concat(' case when [' || name || '] is not null then ' || quote(name || ', ') || ' else '''' end', ' ||') || ' as columns, count(*) as num_rowsfrom [' || :table || ']group by columnsorder by num_rows desc' as query from pragma_table_info(:table)Try that out in a demo that includes the datasette-query-links plugin.
This takes :table as an input and generates SQL which can be used to generate column-combination counts.