706 words
4 minutes
Consecutive groups in SQL using window functions
Anubhav Gain
2024-05-14

Consecutive groups in SQL using window functions#

I have a database table with all of my Swarm checkins since 2011, created using my swarm-to-sqlite tool.

I wanted to run a query to show the date ranges I had spent in different countries, based on the country of the venue of each of those checkins.

I ended up learning how to perform operations against groups of rows identified by a consecutive sequence of values in a column, using window functions.

Here’s a simplified version of the query I ended up with (the final query with joins is here):

with ordered as (
select
created,
country,
lag(country) over (order by created desc)
as previous_country
from
raw
),
grouped as (
select
country,
created,
count(*) filter (
where previous_country is null
or previous_country != country
) over (
order by created desc
rows between unbounded preceding
and current row
) as grp
from
ordered
)
select
country,
date(min(created)) as start,
date(max(created)) as end,
cast(
julianday(date(max(created))) -
julianday(date(min(created))) as integer
) as days
from
grouped
group by
country, grp
order by
start desc;

Here’s the raw data I ran that against:

createdcitycountry
2023-06-08T08:16:36LondonUnited Kingdom
2023-05-11T21:05:35San FranciscoUnited States
2023-05-08T01:56:49Half Moon BayUnited States
2023-05-04T22:35:05San FranciscoUnited States
2023-05-02T20:10:39StanfordUnited States
2023-04-24T20:47:46San FranciscoUnited States
2023-04-20T20:13:53Salt Lake CityUnited States
2023-04-19T00:43:10San FranciscoUnited States
2023-03-02T19:43:55NashvilleUnited States
2019-09-20T18:25:45Palo AltoUnited States
2019-09-20T14:52:47San FranciscoUnited States
2019-09-02T18:08:32San FranciscoUnited States
2019-08-31T10:16:03LavalFrance
2019-08-26T13:01:12France
2019-08-26T12:55:03ChartresFrance
2019-08-25T12:51:16ParisFrance
2019-08-07T08:16:37Madagascar
2019-08-07T06:44:50Madagascar
2019-08-03T15:08:30Madagascar
2019-08-02T17:13:51Madagascar
2019-07-31T15:03:17Madagascar
2019-07-25T06:29:22Le Mesnil-AmelotFrance
2019-06-30T21:57:26San FranciscoUnited States
2019-06-30T18:27:28San FranciscoUnited States
2019-06-02T02:34:59SacramentoUnited States
2019-05-29T15:54:49San FranciscoUnited States
2019-05-04T23:04:18ClevelandUnited States
2018-09-10T09:47:22IlfracombeUnited Kingdom
2018-09-09T15:30:32BidefordUnited Kingdom
2018-08-30T22:27:08LondonUnited Kingdom
2018-08-29T19:45:13LondonUnited Kingdom
2018-08-10T01:44:17San FranciscoUnited States
2018-08-09T00:38:03San FranciscoUnited States
2018-08-05T20:53:56San FranciscoUnited States

And the output of the query:

countrystartenddays
United Kingdom2023-06-082023-06-080
United States2019-09-022023-05-111347
France2019-08-252019-08-316
Madagascar2019-07-312019-08-077
France2019-07-252019-07-250
United States2019-05-042019-06-3057
United Kingdom2018-08-292018-09-1012
United States2018-08-052018-08-105

How that query works#

There are three steps to the query.

  1. Extend the raw data with an extra previous_country column which shows the country for the previous row, using the lag() window function.
  2. Group the data by country, using another window function to assign a group number to each row. That group number increments whenever the country changes.
  3. Group the data by country and group number, and use min() and max() to find the start and end dates for each group and to calculate the number of days in each group.

The first two steps are implemented using CTEs. Let’s break those down:

with ordered as (
select
created,
country,
lag(country) over (order by created desc)
as previous_country
from
raw
)
select * from ordered

Truncated output:

createdcountryprevious_country
2023-06-08T08:16:36United Kingdom
2023-05-11T21:05:35United StatesUnited Kingdom
2023-05-08T01:56:49United StatesUnited States
2023-05-04T22:35:05United StatesUnited States
2019-08-31T10:16:03FranceUnited States
2019-08-26T13:01:12FranceFrance
2019-08-25T12:51:16FranceFrance
2019-08-07T08:16:37MadagascarFrance
2019-08-07T06:44:50MadagascarMadagascar
2019-07-31T15:03:17MadagascarMadagascar
2019-07-25T06:29:22FranceMadagascar

The magic here is this window function:

lag(country) over (order by created desc) as previous_country

That over (order by created desc) part means “consider all rows in this table, ordered by the created column in descending order”. lag(country) means “take the value of the country column from the previous row within that window”.

The next CTE adds a grp column which increments whenever the country changes:

with ordered as (
select
created,
country,
lag(country) over (order by created desc)
as previous_country
from
raw
),
grouped as (
select
country,
created,
count(*) filter (
where previous_country is null
or previous_country != country
) over (
order by created desc
rows between unbounded preceding
and current row
) as grp
from
ordered
)
select * from grouped

Truncated output:

countrycreatedgrp
United Kingdom2023-06-08T08:16:361
United States2023-05-11T21:05:352
United States2023-05-08T01:56:492
United States2023-05-04T22:35:052
France2019-08-31T10:16:033
France2019-08-26T13:01:123
France2019-08-26T12:55:033
Madagascar2019-08-07T08:16:374
Madagascar2019-08-07T06:44:504
France2019-07-25T06:29:225
United States2019-06-30T21:57:266

This window function is more complicated:

count(*) filter (
where previous_country is null
or previous_country != country
) over (
order by created desc
rows between unbounded preceding
and current row
) as grp

The over clause here defines a window that includes all of the rows up to and including the current one, ordered by created in descending order.

The count(*) filter (...) part counts the number of rows in that window where either previous_country is null or previous_country != country.

The final step is to group by country and group number, and to use min() and max() to find the start and end dates for each group and to calculate the number of days in each group:

...
select
country,
date(min(created)) as start,
date(max(created)) as end,
cast(
julianday(date(max(created))) -
julianday(date(min(created))) as integer
) as days
from
grouped
group by
country, grp
order by
start desc;

Grouping by country, grp returns a single row per group. The min(created) and max(created) functions can then find the earliest and latest dates in each group.

I’m using date() to turn those created timestamps into YYYY-MM-DD dates.

Finally, I use julianday() to convert those dates into Julian day numbers, and then subtract the two numbers to get the number of days between them. The result is a floating point number, so I use cast(... as integer) to convert it to an integer.

Consecutive groups in SQL using window functions
https://mranv.pages.dev/posts/consecutive-groups-in-sql-using-window-functions/
Author
Anubhav Gain
Published at
2024-05-14
License
CC BY-NC-SA 4.0