410 words
2 minutes
Combining substr and instr to extract text
Anubhav Gain
2024-08-13

Combining substr and instr to extract text#

Derek Willis has a Datasette instance full of political campaign emails running on Heroku.

Matt Hodges pointed out that a lot of these emails include refcode= codes, which are used by ActBlue campaigns to track clicks.

They look like this:

  • ...c-email?refcode=220210_FR_midmonth1kavin_plain] Jessica Mason...
  • ...hmp-footer?refcode=2021_footer&amount=25&a...

I thought it would be fun to extract just the codes.

The datasette-rure plugin adds regular expression support which can be used for this kind of thing, but in the absence of a plugin like that the only way to do it is with the SQLite instr() and substr() functions.

Here’s the query I figured out:

with snippets as (
select
substr(body, instr(body, 'refcode=') + 8, 128) as snippet
from
emails
where
body LIKE '%refcode%'
),
refcodes as (
select
snippet,
substr(
snippet,
0,
min(
case
when instr(snippet, '&') > 0 then instr(snippet, '&')
else 128
end,
case
when instr(snippet, ']') > 0 then instr(snippet, ']')
else 128
end,
case
when instr(snippet, ' ') > 0 then instr(snippet, ' ')
else 128
end,
case
when instr(snippet, '.') > 0 then instr(snippet, '.')
else 128
end
)
) as refcode
from
snippets
)
select
refcode,
count(*) as n
from
refcodes
group by
refcode
order by
n desc

I started by pulling out just the 128 characters following each refcode= - I picked 128 characters at random just to make the data easier to look at:

substr(body, instr(body, 'refcode=') + 8, 128) as snippet

instr(body, 'refcode=') + 8 gives the character after the = sign, because refcode= is 8 characters long.

Next I needed to find the first character following the refcode that was either a &, a ], a space or a.`. That’s what this bit does:

substr(
snippet,
0,
min(
case
when instr(snippet, '&') > 0 then instr(snippet, '&')
else 128
end,
case
when instr(snippet, ']') > 0 then instr(snippet, ']')
else 128
end,
case
when instr(snippet, ' ') > 0 then instr(snippet, ' ')
else 128
end,
case
when instr(snippet, '.') > 0 then instr(snippet, '.')
else 128
end
)
) as refcode

I’m trying to find the first instance of any of those characters - so I use instr to find them, but ignore any results where that returns 0 for “character not found” - in those cases I use the number 128 picked earlier. I can then grab the minimum of those scores.

Then finally I do a group-by/count to find the most common refcodes:

select
refcode,
count(*) as n
from
refcodes
group by
refcode
order by
n desc

Top results were:

refcoden
email_footer527
em_pt352
em_fr_2020285
pt254
emfooter242
em_fr_2021242
footer-bio198
footer_button192
em_footer173
email-footer168
168
footer164
em2021135
em_jc_fr_footer_link107
em_fr_2019107
em-footer104
em_fr_201884
ABD_EM_FR_202167
Combining substr and instr to extract text
https://mranv.pages.dev/posts/combining-substr-and-instr-to-extract-text/
Author
Anubhav Gain
Published at
2024-08-13
License
CC BY-NC-SA 4.0