Storing and serving related documents with openai-to-sqlite and embeddings

I decide to upgrade the related articles feature on my TILs site. Previously I calculated these using full-text search, but I wanted to try out a new trick using OpenAI embeddings for document similarity instead.

My openai-to-sqlite CLI tool already provides a mechanism for calculating embeddings against text and storing them in a SQLite database.

I was going to add a command for calculating similarity based on those embeddings… and then I saw that Benoit Delbosc had opened a pull request implementing that feature already!

I took Benoit’s work and expanded it. In particular, I added an option for saving the resulting calculations to a database table.

This meant I could find and then save related articles for my TILs by running the following:

1
wget https://s3.amazonaws.com/til.simonwillison.net/tils.db

This grabs the latest tils.db used to serve my TIL website.

1
openai-to-sqlite embeddings tils.db \
2
  --sql 'select path, title, topic, body from til'

This retrieves and stores embeddings from the OpenAI API for every row in my til table - embedding the title, topic and body columns concatenated together, then keying them against the path column (the primary key for that table).

The command output this:

1
Fetching embeddings  [####################################]  100%
2
Total tokens used: 402500

402,500 tokens at [ $0.0001 / 1K tokens](https://openai.com/pricing) comes to$ 0.04 - 4 cents!

Now that I’ve embedded everything, I can search for the most similar articles to a particular article like this:

1
openai-to-sqlite similar tils.db observable-plot_wider-tooltip-areas.md

Here are the results for that search for articles similar to https://til.simonwillison.net/observable-plot/wider-tooltip-areas:

1
observable-plot_wider-tooltip-areas.md
2
  0.860 observable-plot_histogram-with-tooltips.md
3
  0.792 svg_dynamic-line-chart.md
4
  0.791 javascript_copy-rich-text-to-clipboard.md
5
  0.780 javascript_dropdown-menu-with-details-summary.md
6
  0.772 vega_bar-chart-ordering.md
7
  0.770 javascript_working-around-nodevalue-size-limit.md
8
  0.769 presenting_stickies-for-workshop-links.md
9
  0.768 observable_jq-in-observable.md
10
  0.766 javascript_copy-button.md
11
  0.765 django_django-admin-horizontal-scroll.md

Or the top five as links:

These are pretty good matches!

Calculating and storing the similarities#

In order to build the related feature on my site, I wanted to store the calculations of the top ten articles most similar to each one.

The following command can do that:

1
time openai-to-sqlite similar tils.db --all --save

This runs against --all of the records in the embeddings table, and --save causes the results to be saved to the similarities table in the database.

The time command shows this took 27s! It has to run a LOT of cosine similarity calculations here - 446 * 446 = 198,916 calculations, and each of those is comparing two 1,536 dimension vectors.

Running sqlite-utils schema tils.db shows me the schema of the newly added tables:

1
CREATE TABLE [embeddings] (
2
   [id] TEXT PRIMARY KEY,
3
   [embedding] BLOB
4
);
5
CREATE TABLE [similarities] (
6
   [id] TEXT,
7
   [other_id] TEXT,
8
   [score] FLOAT,
9
   PRIMARY KEY ([id], [other_id])
10
);

Here’s what that similarities table looks like:

1
sqlite-utils rows tils.db similarities --limit 5 -t --fmt github

id	other_id	score
svg_dynamic-line-chart.md	observable-plot_wider-tooltip-areas.md	0.792374
svg_dynamic-line-chart.md	observable-plot_histogram-with-tooltips.md	0.771501
svg_dynamic-line-chart.md	overture-maps_overture-maps-parquet.md	0.762345
svg_dynamic-line-chart.md	javascript_openseadragon.md	0.762247
svg_dynamic-line-chart.md	python_json-floating-point.md	0.7589

That’s good enough to build the new feature!

Automating this with GitHub Actions#

My tils.db datase is built by this workflow.

I needed that workflow to embed all of the content, then run the similarity calculations and save them to the database.

The openai-to-sqlite embeddings command is smart enough not to run embeddings against content that has already been calculated, otherwise every time my GitHub Actions workflow runs I would be charged another 4 cents in OpenAI fees.

The catch is that openai-to-sqlite similar tils.db --all --save command. It takes 27s now, and will just get slower as my database continues to grow.

I added one more feature to openai-to-sqlite to help address this: the recalculate-for-matches option.

This lets you do the following:

1
openai-to-sqlite similar tils.db \
2
  svg_dynamic-line-chart.md \
3
  python_json-floating-point.md \
4
  --save --recalculate-for-matches

Here we are passing two specific IDs. The --recalculate-for-matches option means that the command will recalculate the similarity scores for those IDs, and then for every other row in the database that is a top-ten match for one of those IDs.

This should result in a lot less calculations than running against --all.

One more problem: how do I run against just the most recently modified articles in my workflow?

I decided to solve that with a bit of git magic, courtesy of some ChatGPT questions:

1
git diff --name-only HEAD~10

This outputs the names of the files that have changed in the last 10 commits:

1
README.md
2
cosmopolitan/ecosystem.md
3
github/django-postgresql-codespaces.md
4
jq/combined-github-release-notes.md
5
python/pyproject.md

I only care about the ones that are something/something.md - I can filter those using grep:

1
git diff --name-only HEAD~10 | grep '/.*\.md$'

Finally, my IDs are of the format category_title.md - so I can use sed to convert the filenames into IDs:

1
git diff --name-only HEAD~10 HEAD | grep '/.*\.md$' | sed 's/\//_/g'

Which outputs:

1
cosmopolitan_ecosystem.md
2
github_django-postgresql-codespaces.md
3
jq_combined-github-release-notes.md
4
python_pyproject.md

I can pass that to my openai-to-sqlite similar --save command like this:

1
openai-to-sqlite similar tils.db \
2
  $(git diff --name-only HEAD~10 HEAD | grep '/.*\.md$' | sed 's/\//_/g') \
3
  --save --recalculate-for-matches --print

The --print there causes the output to be shown too, for debugging purposes.

That’s everything I need. Time to add it to the workflow.

The GitHub Actions workflow#

I needed to set my OPENAI_API_KEY as a repository secret in simonw/til.

Here’s the code I added to the workflow:

1
- name: Calculate embeddings and document similarity
2
  env:
3
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
4
  run: |-
5
    # Fetch embeddings for documents that need them
6
    openai-to-sqlite embeddings main/tils.db \
7
      --sql 'select path, title, topic, body from til'
8
    # Now calculate and save similarities
9
    if sqlite-utils rows main/tils.db similarities --limit 1; then
10
      # Table exists already, so only calculate new similarities
11
      openai-to-sqlite similar main/tils.db \
12
        $(git diff --name-only HEAD~10 HEAD | grep '/.*\.md$' | sed 's/\//_/g') \
13
        --save --recalculate-for-matches --print
14
    else
15
      # Table does not exist, calculate for everything
16
      openai-to-sqlite similar main/tils.db --all --save
17
    fi

A neat trick here is that it checks to see if the similarities table exists yet by running the sqlite-utils rows tils.db similarities --limit 1 command and checking the exit code, which will be a failure if the table does not exist.

This workflow ran… and created the new tables in my database:

Hooking those into the templates#

I figured out the SQL query for returning the top related items for a story:

1
select
2
  til.topic, til.slug, til.title, til.created
3
from til
4
  join similarities on til.path = similarities.other_id
5
where similarities.id = 'python_pyproject.md'
6
order by similarities.score desc limit 10

Then I updated the existing async def related_tils(til) Python function in my code to use that:

1
async def related_tils(til):
2
    path = til["path"]
3
    sql = """
4
    select
5
      til.topic, til.slug, til.title, til.created
6
    from til
7
      join similarities on til.path = similarities.other_id
8
    where similarities.id = :path
9
    order by similarities.score desc limit 10
10
    """
11
    result = await datasette.get_database().execute(
12
        sql,
13
        {"path": til["path"]},
14
    )
15
    return result.rows

… and it worked! All of my TILs now feature related articles powered by OpenAI embeddings.

Here’s my issue for this - though most of the notes are already in this TIL.

Here’s a SQL query I figured out to show me which pairs of articles have the highest relatedness score out of everything on my site:

1
with top_similarities as (
2
  select id, other_id, score
3
  from similarities
4
  where id < other_id
5
),
6
til_details as (
7
  select path, title, 'https://til.simonwillison.net/' || topic || '/' || slug as url
8
  from til
9
)
10
select
11
  t1.title, t1.url, t2.title, t2.url, score
12
from
13
  til_details t1 join top_similarities on id = t1.path
14
  join til_details t2 on other_id = t2.path
15
order by score desc limit 100

The neatest trick here is the where id < other_id - I added that because without it I was getting the same pairings with the same score show up twice, one for A to B and one for B to A.

(ChatGPT/GPT-4 suggested that fix to me.)

Run that query here.

Here are the top results, produced by this variant of the query that concatenates together Markdown:

Running tests against PostgreSQL in a service container and Talking to a PostgreSQL service container from inside a Docker container - 0.92065323107421
Running nanoGPT on a MacBook M2 to generate terrible Shakespeare and Training nanoGPT entirely on content from my blog - 0.920579340333838
Docker Compose for Django development and Running a Django and PostgreSQL development environment in GitHub Codespaces - 0.896930635052645
Installing Python on macOS with the official Python installer and macOS Catalina sort-of includes Python 3 - 0.892173321940446
Testing Electron apps with Playwright and GitHub Actions and Using pytest and Playwright to test a JavaScript web application - 0.892025528713046
Pisco sour and Whisky sour - 0.891786930904611
Using pysqlite3 on macOS and Loading SQLite extensions in Python on macOS - 0.890980471839453
Natural Earth in SpatiaLite and Datasette and Viewing GeoPackage data with SpatiaLite and Datasette - 0.890347975677207
Storing and serving related documents with openai-to-sqlite and embeddings and Related content with SQLite FTS and a Datasette template function - 0.890192078305175
Bulk fetching repository details with the GitHub GraphQL API and Searching for repositories by topic using the GitHub GraphQL API - 0.88994714155856

Storing and serving related documents with openai-to-sqlite and embeddings#

Calculating and storing the similarities#

Automating this with GitHub Actions#

The GitHub Actions workflow#

Hooking those into the templates#

Bonus: What are the most related pairs of articles?#