632 words
3 minutes
Using sqlite-vec with embeddings in sqlite-utils and Datasette

Using sqlite-vec with embeddings in sqlite-utils and Datasette#

Alex Garcia’s sqlite-vec SQLite extension provides a bunch of useful functions for working with vectors inside SQLite.

My LLM tool has features for storing text embeddings in SQLite.

It turns out sqlite-vec can work directly with the binary format that LLM uses to store embeddings (described here). That same format is also used by my slightly older openai-to-sqlite tool.

Installing sqlite-vec#

A neat thing about sqlite-vec (and many of Alex’s other extensions) is that it’s packaged as both a raw SQLite extension and as packages for various different platforms.

Alex makes Python plugins available for both Datasette and sqlite-utils which bundle that extension and register it with those tools such that the functions become available to SQL queries.

For Datasette:

Terminal window
datasette install datasette-sqlite-vec

For sqlite-utils:

Terminal window
sqlite-utils install sqlite-utils-sqlite-vec

Both of these commands will make the various sqlite-vec functions available within those tools. Test it like this:

Terminal window
sqlite-utils memory 'select vec_version()'

For me that outputs:

[{"vec_version()": "v0.1.1"}]

For Datasette you can run that query using the Datasette web interface, or from the command-line like this:

Terminal window
datasette --get '/_memory.json?sql=select+vec_version()'

Or in Datasette 1.0a14 or higher:

Terminal window
datasette --get '/_memory/-/query.json?sql=select+vec_version()'

Returning:

{"ok": true, "rows": [{"vec_version()": "v0.1.1"}], "truncated": false}

Some example queries#

My TIL website has an embeddings table that stores embeddings for each of the TILs. It has two columns: id is the text ID for the TIL, and embedding is the binary LLM embedding for that text.

Here’s how to use the sqlite-vec vec_distance_cosine() function to find similar documents based on their embeddings:

with document_embedding as (
select embedding as first_embedding from embeddings where id = :id
)
select
id,
vec_distance_cosine(embedding, first_embedding) as distance
from
embeddings, document_embedding
order by distance limit 10

This accepts the id of a TIL and returns the 10 most similar TILs based on their embeddings. Try it out here.

Here’s a more fun query that also explores the vec_to_json() function - which turns that binary format into a readable JSON array of floats - the vec_slice() function for returning a shorter slice of that array and the vec_quantize_binary() function for quantizing a vector to binary - returning a 1 for values >0 and a -1 for <0.

with document_embedding as (
select embedding as first_embedding from embeddings where id = :id
)
select
id,
vec_distance_cosine(embedding, first_embedding) as distance,
vec_to_json(vec_slice(embedding, 0, 3)) as first_3,
vec_to_json(vec_quantize_binary(vec_slice(embedding, 0, 8))) as binary_8
from
embeddings, document_embedding
order by distance limit 5

Run that here. I get back these results:

iddistancefirst_3binary_8
observable-plot_histogram-with-tooltips.md0.0[-0.016882,-0.000301,0.009767][0,0,1,0,0,1,1,0]
observable-plot_wider-tooltip-areas.md0.14028826355934143[-0.000047,-0.005976,-0.007012][0,0,0,0,0,1,1,0]
vega_bar-chart-ordering.md0.22134298086166382[-0.004891,-0.006509,-0.005039][0,0,0,0,0,1,0,0]
svg_dynamic-line-chart.md0.2285003513097763[0.001713,-0.004975,0.010736][1,0,1,0,0,1,0,0]
javascript_copy-rich-text-to-clipboard.md0.2285047024488449[-0.022232,0.008316,-0.000267][0,1,0,0,0,1,0,0]

Creating an index#

sqlite-vec also includes the ability to create an index for a collection of vectors.

Here’s how I created an index for my TILs. First, I created a virtual table using the vec0 mechanism provided by sqlite-vec - I told it to store an embedding column that was an array of 1536 floats (the size of the OpenAI embeddings I’ve been using for my TILs):

create virtual table vec_tils using vec0(
embedding float[1536]
);

Then I populated it like this:

insert into vec_tils(rowid, embedding)
select rowid, embedding from embeddings;

vec0 tables require an integer ID, so I used the rowid of the embeddings table. If I had my own numeric ID on that table I would use that instead.

Now I can run queries against this index like so:

with document_embedding as (
select embedding as first_embedding from embeddings where id = :id
)
select
(select id from embeddings where embeddings.rowid = vec_tils.rowid) as id,
distance
from vec_tils, document_embedding
where embedding match first_embedding
and k = 5
order by distance;

Try it here. I get back:

iddistance
observable-plot_histogram-with-tooltips.md0.0
observable-plot_wider-tooltip-areas.md0.5296944379806519
vega_bar-chart-ordering.md0.6653465032577515
svg_dynamic-line-chart.md0.6760170459747314
javascript_copy-rich-text-to-clipboard.md0.6760244369506836

The where embedding match first_embedding and k = 5 clause hooks into the magic of the underlying virtual table to run an efficient k-nearest-neighbors query against the index.

I’m using that (select id from embeddings where embeddings.rowid = vec_tils.rowid) as id trick to convert the numeric rowid into a human-readable id value by running a subquery against the embeddings table.

The vec_tils table is created for my TIL site by this step in my GitHub Actions workflow that deploys the application.

Using sqlite-vec with embeddings in sqlite-utils and Datasette
https://mranv.pages.dev/posts/using-sqlite-vec-with-embeddings-in-sqlite-utils-and-datasette/
Author
Anubhav Gain
Published at
2024-02-12
License
CC BY-NC-SA 4.0