Crawling Datasette with Datasette#

I wanted to add the new tutorials on https://datasette.io/tutorials to the search index that is used by the https://datasette.io/-/beta search engine.

To do this, I needed the content of those tutorials in a SQLite database table. But the tutorials are implemented as static pages in templates/pages/tutorials - so I needed to crawl that content and insert it into a table.

I ended up using a combination of the datasette.client mechanism (documented here), Beautiful Soup and sqlite-utils - all wrapped up in a Python script that’s now called as part of the GitHub Actions build process for the site.

I’m also using configuration directory mode.

Here’s the annotated script:

1
import asyncio
2
from bs4 import BeautifulSoup as Soup
3
from datasette.app import Datasette
4
import pathlib
5
import sqlite_utils
6

7
# This is an async def function because it needs to call await ds.client
8
async def main():
9
    db = sqlite_utils.Database("content.db")
10
    # We need to simulate the full https://datasette.io/ site - including all
11
    # of its custom templates and plugins. On the command-line we would do this
12
    # by running "datasette ." - using configuration directory mode. This is
13
    # the equivalent of that when constructing the Datasette object directly:
14
    ds = Datasette(config_dir=pathlib.Path("."))
15
    # Equivalent of fetching the HTML from https://datasette.io/tutorials
16
    index_response = await ds.client.get("/tutorials")
17
    index_soup = Soup(index_response.text, "html5lib")
18
    # We want to crawl the links inside <div class="content"><ul>...<a href="">
19
    tutorial_links = index_soup.select(".content ul a")
20
    for link in tutorial_links:
21
        # For each one fetch the HTML, e.g. from /tutorials/learn-sql
22
        tutorial_response = await ds.client.get(link["href"])
23
        # The script should fail loudly if it encounters a broken link
24
        assert tutorial_response.status_code == 200
25
        # Now we can parse the page and extract the <h1> and <div class="content">
26
        soup = Soup(tutorial_response.text, "html5lib")
27
        # Beautiful Soup makes extracting text easy:
28
        title = soup.select("h1")[0].text
29
        body = soup.select(".content")[0].text
30
        # Insert this into the "tutorials" table, creating it if it does not exist
31
        db["tutorials"].insert(
32
            {
33
                "path": link["href"],
34
                "title": title,
35
                "body": body.strip(),
36
            },
37
            # Treat path, e.g. /tutorials/learn-sql, as the primary key
38
            pk="path",
39
            # This will over-write any existing records with the same path
40
            replace=True,
41
        )
42

43

44
if __name__ == "__main__":
45
    # This idiom executes the async function in an event loop:
46
    asyncio.run(main())

It’s then added to the search index by this Dogsheep Beta search configuration fragment:

1
content.db:
2
    tutorials:
3
        sql: |-
4
            select
5
              path as key,
6
              title,
7
              body as search_1,
8
              1 as is_public
9
            from
10
              tutorials
11
        display_sql: |-
12
            select
13
              highlight(
14
                body, :q
15
              ) as snippet
16
            from
17
              tutorials
18
            where
19
              tutorials.path = :key
20
        display: |-
21
            <h3>Tutorial: <a href="{{ key }}">{{ title }}</a></h3>
22
            <p>{{ display.snippet|safe }}</p>

See Building a search engine for datasette.io for more details on exactly how this works.