Scraping Reddit and writing data to the Datasette write API#

Today I built a system for monitoring Reddit for new posts that link to various domains that I own.

I modelled it on my existing scrape-hacker-news-by-domain project, which I described in some detail in this blog post.

The goal of the Reddit scraper was to do the following:

Frequently check Reddit for new links to any page on simonwillison.net or datasette.io, using the unofficial search API - the Reddit search page with .json appended to the URL.
Record the results to a GitHub repository
Submit any new results to my personal Datasette Cloud instance via the Write API
Use datasette-atom to provide a feed of new posts I can subscribe to in netNewsWire

The scraper#

Not much to say about the scraper. It looks like this:

1
curl -H $USER_AGENT \
2
  'https://www.reddit.com/search/.json?q=site%3Asimonwillison.net&sort=new' \
3
  | jq > simonwillison-net.json

Where USER_AGENT is set to this, because Reddit doesn’t like curl hitting it with the default agent:

1
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36

The HTML version of this page is https://www.reddit.com/search/?q=site%3Asimonwillison.net&sort=new

It then repeats the same thing for datasette.io and commits the results. The code lives in this scheduled GitHub Actions file: https://github.com/simonw/scrape-reddit-by-domain/blob/main/.github/workflows/scrape.yml

Creating the table#

I created myself a Datasette signed API token with full permissions that would expire after five minutes using the /-/create-token interface.

I set that as an environment variable like so:

1
export DS_TOKEN='dstok_...'

Then I ran this to create the table and populate it with some initial data:

1
cat simonwillison-net.json | jq '{
2
  table: "reddit_posts",
3
  rows: [.data.children[].data | {
4
    id,
5
    subreddit,
6
    title,
7
    url,
8
    ups,
9
    num_comments,
10
    created_utc: (.created_utc | todateiso8601),
11
    subreddit_subscribers,
12
    domain,
13
    permalink: ("https://www.reddit.com" + .permalink)
14
  }],
15
  pk: "id"
16
}' | \
17
  curl -X POST -H "Content-Type: application/json" \
18
    -H "Authorization: Bearer $DS_TOKEN" \
19
    -d @- https://simon.datasette.cloud/data/-/create

This uses jq to extract the bits of the JSON I care about and reformat them into a smaller set of colunms. It constructs a JSON document that matches that expected by the /-/create API, documented here.

(I used ChatGPT to figure out that jq recipe, in particular the | todateiso8601 part.)

It pipes the resulting JSON to curl and makes an authenticated POST request to the /-/create API. This created the reddit_posts table and populated it with the initial data.

Submitting new rows from GitHub Actions#

With the table created, I could now create a long-lived API token that would allow me to write to the table.

I used the /-/create-token interface again but this time I created a token that would never expire but that only had write and alter permission to the new reddit_posts table I had just created.

I wrote the following script to submit all rows from all JSON files in the current directory:

1
#!/bin/bash
2
for file in *.json
3
do
4
  cat $file | jq '{
5
  rows: [.data.children[].data | {
6
    id,
7
    subreddit,
8
    title,
9
    url,
10
    ups,
11
    num_comments,
12
    created_utc: (.created_utc | todateiso8601),
13
    subreddit_subscribers,
14
    domain,
15
    permalink: ("https://www.reddit.com" + .permalink)
16
  }],
17
  replace: true
18
}' | \
19
  curl -X POST -H "Content-Type: application/json" \
20
    -H "Authorization: Bearer $DS_TOKEN" \
21
    -d @- https://simon.datasette.cloud/data/reddit_posts/-/insert
22
done

This uses the /-/insert API instead, which is a little different from the /-/create API. It doesn’t take a table argument, instead expecting the table name to be part of the URL.

I also send replace: true - this ensures that any existing rows with a primary key that matches incoming data will be replaced by the updated version.

I ran chmod 755 submit-to-datasette-cloud.sh and added it to the GitHub repository.

Running it in GitHub Actions#

Having set the DS_TOKEN secret for my repository, I added the following to the scrape.yml file:

1
    - name: Submit latest to Datasette Cloud
2
      env:
3
        DS_TOKEN: ${{ secrets.DS_TOKEN }}
4
      run: |-
5
        ./submit-to-datasette-cloud.sh

Now every time the workflow runs (once an hour) any new records will be submitted up to Datasette Cloud.

Configuring the Atom feed#

I used the datasette-public plugin to make the reddit_posts table public. You can see that here:

https://simon.datasette.cloud/data/reddit_posts

Then I used the datasette-write plugin to create a SQL view defined like this:

1
create view reddit_posts_atom as select
2
  'reddit:' || id as atom_id,
3
  title as atom_title,
4
  permalink as atom_link,
5
  created_utc as atom_updated,
6
  '<a href="' || url || '">' || url || '</a><br><br>'
7
  || 'Subreddit: ' || subreddit || '<br>'
8
  || 'Comments: ' || num_comments || '<br>'
9
  || 'Upvotes: ' || ups || '<br>'
10
  || 'Subscribers: ' || subreddit_subscribers as atom_content_html
11
from
12
  reddit_posts
13
order by
14
  created_utc desc
15
limit
16
  100;

This follows the conventions required by the datasette-atom plugin.

I made that view public too, and now you can visit it here:

https://simon.datasette.cloud/data/reddit_posts_atom

Or subscribe to the Atom feed here:

https://simon.datasette.cloud/data/reddit_posts_atom.atom