Scraping Reddit and writing data to the Datasette write API
Today I built a system for monitoring Reddit for new posts that link to various domains that I own.
I modelled it on my existing scrape-hacker-news-by-domain project, which I described in some detail in this blog post.
The goal of the Reddit scraper was to do the following:
- Frequently check Reddit for new links to any page on
simonwillison.net
ordatasette.io
, using the unofficial search API - the Reddit search page with.json
appended to the URL. - Record the results to a GitHub repository
- Submit any new results to my personal Datasette Cloud instance via the Write API
- Use datasette-atom to provide a feed of new posts I can subscribe to in netNewsWire
The scraper
Not much to say about the scraper. It looks like this:
curl -H $USER_AGENT \ 'https://www.reddit.com/search/.json?q=site%3Asimonwillison.net&sort=new' \ | jq > simonwillison-net.json
Where USER_AGENT
is set to this, because Reddit doesn’t like curl
hitting it with the default agent:
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36
The HTML version of this page is https://www.reddit.com/search/?q=site%3Asimonwillison.net&sort=new
It then repeats the same thing for datasette.io
and commits the results. The code lives in this scheduled GitHub Actions file: https://github.com/simonw/scrape-reddit-by-domain/blob/main/.github/workflows/scrape.yml
Creating the table
I created myself a Datasette signed API token with full permissions that would expire after five minutes using the /-/create-token
interface.
I set that as an environment variable like so:
export DS_TOKEN='dstok_...'
Then I ran this to create the table and populate it with some initial data:
cat simonwillison-net.json | jq '{ table: "reddit_posts", rows: [.data.children[].data | { id, subreddit, title, url, ups, num_comments, created_utc: (.created_utc | todateiso8601), subreddit_subscribers, domain, permalink: ("https://www.reddit.com" + .permalink) }], pk: "id"}' | \ curl -X POST -H "Content-Type: application/json" \ -H "Authorization: Bearer $DS_TOKEN" \ -d @- https://simon.datasette.cloud/data/-/create
This uses jq
to extract the bits of the JSON I care about and reformat them into a smaller set of colunms. It constructs a JSON document that matches that expected by the /-/create
API, documented here.
(I used ChatGPT to figure out that jq
recipe, in particular the | todateiso8601
part.)
It pipes the resulting JSON to curl
and makes an authenticated POST request to the /-/create
API. This created the reddit_posts
table and populated it with the initial data.
Submitting new rows from GitHub Actions
With the table created, I could now create a long-lived API token that would allow me to write to the table.
I used the /-/create-token
interface again but this time I created a token that would never expire but that only had write and alter permission to the new reddit_posts
table I had just created.
I wrote the following script to submit all rows from all JSON files in the current directory:
#!/bin/bashfor file in *.jsondo cat $file | jq '{ rows: [.data.children[].data | { id, subreddit, title, url, ups, num_comments, created_utc: (.created_utc | todateiso8601), subreddit_subscribers, domain, permalink: ("https://www.reddit.com" + .permalink) }], replace: true}' | \ curl -X POST -H "Content-Type: application/json" \ -H "Authorization: Bearer $DS_TOKEN" \ -d @- https://simon.datasette.cloud/data/reddit_posts/-/insertdone
This uses the /-/insert
API instead, which is a little different from the /-/create
API. It doesn’t take a table
argument, instead expecting the table name to be part of the URL.
I also send replace: true
- this ensures that any existing rows with a primary key that matches incoming data will be replaced by the updated version.
I ran chmod 755 submit-to-datasette-cloud.sh
and added it to the GitHub repository.
Running it in GitHub Actions
Having set the DS_TOKEN
secret for my repository, I added the following to the scrape.yml
file:
- name: Submit latest to Datasette Cloud env: DS_TOKEN: ${{ secrets.DS_TOKEN }} run: |- ./submit-to-datasette-cloud.sh
Now every time the workflow runs (once an hour) any new records will be submitted up to Datasette Cloud.
Configuring the Atom feed
I used the datasette-public plugin to make the reddit_posts
table public. You can see that here:
https://simon.datasette.cloud/data/reddit_posts
Then I used the datasette-write plugin to create a SQL view defined like this:
create view reddit_posts_atom as select 'reddit:' || id as atom_id, title as atom_title, permalink as atom_link, created_utc as atom_updated, '<a href="' || url || '">' || url || '</a><br><br>' || 'Subreddit: ' || subreddit || '<br>' || 'Comments: ' || num_comments || '<br>' || 'Upvotes: ' || ups || '<br>' || 'Subscribers: ' || subreddit_subscribers as atom_content_htmlfrom reddit_postsorder by created_utc desclimit 100;
This follows the conventions required by the datasette-atom plugin.
I made that view public too, and now you can visit it here:
https://simon.datasette.cloud/data/reddit_posts_atom
Or subscribe to the Atom feed here: