Reviewing your history of public GitHub repositories using ClickHouse
There’s a story going around at the moment that people have found code from their private GitHub repositories in the AI training data known as The Stack, using this search tool: https://huggingface.co/spaces/bigcode/in-the-stack
I’m very doubtful that private data has been included in that training set. I think it’s far more likely that the repositories in question were public at some point in the time, and were gathered up by the https://www.softwareheritage.org/ project when they archived code from GitHub.
But how can we tell if a private repository was public at some point in the past?
GitHub have a security audit log for logged in users, but sadly it appears to only cover the past six months.
For a longer history, we can look things up in the GitHub Archive project, which has been recording public events from the GitHub API since 2011.
TLDR: I built a tool for this here: https://observablehq.com/@simonw/github-public-repo-history
The ClickHouse team provide a public tool for querying that data using SQL as a demo of their software. We can use that to try and find out if a repository was public at some point in the past.
Access the tool here - no login required: https://play.clickhouse.com/play
Now execute the following SQL, replacing my username with yours in both places where it occurs:
with public_events as ( select created_at as timestamp, 'Private repo made public' as action, repo_name from github_events where lower(actor_login) = 'simonw' and event_type in ('PublicEvent')),most_recent_public_push as ( select max(created_at) as timestamp, 'Most recent public push' as action, repo_name from github_events where event_type = 'PushEvent' and lower(actor_login) = 'simonw' group by repo_name),combined as ( select * from public_events union all select * from most_recent_public_push)select * from combined order by timestamp
The result is a combined timeline showing two things:
PublicEvent
events - which GitHub describes as “When a private repository is made public. Without a doubt: the best GitHub event.”- The most recent
PushEvent
for each repository. Repositories which started life public won’t show up in thePublicEvent
list, so this aims to capture them.
Here’s an extract from the data I get back when I run the query for myself:
A UI for that query using Observable
I put together an Observable Notebook that provides a UI for executing this query: https://observablehq.com/@simonw/github-public-repo-history
It uses just three cells of JavaScript. The first provides a username input, with a submit button to avoid firing off SQL queries while the user is still typing their name:
viewof username = Inputs.text({ placeholder: "Your GitHub username", submit: true})
The second executes the query using the ClickHouse JSON API, described previously:
results = username.trim() && ( await fetch("https://play.clickhouse.com/?user=play", { method: "POST", body: `with public_events as ( select created_at as timestamp, 'Private repo made public' as action, repo_name from github_events where lower(actor_login) = '${username.trim().toLowerCase()}' and event_type in ('PublicEvent')),most_recent_public_push as ( select max(created_at) as timestamp, 'Most recent public push' as action, repo_name from github_events where event_type = 'PushEvent' and lower(actor_login) = '${username.trim()}.toLowerCase()' group by repo_name),combined as ( select * from public_events union all select * from most_recent_public_push)select * from combined order by timestamp FORMAT JSON` }) ).json()
The third conditionally shows a table of results if the data has been fetched:
table = { if (results && results.data) { return Inputs.table(results.data); } else { return null; }}
Here’s what it looks like running on Observable: