627 words
3 minutes
Reviewing your history of public GitHub repositories using ClickHouse

Reviewing your history of public GitHub repositories using ClickHouse#

There’s a story going around at the moment that people have found code from their private GitHub repositories in the AI training data known as The Stack, using this search tool: https://huggingface.co/spaces/bigcode/in-the-stack

I’m very doubtful that private data has been included in that training set. I think it’s far more likely that the repositories in question were public at some point in the time, and were gathered up by the https://www.softwareheritage.org/ project when they archived code from GitHub.

But how can we tell if a private repository was public at some point in the past?

GitHub have a security audit log for logged in users, but sadly it appears to only cover the past six months.

For a longer history, we can look things up in the GitHub Archive project, which has been recording public events from the GitHub API since 2011.

TLDR: I built a tool for this here: https://observablehq.com/@simonw/github-public-repo-history

The ClickHouse team provide a public tool for querying that data using SQL as a demo of their software. We can use that to try and find out if a repository was public at some point in the past.

Access the tool here - no login required: https://play.clickhouse.com/play

Now execute the following SQL, replacing my username with yours in both places where it occurs:

with public_events as (
select
created_at as timestamp,
'Private repo made public' as action,
repo_name
from github_events
where lower(actor_login) = 'simonw'
and event_type in ('PublicEvent')
),
most_recent_public_push as (
select
max(created_at) as timestamp,
'Most recent public push' as action,
repo_name
from github_events
where event_type = 'PushEvent'
and lower(actor_login) = 'simonw'
group by repo_name
),
combined as (
select * from public_events
union all select * from most_recent_public_push
)
select * from combined order by timestamp

The result is a combined timeline showing two things:

  • PublicEvent events - which GitHub describes as “When a private repository is made public. Without a doubt: the best GitHub event.”
  • The most recent PushEvent for each repository. Repositories which started life public won’t show up in the PublicEvent list, so this aims to capture them.

Here’s an extract from the data I get back when I run the query for myself:

2017-09-10: Most recent public push, simonw/github-large-file-test - 2017-09-12: Most recent public push, simonw/Houston-Shelters - 2017-09-26: Private repo made public, simonw/squirrelspotter - 2017-10-01: Private repo made public, simonw/simonwillisonblog - 2017-10-12: Most recent public push, simonw/ratelimitcache - 2017-10-15: Most recent public push, simonw/irma-scraped-data - 2017-10-15: Most recent public push, simonw/fema-history - 2017-11-06: Most recent public push, simonw/factory_worker_python - 2017-11-13: Private repo made public, simonw/datasette

A UI for that query using Observable#

I put together an Observable Notebook that provides a UI for executing this query: https://observablehq.com/@simonw/github-public-repo-history

It uses just three cells of JavaScript. The first provides a username input, with a submit button to avoid firing off SQL queries while the user is still typing their name:

viewof username = Inputs.text({
placeholder: "Your GitHub username",
submit: true
})

The second executes the query using the ClickHouse JSON API, described previously:

results = username.trim() &&
(
await fetch("https://play.clickhouse.com/?user=play", {
method: "POST",
body: `with public_events as (
select
created_at as timestamp,
'Private repo made public' as action,
repo_name
from github_events
where lower(actor_login) = '${username.trim().toLowerCase()}'
and event_type in ('PublicEvent')
),
most_recent_public_push as (
select
max(created_at) as timestamp,
'Most recent public push' as action,
repo_name
from github_events
where event_type = 'PushEvent'
and lower(actor_login) = '${username.trim()}.toLowerCase()'
group by repo_name
),
combined as (
select * from public_events
union all select * from most_recent_public_push
)
select * from combined order by timestamp FORMAT JSON`
})
).json()

The third conditionally shows a table of results if the data has been fetched:

table = {
if (results && results.data) {
return Inputs.table(results.data);
} else {
return null;
}
}

Here’s what it looks like running on Observable:

A notebook executing the code shown here, displaying a table of results for username simonw.

Reviewing your history of public GitHub repositories using ClickHouse
https://mranv.pages.dev/posts/reviewing-your-history-of-public-github-repositories-using-clickhouse/
Author
Anubhav Gain
Published at
2024-07-11
License
CC BY-NC-SA 4.0