Tailing Google Cloud Run request logs and importing them into SQLite#

The gcloud CLI tool has the alpha ability to tail log files - but it’s a bit of a pain to setup.

You have to install two extras for it. First, this:

1
gcloud alpha logging tail

That installs the functionality, but as the documentation will tell you:

To use gcloud alpha logging tail, you need to have Python 3 and the grpcio Python package installed.

Assuming you have Python 3, the problem you have to solve is which Python is the gcloud tool using to run. After digging around in the source code using cat $(which gcloud) I spotted the following:

1
CLOUDSDK_PYTHON=$(order_python python3 python2 python2.7 python)

So it looks like (on macOS at least) it prefers to use the python3 binary if it can find it.

So this works to install grpcio somewhere it can see it:

1
python3 -m pip install grpcio

Having done that, you can start running commands. gcloud logging logs list shows a list of logs:

1
~ % gcloud logging logs list
2
NAME
3
projects/datasette-222320/logs/cloudaudit.googleapis.com%2Factivity
4
projects/datasette-222320/logs/cloudaudit.googleapis.com%2Fdata_access
5
projects/datasette-222320/logs/cloudaudit.googleapis.com%2Fsystem_event
6
projects/datasette-222320/logs/cloudbuild
7
projects/datasette-222320/logs/clouderrorreporting.googleapis.com%2Finsights
8
projects/datasette-222320/logs/cloudtrace.googleapis.com%2FTraceLatencyShiftDetected
9
projects/datasette-222320/logs/run.googleapis.com%2Frequests
10
projects/datasette-222320/logs/run.googleapis.com%2Fstderr
11
projects/datasette-222320/logs/run.googleapis.com%2Fstdout
12
projects/datasette-222320/logs/run.googleapis.com%2Fvarlog%2Fsystem

Then you can use gcloud alpha logging tail projects/datasette-222320/logs/run.googleapis.com%2Frequests to start logging. Only you also need a CLOUDSDK_PYTHON_SITEPACKAGES=1 environment variable so that gcloud knows to look for the grpcio dependency.

1
CLOUDSDK_PYTHON_SITEPACKAGES=1 \
2
  gcloud alpha logging tail projects/datasette-222320/logs/run.googleapis.com%2Frequests

The default format is verbose YAML. A log entry looks like this:

1
httpRequest:
2
  latency: 0.123684963s
3
  remoteIp: 66.249.69.240
4
  requestMethod: GET
5
  requestSize: '510'
6
  requestUrl: https://www.niche-museums.com/browse/museums.json?_facet_size=max&country=United+States&_facet=osm_city&_facet=updated&_facet=osm_suburb&_facet=osm_footway&osm_city=Santa+Cruz
7
  responseSize: '6403'
8
  serverIp: 142.250.125.121
9
  status: 200
10
  userAgent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
11
insertId: 611171fe000a38a469d59595
12
labels:
13
  instanceId: 00bf4bf02dab164592dbbb9220b56c3ce64cb0f1c1f37812d1d61e851a931e9964ba539c2ede42886773c82662cc28aa858749d2697f537ff7a61e7b
14
  service: niche-museums
15
logName: projects/datasette-222320/logs/run.googleapis.com%2Frequests
16
receiveTimestamp: '2021-08-09T18:20:46.935658405Z'
17
resource:
18
  labels:
19
    configuration_name: niche-museums
20
    location: us-central1
21
    project_id: datasette-222320
22
    revision_name: niche-museums-00039-sur
23
    service_name: niche-museums
24
  type: cloud_run_revision
25
severity: INFO
26
timestamp: '2021-08-09T18:20:46.669860Z'
27
trace: projects/datasette-222320/traces/306a0d6e7e055ba66172003a74c926c2

I decided to import into a SQLite database so I could use Datasette to analyze the log files (hooray for facets).

Adding --format json switches the output to JSON - but it’s a pretty-printed array of JSON objects, something like this:

1
[
2
  {
3
    "httpRequest": {
4
      "latency": "0.112114537s",
5
      "remoteIp": "40.77.167.88",
6
      "requestMethod": "GET",
7
      "requestSize": "534",
8
      "requestUrl": "https://datasette.io/content/repos?forks=0&_facet=homepage&_facet=size&_facet=open_issues&open_issues=3&size=564&_sort=readme_html",
9
      "responseSize": "72757",
10
      "serverIp": "216.239.38.21",
11
      "status": 200,
12
      "userAgent": "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
13
    },
14
    "insertId": "6111722f000b5b4c4d4071e2",
15
    "labels": {
16
      "instanceId": "00bf4bf02d1d7fe4402c3aff8a34688d9a910e6ee6d2545ceebc1edefb99461481e6d9f9ae8de4e907e3d18b98ea9c7f57b2abb527c8857d9163ed193db766c349a1ee",
17
      "service": "datasette-io"
18
    },
19
    "logName": "projects/datasette-222320/logs/run.googleapis.com%2Frequests",
20
    "receiveTimestamp": "2021-08-09T18:21:36.061693305Z",
21
    "resource": {
22
      "labels": {
23
        "configuration_name": "datasette-io",
24
        "location": "us-central1",
25
        "project_id": "datasette-222320",
26
        "revision_name": "datasette-io-00416-coy",
27
        "service_name": "datasette-io"
28
      },
29
      "type": "cloud_run_revision"
30
    },
31
    "severity": "INFO",
32
    "timestamp": "2021-08-09T18:21:35.744268Z",
33
    "trace": "projects/datasette-222320/traces/016d640caf845fbf8709486bc8dff9c7"
34
  }
35
]

I want to stream the logs into sqlite-utils using newline-delimited JSON since that can insert while the data is still being tailed.

I ended up using two new jq recipes:

1
cat example.json | jq -cn --stream 'fromstream(1|truncate_stream(inputs))'

This turns an [{"array": "of objects"}, {"like": "this one"}] into a stream of newline-delimited objects. I found the recipe here - I don’t understand it.

As you can see above, the objects are nested. I want them as flat objects so that sqlite-utils insert will create a separate column for each nested value. I used this recipe for that.

The end result was this:

1
CLOUDSDK_PYTHON_SITEPACKAGES=1 gcloud alpha logging tail \
2
  projects/datasette-222320/logs/run.googleapis.com%2Frequests \
3
  --format json \
4
| jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
5
| jq -c '[leaf_paths as $path | {
6
  "key": $path | join("_"), "value": getpath($path)
7
}] | from_entries' \
8
| sqlite-utils insert /tmp/logs.db logs - --nl --alter --batch-size 1

That last line inserts the data into the /tmp/logs.db database file. --nl means “expect newline-delimited JSON”, --alter means “add new columns if they are missing”, --batch-size 1 means “commit after every record” (so I can see them in Datasette while they are streaming in).

UPDATE: sqlite-utils 3.15 added a --flatten option which you can use instead of that second jq recipe, so this should work instead:

1
CLOUDSDK_PYTHON_SITEPACKAGES=1 gcloud alpha logging tail \
2
  projects/datasette-222320/logs/run.googleapis.com%2Frequests \
3
  --format json \
4
| jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
5
| sqlite-utils insert /tmp/logs.db logs - --nl --alter --batch-size 1 --flatten

The resulting schema looks like this (via sqlite-utils schema /tmp/logs.db):

1
CREATE TABLE [logs] (
2
   [httpRequest_latency] TEXT,
3
   [httpRequest_remoteIp] TEXT,
4
   [httpRequest_requestMethod] TEXT,
5
   [httpRequest_requestSize] TEXT,
6
   [httpRequest_requestUrl] TEXT,
7
   [httpRequest_responseSize] TEXT,
8
   [httpRequest_serverIp] TEXT,
9
   [httpRequest_status] INTEGER,
10
   [httpRequest_userAgent] TEXT,
11
   [insertId] TEXT,
12
   [labels_instanceId] TEXT,
13
   [labels_service] TEXT,
14
   [logName] TEXT,
15
   [receiveTimestamp] TEXT,
16
   [resource_labels_configuration_name] TEXT,
17
   [resource_labels_location] TEXT,
18
   [resource_labels_project_id] TEXT,
19
   [resource_labels_revision_name] TEXT,
20
   [resource_labels_service_name] TEXT,
21
   [resource_type] TEXT,
22
   [severity] TEXT,
23
   [timestamp] TEXT,
24
   [trace] TEXT,
25
   [httpRequest_referer] TEXT
26
);

Then I ran datasette /tmp/logs.db to start exploring the logs. Faceting by resource_labels_service_name was particularly useful.

The httpRequest_latency column contains text data that looks like 0.012572683s - thankfully if you cast it to a float the trailing s will be ignored. Here’s an example query showing the services with the highest average latency:

1
select
2
  resource_labels_service_name,
3
  avg(cast(httpRequest_latency as float)) as avg_latency,
4
  count(*)
5
from
6
  logs
7
group by
8
  resource_labels_service_name
9
order by
10
  avg_latency desc

Using the Logs explorer#

Alternatively, you can use the Google Cloud logs explorer! It has pretty decent faceted search built in.

Here’s a query showing results from that log file:

1
resource.type="cloud_run_revision"
2
log_name="projects/datasette-222320/logs/run.googleapis.com%2Frequests"

Run that at https://console.cloud.google.com/logs/query - or here’s a link I can use to execute that directly (for the last 7 days): https://console.cloud.google.com/logs/query;query=resource.type%3D%22cloud_run_revision%22%0Alog_name%3D%22projects%2Fdatasette-222320%2Flogs%2Frun.googleapis.com%252Frequests%22;timeRange=P7D;?project=datasette-222320