Using S3 triggers to maintain a list of files in DynamoDB#
This is a three-quarters-baked experiment that I ran this morning. I’m interested in efficiently tracking which new files have been added to an S3 bucket, where that bucket could potentially hold millions of items.
Frustratingly S3 doesn’t offer an API for listing files that have been updated since a specific date - you have to do a full paginated listing of every file in the bucket every time.
Could this be solved using S3 triggers, where a lambda function runs for every change to the bucket and writes the results somewhere else?
I was curious to see if this could work, so I decided to build a protoype using Claude, ChatGPT and o3-mini-high.
I hate figuring out AWS things, but I’ve recently found that most of the modern LLMs have a deep knowledge of the key AWS APIs, thanks to their tendency to stay stable over a multi-year time period.
Which means I can learn AWS by getting LLMs to write custom one-off tools for me!
Python CLI app using boto3 with commands for creating a new S3 bucket which it also configures to have S3 lambada event triggers which moantian a dynamodb table containing metadata about all of the files in that bucket. Include these commands
create_bucket - create a bucket and sets up the associated triggers and dynamo tables
list_files - shows me a list of files based purely on querying dynamo
I’ve never used S3 lambda triggers or DynamoDB, so this was a fun new experiment!
Create a DynamoDB table called "{bucket_name}-metadata" with a bucket_key string primary key (“KeySchema” in DynamoDB terminology)
Create a Lambda role with (presumably) the correct policies attached
Create a new Lambda function, implemented in Python, that updates DynamoDB
Set that function to trigger on ["s3:ObjectCreated:*", "s3:ObjectRemoved:*"] bucket notification events
On reading the script more closely though I spotted what looked like a bug: it was sending the raw Python code to Lamba as the Code={'ZipFile': lambda_code} parameter, and I was pretty sure that was meant to be a zip file.
Rather than continuing to wrestle with Claude (which probably would have worked) I decided to switch models and dump the whole thing into ChatGPT’s o3-mini-high model and tell it to identify and fix any bugs:
Identify, explain and then fix any bugs in this code:
code from Claude pasted here
Here’s the transcript. It “reasoned” for 1 minute and 9 seconds and then spat out the following (full code to follow at the end):
Below is one acceptable answer. In our original code there were three main issues:
Lambda Code Not Packaged as a ZIP Archive
The AWS Lambda API (when using the ZipFile parameter) expects a ZIP archive containing your function code (with the file name matching your handler, e.g. “lambda_function.py”). In the original code the raw text was encoded with .encode(), which does not create a ZIP file.
Fix: Use Python’s zipfile (with an in‑memory buffer via io.BytesIO) to package the code as a ZIP archive before uploading.
Missing Permission for S3 to Invoke the Lambda Function
S3 bucket notifications will not work unless the Lambda function’s resource policy allows S3 to invoke it. In the original code no permission was granted to S3.
Fix: Call lambda_client.add_permission with the proper parameters (including a statement ID, the action "lambda:InvokeFunction", principal "s3.amazonaws.com", and the source ARN for your bucket).
Region Mismatch for AWS Clients
The command-line option “—region” is used when creating the S3 bucket but not for other resources (Lambda and DynamoDB are created using the default region). This can lead to resources being created in different regions.
Fix: Instantiate all regional boto3 clients with the same region (by using the region_name parameter) and pass the region value (or create a context) so that all API calls are in the intended region.
This archive is then passed in Code={'ZipFile': zip_bytes} so that AWS Lambda receives a proper deployment package.
Lambda Permission:
In the create_bucket command we add:
1
lambda_client.add_permission(
2
FunctionName=function_name,
3
StatementId=f"{bucket_name}-s3invoke",
4
Action="lambda:InvokeFunction",
5
Principal="s3.amazonaws.com",
6
SourceArn=f"arn:aws:s3:::{bucket_name}"
7
)
This allows S3 (identified by its ARN) to invoke the Lambda function.
Region Consistency:
The CLI now takes a global option “—region” (with default 'us-east-1'). Every boto3 client that is regional (S3, Lambda, DynamoDB) is created with that region so that all resources are created in the same region.
This version fixes the bugs noted above and is one correct way to resolve the issues in the original code.
It spotted and fixed the ZipFile issue I identified, and it looks like it found a couple of other issues worth fixing too.
I ran the code that it gave me… and it almost worked! I got this error:
1
Created bucket: simon-magic-bucket-2
2
Created DynamoDB table: simon-magic-bucket-2-metadata
3
Created/using IAM role: simon-magic-bucket-2-lambda-role
4
Error: An error occurred (InvalidParameterValueException) when calling the CreateFunction operation: The role defined for the function cannot be assumed by Lambda.
5
Aborted!
I pasted that error message back into o3-mini-high, it “reasoned” for another 8 seconds and suggested that I add time.sleep(10) after creating the role to allow it time to propagate.
(I eventually added another time.sleep(5) elsewhere for a similar reason.)
I’ve since saved the file on GitHub, which means anyone with uv installed (and their AWS credentials configured on their machine) can run it like this:
This is a very solid proof of concept: I’ve conclusively proven to myself that this approach can work and is feasible using the tools I have at hand.
There’s actually not much left before I would consider putting it in production. Off the top of my head:
A few people have pointed me to Amazon S3 queryable metadata, a brand new (previewed in December 2024, GA in January 2025) system that solves almost exactly the same problem! I may well use that instead.
The trigger right now only tracks "s3:ObjectCreated:*" and "s3:ObjectRemoved:*" - is there an equivalent for if an object is updated? Claude says that an update is treated as an ObjectCreated event as well so probably not but there might be other events I want to track.
I’d really like to get some automated testing in place for something like this. A CI job that creates a new bucket, adds some files, waits a bit, checks that the DynamoDB table was populated and then cleans up at the end would give me more confidence in the system going forward.
If I ever use this properly I’ll turn it into a Python package with a GitHub repo, issues, a changelog etc.
Is DynamoDB the best choice for this? I’d like to spend a little more time thinking through the project rather than assuming that my first guesses were the most appropriate architecture here.
Honestly though I feel like this prototype got me 90% of the way there already.