Using S3 triggers to maintain a list of files in DynamoDB#

This is a three-quarters-baked experiment that I ran this morning. I’m interested in efficiently tracking which new files have been added to an S3 bucket, where that bucket could potentially hold millions of items.

Frustratingly S3 doesn’t offer an API for listing files that have been updated since a specific date - you have to do a full paginated listing of every file in the bucket every time.

Could this be solved using S3 triggers, where a lambda function runs for every change to the bucket and writes the results somewhere else?

I was curious to see if this could work, so I decided to build a protoype using Claude, ChatGPT and o3-mini-high.

I hate figuring out AWS things, but I’ve recently found that most of the modern LLMs have a deep knowledge of the key AWS APIs, thanks to their tendency to stay stable over a multi-year time period.

Which means I can learn AWS by getting LLMs to write custom one-off tools for me!

v1 using Claude#

I started by prompting my Claude Python app Project with the following:

Python CLI app using boto3 with commands for creating a new S3 bucket which it also configures to have S3 lambada event triggers which moantian a dynamodb table containing metadata about all of the files in that bucket. Include these commands

create_bucket - create a bucket and sets up the associated triggers and dynamo tables

list_files - shows me a list of files based purely on querying dynamo

I’ve never used S3 lambda triggers or DynamoDB, so this was a fun new experiment!

Claude gave me a script that looked convincing - it ran the following steps:

Create a new S3 bucket with the specified name
Create a DynamoDB table called "{bucket_name}-metadata" with a bucket_key string primary key (“KeySchema” in DynamoDB terminology)
Create a Lambda role with (presumably) the correct policies attached
Create a new Lambda function, implemented in Python, that updates DynamoDB
Set that function to trigger on ["s3:ObjectCreated:*", "s3:ObjectRemoved:*"] bucket notification events

On reading the script more closely though I spotted what looked like a bug: it was sending the raw Python code to Lamba as the Code={'ZipFile': lambda_code} parameter, and I was pretty sure that was meant to be a zip file.

v2 using ChatGPT and o3-mini-high#

Rather than continuing to wrestle with Claude (which probably would have worked) I decided to switch models and dump the whole thing into ChatGPT’s o3-mini-high model and tell it to identify and fix any bugs:

Identify, explain and then fix any bugs in this code:

code from Claude pasted here

Here’s the transcript. It “reasoned” for 1 minute and 9 seconds and then spat out the following (full code to follow at the end):

Below is one acceptable answer. In our original code there were three main issues:

Lambda Code Not Packaged as a ZIP Archive The AWS Lambda API (when using the ZipFile parameter) expects a ZIP archive containing your function code (with the file name matching your handler, e.g. “lambda_function.py”). In the original code the raw text was encoded with .encode(), which does not create a ZIP file. Fix: Use Python’s zipfile (with an in‑memory buffer via io.BytesIO) to package the code as a ZIP archive before uploading.

Missing Permission for S3 to Invoke the Lambda Function S3 bucket notifications will not work unless the Lambda function’s resource policy allows S3 to invoke it. In the original code no permission was granted to S3. Fix: Call lambda_client.add_permission with the proper parameters (including a statement ID, the action "lambda:InvokeFunction", principal "s3.amazonaws.com", and the source ARN for your bucket).

Region Mismatch for AWS Clients The command-line option “—region” is used when creating the S3 bucket but not for other resources (Lambda and DynamoDB are created using the default region). This can lead to resources being created in different regions. Fix: Instantiate all regional boto3 clients with the same region (by using the region_name parameter) and pass the region value (or create a context) so that all API calls are in the intended region.

The fixed code below incorporates these changes:

code removed, full code to follow
Explanation#
ZIP Packaging: In the function create_lambda_function, we now create a ZIP archive in memory:
1
zip_buffer = io.BytesIO()
2
with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zip_file:
3
    zip_file.writestr('lambda_function.py', lambda_code)
4
zip_buffer.seek(0)
5
zip_bytes = zip_buffer.read()
This archive is then passed in Code={'ZipFile': zip_bytes} so that AWS Lambda receives a proper deployment package.
Lambda Permission: In the create_bucket command we add:
1
lambda_client.add_permission(
2
    FunctionName=function_name,
3
    StatementId=f"{bucket_name}-s3invoke",
4
    Action="lambda:InvokeFunction",
5
    Principal="s3.amazonaws.com",
6
    SourceArn=f"arn:aws:s3:::{bucket_name}"
7
)
This allows S3 (identified by its ARN) to invoke the Lambda function.
Region Consistency: The CLI now takes a global option “—region” (with default 'us-east-1'). Every boto3 client that is regional (S3, Lambda, DynamoDB) is created with that region so that all resources are created in the same region.
This version fixes the bugs noted above and is one correct way to resolve the issues in the original code.

It spotted and fixed the ZipFile issue I identified, and it looks like it found a couple of other issues worth fixing too.

I ran the code that it gave me… and it almost worked! I got this error:

1
Created bucket: simon-magic-bucket-2
2
Created DynamoDB table: simon-magic-bucket-2-metadata
3
Created/using IAM role: simon-magic-bucket-2-lambda-role
4
Error: An error occurred (InvalidParameterValueException) when calling the CreateFunction operation: The role defined for the function cannot be assumed by Lambda.
5
Aborted!

I pasted that error message back into o3-mini-high, it “reasoned” for another 8 seconds and suggested that I add time.sleep(10) after creating the role to allow it time to propagate.

(I eventually added another time.sleep(5) elsewhere for a similar reason.)

The finished script#

All of that combined together gave me the following script - which works! I saved it as magic_bucket.py:

1
# /// script
2
# requires-python = ">=3.12"
3
# dependencies = [
4
#     "click",
5
#     "boto3",
6
# ]
7
# ///
8

9
import json
10
import click
11
import boto3
12
import io
13
import zipfile
14
from typing import Optional
15
import time
16

17
# IAM is global since it is a global service
18
iam = boto3.client("iam")
19

20

21
def create_lambda_role(role_name: str) -> str:
22
    """Create IAM role for Lambda function with necessary permissions."""
23
    trust_policy = {
24
        "Version": "2012-10-17",
25
        "Statement": [
26
            {
27
                "Effect": "Allow",
28
                "Principal": {"Service": "lambda.amazonaws.com"},
29
                "Action": "sts:AssumeRole",
30
            }
31
        ],
32
    }
33

34
    try:
35
        response = iam.create_role(
36
            RoleName=role_name, AssumeRolePolicyDocument=json.dumps(trust_policy)
37
        )
38

39
        # Attach necessary policies
40
        policy_arns = [
41
            "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole",
42
            "arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess",
43
            "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess",
44
        ]
45

46
        for policy_arn in policy_arns:
47
            iam.attach_role_policy(RoleName=role_name, PolicyArn=policy_arn)
48

49
        return response["Role"]["Arn"]
50
    except iam.exceptions.EntityAlreadyExistsException:
51
        return iam.get_role(RoleName=role_name)["Role"]["Arn"]
52

53

54
def create_lambda_function(
55
    function_name: str, role_arn: str, table_name: str, region: str
56
) -> str:
57
    """Create Lambda function to process S3 events and update DynamoDB."""
58
    # Note the use of double braces to escape inner curly braces so that the final code has proper f-string syntax.
59
    lambda_code = f"""
60
import json
61
import boto3
62
import urllib.parse
63
from datetime import datetime
64

65
dynamodb = boto3.client('dynamodb')
66
s3 = boto3.client('s3')
67

68
def lambda_handler(event, context):
69
    for record in event['Records']:
70
        bucket = record['s3']['bucket']['name']
71
        key = urllib.parse.unquote_plus(record['s3']['object']['key'])
72
        size = record['s3']['object'].get('size', 0)
73
        event_time = record.get('eventTime', '')
74
        event_name = record.get('eventName', '')
75

76
        if event_name.startswith('ObjectCreated'):
77
            # Get additional metadata from S3
78
            response = s3.head_object(Bucket=bucket, Key=key)
79
            content_type = response.get('ContentType', 'unknown')
80

81
            # Store in DynamoDB
82
            dynamodb.put_item(
83
                TableName='{table_name}',
84
                Item={{
85
                    'bucket_key': {{'S': f'{{bucket}}/{{key}}'}},
86
                    'filename': {{'S': key}},
87
                    'bucket': {{'S': bucket}},
88
                    'size': {{'N': str(size)}},
89
                    'content_type': {{'S': content_type}},
90
                    'last_modified': {{'S': event_time}},
91
                    'timestamp': {{'S': datetime.utcnow().isoformat()}}
92
                }}
93
            )
94
        elif event_name.startswith('ObjectRemoved'):
95
            # Remove from DynamoDB
96
            dynamodb.delete_item(
97
                TableName='{table_name}',
98
                Key={{
99
                    'bucket_key': {{'S': f'{{bucket}}/{{key}}'}}
100
                }}
101
            )
102

103
    return {{
104
        'statusCode': 200,
105
        'body': json.dumps('Successfully processed S3 event')
106
    }}
107
"""
108
    # Package the lambda code as a ZIP archive in memory
109
    zip_buffer = io.BytesIO()
110
    with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zip_file:
111
        zip_file.writestr("lambda_function.py", lambda_code)
112
    zip_buffer.seek(0)
113
    zip_bytes = zip_buffer.read()
114

115
    lambda_client = boto3.client("lambda", region_name=region)
116

117
    try:
118
        response = lambda_client.create_function(
119
            FunctionName=function_name,
120
            Runtime="python3.12",
121
            Role=role_arn,
122
            Handler="lambda_function.lambda_handler",
123
            Code={"ZipFile": zip_bytes},
124
            Timeout=30,
125
            MemorySize=128,
126
            Environment={"Variables": {"DYNAMODB_TABLE": table_name}},
127
        )
128
        return response["FunctionArn"]
129
    except lambda_client.exceptions.ResourceConflictException:
130
        return lambda_client.get_function(FunctionName=function_name)["Configuration"][
131
            "FunctionArn"
132
        ]
133

134

135
def create_dynamodb_table(table_name: str, region: str):
136
    """Create DynamoDB table for storing file metadata."""
137
    dynamodb = boto3.client("dynamodb", region_name=region)
138
    try:
139
        dynamodb.create_table(
140
            TableName=table_name,
141
            KeySchema=[{"AttributeName": "bucket_key", "KeyType": "HASH"}],
142
            AttributeDefinitions=[
143
                {"AttributeName": "bucket_key", "AttributeType": "S"}
144
            ],
145
            BillingMode="PAY_PER_REQUEST",
146
        )
147
        waiter = dynamodb.get_waiter("table_exists")
148
        waiter.wait(TableName=table_name)
149
    except dynamodb.exceptions.ResourceInUseException:
150
        pass
151

152

153
@click.group()
154
@click.option("--region", default="us-east-1", help="AWS region for the resources")
155
@click.pass_context
156
def cli(ctx, region):
157
    """Manage S3 buckets with DynamoDB metadata tracking."""
158
    ctx.ensure_object(dict)
159
    ctx.obj["REGION"] = region
160

161

162
@cli.command()
163
@click.argument("bucket_name")
164
@click.pass_context
165
def create_bucket(ctx, bucket_name: str):
166
    """Create a new S3 bucket with Lambda triggers and DynamoDB table."""
167
    region = ctx.obj["REGION"]
168
    s3 = boto3.client("s3", region_name=region)
169
    lambda_client = boto3.client("lambda", region_name=region)
170

171
    try:
172
        # Create S3 bucket
173
        if region == "us-east-1":
174
            s3.create_bucket(Bucket=bucket_name)
175
        else:
176
            s3.create_bucket(
177
                Bucket=bucket_name,
178
                CreateBucketConfiguration={"LocationConstraint": region},
179
            )
180
        click.echo(f"Created bucket: {bucket_name}")
181

182
        # Create DynamoDB table
183
        table_name = f"{bucket_name}-metadata"
184
        create_dynamodb_table(table_name, region)
185
        click.echo(f"Created DynamoDB table: {table_name}")
186

187
        # Create Lambda function
188
        role_name = f"{bucket_name}-lambda-role"
189
        role_arn = create_lambda_role(role_name)
190
        click.echo(f"Created/using IAM role: {role_name} - wait 10s for it to be ready")
191
        time.sleep(10)
192

193
        function_name = f"{bucket_name}-processor"
194
        lambda_arn = create_lambda_function(function_name, role_arn, table_name, region)
195
        click.echo(f"Created Lambda function: {function_name}")
196

197
        # Grant S3 permission to invoke the Lambda function
198
        try:
199
            lambda_client.add_permission(
200
                FunctionName=function_name,
201
                StatementId=f"{bucket_name}-s3invoke",
202
                Action="lambda:InvokeFunction",
203
                Principal="s3.amazonaws.com",
204
                SourceArn=f"arn:aws:s3:::{bucket_name}",
205
            )
206
        except lambda_client.exceptions.ResourceConflictException:
207
            # Permission already exists
208
            pass
209

210
        time.sleep(5)
211

212
        # Add bucket notification configuration
213
        s3.put_bucket_notification_configuration(
214
            Bucket=bucket_name,
215
            NotificationConfiguration={
216
                "LambdaFunctionConfigurations": [
217
                    {
218
                        "LambdaFunctionArn": lambda_arn,
219
                        "Events": ["s3:ObjectCreated:*", "s3:ObjectRemoved:*"],
220
                    }
221
                ]
222
            },
223
        )
224
        click.echo("Configured S3 event notifications")
225

226
        click.echo("\nSetup completed successfully!")
227

228
    except Exception as e:
229
        click.echo(f"Error: {str(e)}", err=True)
230
        raise click.Abort()
231

232

233
@cli.command()
234
@click.argument("bucket_name")
235
@click.option("--prefix", help="Filter files by prefix")
236
@click.option("--region", default="us-east-1", help="AWS region for the resources")
237
def list_files(bucket_name: str, prefix: Optional[str], region: str):
238
    """List files in the bucket using DynamoDB metadata."""
239
    dynamodb = boto3.client("dynamodb", region_name=region)
240
    table_name = f"{bucket_name}-metadata"
241

242
    try:
243
        # Scan DynamoDB table
244
        scan_kwargs = {
245
            "TableName": table_name,
246
        }
247

248
        if prefix:
249
            scan_kwargs["FilterExpression"] = "begins_with(filename, :prefix)"
250
            scan_kwargs["ExpressionAttributeValues"] = {":prefix": {"S": prefix}}
251

252
        response = dynamodb.scan(**scan_kwargs)
253

254
        if not response.get("Items"):
255
            click.echo("No files found.")
256
            return
257

258
        # Print file information
259
        click.echo("\nFiles in bucket:")
260
        click.echo("-" * 80)
261
        format_str = "{:<40} {:>10} {:<20} {:<20}"
262
        click.echo(
263
            format_str.format("Filename", "Size (B)", "Content Type", "Last Modified")
264
        )
265
        click.echo("-" * 80)
266

267
        for item in response["Items"]:
268
            click.echo(
269
                format_str.format(
270
                    item["filename"]["S"],
271
                    item["size"]["N"],
272
                    item["content_type"]["S"],
273
                    item["last_modified"]["S"],
274
                )
275
            )
276

277
    except Exception as e:
278
        click.echo(f"Error: {str(e)}", err=True)
279
        raise click.Abort()
280

281

282
if __name__ == "__main__":
283
    cli()

Trying it out#

I ran it with uv like this - the /// script block ensured that uv ran it with the click and boto3 dependencies it needed:

1
uv run magic_bucket.py create-bucket simon-magic-bucket-6

1
Created bucket: simon-magic-bucket-6
2
Created DynamoDB table: simon-magic-bucket-6-metadata
3
Created/using IAM role: simon-magic-bucket-6-lambda-role - wait 10s for it to be ready
4
Created Lambda function: simon-magic-bucket-6-processor
5
Configured S3 event notifications
6

7
Setup completed successfully!

Now I can list the files in the bucket according to the DynamoDB table:

1
uv run magic_bucket.py list-files simon-magic-bucket-6

1
No files found.

Then upload some files using my existing s3-credentials put-objects tool:

1
uvx s3-credentials put-objects simon-magic-bucket-6 /tmp/*.gif

Wait a few seconds for the triggers to fire and:

1
uv run magic_bucket.py list-files simon-magic-bucket-6

1
Files in bucket:
2
--------------------------------------------------------------------------------
3
Filename                                   Size (B) Content Type         Last Modified
4
--------------------------------------------------------------------------------
5
tab.gif                                      107141 binary/octet-stream  2025-02-19T21:30:03.860Z
6
tweak.gif                                    268411 binary/octet-stream  2025-02-19T21:30:04.649Z
7
my-image.gif                                 714735 binary/octet-stream  2025-02-19T21:30:03.423Z

I’ve since saved the file on GitHub, which means anyone with uv installed (and their AWS credentials configured on their machine) can run it like this:

1
uv run https://raw.githubusercontent.com/simonw/tools/refs/heads/main/python/magic_bucket.py \
2
  create-bucket name-your-magic-bucket-here

Potential next steps#

This is a very solid proof of concept: I’ve conclusively proven to myself that this approach can work and is feasible using the tools I have at hand.

There’s actually not much left before I would consider putting it in production. Off the top of my head:

A few people have pointed me to Amazon S3 queryable metadata, a brand new (previewed in December 2024, GA in January 2025) system that solves almost exactly the same problem! I may well use that instead.
The trigger right now only tracks "s3:ObjectCreated:*" and "s3:ObjectRemoved:*" - is there an equivalent for if an object is updated? Claude says that an update is treated as an ObjectCreated event as well so probably not but there might be other events I want to track.
I’d really like to get some automated testing in place for something like this. A CI job that creates a new bucket, adds some files, waits a bit, checks that the DynamoDB table was populated and then cleans up at the end would give me more confidence in the system going forward.
If I ever use this properly I’ll turn it into a Python package with a GitHub repo, issues, a changelog etc.
Is DynamoDB the best choice for this? I’d like to spend a little more time thinking through the project rather than assuming that my first guesses were the most appropriate architecture here.

Honestly though I feel like this prototype got me 90% of the way there already.

Bonus TIL: deleting an S3 bucket with uvx#

I wanted to clean up after myself, so I figured out how to run the aws CLI command via uvx like this:

1
uvx --with awscli aws s3 rb s3://simon-magic-bucket-6 --force

The --force option allows this to work even with buckets that aren’t empty - it deletes each key first before deleting the bucket.