The Complete Guide to AWS Glue: Serverless ETL and Data Catalog Management#

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. This comprehensive guide covers all aspects of AWS Glue, from basic data cataloging to advanced ETL workflows and optimization strategies.

Introduction to AWS Glue {#introduction}#

AWS Glue is a serverless data integration service that simplifies data preparation and loading for analytics and machine learning. It provides both visual and code-based interfaces for building ETL workflows.

Key Benefits:#

Serverless: No infrastructure to manage
Scalable: Automatically scales based on workload
Cost-effective: Pay only for resources used during job execution
Integrated: Works seamlessly with other AWS services
Code Generation: Automatically generates ETL code

Use Cases:#

Data lake and data warehouse preparation
Real-time and batch data processing
Data cataloging and discovery
Schema evolution management
Data quality and validation

Core Components {#core-components}#

1
import boto3
2
import json
3
from datetime import datetime
4

5
# Initialize AWS Glue clients
6
glue = boto3.client('glue')
7
s3 = boto3.client('s3')
8

9
def glue_components_overview():
10
    """
11
    Overview of AWS Glue components and their purposes
12
    """
13
    components = {
14
        "data_catalog": {
15
            "description": "Centralized metadata repository",
16
            "components": [
17
                "Databases",
18
                "Tables",
19
                "Partitions",
20
                "Connections"
21
            ],
22
            "benefits": [
23
                "Schema discovery and evolution",
24
                "Data lineage tracking",
25
                "Cross-service metadata sharing",
26
                "Query optimization"
27
            ]
28
        },
29
        "crawlers": {
30
            "description": "Automated schema discovery and cataloging",
31
            "capabilities": [
32
                "Schema inference from data",
33
                "Partition discovery",
34
                "Schema evolution handling",
35
                "Scheduled crawling"
36
            ],
37
            "supported_sources": [
38
                "Amazon S3",
39
                "Amazon RDS",
40
                "Amazon Redshift",
41
                "JDBC databases",
42
                "DynamoDB"
43
            ]
44
        },
45
        "etl_jobs": {
46
            "description": "Data transformation and loading workflows",
47
            "types": [
48
                "Apache Spark jobs",
49
                "Python Shell jobs",
50
                "Ray jobs"
51
            ],
52
            "execution_modes": [
53
                "Serverless",
54
                "Traditional (with DPUs)"
55
            ]
56
        },
57
        "glue_studio": {
58
            "description": "Visual interface for ETL job creation",
59
            "features": [
60
                "Drag-and-drop interface",
61
                "Pre-built transforms",
62
                "Code generation",
63
                "Job monitoring"
64
            ]
65
        },
66
        "data_brew": {
67
            "description": "Visual data preparation tool",
68
            "capabilities": [
69
                "Data profiling",
70
                "Data cleaning",
71
                "Recipe-based transformations",
72
                "Data quality rules"
73
            ]
74
        }
75
    }
76

77
    return components
78

79
print("AWS Glue Components Overview:")
80
print(json.dumps(glue_components_overview(), indent=2))

AWS Glue Data Catalog {#data-catalog}#

Managing Databases and Tables#

1
class GlueCatalogManager:
2
    def __init__(self):
3
        self.glue = boto3.client('glue')
4

5
    def create_database(self, database_name, description=""):
6
        """
7
        Create a Glue database
8
        """
9
        try:
10
            response = self.glue.create_database(
11
                DatabaseInput={
12
                    'Name': database_name,
13
                    'Description': description or f'Database for {database_name} data',
14
                    'Parameters': {
15
                        'classification': 'database',
16
                        'owner': 'data-engineering-team',
17
                        'created_by': 'glue-automation'
18
                    }
19
                }
20
            )
21

22
            print(f"Database '{database_name}' created successfully")
23
            return response
24

25
        except self.glue.exceptions.AlreadyExistsException:
26
            print(f"Database '{database_name}' already exists")
27
        except Exception as e:
28
            print(f"Error creating database: {e}")
29
            return None
30

31
    def create_table(self, database_name, table_name, s3_location,
32
                     input_format, output_format, serde_info, columns, partitions=None):
33
        """
34
        Create a table in the Glue Data Catalog
35
        """
36
        try:
37
            storage_descriptor = {
38
                'Columns': columns,
39
                'Location': s3_location,
40
                'InputFormat': input_format,
41
                'OutputFormat': output_format,
42
                'SerdeInfo': serde_info,
43
                'Compressed': False,
44
                'StoredAsSubDirectories': False
45
            }
46

47
            table_input = {
48
                'Name': table_name,
49
                'Description': f'Table {table_name} in {database_name}',
50
                'StorageDescriptor': storage_descriptor,
51
                'Parameters': {
52
                    'classification': self._get_classification(input_format),
53
                    'compressionType': 'none',
54
                    'typeOfData': 'file'
55
                }
56
            }
57

58
            if partitions:
59
                table_input['PartitionKeys'] = partitions
60

61
            response = self.glue.create_table(
62
                DatabaseName=database_name,
63
                TableInput=table_input
64
            )
65

66
            print(f"Table '{table_name}' created in database '{database_name}'")
67
            return response
68

69
        except Exception as e:
70
            print(f"Error creating table: {e}")
71
            return None
72

73
    def create_parquet_table(self, database_name, table_name, s3_location, columns, partitions=None):
74
        """
75
        Create a Parquet table with standard configuration
76
        """
77
        return self.create_table(
78
            database_name=database_name,
79
            table_name=table_name,
80
            s3_location=s3_location,
81
            input_format='org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
82
            output_format='org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
83
            serde_info={
84
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
85
            },
86
            columns=columns,
87
            partitions=partitions
88
        )
89

90
    def create_json_table(self, database_name, table_name, s3_location, columns, partitions=None):
91
        """
92
        Create a JSON table with standard configuration
93
        """
94
        return self.create_table(
95
            database_name=database_name,
96
            table_name=table_name,
97
            s3_location=s3_location,
98
            input_format='org.apache.hadoop.mapred.TextInputFormat',
99
            output_format='org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
100
            serde_info={
101
                'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe'
102
            },
103
            columns=columns,
104
            partitions=partitions
105
        )
106

107
    def create_csv_table(self, database_name, table_name, s3_location, columns,
108
                         delimiter=',', partitions=None):
109
        """
110
        Create a CSV table with standard configuration
111
        """
112
        return self.create_table(
113
            database_name=database_name,
114
            table_name=table_name,
115
            s3_location=s3_location,
116
            input_format='org.apache.hadoop.mapred.TextInputFormat',
117
            output_format='org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
118
            serde_info={
119
                'SerializationLibrary': 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
120
                'Parameters': {
121
                    'field.delim': delimiter,
122
                    'skip.header.line.count': '1'
123
                }
124
            },
125
            columns=columns,
126
            partitions=partitions
127
        )
128

129
    def update_table_schema(self, database_name, table_name, new_columns, version_id=None):
130
        """
131
        Update table schema with new columns
132
        """
133
        try:
134
            # Get current table definition
135
            current_table = self.glue.get_table(
136
                DatabaseName=database_name,
137
                Name=table_name
138
            )
139

140
            # Update storage descriptor with new columns
141
            table_input = current_table['Table']
142
            table_input['StorageDescriptor']['Columns'] = new_columns
143

144
            # Remove read-only fields
145
            for field in ['CreatedBy', 'CreateTime', 'UpdateTime', 'DatabaseName']:
146
                table_input.pop(field, None)
147

148
            response = self.glue.update_table(
149
                DatabaseName=database_name,
150
                TableInput=table_input,
151
                VersionId=version_id
152
            )
153

154
            print(f"Table '{table_name}' schema updated successfully")
155
            return response
156

157
        except Exception as e:
158
            print(f"Error updating table schema: {e}")
159
            return None
160

161
    def add_partition(self, database_name, table_name, partition_values, storage_location):
162
        """
163
        Add a partition to a table
164
        """
165
        try:
166
            # Get table to understand partition structure
167
            table = self.glue.get_table(
168
                DatabaseName=database_name,
169
                Name=table_name
170
            )
171

172
            partition_keys = table['Table'].get('PartitionKeys', [])
173

174
            if len(partition_values) != len(partition_keys):
175
                raise ValueError(f"Expected {len(partition_keys)} partition values, got {len(partition_values)}")
176

177
            storage_descriptor = table['Table']['StorageDescriptor'].copy()
178
            storage_descriptor['Location'] = storage_location
179

180
            response = self.glue.create_partition(
181
                DatabaseName=database_name,
182
                TableName=table_name,
183
                PartitionInput={
184
                    'Values': partition_values,
185
                    'StorageDescriptor': storage_descriptor,
186
                    'Parameters': {
187
                        'last_modified_by': 'glue-automation',
188
                        'last_modified_time': str(int(datetime.utcnow().timestamp()))
189
                    }
190
                }
191
            )
192

193
            print(f"Partition {partition_values} added to table '{table_name}'")
194
            return response
195

196
        except Exception as e:
197
            print(f"Error adding partition: {e}")
198
            return None
199

200
    def _get_classification(self, input_format):
201
        """
202
        Get classification based on input format
203
        """
204
        format_mapping = {
205
            'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat': 'parquet',
206
            'org.apache.hadoop.mapred.TextInputFormat': 'csv',
207
            'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat': 'orc'
208
        }
209

210
        return format_mapping.get(input_format, 'unknown')
211

212
    def list_tables(self, database_name):
213
        """
214
        List all tables in a database
215
        """
216
        try:
217
            response = self.glue.get_tables(DatabaseName=database_name)
218

219
            tables_info = []
220
            for table in response['TableList']:
221
                tables_info.append({
222
                    'name': table['Name'],
223
                    'location': table['StorageDescriptor'].get('Location', 'N/A'),
224
                    'input_format': table['StorageDescriptor'].get('InputFormat', 'N/A'),
225
                    'columns': len(table['StorageDescriptor'].get('Columns', [])),
226
                    'partitions': len(table.get('PartitionKeys', []))
227
                })
228

229
            return tables_info
230

231
        except Exception as e:
232
            print(f"Error listing tables: {e}")
233
            return []
234

235
# Usage examples
236
catalog_manager = GlueCatalogManager()
237

238
# Create database
239
catalog_manager.create_database('ecommerce_data', 'E-commerce analytics data')
240

241
# Define columns for different table types
242
user_columns = [
243
    {'Name': 'user_id', 'Type': 'string'},
244
    {'Name': 'email', 'Type': 'string'},
245
    {'Name': 'registration_date', 'Type': 'timestamp'},
246
    {'Name': 'country', 'Type': 'string'},
247
    {'Name': 'age', 'Type': 'int'},
248
    {'Name': 'subscription_tier', 'Type': 'string'}
249
]
250

251
order_columns = [
252
    {'Name': 'order_id', 'Type': 'string'},
253
    {'Name': 'user_id', 'Type': 'string'},
254
    {'Name': 'product_id', 'Type': 'string'},
255
    {'Name': 'quantity', 'Type': 'int'},
256
    {'Name': 'price', 'Type': 'decimal(10,2)'},
257
    {'Name': 'order_timestamp', 'Type': 'timestamp'},
258
    {'Name': 'status', 'Type': 'string'}
259
]
260

261
# Define partition keys
262
date_partitions = [
263
    {'Name': 'year', 'Type': 'string'},
264
    {'Name': 'month', 'Type': 'string'},
265
    {'Name': 'day', 'Type': 'string'}
266
]
267

268
# Create tables with different formats
269
catalog_manager.create_parquet_table(
270
    'ecommerce_data',
271
    'users',
272
    's3://my-data-lake/users/',
273
    user_columns
274
)
275

276
catalog_manager.create_parquet_table(
277
    'ecommerce_data',
278
    'orders',
279
    's3://my-data-lake/orders/',
280
    order_columns,
281
    partitions=date_partitions
282
)
283

284
catalog_manager.create_json_table(
285
    'ecommerce_data',
286
    'events',
287
    's3://my-data-lake/events/',
288
    [
289
        {'Name': 'event_id', 'Type': 'string'},
290
        {'Name': 'user_id', 'Type': 'string'},
291
        {'Name': 'event_type', 'Type': 'string'},
292
        {'Name': 'properties', 'Type': 'string'},
293
        {'Name': 'timestamp', 'Type': 'timestamp'}
294
    ],
295
    partitions=date_partitions
296
)
297

298
# Add partition to orders table
299
catalog_manager.add_partition(
300
    'ecommerce_data',
301
    'orders',
302
    ['2024', '01', '15'],
303
    's3://my-data-lake/orders/year=2024/month=01/day=15/'
304
)
305

306
# List tables
307
tables = catalog_manager.list_tables('ecommerce_data')
308
print("Tables in ecommerce_data database:")
309
for table in tables:
310
    print(f"  {table['name']}: {table['columns']} columns, {table['partitions']} partition keys")

Crawlers and Schema Discovery {#crawlers}#

Creating and Managing Crawlers#

1
class GlueCrawlerManager:
2
    def __init__(self):
3
        self.glue = boto3.client('glue')
4
        self.iam = boto3.client('iam')
5

6
    def create_crawler_role(self, role_name):
7
        """
8
        Create IAM role for Glue crawler
9
        """
10
        trust_policy = {
11
            "Version": "2012-10-17",
12
            "Statement": [
13
                {
14
                    "Effect": "Allow",
15
                    "Principal": {
16
                        "Service": "glue.amazonaws.com"
17
                    },
18
                    "Action": "sts:AssumeRole"
19
                }
20
            ]
21
        }
22

23
        try:
24
            response = self.iam.create_role(
25
                RoleName=role_name,
26
                AssumeRolePolicyDocument=json.dumps(trust_policy),
27
                Description='IAM role for AWS Glue crawler'
28
            )
29

30
            # Attach required policies
31
            policies = [
32
                'arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole',
33
                'arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess'
34
            ]
35

36
            for policy_arn in policies:
37
                self.iam.attach_role_policy(
38
                    RoleName=role_name,
39
                    PolicyArn=policy_arn
40
                )
41

42
            role_arn = response['Role']['Arn']
43
            print(f"Crawler role created: {role_arn}")
44
            return role_arn
45

46
        except Exception as e:
47
            print(f"Error creating crawler role: {e}")
48
            return None
49

50
    def create_s3_crawler(self, crawler_name, database_name, s3_path, role_arn,
51
                          schedule=None, table_prefix=""):
52
        """
53
        Create a crawler for S3 data sources
54
        """
55
        try:
56
            crawler_config = {
57
                'Name': crawler_name,
58
                'Role': role_arn,
59
                'DatabaseName': database_name,
60
                'Description': f'Crawler for {s3_path}',
61
                'Targets': {
62
                    'S3Targets': [
63
                        {
64
                            'Path': s3_path,
65
                            'Exclusions': [
66
                                '**/_temporary/**',
67
                                '**/_SUCCESS',
68
                                '**/.DS_Store'
69
                            ]
70
                        }
71
                    ]
72
                },
73
                'TablePrefix': table_prefix,
74
                'SchemaChangePolicy': {
75
                    'UpdateBehavior': 'UPDATE_IN_DATABASE',
76
                    'DeleteBehavior': 'LOG'
77
                },
78
                'RecrawlPolicy': {
79
                    'RecrawlBehavior': 'CRAWL_EVERYTHING'
80
                },
81
                'LineageConfiguration': {
82
                    'CrawlerLineageSettings': 'ENABLE'
83
                },
84
                'Configuration': json.dumps({
85
                    "Version": 1.0,
86
                    "CrawlerOutput": {
87
                        "Partitions": {"AddOrUpdateBehavior": "InheritFromTable"},
88
                        "Tables": {"AddOrUpdateBehavior": "MergeNewColumns"}
89
                    }
90
                })
91
            }
92

93
            if schedule:
94
                crawler_config['Schedule'] = schedule
95

96
            response = self.glue.create_crawler(**crawler_config)
97

98
            print(f"S3 crawler '{crawler_name}' created successfully")
99
            return response
100

101
        except Exception as e:
102
            print(f"Error creating S3 crawler: {e}")
103
            return None
104

105
    def create_jdbc_crawler(self, crawler_name, database_name, connection_name,
106
                           jdbc_path, role_arn, schedule=None):
107
        """
108
        Create a crawler for JDBC data sources
109
        """
110
        try:
111
            crawler_config = {
112
                'Name': crawler_name,
113
                'Role': role_arn,
114
                'DatabaseName': database_name,
115
                'Description': f'JDBC crawler for {jdbc_path}',
116
                'Targets': {
117
                    'JdbcTargets': [
118
                        {
119
                            'ConnectionName': connection_name,
120
                            'Path': jdbc_path
121
                        }
122
                    ]
123
                },
124
                'SchemaChangePolicy': {
125
                    'UpdateBehavior': 'UPDATE_IN_DATABASE',
126
                    'DeleteBehavior': 'LOG'
127
                }
128
            }
129

130
            if schedule:
131
                crawler_config['Schedule'] = schedule
132

133
            response = self.glue.create_crawler(**crawler_config)
134

135
            print(f"JDBC crawler '{crawler_name}' created successfully")
136
            return response
137

138
        except Exception as e:
139
            print(f"Error creating JDBC crawler: {e}")
140
            return None
141

142
    def create_connection(self, connection_name, connection_type, connection_properties):
143
        """
144
        Create a Glue connection for databases
145
        """
146
        try:
147
            response = self.glue.create_connection(
148
                ConnectionInput={
149
                    'Name': connection_name,
150
                    'Description': f'Connection for {connection_type}',
151
                    'ConnectionType': connection_type,
152
                    'ConnectionProperties': connection_properties,
153
                    'PhysicalConnectionRequirements': {
154
                        'SubnetId': 'subnet-12345678',  # Replace with your subnet
155
                        'SecurityGroupIdList': ['sg-12345678'],  # Replace with your security group
156
                        'AvailabilityZone': 'us-east-1a'
157
                    }
158
                }
159
            )
160

161
            print(f"Connection '{connection_name}' created successfully")
162
            return response
163

164
        except Exception as e:
165
            print(f"Error creating connection: {e}")
166
            return None
167

168
    def run_crawler(self, crawler_name):
169
        """
170
        Start a crawler run
171
        """
172
        try:
173
            response = self.glue.start_crawler(Name=crawler_name)
174
            print(f"Crawler '{crawler_name}' started successfully")
175
            return response
176

177
        except Exception as e:
178
            print(f"Error starting crawler: {e}")
179
            return None
180

181
    def get_crawler_metrics(self, crawler_names):
182
        """
183
        Get metrics for crawlers
184
        """
185
        try:
186
            response = self.glue.get_crawler_metrics(CrawlerNameList=crawler_names)
187

188
            metrics_summary = []
189
            for metric in response['CrawlerMetricsList']:
190
                metrics_summary.append({
191
                    'crawler_name': metric['CrawlerName'],
192
                    'tables_created': metric.get('TablesCreated', 0),
193
                    'tables_updated': metric.get('TablesUpdated', 0),
194
                    'tables_deleted': metric.get('TablesDeleted', 0),
195
                    'last_runtime_seconds': metric.get('LastRuntimeSeconds', 0),
196
                    'median_runtime_seconds': metric.get('MedianRuntimeSeconds', 0),
197
                    'still_estimating': metric.get('StillEstimating', False)
198
                })
199

200
            return metrics_summary
201

202
        except Exception as e:
203
            print(f"Error getting crawler metrics: {e}")
204
            return []
205

206
    def setup_incremental_crawling(self, crawler_name, database_name, s3_path, role_arn):
207
        """
208
        Set up incremental crawling with optimized configuration
209
        """
210
        try:
211
            # Create crawler with incremental settings
212
            crawler_config = {
213
                'Name': crawler_name,
214
                'Role': role_arn,
215
                'DatabaseName': database_name,
216
                'Description': f'Incremental crawler for {s3_path}',
217
                'Targets': {
218
                    'S3Targets': [
219
                        {
220
                            'Path': s3_path,
221
                            'Exclusions': [
222
                                '**/_temporary/**',
223
                                '**/_SUCCESS',
224
                                '**/.DS_Store'
225
                            ]
226
                        }
227
                    ]
228
                },
229
                'SchemaChangePolicy': {
230
                    'UpdateBehavior': 'UPDATE_IN_DATABASE',
231
                    'DeleteBehavior': 'DEPRECATE_IN_DATABASE'
232
                },
233
                'RecrawlPolicy': {
234
                    'RecrawlBehavior': 'CRAWL_NEW_FOLDERS_ONLY'
235
                },
236
                'Configuration': json.dumps({
237
                    "Version": 1.0,
238
                    "Grouping": {
239
                        "TableGroupingPolicy": "CombineCompatibleSchemas"
240
                    },
241
                    "CrawlerOutput": {
242
                        "Partitions": {"AddOrUpdateBehavior": "InheritFromTable"},
243
                        "Tables": {"AddOrUpdateBehavior": "MergeNewColumns"}
244
                    }
245
                })
246
            }
247

248
            response = self.glue.create_crawler(**crawler_config)
249
            print(f"Incremental crawler '{crawler_name}' created successfully")
250
            return response
251

252
        except Exception as e:
253
            print(f"Error creating incremental crawler: {e}")
254
            return None
255

256
# Usage examples
257
crawler_manager = GlueCrawlerManager()
258

259
# Create crawler role
260
role_arn = crawler_manager.create_crawler_role('GlueCrawlerRole')
261

262
if role_arn:
263
    # Create S3 crawler for data lake
264
    crawler_manager.create_s3_crawler(
265
        'ecommerce-data-crawler',
266
        'ecommerce_data',
267
        's3://my-data-lake/raw/',
268
        role_arn,
269
        schedule='cron(0 2 * * ? *)',  # Daily at 2 AM
270
        table_prefix='raw_'
271
    )
272

273
    # Create incremental crawler
274
    crawler_manager.setup_incremental_crawling(
275
        'ecommerce-incremental-crawler',
276
        'ecommerce_data',
277
        's3://my-data-lake/incremental/',
278
        role_arn
279
    )
280

281
    # Create JDBC connection and crawler
282
    postgres_connection_props = {
283
        'JDBC_CONNECTION_URL': 'jdbc:postgresql://mydb.cluster-xyz.us-east-1.rds.amazonaws.com:5432/production',
284
        'USERNAME': 'glue_user',
285
        'PASSWORD': 'secure_password'
286
    }
287

288
    crawler_manager.create_connection(
289
        'postgres-production',
290
        'JDBC',
291
        postgres_connection_props
292
    )
293

294
    crawler_manager.create_jdbc_crawler(
295
        'postgres-production-crawler',
296
        'production_replica',
297
        'postgres-production',
298
        'production/%',
299
        role_arn,
300
        schedule='cron(0 3 * * ? *)'  # Daily at 3 AM
301
    )
302

303
    # Run crawler
304
    crawler_manager.run_crawler('ecommerce-data-crawler')
305

306
    # Get crawler metrics
307
    metrics = crawler_manager.get_crawler_metrics(['ecommerce-data-crawler'])
308
    print("Crawler Metrics:")
309
    for metric in metrics:
310
        print(f"  {metric['crawler_name']}: {metric['tables_created']} tables created, "
311
              f"{metric['last_runtime_seconds']}s runtime")

ETL Jobs and Development {#etl-jobs}#

Creating and Managing ETL Jobs#

1
class GlueETLManager:
2
    def __init__(self):
3
        self.glue = boto3.client('glue')
4
        self.s3 = boto3.client('s3')
5

6
    def create_etl_job(self, job_name, script_location, role_arn, job_type='glueetl',
7
                       max_capacity=None, worker_type=None, number_of_workers=None,
8
                       timeout=2880, max_retries=1):
9
        """
10
        Create an ETL job with flexible configuration
11
        """
12
        try:
13
            job_config = {
14
                'Name': job_name,
15
                'Description': f'ETL job: {job_name}',
16
                'Role': role_arn,
17
                'Command': {
18
                    'Name': job_type,
19
                    'ScriptLocation': script_location,
20
                    'PythonVersion': '3'
21
                },
22
                'DefaultArguments': {
23
                    '--TempDir': 's3://my-glue-temp-bucket/temp/',
24
                    '--job-bookmark-option': 'job-bookmark-enable',
25
                    '--enable-metrics': '',
26
                    '--enable-continuous-cloudwatch-log': 'true',
27
                    '--enable-glue-datacatalog': 'true'
28
                },
29
                'MaxRetries': max_retries,
30
                'Timeout': timeout,
31
                'GlueVersion': '4.0'
32
            }
33

34
            # Configure capacity based on job type
35
            if job_type == 'glueetl':
36
                if worker_type and number_of_workers:
37
                    job_config['WorkerType'] = worker_type
38
                    job_config['NumberOfWorkers'] = number_of_workers
39
                elif max_capacity:
40
                    job_config['MaxCapacity'] = max_capacity
41
                else:
42
                    job_config['WorkerType'] = 'G.1X'
43
                    job_config['NumberOfWorkers'] = 2
44
            elif job_type == 'pythonshell':
45
                job_config['MaxCapacity'] = 0.0625  # 1/16 DPU for Python shell
46

47
            response = self.glue.create_job(**job_config)
48

49
            print(f"ETL job '{job_name}' created successfully")
50
            return response
51

52
        except Exception as e:
53
            print(f"Error creating ETL job: {e}")
54
            return None
55

56
    def create_spark_etl_script(self, source_database, source_table,
57
                                target_s3_path, transformations=None):
58
        """
59
        Generate a Spark ETL script template
60
        """
61
        script_template = f'''
62
import sys
63
from awsglue.transforms import *
64
from awsglue.utils import getResolvedOptions
65
from pyspark.context import SparkContext
66
from awsglue.context import GlueContext
67
from awsglue.job import Job
68
from awsglue.dynamicframe import DynamicFrame
69
from pyspark.sql import functions as F
70
from pyspark.sql.types import *
71
import boto3
72

73
# Initialize Glue context
74
args = getResolvedOptions(sys.argv, [
75
    'JOB_NAME',
76
    'source_database',
77
    'source_table',
78
    'target_s3_path'
79
])
80

81
sc = SparkContext()
82
glueContext = GlueContext(sc)
83
spark = glueContext.spark_session
84
job = Job(glueContext)
85
job.init(args['JOB_NAME'], args)
86

87
# Create logger
88
logger = glueContext.get_logger()
89
logger.info(f"Starting ETL job: {{args['JOB_NAME']}}")
90

91
try:
92
    # Read from Data Catalog
93
    source_df = glueContext.create_dynamic_frame.from_catalog(
94
        database=args.get('source_database', '{source_database}'),
95
        table_name=args.get('source_table', '{source_table}'),
96
        transformation_ctx="source_df"
97
    )
98

99
    logger.info(f"Read {{source_df.count()}} records from source")
100

101
    # Convert to Spark DataFrame for complex transformations
102
    spark_df = source_df.toDF()
103

104
    # Data Quality Checks
105
    initial_count = spark_df.count()
106
    logger.info(f"Initial record count: {{initial_count}}")
107

108
    # Remove null values from critical columns
109
    spark_df = spark_df.filter(F.col("id").isNotNull())
110

111
    # Data type conversions and validations
112
    spark_df = spark_df.withColumn("processed_timestamp", F.current_timestamp())
113

114
    # Add data quality metrics
115
    null_count = spark_df.filter(F.col("id").isNull()).count()
116
    duplicate_count = initial_count - spark_df.dropDuplicates(["id"]).count()
117

118
    logger.info(f"Data quality - Nulls: {{null_count}}, Duplicates: {{duplicate_count}}")
119

120
    # Custom transformations
121
    {self._generate_transformation_code(transformations)}
122

123
    # Convert back to DynamicFrame
124
    target_df = DynamicFrame.fromDF(spark_df, glueContext, "target_df")
125

126
    # Write to S3 in Parquet format with partitioning
127
    glueContext.write_dynamic_frame.from_options(
128
        frame=target_df,
129
        connection_type="s3",
130
        connection_options={{
131
            "path": args.get('target_s3_path', '{target_s3_path}'),
132
            "partitionKeys": ["year", "month", "day"]
133
        }},
134
        format="parquet",
135
        transformation_ctx="target_df"
136
    )
137

138
    final_count = target_df.count()
139
    logger.info(f"Successfully wrote {{final_count}} records to target")
140

141
    # Job metrics
142
    job_metrics = {{
143
        "source_records": initial_count,
144
        "target_records": final_count,
145
        "filtered_records": initial_count - final_count,
146
        "null_records": null_count,
147
        "duplicate_records": duplicate_count
148
    }}
149

150
    logger.info(f"Job metrics: {{job_metrics}}")
151

152
except Exception as e:
153
    logger.error(f"Job failed with error: {{str(e)}}")
154
    raise e
155

156
finally:
157
    job.commit()
158
    logger.info("Job completed successfully")
159
'''
160

161
        return script_template
162

163
    def _generate_transformation_code(self, transformations):
164
        """
165
        Generate transformation code based on configuration
166
        """
167
        if not transformations:
168
            return "# No additional transformations specified"
169

170
        transformation_code = ""
171

172
        for transform in transformations:
173
            if transform['type'] == 'filter':
174
                transformation_code += f"""
175
    # Filter: {transform['description']}
176
    spark_df = spark_df.filter({transform['condition']})
177
"""
178
            elif transform['type'] == 'column_rename':
179
                transformation_code += f"""
180
    # Rename column: {transform['old_name']} -> {transform['new_name']}
181
    spark_df = spark_df.withColumnRenamed('{transform['old_name']}', '{transform['new_name']}')
182
"""
183
            elif transform['type'] == 'derive_column':
184
                transformation_code += f"""
185
    # Derive column: {transform['column_name']}
186
    spark_df = spark_df.withColumn('{transform['column_name']}', {transform['expression']})
187
"""
188
            elif transform['type'] == 'aggregate':
189
                transformation_code += f"""
190
    # Aggregate data
191
    spark_df = spark_df.groupBy({transform['group_by']}).agg({transform['aggregations']})
192
"""
193

194
        return transformation_code
195

196
    def create_python_shell_script(self, source_s3_path, target_s3_path,
197
                                   processing_logic=""):
198
        """
199
        Generate a Python shell script template
200
        """
201
        script_template = f'''
202
import sys
203
import boto3
204
import pandas as pd
205
import json
206
from datetime import datetime
207
import logging
208

209
# Set up logging
210
logging.basicConfig(level=logging.INFO)
211
logger = logging.getLogger(__name__)
212

213
# Initialize AWS clients
214
s3 = boto3.client('s3')
215
glue = boto3.client('glue')
216

217
def main():
218
    try:
219
        source_path = "{source_s3_path}"
220
        target_path = "{target_s3_path}"
221

222
        logger.info(f"Starting Python shell job")
223
        logger.info(f"Source: {{source_path}}")
224
        logger.info(f"Target: {{target_path}}")
225

226
        # Read data from S3
227
        df = read_s3_data(source_path)
228
        logger.info(f"Read {{len(df)}} records from source")
229

230
        # Process data
231
        processed_df = process_data(df)
232
        logger.info(f"Processed {{len(processed_df)}} records")
233

234
        # Write to S3
235
        write_s3_data(processed_df, target_path)
236
        logger.info(f"Successfully wrote data to target")
237

238
    except Exception as e:
239
        logger.error(f"Job failed: {{str(e)}}")
240
        raise
241

242
def read_s3_data(s3_path):
243
    \"\"\"Read data from S3 using pandas\"\"\"
244
    # Parse S3 path
245
    bucket, key = parse_s3_path(s3_path)
246

247
    # Read CSV file
248
    obj = s3.get_object(Bucket=bucket, Key=key)
249
    df = pd.read_csv(obj['Body'])
250

251
    return df
252

253
def process_data(df):
254
    \"\"\"Process the data\"\"\"
255
    # Add processing timestamp
256
    df['processed_at'] = datetime.now().isoformat()
257

258
    # Custom processing logic
259
    {processing_logic}
260

261
    return df
262

263
def write_s3_data(df, s3_path):
264
    \"\"\"Write data to S3\"\"\"
265
    bucket, key = parse_s3_path(s3_path)
266

267
    # Convert to CSV and upload
268
    csv_buffer = df.to_csv(index=False)
269
    s3.put_object(Bucket=bucket, Key=key, Body=csv_buffer)
270

271
def parse_s3_path(s3_path):
272
    \"\"\"Parse S3 path into bucket and key\"\"\"
273
    path_parts = s3_path.replace('s3://', '').split('/', 1)
274
    bucket = path_parts[0]
275
    key = path_parts[1] if len(path_parts) > 1 else ''
276
    return bucket, key
277

278
if __name__ == '__main__':
279
    main()
280
'''
281

282
        return script_template
283

284
    def start_job_run(self, job_name, arguments=None, timeout=None,
285
                      worker_type=None, number_of_workers=None):
286
        """
287
        Start a job run with custom parameters
288
        """
289
        try:
290
            job_run_config = {
291
                'JobName': job_name
292
            }
293

294
            if arguments:
295
                job_run_config['Arguments'] = arguments
296

297
            if timeout:
298
                job_run_config['Timeout'] = timeout
299

300
            if worker_type and number_of_workers:
301
                job_run_config['WorkerType'] = worker_type
302
                job_run_config['NumberOfWorkers'] = number_of_workers
303

304
            response = self.glue.start_job_run(**job_run_config)
305

306
            job_run_id = response['JobRunId']
307
            print(f"Job run started: {job_run_id}")
308
            return job_run_id
309

310
        except Exception as e:
311
            print(f"Error starting job run: {e}")
312
            return None
313

314
    def get_job_run_status(self, job_name, run_id):
315
        """
316
        Get job run status and metrics
317
        """
318
        try:
319
            response = self.glue.get_job_run(JobName=job_name, RunId=run_id)
320

321
            job_run = response['JobRun']
322

323
            status_info = {
324
                'job_name': job_name,
325
                'run_id': run_id,
326
                'state': job_run['JobRunState'],
327
                'started_on': job_run.get('StartedOn'),
328
                'completed_on': job_run.get('CompletedOn'),
329
                'execution_time': job_run.get('ExecutionTime'),
330
                'max_capacity': job_run.get('MaxCapacity'),
331
                'worker_type': job_run.get('WorkerType'),
332
                'number_of_workers': job_run.get('NumberOfWorkers'),
333
                'error_message': job_run.get('ErrorMessage')
334
            }
335

336
            return status_info
337

338
        except Exception as e:
339
            print(f"Error getting job run status: {e}")
340
            return None
341

342
# Usage examples
343
etl_manager = GlueETLManager()
344

345
# Define transformations
346
transformations = [
347
    {
348
        'type': 'filter',
349
        'description': 'Filter out test users',
350
        'condition': 'F.col("email").rlike("^(?!test@).*")'
351
    },
352
    {
353
        'type': 'column_rename',
354
        'old_name': 'reg_date',
355
        'new_name': 'registration_date'
356
    },
357
    {
358
        'type': 'derive_column',
359
        'column_name': 'age_group',
360
        'expression': 'F.when(F.col("age") < 25, "young").when(F.col("age") < 65, "adult").otherwise("senior")'
361
    }
362
]
363

364
# Create Spark ETL script
365
spark_script = etl_manager.create_spark_etl_script(
366
    'ecommerce_data',
367
    'raw_users',
368
    's3://my-data-lake/processed/users/',
369
    transformations
370
)
371

372
print("Generated Spark ETL Script:")
373
print(spark_script[:1000] + "...")  # Print first 1000 characters
374

375
# Upload script to S3
376
script_key = 'glue-scripts/user_processing_etl.py'
377
s3 = boto3.client('s3')
378
s3.put_object(
379
    Bucket='my-glue-scripts-bucket',
380
    Key=script_key,
381
    Body=spark_script
382
)
383

384
script_location = f's3://my-glue-scripts-bucket/{script_key}'
385

386
# Create ETL job
387
job_response = etl_manager.create_etl_job(
388
    'user-processing-etl',
389
    script_location,
390
    'arn:aws:iam::123456789012:role/GlueServiceRole',
391
    job_type='glueetl',
392
    worker_type='G.1X',
393
    number_of_workers=2,
394
    timeout=60  # 1 hour
395
)
396

397
# Start job run with custom arguments
398
if job_response:
399
    job_run_id = etl_manager.start_job_run(
400
        'user-processing-etl',
401
        arguments={
402
            '--source_database': 'ecommerce_data',
403
            '--source_table': 'raw_users',
404
            '--target_s3_path': 's3://my-data-lake/processed/users/'
405
        }
406
    )
407

408
    if job_run_id:
409
        # Check job status
410
        import time
411
        time.sleep(10)  # Wait a bit for job to start
412

413
        status = etl_manager.get_job_run_status('user-processing-etl', job_run_id)
414
        if status:
415
            print(f"Job Status: {status['state']}")
416
            print(f"Started: {status['started_on']}")
417
            if status['error_message']:
418
                print(f"Error: {status['error_message']}")
419

420
# Create Python shell script for simple data processing
421
python_processing_logic = '''
422
    # Remove duplicates
423
    df = df.drop_duplicates(subset=['user_id'])
424

425
    # Clean email addresses
426
    df['email'] = df['email'].str.lower().str.strip()
427

428
    # Add data quality flags
429
    df['has_valid_email'] = df['email'].str.contains('@', na=False)
430
    df['has_complete_profile'] = ~df[['user_id', 'email', 'registration_date']].isnull().any(axis=1)
431
'''
432

433
python_script = etl_manager.create_python_shell_script(
434
    's3://my-data-lake/raw/user_exports/users.csv',
435
    's3://my-data-lake/processed/clean_users.csv',
436
    python_processing_logic
437
)
438

439
print("Generated Python Shell Script:")
440
print(python_script[:500] + "...")  # Print first 500 characters

Best Practices {#best-practices}#

AWS Glue Optimization and Operational Excellence#

1
class GlueBestPractices:
2
    def __init__(self):
3
        self.glue = boto3.client('glue')
4
        self.cloudwatch = boto3.client('cloudwatch')
5

6
    def implement_job_optimization_strategies(self):
7
        """
8
        Implement job optimization best practices
9
        """
10
        optimization_strategies = {
11
            'performance_optimization': {
12
                'partition_optimization': {
13
                    'description': 'Optimize data partitioning for better query performance',
14
                    'strategies': [
15
                        'Use date-based partitioning for time-series data',
16
                        'Limit partition count to 10,000-15,000 per table',
17
                        'Use partition projection for better query performance',
18
                        'Avoid small file problems with proper partition sizing'
19
                    ],
20
                    'example_code': '''
21
# Optimal partitioning example
22
glueContext.write_dynamic_frame.from_options(
23
    frame=transformed_df,
24
    connection_type="s3",
25
    connection_options={
26
        "path": "s3://my-bucket/data/",
27
        "partitionKeys": ["year", "month", "day"],
28
        "compression": "snappy"
29
    },
30
    format="parquet",
31
    transformation_ctx="write_partitioned_data"
32
)
33
'''
34
                },
35
                'worker_optimization': {
36
                    'description': 'Choose appropriate worker types and counts',
37
                    'guidelines': [
38
                        'G.1X: Small to medium datasets (< 100GB)',
39
                        'G.2X: Medium to large datasets (100GB - 1TB)',
40
                        'G.4X: Very large datasets (> 1TB)',
41
                        'G.8X: Extremely large datasets with complex transformations',
42
                        'Start with minimum workers and scale based on performance'
43
                    ],
44
                    'auto_scaling_example': '''
45
# Enable auto scaling for variable workloads
46
job_config = {
47
    "WorkerType": "G.1X",
48
    "NumberOfWorkers": 2,
49
    "MaxCapacity": 10,  # Maximum DPUs for auto scaling
50
    "DefaultArguments": {
51
        "--enable-auto-scaling": "true",
52
        "--job-bookmark-option": "job-bookmark-enable"
53
    }
54
}
55
'''
56
                },
57
                'data_format_optimization': {
58
                    'description': 'Use optimal data formats for performance',
59
                    'recommendations': [
60
                        'Parquet: Best for analytics workloads',
61
                        'ORC: Good alternative to Parquet',
62
                        'Avro: Good for schema evolution',
63
                        'JSON: Use only for semi-structured data',
64
                        'Enable compression (Snappy, GZIP, LZO)'
65
                    ]
66
                }
67
            },
68
            'cost_optimization': {
69
                'job_bookmarks': {
70
                    'description': 'Use job bookmarks to process only new data',
71
                    'implementation': '''
72
# Enable job bookmarks in job arguments
73
"--job-bookmark-option": "job-bookmark-enable"
74

75
# In ETL script, use transformation_ctx for bookmark tracking
76
source_df = glueContext.create_dynamic_frame.from_catalog(
77
    database="my_database",
78
    table_name="my_table",
79
    transformation_ctx="source_df"  # Required for bookmarks
80
)
81
'''
82
                },
83
                'serverless_optimization': {
84
                    'description': 'Optimize for serverless execution',
85
                    'strategies': [
86
                        'Use Glue 4.0 for improved performance',
87
                        'Enable auto scaling',
88
                        'Use appropriate timeout values',
89
                        'Implement efficient data processing patterns'
90
                    ]
91
                },
92
                'resource_monitoring': {
93
                    'description': 'Monitor resource utilization for cost optimization',
94
                    'metrics_to_track': [
95
                        'DPU hours consumed',
96
                        'Job execution time',
97
                        'Data processed per hour',
98
                        'Failed job retry costs'
99
                    ]
100
                }
101
            },
102
            'reliability_patterns': {
103
                'error_handling': {
104
                    'description': 'Implement comprehensive error handling',
105
                    'pattern_example': '''
106
try:
107
    # Main ETL logic
108
    source_df = glueContext.create_dynamic_frame.from_catalog(...)
109
    transformed_df = apply_transformations(source_df)
110
    write_to_target(transformed_df)
111

112
    # Log success metrics
113
    logger.info(f"Successfully processed {transformed_df.count()} records")
114

115
except Exception as e:
116
    logger.error(f"Job failed with error: {str(e)}")
117

118
    # Send failure notification
119
    send_failure_notification(str(e))
120

121
    # Write error records to separate location for analysis
122
    error_df = create_error_record(e, source_data)
123
    write_error_records(error_df)
124

125
    raise e
126
'''
127
                },
128
                'data_validation': {
129
                    'description': 'Implement data quality checks',
130
                    'validation_example': '''
131
def validate_data_quality(df, validation_rules):
132
    """Implement comprehensive data validation"""
133

134
    validation_results = {}
135

136
    # Check for null values in critical columns
137
    for column in validation_rules.get('required_columns', []):
138
        null_count = df.filter(F.col(column).isNull()).count()
139
        validation_results[f'{column}_null_count'] = null_count
140

141
        if null_count > 0:
142
            logger.warning(f"Found {null_count} null values in {column}")
143

144
    # Check data ranges
145
    for column, range_check in validation_rules.get('range_checks', {}).items():
146
        out_of_range = df.filter(
147
            (F.col(column) < range_check['min']) |
148
            (F.col(column) > range_check['max'])
149
        ).count()
150

151
        validation_results[f'{column}_out_of_range'] = out_of_range
152

153
    # Check for duplicates
154
    total_count = df.count()
155
    unique_count = df.dropDuplicates(validation_rules.get('unique_columns', [])).count()
156
    duplicate_count = total_count - unique_count
157

158
    validation_results['duplicate_count'] = duplicate_count
159

160
    return validation_results
161
'''
162
                },
163
                'retry_mechanisms': {
164
                    'description': 'Implement intelligent retry strategies',
165
                    'configuration': {
166
                        'max_retries': 2,
167
                        'retry_delay': 300,  # 5 minutes
168
                        'exponential_backoff': True
169
                    }
170
                }
171
            }
172
        }
173

174
        return optimization_strategies
175

176
    def setup_comprehensive_monitoring(self, job_names):
177
        """
178
        Set up comprehensive monitoring for Glue jobs
179
        """
180
        monitoring_setup = {
181
            'cloudwatch_metrics': self._setup_cloudwatch_alarms(job_names),
182
            'custom_metrics': self._setup_custom_metrics(),
183
            'logging_configuration': self._setup_logging_best_practices(),
184
            'notification_setup': self._setup_notifications()
185
        }
186

187
        return monitoring_setup
188

189
    def _setup_cloudwatch_alarms(self, job_names):
190
        """
191
        Create CloudWatch alarms for Glue jobs
192
        """
193
        alarms_created = []
194

195
        for job_name in job_names:
196
            # Job failure alarm
197
            try:
198
                self.cloudwatch.put_metric_alarm(
199
                    AlarmName=f'GlueJob-{job_name}-Failures',
200
                    ComparisonOperator='GreaterThanThreshold',
201
                    EvaluationPeriods=1,
202
                    MetricName='glue.driver.aggregate.numFailedTasks',
203
                    Namespace='AWS/Glue',
204
                    Period=300,
205
                    Statistic='Sum',
206
                    Threshold=0.0,
207
                    ActionsEnabled=True,
208
                    AlarmActions=[
209
                        'arn:aws:sns:us-east-1:123456789012:glue-job-failures'
210
                    ],
211
                    AlarmDescription=f'Alert when Glue job {job_name} fails',
212
                    Dimensions=[
213
                        {
214
                            'Name': 'JobName',
215
                            'Value': job_name
216
                        }
217
                    ]
218
                )
219
                alarms_created.append(f'GlueJob-{job_name}-Failures')
220

221
                # Long running job alarm
222
                self.cloudwatch.put_metric_alarm(
223
                    AlarmName=f'GlueJob-{job_name}-LongRunning',
224
                    ComparisonOperator='GreaterThanThreshold',
225
                    EvaluationPeriods=1,
226
                    MetricName='glue.driver.ExecutorAllocationManager.executors.numberAllExecutors',
227
                    Namespace='AWS/Glue',
228
                    Period=3600,  # 1 hour
229
                    Statistic='Average',
230
                    Threshold=0.0,
231
                    ActionsEnabled=True,
232
                    AlarmActions=[
233
                        'arn:aws:sns:us-east-1:123456789012:glue-job-performance'
234
                    ],
235
                    AlarmDescription=f'Alert when Glue job {job_name} runs longer than expected',
236
                    Dimensions=[
237
                        {
238
                            'Name': 'JobName',
239
                            'Value': job_name
240
                        }
241
                    ]
242
                )
243
                alarms_created.append(f'GlueJob-{job_name}-LongRunning')
244

245
            except Exception as e:
246
                print(f"Error creating alarm for {job_name}: {e}")
247

248
        return alarms_created
249

250
    def _setup_custom_metrics(self):
251
        """
252
        Set up custom metrics for job monitoring
253
        """
254
        custom_metrics_code = '''
255
import boto3
256
from datetime import datetime
257

258
def publish_custom_metrics(job_name, metrics_data):
259
    """Publish custom metrics to CloudWatch"""
260
    cloudwatch = boto3.client('cloudwatch')
261

262
    metric_data = []
263

264
    for metric_name, value in metrics_data.items():
265
        metric_data.append({
266
            'MetricName': metric_name,
267
            'Value': value,
268
            'Unit': 'Count',
269
            'Dimensions': [
270
                {
271
                    'Name': 'JobName',
272
                    'Value': job_name
273
                }
274
            ],
275
            'Timestamp': datetime.utcnow()
276
        })
277

278
    try:
279
        cloudwatch.put_metric_data(
280
            Namespace='Glue/CustomMetrics',
281
            MetricData=metric_data
282
        )
283
    except Exception as e:
284
        logger.error(f"Failed to publish custom metrics: {e}")
285

286
# Example usage in ETL job
287
def main_etl_logic():
288
    # ... ETL processing ...
289

290
    # Collect custom metrics
291
    job_metrics = {
292
        'RecordsProcessed': processed_count,
293
        'RecordsFiltered': filtered_count,
294
        'DataQualityScore': calculate_quality_score(),
295
        'ProcessingRate': processed_count / execution_time
296
    }
297

298
    publish_custom_metrics('my-etl-job', job_metrics)
299
'''
300

301
        return custom_metrics_code
302

303
    def _setup_logging_best_practices(self):
304
        """
305
        Logging configuration best practices
306
        """
307
        logging_config = {
308
            'job_arguments': {
309
                '--enable-continuous-cloudwatch-log': 'true',
310
                '--enable-metrics': '',
311
                '--additional-python-modules': 'requests,pandas==1.5.3'
312
            },
313
            'logging_code_example': '''
314
import logging
315
from awsglue.context import GlueContext
316

317
# Set up structured logging
318
logger = logging.getLogger(__name__)
319
logger.setLevel(logging.INFO)
320

321
# Create formatter for structured logs
322
formatter = logging.Formatter(
323
    '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
324
)
325

326
def log_job_progress(stage, records_processed=0, additional_info=None):
327
    """Log job progress with structured information"""
328
    log_data = {
329
        'stage': stage,
330
        'timestamp': datetime.utcnow().isoformat(),
331
        'records_processed': records_processed
332
    }
333

334
    if additional_info:
335
        log_data.update(additional_info)
336

337
    logger.info(json.dumps(log_data))
338

339
# Usage examples
340
log_job_progress('data_extraction', records_processed=1000)
341
log_job_progress('data_transformation',
342
                records_processed=950,
343
                additional_info={'filtered_records': 50})
344
log_job_progress('data_loading', records_processed=950)
345
''',
346
            'log_retention': {
347
                'description': 'Set appropriate log retention periods',
348
                'recommendation': '30 days for development, 90 days for production'
349
            }
350
        }
351

352
        return logging_config
353

354
    def _setup_notifications(self):
355
        """
356
        Set up notification systems for job events
357
        """
358
        notification_setup = {
359
            'sns_topics': [
360
                {
361
                    'name': 'glue-job-failures',
362
                    'description': 'Critical job failures requiring immediate attention'
363
                },
364
                {
365
                    'name': 'glue-job-performance',
366
                    'description': 'Performance issues and long-running jobs'
367
                },
368
                {
369
                    'name': 'glue-data-quality',
370
                    'description': 'Data quality issues and validation failures'
371
                }
372
            ],
373
            'slack_integration': '''
374
import json
375
import urllib3
376

377
def send_slack_notification(webhook_url, message, channel='#data-engineering'):
378
    """Send notification to Slack"""
379
    http = urllib3.PoolManager()
380

381
    slack_message = {
382
        'channel': channel,
383
        'username': 'AWS Glue',
384
        'text': message,
385
        'icon_emoji': ':warning:'
386
    }
387

388
    try:
389
        response = http.request(
390
            'POST',
391
            webhook_url,
392
            body=json.dumps(slack_message),
393
            headers={'Content-Type': 'application/json'}
394
        )
395
        return response.status == 200
396
    except Exception as e:
397
        logger.error(f"Failed to send Slack notification: {e}")
398
        return False
399
''',
400
            'email_templates': {
401
                'job_failure': {
402
                    'subject': 'AWS Glue Job Failure: {job_name}',
403
                    'body': '''
404
Job Name: {job_name}
405
Run ID: {run_id}
406
Error: {error_message}
407
Started: {start_time}
408
Failed: {end_time}
409
Duration: {duration}
410

411
Please investigate and resolve the issue.
412
'''
413
                }
414
            }
415
        }
416

417
        return notification_setup
418

419
    def implement_security_best_practices(self):
420
        """
421
        Implement security best practices for Glue jobs
422
        """
423
        security_practices = {
424
            'iam_policies': {
425
                'principle_of_least_privilege': '''
426
{
427
    "Version": "2012-10-17",
428
    "Statement": [
429
        {
430
            "Effect": "Allow",
431
            "Action": [
432
                "s3:GetObject",
433
                "s3:PutObject"
434
            ],
435
            "Resource": [
436
                "arn:aws:s3:::my-data-bucket/input/*",
437
                "arn:aws:s3:::my-data-bucket/output/*"
438
            ]
439
        },
440
        {
441
            "Effect": "Allow",
442
            "Action": [
443
                "glue:GetTable",
444
                "glue:GetPartitions"
445
            ],
446
            "Resource": [
447
                "arn:aws:glue:*:*:catalog",
448
                "arn:aws:glue:*:*:database/my_database",
449
                "arn:aws:glue:*:*:table/my_database/*"
450
            ]
451
        }
452
    ]
453
}
454
''',
455
                'cross_account_access': '''
456
# For cross-account access, use AssumeRole
457
{
458
    "Effect": "Allow",
459
    "Action": "sts:AssumeRole",
460
    "Resource": "arn:aws:iam::TARGET-ACCOUNT:role/CrossAccountGlueRole",
461
    "Condition": {
462
        "StringEquals": {
463
            "sts:ExternalId": "unique-external-id"
464
        }
465
    }
466
}
467
'''
468
            },
469
            'data_encryption': {
470
                'at_rest': {
471
                    'description': 'Encrypt data at rest using KMS',
472
                    'configuration': {
473
                        's3_encryption': 'AES256 or aws:kms',
474
                        'glue_catalog_encryption': 'Enabled with KMS key',
475
                        'job_bookmark_encryption': 'Enabled'
476
                    }
477
                },
478
                'in_transit': {
479
                    'description': 'Enable SSL/TLS for all connections',
480
                    'jdbc_connections': 'Use SSL connection strings',
481
                    'api_calls': 'All AWS API calls use TLS 1.2'
482
                }
483
            },
484
            'network_security': {
485
                'vpc_configuration': {
486
                    'description': 'Run Glue jobs in VPC for network isolation',
487
                    'requirements': [
488
                        'Private subnets for Glue jobs',
489
                        'NAT Gateway for internet access',
490
                        'VPC endpoints for AWS services',
491
                        'Security groups with minimal permissions'
492
                    ]
493
                },
494
                'endpoint_security': {
495
                    'vpc_endpoints': [
496
                        'com.amazonaws.region.s3',
497
                        'com.amazonaws.region.glue',
498
                        'com.amazonaws.region.logs'
499
                    ]
500
                }
501
            },
502
            'secrets_management': {
503
                'database_credentials': '''
504
import boto3
505
from botocore.exceptions import ClientError
506

507
def get_secret(secret_name, region_name="us-east-1"):
508
    """Retrieve database credentials from AWS Secrets Manager"""
509
    session = boto3.session.Session()
510
    client = session.client('secretsmanager', region_name=region_name)
511

512
    try:
513
        response = client.get_secret_value(SecretId=secret_name)
514
        return json.loads(response['SecretString'])
515
    except ClientError as e:
516
        logger.error(f"Failed to retrieve secret {secret_name}: {e}")
517
        raise
518

519
# Usage in Glue job
520
db_credentials = get_secret('prod/database/credentials')
521
connection_options = {
522
    "url": f"jdbc:postgresql://host:5432/db",
523
    "user": db_credentials['username'],
524
    "password": db_credentials['password']
525
}
526
'''
527
            }
528
        }
529

530
        return security_practices
531

532
# Best practices implementation
533
best_practices = GlueBestPractices()
534

535
# Get optimization strategies
536
optimization_strategies = best_practices.implement_job_optimization_strategies()
537
print("Glue Job Optimization Strategies:")
538
print(json.dumps(optimization_strategies, indent=2, default=str))
539

540
# Set up monitoring for jobs
541
monitoring_setup = best_practices.setup_comprehensive_monitoring(['user-processing-etl', 'order-aggregation-etl'])
542
print(f"\nMonitoring setup completed. Alarms created: {len(monitoring_setup['cloudwatch_metrics'])}")
543

544
# Get security best practices
545
security_practices = best_practices.implement_security_best_practices()
546
print("\nSecurity Best Practices:")
547
print(json.dumps(security_practices, indent=2))

Cost Optimization {#cost-optimization}#

Glue Cost Management#

1
class GlueCostOptimizer:
2
    def __init__(self):
3
        self.glue = boto3.client('glue')
4
        self.ce = boto3.client('ce')  # Cost Explorer
5
        self.cloudwatch = boto3.client('cloudwatch')
6

7
    def analyze_glue_costs(self, start_date, end_date):
8
        """
9
        Analyze AWS Glue costs and usage patterns
10
        """
11
        try:
12
            response = self.ce.get_cost_and_usage(
13
                TimePeriod={
14
                    'Start': start_date.strftime('%Y-%m-%d'),
15
                    'End': end_date.strftime('%Y-%m-%d')
16
                },
17
                Granularity='MONTHLY',
18
                Metrics=['BlendedCost', 'UsageQuantity'],
19
                GroupBy=[
20
                    {
21
                        'Type': 'DIMENSION',
22
                        'Key': 'USAGE_TYPE'
23
                    }
24
                ],
25
                Filter={
26
                    'Dimensions': {
27
                        'Key': 'SERVICE',
28
                        'Values': ['AWS Glue']
29
                    }
30
                }
31
            )
32

33
            cost_breakdown = {}
34
            for result in response['ResultsByTime']:
35
                for group in result['Groups']:
36
                    usage_type = group['Keys'][0]
37
                    cost = float(group['Metrics']['BlendedCost']['Amount'])
38
                    usage = float(group['Metrics']['UsageQuantity']['Amount'])
39

40
                    if usage_type not in cost_breakdown:
41
                        cost_breakdown[usage_type] = {'cost': 0, 'usage': 0}
42

43
                    cost_breakdown[usage_type]['cost'] += cost
44
                    cost_breakdown[usage_type]['usage'] += usage
45

46
            return cost_breakdown
47

48
        except Exception as e:
49
            print(f"Error analyzing Glue costs: {e}")
50
            return {}
51

52
    def optimize_job_configurations(self):
53
        """
54
        Analyze job configurations and provide optimization recommendations
55
        """
56
        try:
57
            jobs = self.glue.get_jobs()
58

59
            optimization_recommendations = []
60

61
            for job in jobs['Jobs']:
62
                job_name = job['Name']
63
                recommendations = []
64
                current_cost_estimate = 0
65

66
                # Analyze worker configuration
67
                worker_type = job.get('WorkerType', 'Standard')
68
                number_of_workers = job.get('NumberOfWorkers', 2)
69
                max_capacity = job.get('MaxCapacity', 2)
70

71
                # Calculate approximate hourly cost
72
                worker_costs = {
73
                    'Standard': 0.44,    # $0.44 per DPU hour
74
                    'G.1X': 0.44,        # $0.44 per DPU hour
75
                    'G.2X': 0.88,        # $0.88 per DPU hour
76
                    'G.4X': 1.76,        # $1.76 per DPU hour
77
                    'G.8X': 3.52         # $3.52 per DPU hour
78
                }
79

80
                if worker_type in worker_costs:
81
                    hourly_cost = worker_costs[worker_type] * number_of_workers
82
                else:
83
                    hourly_cost = 0.44 * max_capacity  # Fallback to DPU pricing
84

85
                current_cost_estimate = hourly_cost
86

87
                # Check for over-provisioning
88
                if worker_type == 'G.8X' and number_of_workers > 2:
89
                    recommendations.append({
90
                        'type': 'worker_optimization',
91
                        'description': 'Consider using smaller worker types with more workers',
92
                        'potential_savings': hourly_cost * 0.3,
93
                        'action': 'Try G.2X with more workers for better cost efficiency'
94
                    })
95

96
                # Check timeout settings
97
                timeout = job.get('Timeout', 2880)  # Default 48 hours
98
                if timeout > 720:  # More than 12 hours
99
                    recommendations.append({
100
                        'type': 'timeout_optimization',
101
                        'description': f'Long timeout setting: {timeout} minutes',
102
                        'potential_savings': 'Prevent runaway job costs',
103
                        'action': 'Review and optimize job logic, reduce timeout'
104
                    })
105

106
                # Check retry configuration
107
                max_retries = job.get('MaxRetries', 0)
108
                if max_retries > 2:
109
                    recommendations.append({
110
                        'type': 'retry_optimization',
111
                        'description': f'High retry count: {max_retries}',
112
                        'potential_savings': hourly_cost * max_retries * 0.5,
113
                        'action': 'Implement better error handling, reduce retries'
114
                    })
115

116
                # Check for job bookmark usage
117
                default_args = job.get('DefaultArguments', {})
118
                bookmark_option = default_args.get('--job-bookmark-option', 'job-bookmark-disable')
119

120
                if bookmark_option == 'job-bookmark-disable':
121
                    recommendations.append({
122
                        'type': 'bookmark_optimization',
123
                        'description': 'Job bookmarks not enabled',
124
                        'potential_savings': hourly_cost * 0.7,  # Significant savings
125
                        'action': 'Enable job bookmarks to process only new data'
126
                    })
127

128
                if recommendations:
129
                    total_potential_savings = sum(
130
                        r.get('potential_savings', 0) for r in recommendations
131
                        if isinstance(r.get('potential_savings'), (int, float))
132
                    )
133

134
                    optimization_recommendations.append({
135
                        'job_name': job_name,
136
                        'current_hourly_cost': hourly_cost,
137
                        'worker_type': worker_type,
138
                        'number_of_workers': number_of_workers,
139
                        'recommendations': recommendations,
140
                        'total_potential_hourly_savings': total_potential_savings
141
                    })
142

143
            return optimization_recommendations
144

145
        except Exception as e:
146
            print(f"Error optimizing job configurations: {e}")
147
            return []
148

149
    def analyze_crawler_costs(self):
150
        """
151
        Analyze crawler costs and usage patterns
152
        """
153
        try:
154
            crawlers = self.glue.get_crawlers()
155

156
            crawler_analysis = []
157

158
            for crawler in crawlers['Crawlers']:
159
                crawler_name = crawler['Name']
160

161
                # Get crawler metrics
162
                try:
163
                    metrics = self.glue.get_crawler_metrics(CrawlerNameList=[crawler_name])
164
                    crawler_metrics = metrics['CrawlerMetricsList'][0] if metrics['CrawlerMetricsList'] else {}
165

166
                    # Calculate approximate costs
167
                    last_runtime_seconds = crawler_metrics.get('LastRuntimeSeconds', 0)
168
                    runtime_hours = last_runtime_seconds / 3600
169

170
                    # Crawler pricing: $0.44 per DPU hour
171
                    estimated_cost_per_run = runtime_hours * 0.44
172

173
                    # Check schedule
174
                    schedule = crawler.get('Schedule', {}).get('ScheduleExpression', 'On demand')
175

176
                    recommendations = []
177

178
                    # Check for frequent scheduling
179
                    if 'cron' in schedule.lower() and ('hour' in schedule.lower() or 'minute' in schedule.lower()):
180
                        recommendations.append({
181
                            'type': 'schedule_optimization',
182
                            'description': 'Frequent crawler schedule detected',
183
                            'action': 'Consider less frequent scheduling or event-driven crawling'
184
                        })
185

186
                    # Check for long runtime
187
                    if last_runtime_seconds > 3600:  # More than 1 hour
188
                        recommendations.append({
189
                            'type': 'runtime_optimization',
190
                            'description': f'Long crawler runtime: {runtime_hours:.2f} hours',
191
                            'action': 'Optimize crawler targets and exclusion patterns'
192
                        })
193

194
                    crawler_analysis.append({
195
                        'crawler_name': crawler_name,
196
                        'last_runtime_hours': runtime_hours,
197
                        'estimated_cost_per_run': estimated_cost_per_run,
198
                        'schedule': schedule,
199
                        'tables_created': crawler_metrics.get('TablesCreated', 0),
200
                        'tables_updated': crawler_metrics.get('TablesUpdated', 0),
201
                        'recommendations': recommendations
202
                    })
203

204
                except Exception as e:
205
                    print(f"Error getting metrics for crawler {crawler_name}: {e}")
206

207
            return crawler_analysis
208

209
        except Exception as e:
210
            print(f"Error analyzing crawler costs: {e}")
211
            return []
212

213
    def calculate_cost_projections(self, job_usage_patterns):
214
        """
215
        Calculate cost projections for different usage patterns
216
        """
217
        pricing = {
218
            'glue_etl': {
219
                'Standard': 0.44,    # $ per DPU hour
220
                'G.1X': 0.44,        # $ per DPU hour
221
                'G.2X': 0.88,        # $ per DPU hour
222
                'G.4X': 1.76,        # $ per DPU hour
223
                'G.8X': 3.52,        # $ per DPU hour
224
                'G.025X': 0.44       # $ per DPU hour (Python shell)
225
            },
226
            'glue_crawler': 0.44,    # $ per DPU hour
227
            'glue_catalog': {
228
                'first_million_requests': 0.0,      # Free
229
                'additional_requests': 1.0,         # $ per million requests
230
                'storage': 0.0                      # Free
231
            },
232
            'glue_studio': 0.0,      # No additional cost
233
            'glue_databrew': {
234
                'node_hour': 0.48,                  # $ per node hour
235
                'first_30_datasets': 0.0,           # Free
236
                'additional_datasets': 1.0          # $ per dataset per month
237
            }
238
        }
239

240
        projections = {}
241

242
        for job_name, pattern in job_usage_patterns.items():
243
            worker_type = pattern.get('worker_type', 'G.1X')
244
            number_of_workers = pattern.get('number_of_workers', 2)
245
            hours_per_month = pattern.get('hours_per_month', 0)
246

247
            # Calculate monthly cost
248
            hourly_cost = pricing['glue_etl'][worker_type] * number_of_workers
249
            monthly_cost = hourly_cost * hours_per_month
250

251
            # Calculate cost with different optimizations
252
            optimized_scenarios = {}
253

254
            # Scenario 1: Enable job bookmarks (70% reduction in processing time)
255
            bookmark_hours = hours_per_month * 0.3
256
            bookmark_cost = hourly_cost * bookmark_hours
257
            optimized_scenarios['with_bookmarks'] = {
258
                'monthly_cost': bookmark_cost,
259
                'savings': monthly_cost - bookmark_cost,
260
                'description': 'Enable job bookmarks to process only new data'
261
            }
262

263
            # Scenario 2: Right-size workers (switch to smaller/larger workers)
264
            if worker_type == 'G.2X' and number_of_workers >= 4:
265
                alt_cost = pricing['glue_etl']['G.1X'] * (number_of_workers * 2) * hours_per_month
266
                optimized_scenarios['right_sized_workers'] = {
267
                    'monthly_cost': alt_cost,
268
                    'savings': monthly_cost - alt_cost if alt_cost < monthly_cost else 0,
269
                    'description': f'Use G.1X workers instead of {worker_type}'
270
                }
271

272
            # Scenario 3: Optimize runtime (assume 20% improvement)
273
            optimized_hours = hours_per_month * 0.8
274
            runtime_optimized_cost = hourly_cost * optimized_hours
275
            optimized_scenarios['runtime_optimization'] = {
276
                'monthly_cost': runtime_optimized_cost,
277
                'savings': monthly_cost - runtime_optimized_cost,
278
                'description': 'Optimize job logic and data processing'
279
            }
280

281
            projections[job_name] = {
282
                'current_monthly_cost': monthly_cost,
283
                'current_hourly_cost': hourly_cost,
284
                'hours_per_month': hours_per_month,
285
                'worker_configuration': f"{worker_type} x {number_of_workers}",
286
                'optimization_scenarios': optimized_scenarios,
287
                'best_optimization': max(
288
                    optimized_scenarios.items(),
289
                    key=lambda x: x[1]['savings']
290
                )[0] if optimized_scenarios else None
291
            }
292

293
        return projections
294

295
    def generate_cost_optimization_report(self):
296
        """
297
        Generate comprehensive cost optimization report
298
        """
299
        from datetime import datetime, timedelta
300

301
        end_date = datetime.utcnow()
302
        start_date = end_date - timedelta(days=90)  # Last 3 months
303

304
        report = {
305
            'report_date': datetime.utcnow().isoformat(),
306
            'analysis_period': f"{start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}",
307
            'current_costs': self.analyze_glue_costs(start_date, end_date),
308
            'job_optimizations': self.optimize_job_configurations(),
309
            'crawler_analysis': self.analyze_crawler_costs(),
310
            'recommendations_summary': {
311
                'immediate_actions': [
312
                    'Enable job bookmarks for incremental processing',
313
                    'Right-size worker types based on data volume',
314
                    'Optimize crawler schedules and reduce frequency',
315
                    'Implement proper timeout settings'
316
                ],
317
                'cost_reduction_strategies': [
318
                    'Use partition projection for better query performance',
319
                    'Implement data lifecycle policies for staging data',
320
                    'Optimize data formats (use Parquet with compression)',
321
                    'Consolidate small files to reduce processing overhead'
322
                ]
323
            }
324
        }
325

326
        # Calculate total potential savings
327
        total_job_savings = 0
328
        for job_opt in report['job_optimizations']:
329
            total_job_savings += job_opt.get('total_potential_hourly_savings', 0)
330

331
        # Estimate monthly savings (assuming 20 hours average monthly execution)
332
        estimated_monthly_savings = total_job_savings * 20
333

334
        report['cost_summary'] = {
335
            'total_potential_hourly_savings': total_job_savings,
336
            'estimated_monthly_savings': estimated_monthly_savings,
337
            'estimated_annual_savings': estimated_monthly_savings * 12
338
        }
339

340
        return report
341

342
# Cost optimization examples
343
cost_optimizer = GlueCostOptimizer()
344

345
# Example job usage patterns for cost projection
346
job_usage_patterns = {
347
    'daily-user-etl': {
348
        'worker_type': 'G.1X',
349
        'number_of_workers': 2,
350
        'hours_per_month': 30  # 1 hour daily
351
    },
352
    'weekly-sales-aggregation': {
353
        'worker_type': 'G.2X',
354
        'number_of_workers': 4,
355
        'hours_per_month': 16  # 4 hours weekly
356
    },
357
    'real-time-processing': {
358
        'worker_type': 'G.1X',
359
        'number_of_workers': 1,
360
        'hours_per_month': 720  # Always running
361
    }
362
}
363

364
# Calculate cost projections
365
projections = cost_optimizer.calculate_cost_projections(job_usage_patterns)
366
print("Glue Cost Projections:")
367
for job_name, projection in projections.items():
368
    print(f"\n{job_name}:")
369
    print(f"  Current monthly cost: ${projection['current_monthly_cost']:.2f}")
370
    print(f"  Worker config: {projection['worker_configuration']}")
371

372
    if projection['best_optimization']:
373
        best_opt = projection['optimization_scenarios'][projection['best_optimization']]
374
        print(f"  Best optimization: {projection['best_optimization']}")
375
        print(f"  Potential monthly savings: ${best_opt['savings']:.2f}")
376

377
# Generate comprehensive cost optimization report
378
report = cost_optimizer.generate_cost_optimization_report()
379
print(f"\nGlue Cost Optimization Report:")
380
print(f"Estimated Monthly Savings: ${report['cost_summary']['estimated_monthly_savings']:.2f}")
381
print(f"Estimated Annual Savings: ${report['cost_summary']['estimated_annual_savings']:.2f}")
382

383
print(f"\nTop Recommendations:")
384
for rec in report['recommendations_summary']['immediate_actions']:
385
    print(f"  - {rec}")

Conclusion#

AWS Glue provides a comprehensive serverless platform for data integration, cataloging, and ETL processing. Key takeaways:

Essential Components:#

Data Catalog: Centralized metadata repository for data discovery and governance
Crawlers: Automated schema discovery and cataloging from various data sources
ETL Jobs: Flexible data transformation with Apache Spark and Python
Glue Studio: Visual interface for building and monitoring ETL workflows

Advanced Capabilities:#

Multiple job types: Spark ETL, Python Shell, and Ray for different use cases
Serverless execution: Auto-scaling with pay-per-use pricing model
Schema evolution: Automatic handling of schema changes and versioning
Integration ecosystem: Seamless integration with AWS analytics services
Data quality: Built-in data validation and quality checking capabilities

Best Practices:#

Implement effective partitioning strategies for optimal performance
Use job bookmarks for incremental data processing
Set up comprehensive monitoring and alerting
Implement proper error handling and retry mechanisms
Follow security best practices with IAM, encryption, and VPC configuration
Optimize worker types and counts based on data volume and complexity

Cost Optimization Strategies:#

Enable job bookmarks to process only new/changed data (up to 70% cost reduction)
Right-size worker types based on actual processing requirements
Optimize crawler schedules and use incremental crawling patterns
Use appropriate data formats (Parquet with compression) for better performance
Implement proper timeout settings to prevent runaway costs
Monitor and optimize job execution times regularly

Operational Excellence:#

Use infrastructure as code for job and crawler deployment
Implement comprehensive logging and monitoring strategies
Set up automated data quality validation
Maintain proper documentation and data lineage
Regular cost reviews and optimization cycles
Disaster recovery and backup strategies for metadata

AWS Glue enables organizations to build scalable, cost-effective data processing pipelines while reducing the operational overhead of managing infrastructure, making it ideal for modern data lakes and analytics workloads.