AWS Glue
Serververless cost-effective Extract Transform and Load (ETL) system that uses EMR clusters under the hood. It’s used to transform data from sources to targets.
Glue can also crawl data sources to generate an AWS Glue Data Catalog
Available sources:
-
Stores:
-
S3
-
JDBC compatoble data sources (like RDS)
-
DynamoDB
-
-
Streams:
-
Kinesis Data Streams
-
Apache Kafka
-
-
Targets:
-
S3
-
RDS
-
JDBC compatible endpoints
-
Data Catalog
A catalog of metadata plus a collection of data management and search tools.
Metadata is about data sources in the region. The AWS Glue Data Catalog provides one unique catalog per region per account. Having a catalog helps avoiding data silos (invisible data managed by a single team).
Data Catalog is used by serveral other services:
-
Athena
-
Redshift Spectrum
-
EMR
-
AWS Lake Formation
Steps:
-
Create credentials for crawlers
-
Create crawlers
-
Run crawlers on sources
Glue Jobs
ETL jobs.
Data is extracted from sources, transformed using user-defined scripts and loaded into targets.
AWS maintains an AWS managed WARM resource pool but you’re only billed for resources you consume.
Jobs can be started manually or in response to events from, for example, EventBridge.
Architectures
Convert table data to columnar
You can use Glue to import a CSV file from an S3 bucket and transform it into Apache Parquet format and store it in an output S3 bucket to then have Athena query it.
Using S3 Events you can trigger a Lambda Function (or you can use EventBridge) which will *start a Glue Job.