AWS Glue

Serververless cost-effective Extract Transform and Load (ETL) system that uses EMR clusters under the hood. It’s used to transform data from sources to targets.

Glue can also crawl data sources to generate an AWS Glue Data Catalog

Available sources:

  • Stores:

    • S3

    • JDBC compatoble data sources (like RDS)

    • DynamoDB

  • Streams:

    • Kinesis Data Streams

    • Apache Kafka

  • Targets:

    • S3

    • RDS

    • JDBC compatible endpoints

Data Catalog

A catalog of metadata plus a collection of data management and search tools.

Metadata is about data sources in the region. The AWS Glue Data Catalog provides one unique catalog per region per account. Having a catalog helps avoiding data silos (invisible data managed by a single team).

Data Catalog is used by serveral other services:

  • Athena

  • Redshift Spectrum

  • EMR

  • AWS Lake Formation

Steps:

  • Create credentials for crawlers

  • Create crawlers

  • Run crawlers on sources

Glue Jobs

ETL jobs.

Data is extracted from sources, transformed using user-defined scripts and loaded into targets.

AWS maintains an AWS managed WARM resource pool but you’re only billed for resources you consume.

Jobs can be started manually or in response to events from, for example, EventBridge.

Bookmarks

They allow to avoid reprocessing all data when a Job is started.

DataBrew

It allows to build materialized views that combine and replicate data across multiple data stores without you having to write custom code, with pre-built tranformation.

Glue Studio

It’s a GUI to create, run and monitor Glue’s ETL jobs.

Streaming ETL

It allows to run streaming jobs instead of having to perform batches. It’s built on top of Apache Spark Structured Streaming. It’s compatible with:

  • Kinesis Data Streams

  • Kafka

  • Amazon Managed Streaming for Apache Kafka (MSK)

Architectures

Convert table data to columnar

You can use Glue to import a CSV file from an S3 bucket and transform it into Apache Parquet format and store it in an output S3 bucket to then have Athena query it.

Using S3 Events you can trigger a Lambda Function (or you can use EventBridge) which will *start a Glue Job.