Amazon Athena

Athena is a serverless interactive querying service.

You can run ad-hoc on data from S3 only paying for data consumed.

Athena works on structured, semi-structured and unstructured data. It uses schema-on-read to work with data, which allows for a table-like translation and a subsequent relational-like reading. This also means that you need no ETL process before running queries.

SOURCE data is NEVER CHANGED. And output data can be sent to other AWS services.

If your data is not in an optimal format you can use an AWS Glue ETL Job to transform it into Apache Parquet or Apache ORC, which are the preferred columnar formats for Athena.

Try to use larger files rather than many small files.

Athena provides two modes: SQL and Spark.

Federated Queries and Connectors

Athena uses data source connectors that run on AWS Lambda to run federated queries. They can translate between your target data source and Athena. You can think of a connector as an extension of Athena’s query engine.

Compression

Allowed compression formats:

  • BZIP2

  • DEFLATE

  • GZIP

  • LZ4:

    • Unframed (Raw)

    • Framed

    • Hadoop-Compatible

  • LZO:

    • Standard

    • Hadoop-compatible

  • SNAPPY

  • ZLIB

  • ZSTD

Formats

You should prefer a columnar format because this way Athena doesn’t have to read the whole file, thus reducing costs and times.

  • Read using a SerDe from S3:

    • text

    • XML

    • CSV/TSV

    • JSON

    • Avro

    • Apache Parquet

    • Apache ORC

    • Apache web server logs

    • Microsoft IIS logs

    • CloudTrail logs

    • VPC Flow Logs

    • ALB and NLB logs

    • CloudFront logs

    • Logstash logs

    • EMR logs

    • Global Accelerator

    • Amazon GuardDuty

    • Network Firewall

    • Simple Email Service (SES)

    • Route53

    • AWS WAF

    • AWS Glue Data Catalog

    • Custom-delimited

  • Federated Queries using Connectors:

    • CloudWatch (metrics + logs)

    • AWS Timestream

    • Redis

    • Database engines:

      • MySQL

      • PostgreSQL

      • AWS DynamoDB

      • DocumentDB

      • Oracle

      • IBM DB2

    • Kafka and MSK

    • OpenSearch

Use Cases

  • Scenarios where loading/transformation isn’t desired.

  • Ad-hoc/occasional queries on S3 data

  • Serverless querying *

Pricing

With per query billing, you can get started quickly and pay only for the data scanned by queries you run. You are charged for the number of bytes scanned per query, rounded up to the nearest megabyte, with a 10 MB minimum per query.

You can save up to 90% per query and get better performance by compressing, partitioning, and converting your data into columnar formats. Compressing your data allows Athena to process less data, and partitioning allows Athena to process only the data necessary for your query.

Consider a table with 4 equally sized columns, stored as an uncompressed text file with a total size of 3 TB on Amazon S3, where we’re only interested on a single column:
This query would cost: $15. (Price for 3 TB scanned is 3 * $5/TB = $15).
If you compress your file using GZIP, you might see 3:1 compression gains. In this case, you would have a compressed file with a size of 1 TB (less S3 storage and transfer charges), Athena has to scan the entire file again, but because it’s three times smaller, you pay one-third of what you did before, so the same query on this file would cost $5.
If you compress your file and also convert it to a columnar format like Apache Parquet, achieving 3:1 compression, you would still end up with 1 TB of data on S3. But, in this case, because Parquet is columnar, Athena can read only the column that is relevant for the query being run. Because the query in question only references a single column, Athena reads only that column: 0.25 TB ⇒ $1.25.