Athena

What?

Athena is a serverless interactive SQL-like query service that works directly with data stored in S3. Serverless is a cloud computing execution model where the cloud provider dynamically manages the allocation and provisioning of servers. A serverless application runs in stateless compute containers that are event-triggered, ephemeral (may last for one invocation), and fully managed by the cloud provider. Pricing is based on the number of executions rather than pre-purchased compute capacity.

How?

Athena uses a distributed SQL engine (Presto) to perform read operations (select, etc.) and Apache Hive to perform write operations (create table, etc.). A variety of Presto function calls can be found in this documentation. Athena allows you to project your schema on to your data at the time you execute a query (schema-on-read).

Features:

  • Queries w/ regular expressions
  • Reading of Parquet, JSON, etc.
  • A created table automatically grows automatically when you add more data to the S3 bucket (“prefix”) it points to
  • Supported functions
  • By default, query results are stored as txt files an S3 bucket of your choice (default for emr-comscore was s3://emr-comscore/aws-athena-query-results-929035564788-us-west-2) and are billed at standard Amazon S3 rates

Recently supported:

  • CREATE TABLE AS SELECT, which creates a table from the result of a SELECT query statement
  • multiple sql statements in one query
  • CREATE TABLE ‘s LOCATION must be a directory – Hive will include all files in that directory
  • Athena only allows you to create tables with the EXTERNAL keyword. Dropping a table created with the External keyword does not delete the underlying data.

How much?

Athena charges per query, conditional on the amount of data scanned.

  • $5 per TB of data scanned
  • rounded up to the nearest megabyte
  • 10MB minimum per query
  • no charges for

    • Data Definition Language (DDL) statements like CREATE/ALTER/DROP TABLE
    • statements for managing partitions
    • failed queries
  • cancelled queries are charged based on the amount of data scanned
  • standard S3 rates apply for storage, requests, and data transfer
  • Cost/Performance Efficiency:

    • Columnar data (i.e. Parquet) allows Athena to selectively read only required columns to process the data
    • Partitioning your data also allows Athena to restrict the amount of data scanned
    • see the Athena pricing example.