Watch out for Unexpected S3 cost when using Athena.

Francois LAGIER - Apr 29 '19 - - Dev Community

Originally published at cloudforecast.io/blog

Following up on my last blog post (Using Parquet on Athena to Save Money on AWS), I wanted to share another thought about Athena, specifically how the S3 bucket is being used by Athena to store query results.

On the first use of Athena, AWS will automatically create a new bucket to store the query results (bucket name aws-athena-query-results-<ACCOUNTID>-<REGION>). Athena will store a raw result file (QueryId.csv) and a metadata file (QueryId.csv.metadata).

By storing the data using the QueryId, it allows you to access previous query's result without re-running them (saving you money since you don't need to rescan the data).

However, you are the owner of the bucket and therefore responsible for the storage on this bucket and here is a couple of reasons why it could cost you a LOT of money:

#1 All the queries are being stored! ALL OF THEM!

Athena store every query results in the bucket. Query data will just accumulate forever costing more and more $$$.

#2 Your data may be compressed but the results are not

S3 is storing the results in raw CSV. Your data may be compressed (GZIP, Snappy, ...) but the results will be in raw CSV. As an example, I ran an accidental SELECT * FROM flights.parquet_snappy_data on a 84M dataset using Parquet which resulted on a 977MB file on S3.

How to fix this?

It's actually pretty easy. If (and only if) you don't plan to re-use old query results, make sure to setup Lifecycle on your bucket using a Transition or Expiration actions. For example, you could delete query results after 1 or 7 days. At CloudForecast, we actually don't persist QueryIds since it's not useful to us so we expire the S3 files after 1 day.

Feel free to reach out if you have any questions at francois@cloudforecast.io or by Twitter: @francoislagier. Also, follow our journey @cloudforecast

. . . . . . . . . . . . . . . .
Terabox Video Player