Amazon Athena (original) (raw)
What is Amazon Athena?
Amazon Athena is a service that enables data analysts to perform interactive queries in the web-based cloud storage service, Amazon Simple Storage Service (S3). Athena is used with large-scale data sets.
Amazon S3 is designed for online backup and archiving of data and applications on Amazon Web Services (AWS). Amazon S3 was created to make web-scale computing easier for developers, with use cases such as data storage, archiving, website hosting, data backup and recovery, and application hosting for deployment. Amazon Athena enables users to analyze data in Amazon S3 using Structured Query Language (SQL). The tool is designed for quick, ad hoc and complex analysis.
Because Athena is a serverless query service, analysts do not need to manage any underlying compute infrastructure to use it. They also do not need to load S3 data into Amazon Athena or transform it for analysis, making it easier and faster to gain insights. A data analyst accesses Athena through the AWS Management Console, an application programming interface or a Java Database Connectivity driver. The analyst then defines the schema and can start to use the built-in query editor to execute SQL queries on S3 data.
What is Amazon Athena used for?
An Athena user can query encrypted data with keys managed by AWS Key Management Service and encrypt query results. Athena also enables cross-account access to S3 buckets owned by another user. In addition, Athena uses managed data catalogs to store information and schemas related to searches on Amazon S3 data.
Overall, the interactive query service is an analytical tool that helps organizations analyze data stored in Amazon S3. Amazon Athena can process unstructured, semistructured and structured data sets. This is useful for research, log analysis and Online Analytical Processing.
What features does Amazon Athena have?
Touted features of Amazon Athena include:
- No server to manage . Analysts do not have to manage the underlying infrastructure. The software automatically handles configuration and software updates.
- Easy SQL querying. Athena uses the distributed SQL query engine, Presto, which is optimized for low-latency data analysis.
- Integrations. Athena integrates with other Amazon services, including AWS Glue out of the box, which helps integration with other services.
- Federated queries. Amazon Athena Federated Query enables Athena to run SQL queries across relational, nonrelational, object and custom data sources.
- Security. Athena uses AWS Identity and Access Management (IAM) policies, Amazon S3 bucket policies and access control lists.
- Machine learning. Developers can use Amazon SageMaker to create and deploy machine learning models in Amazon Athena.
What are the benefits of Amazon Athena?
Benefits of Amazon Athena include:
- Serverless. There is no need to manage any underlying compute infrastructure to use the tool.
- SQL-based. Users can run SQL queries using Presto. The query engine is open source and optimized for data analysis.
- Organizations only pay for data scanned. Pay only for queries that are run. Query costs are $5 per terabyte scanned.
- Speed. Queries are executed in parallel for large data sets, making complex queries fast.
- An open architecture. Users are not limited to AWS-specific software, avoiding vendor lock-in.
- Flexibility. Users can run multiple queries simultaneously.
What are the limitations of Amazon Athena?
Limitations of Athena include:
- Optimization is limited to queries. For example, data already stored in S3 cannot be optimized.
- No indexing options. Indexing options commonly appear in traditional databases. Without indexing, the operation load on Athena increases, potentially affecting performance.
- Efficient queries require partitioning. In order to enable efficient queries, data must first be partitioned. Partitions must then be managed for what best fits performance needs.
- Stored procedures, parameterized queries and Presto federated connectors are not supported. Amazon Athena Federated Query is needed to connect data sources.
- When querying a table with thousands of partitions, Athena can time out.
- Source files that start with an underscore or a dot are treated as hidden.
- The row and column size cannot exceed 32 megabytes.
- Athena does not support querying data in S3 Glacier and S3 Glacier Deep Archive storage classes.
- Statements such as CREATE TABLE LIKE, DESCRIBE INPUT and DESCRIBE OUTPUT, EXECUTE … USING, MERGE and UPDATE are not supported.
Which data types does Amazon Athena support?
Athena can process numerous structured and unstructured data types, including standard data formats like CSV (comma-separated value), JSON (JavaScript Object Notation), ORC (Optimized Row Columnar), Apache Parquet and Apache Avro. Athena also supports compressed data in Snappy, Zlib, LZO (Lempel-Ziv-Oberhumer) and Gzip (GNU Zip) formats.
Other examples of supported data types include:
- Boolean
- TinyIT
- SMALLINT
- Column
- VARCHAR
- CHAR
- BigInt
- WorkGroupConfigurationUpdates
- UnprocessedNamedQueryId
What integration options does Amazon Athena have?
Athena integrates with a variety of other services in the AWS portfolio. For example, AWS Glue integrates with Athena to enable more sophisticated data catalog features, such as a metadata repository, automated schema and partition recognition, and data pipelines based on Python. Glue Data Catalog stores and retrieves table metadata for S3 data.
Athena uses Amazon S3 as an underlying data store, which provides data redundancy.
Other integrations include:
- AWS CloudFormation
- Elastic Load Balancing
- Amazon QuickSight
- AWS Step Functions
- AWS Systems Manager Inventory
- Amazon CloudFront
- Amazon S3 Inventory
- Amazon Virtual Private Cloud
- AWS IAM
- AWS CloudTrail
How does Amazon Athena compare to other services?
Amazon Athena vs. Amazon Redshift. Amazon RedShift, AWS' data warehouse service, can analyze data by using standard SQL-based clients and business intelligence (BI) tools. Redshift addresses different needs than Athena. Redshift handles more complex, multipart SQL queries and is a better fit for organizations that need to combine data from disparate sources into a common format.
Amazon Athena vs. Amazon Elastic MapReduce (EMR). EMR enables teams to run distributed data processing frameworks, like Apache Hadoop, Apache Spark and the Presto SQL query engine. EMR is better suited for projects that require custom code, specific cluster configurations or extremely large data sets. However, Athena can query data processed by EMR without affecting ongoing EMR jobs. As an example, EMR is used for machine learning, data warehousing and financial analysis.
Amazon Athena vs. Microsoft SQL Server. SQL Server is a relational database management system that supports various transaction processing, BI and analytics applications. It is used for database management and analysis in fields such as e-commerce and data warehousing. Both Athena and SQL Server are in the same category of tools. SQL Server integrates well with Windows-based applications, but other options may work better outside of Windows environments.
Continue reading to learn more about Amazon Athena and other real-world serverless examples to see how to process and analyze data as part of an IT strategy.
This was last updated in June 2021
Continue Reading About Amazon Athena
- How to use Athena to troubleshoot AWS operations issues
- AWS data lake and data warehouse options for the cloud
- Implement serverless architecture on AWS without Lambda
- Presto users detail what's next for open source SQL engine
- S3 Integration with Athena for user access log analysis