Amazon Data Firehose FAQs (original) (raw)

General and Streaming ETL Concepts

Streaming ETL is the processing and movement of real-time data from one place to another. ETL is short for the database functions extract, transform, and load. Extract refers to collecting data from some source. Transform refers to any processes performed on that data. Load refers to sending the processed data to a destination, such as a warehouse, a datalake, or an analytical tool.

Data Firehose is a streaming ETL solution. It is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Snowflake, Apache Iceberg tables and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security.

A source is where your streaming data is continuously generated and captured. For example, a source can be a logging server running on Amazon EC2 instances, an application running on mobile devices, or a sensor on an IoT device. You can connect your sources to Firehose using 1) Amazon Data Firehose API, which uses the AWS SDK for Java, .NET, Node.js, Python, or Ruby. 2) Kinesis Data Stream, where Firehose reads data easily from an existing Kinesis data stream and load it into Firehose destinations. 3) Amazon MSK, where Firehose reads data easily from an existing Amazon MSK cluster and load it into Amazon S3 buckets. 4) AWS natively supported Service like AWS Cloudwatch, AWS EventBridge, AWS IOT, or AWS Pinpoint. For complete list, see the Amazon Data Firehose developer guide. 5) Kinesis Agents, which is a stand-alone Java software application that continuously monitors a set of files and sends new data to your stream. 6) Fluentbit, which an open source Log Processor and Forwarder. 7) AWS Lambda, which is a serverless compute service that lets you run code without provisioning or managing servers. You can use write your Lambda function to send traffic from S3 or DynamoDB to Firehose based on a triggered event.

A destination is the data store where your data will be delivered. Firehose currently supports Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Snowflake, Apache Iceberg tables, Splunk, Datadog, NewRelic, Dynatrace, Sumo Logic, LogicMonitor, MongoDB, and HTTP End Point as destinations.

Data Firehose manages all underlying infrastructure, storage, networking, and configuration needed to capture and load your data into Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Snowflake, Apache Iceberg tables or Splunk. You do not have to worry about provisioning, deployment, ongoing maintenance of the hardware, software, or write any other application to manage this process. Data Firehose also scales elastically without requiring any intervention or associated developer overhead. Moreover, Data Firehose synchronously replicates data across three facilities in an AWS Region, providing high availability and durability for the data as it is transported to the destinations.

After you sign up for Amazon Web Services, you can start using Firehose with the following steps:

Create an Firehose stream through the Firehose Console or the CreateDeliveryStream operation. You can optionally configure an AWS Lambda function in your Firehose stream to prepare and transform the raw data before loading the data.
Configure your data producers to continuously send data to your Firehose stream using the Amazon Kinesis Agent or the Firehose API.
Firehose automatically and continuously loads your data to the destinations you specify.

A Firehose stream is the underlying entity of Firehose. You use Firehose by creating a Firehose stream and then sending data to it. You can create a Firehose stream through the Firehose Console or the CreateDeliveryStream operation. For more information, see Creating a Firehose stream.

A record is the data of interest your data producer sends to a Firehose stream. The maximum size of a record (before Base64-encoding) is 1024 KB if your data source is Direct PUT or Kinesis Data Streams. The maximum size of a record (before Base64-encoding) is 10 MB if your data source is Amazon MSK.

For information about limits, see Amazon Data Firehose Limits in the developer guide.

Yes, Firehose can back up all un-transformed records to your S3 bucket concurrently while delivering transformed records to destination. Source record backup can be enabled when you create or update your Firehose stream.

The frequency of data delivery to Amazon S3 is determined by the S3 buffer size and buffer interval value you configured for your Firehose stream. Firehose buffers incoming data before delivering it to Amazon S3. You can configure the values for S3 buffer size (1 MB to 128 MB) or buffer interval (0 to 900 seconds), and the condition satisfied first triggers data delivery to Amazon S3. If you have Apache parquet or dynamic partitioning enabled, then your buffer size is in MBs and ranges from 64MB to 128MB for Amazon S3 destination, with is 128MB being the default value. Note that in circumstances where data delivery to the destination is falling behind data ingestion into the Firehose stream, Firehose raises the buffer size automatically to catch up and make sure that all data is delivered to the destination.

Buffer size is applied before compression. As a result, if you choose to compress your data, the size of the objects within your Amazon S3 bucket can be smaller than the buffer size you specify.

The Redshift user needs to have Redshift INSERT privilege for copying data from your Amazon S3 bucket to your Redshift instance.

If your Redshift instance is within a VPC, you need to grant Amazon Data Firehose access to your Redshift instance by unblocking Firehose IP addresses from your VPC. For information about how to unblock IPs to your VPC, see Grant Firehose Access to an Amazon Redshift Destination in the Amazon Data Firehose developer guide.

For Redshift destinations, Amazon Data Firehose delivers data to your Amazon S3 bucket first and then issues the Redshift COPY command to load data from your S3 bucket to your Redshift instance.

Currently, a single Firehose stream can only deliver data to one Snowflake table. To deliver data to multiple Snowflake tables, you need to create multiple Firehose streams.

Firehose uses exactly-once delivery semantics for Snowflake. This means that each record is delivered to Snowflake exactly once, even if there are errors or retries. However, exactly-once delivery does not guarantee that there will be no duplicates in the data end to end, as data may be duplicated by the producer or by other parts of the ETL pipeline.

We expect most data streams to be delivered within 5 seconds.

Amazon OpenSearch Service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). Click here for more information on Amazon OpenSearch.

Firehose can rotate your Amazon OpenSearch Service index based on a time duration. You can configure this time duration while creating your Firehose stream. For more information, see Index Rotation for the Amazon OpenSearch Destination in the Amazon Data Firehose developer guide.

When loading data into Amazon OpenSearch Service, Firehose can back up all of the data or only the data that failed to deliver. To take advantage of this feature and prevent any data loss, you need to provide a backup Amazon S3 bucket.

You can change the configuration of your Firehose stream at any time after it’s created. You can do so by using the Firehose Console or the UpdateDestination operation. Your Firehose stream remains in ACTIVE state while your configurations are updated and you can continue to send data to your Firehose stream. The updated configurations normally take effect within a few minutes.

When delivering to a VPC destination, you can change the destination endpoint URL, as long as new destination is accessible within the same VPC, subnets and security groups. For changes of VPC, subnets and security groups, you need to re-create the Firehose stream.

Firehose delivery can deliver to a different account in Amazon OpenSearch Service only when Firehose and Amazon OpenSearch Service are connected through public end point.

If Firehose and Amazon OpenSearch Service are connected through in a private VPC. Then Firehose stream and destination Amazon OpenSearch Service domain VPC need to be in the same account.

No, your Firehose stream and destination Amazon OpenSearch Service domain need to be in the same region.

The frequency of data delivery to Amazon OpenSearch Service is determined by the OpenSearch buffer size and buffer interval values that you configured for your Firehose stream. Firehose buffers incoming data before delivering it to Amazon OpenSearch Service. You can configure the values for OpenSearch buffer size (1 MB to 100 MB) or buffer interval (0 to 900 seconds), and the condition satisfied first triggers data delivery to Amazon OpenSearch Service. Note that in circumstances where data delivery to the destination is falling behind data ingestion into the Firehose stream, Amazon Data Firehose raises the buffer size automatically to catch up and make sure that all data is delivered to the destination.

For Redshift destinations, Amazon Data Firehose generates manifest files to load Amazon S3 objects to Redshift instances in batch. The manifests folder stores the manifest files generated by Firehose.

If “all documents” mode is used, Amazon Data Firehose concatenates multiple incoming records based on buffering configuration of your Firehose stream, and then delivers them to your S3 bucket as an S3 object. Regardless of which backup mode is configured, the failed documents are delivered to your S3 bucket using a certain JSON format that provides additional information such as error code and time of delivery attempt. For more information, see Amazon S3 Backup for the Amazon OpenSearch Destination in the Amazon Data Firehose developer guide.

A single Firehose stream can currently only deliver data to one Amazon S3 bucket. If you want to have data delivered to multiple S3 buckets, you can create multiple Firehose streams.

A single Firehose stream can currently only deliver data to one Redshift instance and one table. If you want to have data delivered to multiple Redshift instances or tables, you can create multiple Firehose streams.

A single Firehose stream can only deliver data to one Amazon OpenSearch Service domain and one index currently. If you want to have data delivered to multiple Amazon OpenSearch domains or indexes, you can create multiple Firehose stream.

When you enable Firehose to deliver data to an Amazon OpenSearch Service destination in a VPC, Amazon Data Firehose creates one or more cross account elastic network interfaces (ENI) in your VPC for each subnet(s) that you choose. Amazon Data Firehose uses these ENIs to deliver the data into your VPC. The number of ENIs scales automatically to meet the service requirements.

Yes, one Firehose stream can deliver data to multiple Apache Iceberg tables.

Yes, Firehose supports connecting to the AWS Glue Data Catalog in a different account, or in a different AWS Region.

Yes, you can use Data Transformation using Lambda when delivering to Apache Iceberg tables.