Amazon Redshift data ingestion options

Amazon Redshift data ingestion options


Amazon Redshift, a warehousing service, offers a variety of options for ingesting data from diverse sources into its high-performance, scalable environment. Whether your data resides in operational databases, data lakes, on-premises systems, Amazon Elastic Compute Cloud (Amazon EC2), or other AWS services, Amazon Redshift provides multiple ingestion methods to meet your specific needs. The currently available choices include:

This post explores each option (as illustrated in the following figure), determines which are suitable for different use cases, and discusses how and why to select a specific Amazon Redshift tool or feature for data ingestion.

A box indicating Amazon Redshift in the center of the image with boxes from right to left for Amazon RDS MySQL and PostgreSQL, Amazon Aurora MySQL and PostreSQL, Amazon EMR, Amazon Glue, Amazon S3 bucket, Amazon Managed Streaming for Apache Kafka and Amazon Kinesis. Each box has an arrow pointing to Amazon Redshift. Each arrow has the following labels: Amazon RDS & Amazon Aurora: zero-ETL and federated queries; AWS Glue and Amazon EMR: spark connector; Amazon S3 bucket: COPY command; Amazon Managed Streaming for Apache Kafka and Amazon Kinesis: redshift streaming. Amazon Data Firehose has an arrow pointing to Amazon S3 bucket indicating the data flow direction.

Amazon Redshift COPY command

The Redshift COPY command, a simple low-code data ingestion tool, loads data into Amazon Redshift from Amazon S3, DynamoDB, Amazon EMR, and remote hosts over SSH. It’s a fast and efficient way to load large datasets into Amazon Redshift. It uses massively parallel processing (MPP) architecture in Amazon Redshift to read and load large amounts of data in parallel from files or data from supported data sources. This allows you to utilize parallel processing by splitting data into multiple files, especially when the files are compressed.

Recommended use cases for the COPY command include loading large datasets and data from supported data sources. COPY automatically splits large uncompressed delimited text files into smaller scan ranges to utilize the parallelism of Amazon Redshift provisioned clusters and serverless workgroups. With auto-copy, automation enhances the COPY command by adding jobs for automatic ingestion of data.

COPY command advantages:

  • Performance – Efficiently loads large datasets from Amazon S3 or other sources in parallel with optimized throughput
  • Simplicity – Straightforward and user-friendly, requiring minimal setup
  • Cost-optimized – Uses Amazon Redshift MPP at a lower cost by reducing data transfer time
  • Flexibility – Supports file formats such as CSV, JSON, Parquet, ORC, and AVRO

Amazon Redshift federated queries

Amazon Redshift federated queries allow you to incorporate live data from Amazon RDS or Aurora operational databases as part of business intelligence (BI) and reporting applications.

Federated queries are useful for use cases where organizations want to combine data from their operational systems with data stored in Amazon Redshift. Federated queries allow querying data across Amazon RDS for MySQL and PostgreSQL data sources without the need for extract, transform, and load (ETL) pipelines. If storing operational data in a data warehouse is a requirement, synchronization of tables between operational data stores and Amazon Redshift tables is supported. In scenarios where data transformation is required, you can use Redshift stored procedures to modify data in Redshift tables.

Federated queries key features:

  • Real-time access – Enables querying of live data across discrete sources, such as Amazon RDS and Aurora, without the need to move the data
  • Unified data view – Provides a single view of data across multiple databases, simplifying data analysis and reporting
  • Cost savings – Eliminates the need for ETL processes to move data into Amazon Redshift, saving on storage and compute costs
  • Flexibility – Supports Amazon RDS and Aurora data sources, offering flexibility in accessing and analyzing distributed data

Amazon Redshift Zero-ETL integration

Aurora zero-ETL integration with Amazon Redshift allows access to operational data from Amazon Aurora MySQL-Compatible (and Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL in preview), and DynamoDB from Amazon Redshift without the need for ETL in near real time. You can use zero-ETL to simplify ingestion pipelines for performing change data capture (CDC) from an Aurora database to Amazon Redshift. Built on the integration of Amazon Redshift and Aurora storage layers, zero-ETL boasts simple setup, data filtering, automated observability, auto-recovery, and integration with either Amazon Redshift provisioned clusters or Amazon Redshift Serverless workgroups.

Zero-ETL integration benefits:

  • Seamless integration – Automatically integrates and synchronizes data between operational databases and Amazon Redshift without the need for custom ETL processes
  • Near real-time insights – Provides near real-time data updates, so the most current data is available for analysis
  • Ease of use – Simplifies data architecture by eliminating the need for separate ETL tools and processes
  • Efficiency – Minimizes data latency and provides data consistency across systems, enhancing overall data accuracy and reliability

Amazon Redshift integration for Apache Spark

The Amazon Redshift integration for Apache Spark, automatically included through Amazon EMR or AWS Glue, provides performance and security optimizations when compared to the community-provided connector. The integration enhances and simplifies security with AWS Identity and Access Management (IAM) authentication support. AWS Glue 4.0 provides a visual ETL tool for authoring jobs to read from and write to Amazon Redshift, using the Redshift Spark connector for connectivity. This simplifies the process of building ETL pipelines to Amazon Redshift. The Spark connector allows use of Spark applications to process and transform data before loading into Amazon Redshift. The integration minimizes the manual process of setting up a Spark connector and shortens the time needed to prepare for analytics and machine learning (ML) tasks. It allows you to specify the connection to a data warehouse and start working with Amazon Redshift data from your Apache Spark-based applications within minutes.

The integration provides pushdown capabilities for sort, aggregate, limit, join, and scalar function operations to optimize performance by moving only the relevant data from Amazon Redshift to the consuming Apache Spark application. Spark jobs are suitable for data processing pipelines and when you need to use Spark’s advanced data transformation capabilities.

With the Amazon Redshift integration for Apache Spark, you can simplify the building of ETL pipelines with data transformation requirements. It offers the following benefits:

  • High performance – Uses the distributed computing power of Apache Spark for large-scale data processing and analysis
  • Scalability – Effortlessly scales to handle massive datasets by distributing computation across multiple nodes
  • Flexibility – Supports a wide range of data sources and formats, providing versatility in data processing tasks
  • Interoperability – Seamlessly integrates with Amazon Redshift for efficient data transfer and queries

Amazon Redshift streaming ingestion

The key benefit of Amazon Redshift streaming ingestion is the ability to ingest hundreds of megabytes of data per second directly from streaming sources into Amazon Redshift with very low latency, supporting real-time analytics and insights. Supporting streams from Kinesis Data Streams, Amazon MSK, and Data Firehose, streaming ingestion requires no data staging, supports flexible schemas, and is configured with SQL. Streaming ingestion powers real-time dashboards and operational analytics by directly ingesting data into Amazon Redshift materialized views.

Amazon Redshift streaming ingestion unlocks near real-time streaming analytics with:

  • Low latency – Ingests streaming data in near real time, making streaming ingestion ideal for time-sensitive applications such as Internet of Things (IoT), financial transactions, and clickstream analysis
  • Scalability – Manages high throughput and large volumes of streaming data from sources such as Kinesis Data Streams, Amazon MSK, and Data Firehose
  • Integration – Integrates with other AWS services to build end-to-end streaming data pipelines
  • Continuous updates – Keeps data in Amazon Redshift continuously updated with the latest information from the data streams

Amazon Redshift ingestion use cases and examples

In this section, we discuss the details of different Amazon Redshift ingestion use cases and provide examples.

Redshift COPY use case: Application log data ingestion and analysis

Ingesting application log data stored in Amazon S3 is a common use case for the Redshift COPY command. Data engineers in an organization need to analyze application log data to gain insights into user behavior, identify potential issues, and optimize a platform’s performance. To achieve this, data engineers ingest log data in parallel from multiple files stored in S3 buckets into Redshift tables. This parallelization uses the Amazon Redshift MPP architecture, allowing for faster data ingestion compared to other ingestion methods.

The following code is an example of the COPY command loading data from a set of CSV files in an S3 bucket into a Redshift table:

COPY myschema.mytable
FROM 's3://my-bucket/data/files/'
IAM_ROLE ‘arn:aws:iam::1234567891011:role/MyRedshiftRole’
FORMAT AS CSV;

This code uses the following parameters:

  • mytable is the target Redshift table for data load
  • s3://my-bucket/data/files/‘ is the S3 path where the CSV files are located
  • IAM_ROLE specifies the IAM role required to access the S3 bucket
  • FORMAT AS CSV specifies that the data files are in CSV format

In addition to Amazon S3, the COPY command loads data from other sources, such as DynamoDB, Amazon EMR, remote hosts through SSH, or other Redshift databases. The COPY command provides options to specify data formats, delimiters, compression, and other parameters to handle different data sources and formats.

To get started with the COPY command, see Using the COPY command to load from Amazon S3.

Federated queries use case: Integrated reporting and analytics for a retail company

For this use case, a retail company has an operational database running on Amazon RDS for PostgreSQL, which stores real-time sales transactions, inventory levels, and customer information data. Additionally, a data warehouse runs on Amazon Redshift storing historical data for reporting and analytics purposes. To create an integrated reporting solution that combines real-time operational data with historical data in the data warehouse, without the need for multi-step ETL processes, complete the following steps:

  1. Set up network connectivity. Make sure your Redshift cluster and RDS for PostgreSQL instance are in the same virtual private cloud (VPC) or have network connectivity established through VPC peering, AWS PrivateLink, or AWS Transit Gateway.
  2. Create a secret and IAM role for federated queries:
    1. In AWS Secrets Manager, create a new secret to store the credentials (user name and password) for your Amazon RDS for PostgreSQL instance.
    2. Create an IAM role with permissions to access the Secrets Manager secret and the Amazon RDS for PostgreSQL instance.
    3. Associate the IAM role with your Amazon Redshift cluster.
  3. Create an external schema in Amazon Redshift:
    1. Connect to your Redshift cluster using a SQL client or the query editor v2 on the Amazon Redshift console.
    2. Create an external schema that references your Amazon RDS for PostgreSQL instance:
CREATE EXTERNAL SCHEMA postgres_schema
FROM POSTGRES
DATABASE 'mydatabase'
SCHEMA 'public'
URI 'endpoint-for-your-rds-instance.aws-region.rds.amazonaws.com:5432'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRoleForRDS'
SECRET_ARN 'arn:aws:secretsmanager:aws-region:123456789012:secret:my-rds-secret-abc123';

  1. Query tables in your Amazon RDS for PostgreSQL instance directly from Amazon Redshift using federated queries:
SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.category
FROM
    postgres_schema.orders r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id
WHERE
    r.order_date >= '2024-01-01';

  1. Create views or materialized views in Amazon Redshift that combine the operational data from federated queries with the historical data in Amazon Redshift for reporting purposes:
CREATE MATERIALIZED VIEW sales_report AS
SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.category,
    h.historical_sales
FROM
    (
        SELECT
            order_id,
            order_date,
            customer_name,
            total_amount,
            product_id
        FROM
            postgres_schema.orders
    ) r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id;

With this implementation, federated queries in Amazon Redshift integrate real-time operational data from Amazon RDS for PostgreSQL instances with historical data in a Redshift data warehouse. This approach eliminates the need for multi-step ETL processes and enables you to create comprehensive reports and analytics that combine data from multiple sources.

To get started with Amazon Redshift federated query ingestion, see Querying data with federated queries in Amazon Redshift.

Zero-ETL integration use case: Near real-time analytics for an ecommerce application

Suppose an ecommerce application built on Aurora MySQL-Compatible manages online orders, customer data, and product catalogs. To perform near real-time analytics with data filtering on transactional data to gain insights into customer behavior, sales trends, and inventory management without the overhead of building and maintaining multi-step ETL pipelines, you can use zero-ETL integrations for Amazon Redshift. Complete the following steps:

  1. Set up an Aurora MySQL cluster (must be running Aurora MySQL version 3.05-compatible with MySQL 8.0.32 or higher):
    1. Create an Aurora MySQL cluster in your desired AWS Region.
    2. Configure the cluster settings, such as the instance type, storage, and backup options.
  2. Create a zero-ETL integration with Amazon Redshift:
    1. On the Amazon RDS console, navigate to the Zero-ETL integrations
    2. Choose Create integration and select your Aurora MySQL cluster as the source.
    3. Choose an existing Redshift cluster or create a new cluster as the target.
    4. Provide a name for the integration and review the settings.
    5. Choose Create integration to initiate the zero-ETL integration process.
  3. Verify the integration status:
    1. After the integration is created, monitor the status on the Amazon RDS console or by querying the SVV_INTEGRATION and SYS_INTEGRATION_ACTIVITY system views in Amazon Redshift.
    2. Wait for the integration to reach the Active state, indicating that data is being replicated from Aurora to Amazon Redshift.
  4. Create analytics views:
    1. Connect to your Redshift cluster using a SQL client or the query editor v2 on the Amazon Redshift console.
    2. Create views or materialized views that combine and transform the replicated data from Aurora for your analytics use cases:
CREATE MATERIALIZED VIEW orders_summary AS
SELECT
    o.order_id,
    o.customer_id,
    SUM(oi.quantity * oi.price) AS total_revenue,
    MAX(o.order_date) AS latest_order_date
FROM
    aurora_schema.orders o
    JOIN aurora_schema.order_items oi ON o.order_id = oi.order_id
GROUP BY
    o.order_id,
    o.customer_id;

  1. Query the views or materialized views in Amazon Redshift to perform near real-time analytics on the transactional data from your Aurora MySQL cluster:
SELECT
	customer_id,
	SUM(total_revenue) AS total_customer_revenue,
	MAX(latest_order_date) AS most_recent_order
FROM
	orders_summary
GROUP BY
	customer_id
ORDER BY
	total_customer_revenue DESC;

This implementation achieves near real-time analytics for an ecommerce application’s transactional data using the zero-ETL integration between Aurora MySQL-Compatible and Amazon Redshift. The data automatically replicates from Aurora to Amazon Redshift, eliminating the need for multi-step ETL pipelines and supporting insights from the latest data quickly.

To get started with Amazon Redshift zero-ETL integrations, see Working with zero-ETL integrations. To learn more about Aurora zero-ETL integrations with Amazon Redshift, see Amazon Aurora zero-ETL integrations with Amazon Redshift.

Integration for Apache Spark use case: Gaming player events written to Amazon S3

Consider a large volume of gaming player events stored in Amazon S3. The events require data transformation, cleansing, and preprocessing to extract insights, generate reports, or build ML models. In this case, you can use the scalability and processing power of Amazon EMR to perform the required data changes using Apache Spark. After it’s processed, the transformed data must be loaded into Amazon Redshift for further analysis, reporting, and integration with BI tools.

In this scenario, you can use the Amazon Redshift integration for Apache Spark to perform the necessary data transformations and load the processed data into Amazon Redshift. The following implementation example assumes gaming player events in Parquet format are stored in Amazon S3 (s3://<bucket_name>/player_events/).

  1. Launch an Amazon EMR (emr-6.9.0) cluster with Apache Spark (Spark 3.3.0) with Amazon Redshift integration with Apache Spark support.
  2. Configure the necessary IAM role for accessing Amazon S3 and Amazon Redshift.
  3. Add security group rules to Amazon Redshift to allow access to the provisioned cluster or serverless workgroup.
  4. Create a Spark job that sets up a connection to Amazon Redshift, reads data from Amazon S3, performs transformations, and writes resulting data to Amazon Redshift. See the following code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
import os

def main():

	# Create a SparkSession
	spark = SparkSession.builder \
    		.appName("RedshiftSparkJob") \
    		.getOrCreate()

	# Set Amazon Redshift connection properties
	Redshift_jdbc_url = "jdbc:redshift://<redshift-endpoint>:<port>/<database>"
	redshift_table = "<schema>.<table_name>"
	temp_s3_bucket = "s3://<bucket_name>/temp/"
	iam_role_arn = "<iam_role_arn>"

	# Read data from Amazon S3
	s3_data = spark.read.format("parquet") \
    		.load("s3://<bucket_name>/player_events/")

	# Perform transformations
	transformed_data = s3_data.withColumn("transformed_column", lit("transformed_value"))

	# Write the transformed data to Amazon Redshift
	transformed_data.write \
    		.format("io.github.spark_redshift_community.spark.redshift") \
    		.option("url", redshift_jdbc_url) \
    		.option("dbtable", redshift_table) \
    		.option("tempdir", temp_s3_bucket) \
    		.option("aws_iam_role", iam_role_arn) \
    		.mode("overwrite") \
    		.save()

if __name__ == "__main__":
    main()

In this example, you first import the necessary modules and create a SparkSession. Set the connection properties for Amazon Redshift, including the endpoint, port, database, schema, table name, temporary S3 bucket path, and the IAM role ARN for authentication. Read data from Amazon S3 in Parquet format using the spark.read.format("parquet").load() method. Perform a transformation on the Amazon S3 data by adding a new column transformed_column with a constant value using the withColumn method and the lit function. Write the transformed data to Amazon Redshift using the write method and the io.github.spark_redshift_community.spark.redshift format. Set the necessary options for the Redshift connection URL, table name, temporary S3 bucket path, and IAM role ARN. Use the mode("overwrite") option to overwrite the existing data in the Amazon Redshift table with the transformed data.

To get started with Amazon Redshift integration for Apache Spark, see Amazon Redshift integration for Apache Spark. For more examples of using the Amazon Redshift for Apache Spark connector, see New – Amazon Redshift Integration with Apache Spark.

Streaming ingestion use case: IoT telemetry near real-time analysis

Imagine a fleet of IoT devices (sensors and industrial equipment) that generate a continuous stream of telemetry data such as temperature readings, pressure measurements, or operational metrics. Ingesting this data in real time to perform analytics to monitor the devices, detect anomalies, and make data-driven decisions requires a streaming solution integrated with a Redshift data warehouse.

In this example, we use Amazon MSK as the streaming source for IoT telemetry data.

  1. Create an external schema in Amazon Redshift:
    1. Connect to an Amazon Redshift cluster using a SQL client or the query editor v2 on the Amazon Redshift console.
    2. Create an external schema that references the MSK cluster:
CREATE EXTERNAL SCHEMA kafka_schema
FROM KAFKA
BROKER 'broker-1.example.com:9092,broker-2.example.com:9092'
TOPIC 'iot-telemetry-topic'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRoleForMSK';

  1. Create a materialized view in Amazon Redshift:
    1. Define a materialized view that maps the Kafka topic data to Amazon Redshift table columns.
    2. CAST the streaming message payload data type to the Amazon Redshift SUPER type.
    3. Set the materialized view to auto refresh.
CREATE MATERIALIZED VIEW iot_telemetry_view
AUTO REFRESH YES
AS SELECT
    kafka_partition,
    kafka_offset,
    kafka_timestamp_type,
    kafka_timestamp,
    CAST(kafka_value AS SUPER) payload
FROM kafka_schema.iot-telemetry-topic;

  1. Query the iot_telemetry_view materialized view to access the real-time IoT telemetry data ingested from the Kafka topic. The materialized view will automatically refresh as new data arrives in the Kafka topic.
SELECT
    kafka_timestamp,
    payload:device_id,
    payload:temperature,
    payload:pressure
FROM iot_telemetry_view;

With this implementation, you can achieve near real-time analytics on IoT device telemetry data using Amazon Redshift streaming ingestion. As telemetry data is received by an MSK topic, Amazon Redshift automatically ingests and reflects the data in a materialized view, supporting query and analysis of the data in near real time.

To get started with Amazon Redshift streaming ingestion, see Streaming ingestion to a materialized view. To learn more about streaming and customer use cases, see Amazon Redshift Streaming Ingestion.

Conclusion

This post detailed the options available for Amazon Redshift data ingestion. The choice of data ingestion method depends on factors such as the size and structure of data, the need for real-time access or transformations, data sources, existing infrastructure, ease of use, and user skill-sets. Zero-ETL integrations and federated queries are suitable for simple data ingestion tasks or joining data between operational databases and Amazon Redshift analytics data. Large-scale data ingestion with transformation and orchestration benefit from Amazon Redshift integration with Apache Spark with Amazon EMR and AWS Glue. Bulk loading of data into Amazon Redshift regardless of dataset size fits perfectly with the capabilities of the Redshift COPY command. Utilizing streaming sources such as Kinesis Data Streams, Amazon MSK, or Data Firehose are ideal scenarios for utilizing AWS streaming services integration for data ingestion.

Evaluate the features and guidance provided for your data ingestion workloads and let us know your feedback in the comments.


About the Authors

Steve Phillips is a senior technical account manager at AWS in the North America region. Steve has worked with games customers for eight years and currently focuses on data warehouse architectural design, data lakes, data ingestion pipelines, and cloud distributed architectures.

Sudipta Bagchi is a Sr. Specialist Solutions Architect at Amazon Web Services. He has over 14 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket.

Post Comment