Extract insights in a 30TB time series workload with Amazon OpenSearch Serverless

Extract insights in a 30TB time series workload with Amazon OpenSearch Serverless


In today’s data-driven landscape, managing and analyzing vast amounts of data, especially logs, is crucial for organizations to derive insights and make informed decisions. However, handling large data while extracting insights is a significant challenge, prompting organizations to seek scalable solutions without the complexity of infrastructure management.

Amazon OpenSearch Serverless reduces the burden of manual infrastructure provisioning and scaling while still empowering you to ingest, analyze, and visualize your time-series data, simplifying data management and enabling you to derive actionable insights from data.

We recently announced a new capacity level of 30TB for time series data per account per AWS Region. The OpenSearch Serverless compute capacity for data ingestion and search/query is measured in OpenSearch Compute Units (OCUs), which are shared among various collections with the same AWS Key Management Service (AWS KMS) key. To accommodate larger datasets, OpenSearch Serverless now supports up to 500 OCUs per account per Region, each for indexing and search respectively, more than double from the previous limit of 200. You can configure the maximum OCU limits on search and indexing independently, giving you the reassurance of managing costs effectively. You can also monitor real-time OCU usage with Amazon CloudWatch metrics to gain a better perspective on your workload’s resource consumption. With the support for 30TB datasets, you can analyze data at the 30TB level to unlock valuable operational insights and make data-driven decisions to troubleshoot application downtime, improve system performance, or identify fraudulent activities.

This post discusses how you can analyze 30TB time series datasets with OpenSearch Serverless.

Innovations and optimizations to support larger data size and faster responses

Sufficient disk, memory, and CPU resources are crucial for handling extensive data effectively and conducting thorough analysis. These resources are not just beneficial but crucial for our operations. In time series collections, the OCU disk typically contains older shards that are not frequently accessed, referred to as warm shards. We have introduced a new feature called warm shard recovery prefetch. This feature actively monitors recently queried data blocks for a shard. It prioritizes them during shard movements, such as shard balancing, vertical scaling, and deployment activities. More importantly, it accelerates auto-scaling and provides faster readiness for varying search workloads, thereby significantly improving our system’s performance. The results provided later in this post provide details on the improvements.

A few select customers worked with us on early adoption prior to General Availability. In these trials, we observed up to 66% improvement in warm query performance for some customer workloads. This significant improvement shows the effectiveness of our new features. Additionally, we have enhanced the concurrency between coordinator and worker nodes, allowing more requests to be processed as the OCUs increases through auto scaling. This enhancement has resulted in up to a 10% improvement in query performance for hot and warm queries.

We have enhanced our system’s stability to handle time-series collections of up to 30 TB effectively. Our team is committed to improving system performance, as demonstrated by our ongoing enhancements to the auto-scaling system. These improvements comprised of enhanced shard distribution for optimal placement after rollover, auto-scaling policies based on queue length, and a dynamic sharding strategy that adjusts shard count based on ingestion rate.

In the following section we share an example test setup of a 30TB workload that we used internally, detailing the data being used and generated, along with our observations and results. Performance may vary depending on the specific workload.

Ingest the data

You can use the load generation scripts shared in the following workshop, or you can use your own application or data generator to create a load. You can run multiple instances of these scripts to generate a burst in indexing requests. As shown in the following screenshot, we tested with an index, sending approximately 30 TB of data over a period of 15 days. We used our load generator script to send the traffic to a single index, retaining data for 15 days using a data life cycle policy.

Test methodology

We set the deployment type to ‘Enable redundancy’ to enable data replication across Availability Zones. This deployment configuration will lead to 12-24 hours of data in hot storage (OCU disk memory) and the rest in Amazon Simple Storage Service (Amazon S3). With a defined set of search performance and the preceding ingestion expectation, we set the max OCUs to be 500 for both indexing and search.

As part of the testing, we observed auto-scaling behavior and graphed it. The indexing took around 8 hours to get stabilized at 80 OCU.

On the Search side, it took around 2 days to get stabilized at 80 OCU.

Observations:

Ingestion

The ingestion performance achieved was consistently over 2 TB per day

Search

Queries were of two types, with time ranging from 15 minutes to 15 days.

{"aggs":{"1":{"cardinality":{"field":"carrier.keyword"}}},"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-15m","lte":"now"}}}]}}}

For example

{"aggs":{"1":{"cardinality":{"field":"carrier.keyword"}}},"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-1d","lte":"now"}}}]}}}

The following chart provides the various percentile performance on the aggregation query

The second query was

{"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-15m","lte":"now"}}}],"should":[{"match":{"originState":"State"}}]}}}

For example

{"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-15m","lte":"now"}}}],"should":[{"match":{"originState":"California"}}]}}}

The following chart provides the various percentile performance on the search query

The following chart summarizes the time range for different queries

Time-range Query P50 (ms) P90 (ms) P95 (ms) P99 (ms)
15 minutes {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-15m”,”lte”:”now”}}}]}}} 325 403.867 441.917 514.75
1 day {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1d”,”lte”:”now”}}}]}}} 7,693.06 12,294 13,411.19 17,481.4
1 hour {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1h”,”lte”:”now”}}}]}}} 1,061.66 1,397.27 1,482.75 1,719.53
1 year {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1y”,”lte”:”now”}}}]}}} 2,758.66 10,758 12,028 22,871.4
4 hour {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-4h”,”lte”:”now”}}}]}}} 3,870.79 5,233.73 5,609.9 6,506.22
7 day {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-7d”,”lte”:”now”}}}]}}} 5,395.68 17,538.12 19,159.18 22,462.32
15 minutes {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-15m”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”California”}}]}}} 139 190 234.55 6,071.96
1 day {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1d”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”California”}}]}}} 678.917 1,366.63 2,423 7,893.56
1 hour {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1h”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}} 259.167 305.8 343.3 1,125.66
1 year {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1y”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}} 2,166.33 2,469.7 4,804.9 9,440.11
4 hours {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-4h”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}} 462.933 653.6 725.3 1,583.37
7 days {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-7d”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}} 1,353 2,745.1 4,338.8 9,496.36

Conclusion

OpenSearch Serverless not only supports a larger data size than prior releases but also introduces performance improvements like warm shard pre-fetch and concurrency optimization for better query response. These features reduce the latency of warm queries and improve auto-scaling to handle varied workloads. We encourage you to take advantage of the 30TB index support and put it to the test! Migrate your data, explore the improved throughput, and take advantage of the enhanced scaling capabilities.

To get started, refer to Log analytics the easy way with Amazon OpenSearch Serverless. To get hands-on experience with OpenSearch Serverless, follow the Getting started with Amazon OpenSearch Serverless workshop, which has a step-by-step guide for configuring and setting up an OpenSearch Serverless collection.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the authors

Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and has years of experience in networking, security and AI/ML. He holds a Bachelor’s degree in Computer Science and an MBA in Entrepreneurship. In his free time, he likes to fly airplanes and hang gliders and ride his motorcycle.

Milav Shah is an Engineering Leader with Amazon OpenSearch Service. He focuses on search experience for OpenSearch customers. He has extensive experience building highly scalable solutions in databases, real-time streaming and distributed computing. He also possesses functional domain expertise in verticals like Internet of Things, fraud protection, gaming and AI/ML. In his free time, he likes to ride cycle, hike, and play chess.

Qiaoxuan Xue is a Senior Software Engineer at AWS leading the search and benchmarking areas of the Amazon OpenSearch Serverless Project. His passion lies in finding solutions for intricate challenges within large-scale distributed systems. Outside of work, he enjoys woodworking, biking, playing basketball, and spending time with his family and dog.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Post Comment