Construct billion-scale vector databases in beneath an hour with GPU acceleration on Amazon OpenSearch Service

December 15, 2025

66

AWS just lately introduced the overall availability of GPU-accelerated vector (k-NN) indexing on Amazon OpenSearch Service. Now you can construct billion-scale vector databases in beneath an hour and index vectors as much as 10 instances sooner at 1 / 4 of the associated fee. This function dynamically attaches serverless GPUs to spice up domains and collections operating CPU-based situations. With this function, you may scale AI apps rapidly, innovate sooner, and run vector workloads leaner.

On this publish, we talk about the advantages of GPU-accelerated vector indexing, discover key use circumstances, and share efficiency benchmarks.

Overview of vector search and vector indexes

Vector search is a way that improves search relevance, and is a cornerstone of generative AI purposes. It includes utilizing an embeddings mannequin to transform content material into numerical encodings (vectors), enabling content material matching by semantic similarity as a substitute of simply key phrases. You may construct vector databases by ingesting vectors into OpenSearch Service to construct indexes that allow searches throughout billions of vectors in milliseconds.

Challenges with scaling vector databases

Clients are more and more scaling vector databases to multi-billion-scale on OpenSearch Service to energy generative AI purposes, product catalogs, data bases, and extra. Purposes have gotten more and more agentic, integrating AI brokers that depend on vector databases for high-quality search outcomes throughout enterprise knowledge sources to allow chat-based interactions and automation.

Nevertheless, there are challenges on the way in which to billion-scale. First, multi-million to billion-scale vector indexes take hours to days to construct. These indexes use algorithms like Hierarchal Navigable Small Worlds (HNSW) to allow high-quality, millisecond searches at scale. Nevertheless, they require extra compute energy than conventional indexes to construct. Moreover, you must rebuild your indexes every time your mannequin modifications, similar to switching between distributors, variations, or after fine-tuning. Some use circumstances similar to customized search require fashions to be fine-tuned each day and adapt to evolving person behaviors. All vectors have to be regenerated when the mannequin modifications, so the index have to be rebuilt. HNSW may also degrade following important updates and deletes, so indexes have to be rebuilt to regain accuracy.

Lastly, as your agentic purposes change into extra dynamic, your vector database should scale for heavy streaming ingestion, updates, and deletes whereas sustaining low search latency. If search and indexing use the identical infrastructure, these intensive processes will compete for restricted compute and RAM, so search latency can degrade.

Resolution overview

You may overcome these challenges by enabling GPU-accelerated indexing on OpenSearch Service 3.1+ domains or collections. GPU acceleration will dynamically activate, for example, in response to a reindex command on a million-plus-size index. Throughout activation, index duties are offloaded to GPU servers that run NVIDIA cuVS to construct HNSW graphs. Superior velocity and effectivity are achieved via parallelization of vector operations. Inverted indexes will proceed utilizing your cluster’s CPU for indexing and search on non-vector knowledge. These indexes function alongside HNSW to assist key phrase, hybrid, and filtered vector search. The sources required to construct inverted indexes is low in comparison with HNSW.

GPU acceleration is enabled as a cluster-level configuration, however it may be disabled on particular person indexes. This function is serverless, so that you don’t must handle GPU situations. You merely pay-per-use via OpenSearch Compute Models (OCUs).

The next diagram illustrates how this function works.

The workflow consists of the next steps:

You write vectors into your area or assortment, utilizing the present APIs: bulk, reindex, index, replace, delete, and drive merge.
GPU acceleration is activated when the listed vector knowledge surpasses a configured threshold inside a refresh interval.
This results in a safe, single-tenant project of GPU servers to your cluster from a multi-tenant heat pool of GPUs managed by OpenSearch Service.
Inside milliseconds, OpenSearch Service initiates and offloads HNSW operations.
When the write quantity falls beneath the brink, GPU servers are scaled down and returned to the nice and cozy pool.

This automation is totally managed. You solely pay for acceleration time, which you’ll be able to monitor from Amazon CloudWatch.

This function isn’t simply designed for ease of use. It permits GPU acceleration advantages with out financial challenges. For instance, a website sized to host 1 billion (1,024 dimension) vectors compressed 32 instances (utilizing binary quantization) takes three r8g.12xlarge.search situations to offer the required 1.15 TBs of RAM. A design that requires operating a website on GPU situations, would wish six g6.12xlarge situations to do the identical, leading to 2.4 instances larger value and extreme GPUs. This answer delivers effectivity by offering the correct quantity of GPUs solely while you want them, so that you achieve velocity with value financial savings.

Use circumstances and advantages

This function has three main makes use of and advantages:

Construct large-scale indexes sooner, rising productiveness and innovation velocity
Scale back value by decreasing Amazon OpenSearch Serverless indexing OCU utilization, or downsizing domains with write-heavy vector workloads
Speed up writes, decrease search latency, and enhance person expertise in your dynamic AI purposes

Within the following sections, we talk about these use circumstances in additional element.

Construct large-scale indexes sooner

We benchmarked index builds for 1M, 10M, 113M, and 1B vector take a look at circumstances to show velocity good points on each domains and collections. Velocity good points ranged from 6.4 to 13.8 instances sooner. These checks have been carried out with manufacturing configurations (Multi-AZ with replication) and default GPU service limits. All checks have been run on right-sized search clusters, and the CPU-only checks had CPU utilization maxed solely for indexing. The next chart illustrates the relative velocity good points from GPU acceleration on managed domains.

The full index construct time on domains features a drive merge to optimize the underlying storage engine for search efficiency. Throughout regular operation, merges are computerized. Nevertheless, when benchmarking domains, we carry out a guide merge after indexing to verify merging impression is constant throughout checks. The next desk summarizes the index construct benchmarks and dataset references for domains.

We ran the identical efficiency checks on collections. The efficiency is totally different on OpenSearch Serverless as a result of its serverless structure includes efficiency trade-offs similar to computerized scaling, which introduces a ramp-up to succeed in peak efficiency. The next desk summarizes these outcomes.

OpenSearch Serverless doesn’t assist drive merge, so the total profit from GPU acceleration is perhaps delayed till the automated background merges full. The default minimal OCUs needed to be elevated for checks past 1 million vectors to deal with larger indexing throughput.

Scale back value

Our serverless GPU design uniquely delivers velocity good points and price financial savings. With OpenSearch Serverless, your web indexing prices will likely be decreased in case you have indexing workloads which might be important sufficient to activate GPU acceleration. The next desk presents the OCU utilization and price consumption utilization from the earlier index construct checks.

The vector acceleration OCUs offload and cut back indexing OCUs. The full OCU utilization is much less with GPU as a result of the index is constructed extra effectively, leading to value financial savings.

With managed domains, value financial savings are situational as a result of search and indexing infrastructure isn’t decoupled like on OpenSearch Serverless. Nevertheless, in case you have a write-heavy, compute-bound vector search software (that’s, your area is sized for vCPUs to maintain write throughput), you can downsize your area.

The next benchmarks show the effectivity good points from GPU acceleration. We measure the infrastructure prices throughout the indexing duties. GPU acceleration has the further value of GPUs at $0.24 per OCU/hour. Nevertheless, as a result of indexes are constructed sooner and extra effectively, it’s extra economical to make use of GPU to cut back CPU utilization in your area and downsize it.

*Domains are operating a high-availability configuration with none cost-optimizations

Speed up writes, decrease search latency

In skilled fingers, domains supply operational management and the flexibility to attain nice scalability, efficiency, and price optimizations. Nevertheless, operational obligations embody managing indexing and search workloads on shared infrastructure. In case your vector deployment includes heavy, sustained streaming ingestion, updates, and deletes, you would possibly observe larger search instances in your area. As illustrated within the following chart, as you enhance vector writes, the CPU utilization will increase to assist HNSW graph constructing. Concurrent search latency additionally will increase due to competitors for compute and RAM sources.

You can clear up the issue by including knowledge nodes to extend your area’s compute capability. Nevertheless, enabling GPU acceleration is less complicated and cheaper. As illustrated within the chart, GPU frees up CPU and RAM in your area, serving to you maintain low and steady search latency beneath excessive write throughput.

Get began

Able to get began? If you have already got an OpenSearch Service vector deployment, use the AWS Administration Console, AWS Command Line Interface (AWS CLI), or API to allow GPU acceleration in your OpenSearch 3.1+ area or vector assortment. Take a look at it along with your present indexing workloads. If you happen to’re planning to construct a brand new vector database, check out our new vector ingestion function, which simplifies vector ingestion, indexing, and automates optimizations. Take a look at this demonstration on YouTube.

Acknowledgments

The authors want to thank Manas Singh, Nathan Stephens, Jiahong Liu, Ben Gardner, and Zack Meeks from NVIDIA, and Yigit Kiran and Jay Deng from AWS for his or her contributions to this publish.

Concerning the authors

Authors want to add particular because of Manas Singh, Nathan Stephens, Jiahong Liu, Ben Gardner, Zack Meeks NVIDIA and Yigit Kiran and Jay Deng from AWS.