The rise of generative AI purposes has heightened the need to implement semantic search and pure language search. These superior search options assist discover and retrieve conceptually related paperwork from enterprise content material repositories to function prompts for generative AI fashions. Uncooked information inside numerous supply repositories within the type of textual content, photographs, audio, video, and so forth are transformed, with the assistance of embedding fashions, to a normal numerical illustration referred to as vectors that powers the semantic and pure language search. As organizations harness extra subtle massive language and foundational fashions to energy their generative AI purposes, supplemental embedding fashions are additionally evolving to deal with massive, high-dimension vector embedding. Because the vector quantity expands, there’s a proportional enhance in reminiscence utilization and computational necessities, leading to greater operational prices. To mitigate this situation, numerous compression strategies can be utilized to optimize reminiscence utilization and computational effectivity.
Quantization is a lossy information compression approach aimed to decrease computation and reminiscence utilization resulting in decrease prices, particularly for high-volume information workloads. There are numerous strategies to compress information relying on the sort and quantity of the information. The standard approach is to map infinite values (or a comparatively massive record of finites) to smaller extra discrete values. Vector compression will be achieved via two major strategies: product quantization and scalar quantization. Within the product quantization approach, the unique vector dimension array is damaged into a number of sub-vectors and every sub-vector is encoded into a set variety of bits that characterize the unique vector. This technique requires that you simply solely retailer and search throughout the encoded sub-vector as an alternative of the unique vector. In scalar quantization, every dimension of the enter vector is mapped from a 32-bit floating-point illustration to a smaller information kind.
Amazon OpenSearch Service, as a vector database, helps scalar and product quantization strategies to optimize reminiscence utilization and scale back operational prices.
OpenSearch as a vector database
OpenSearch is a distributed search and analytics service. The OpenSearch k-nearest neighbor (k-NN) plugin means that you can index, retailer, and search vectors. Vectors are saved in OpenSearch as a 32-bit float array of kind knn_vector and that helps as much as 16,000 dimensions per vector.
OpenSearch makes use of approximate nearest neighbor search to supply scalable vector search. The approximate k-NN algorithm retrieves outcomes based mostly on an estimation of the closest vectors to a given question vector. Two major strategies for performing approximate k-NN are the graph-based Hierarchical Navigable Small-World (HNSW) and the cluster-based Inverted File (IVF). These information constructions are constructed and loaded into reminiscence throughout the preliminary vector search operation. As vector quantity grows, each the information constructions and related reminiscence necessities for search operations scale proportionally.
For instance, every HNSW graph with 32-bit float information takes roughly 1.1 * (4 * d + 8 * m) * num_vectors bytes of reminiscence. Right here, num_vectors represents the entire amount of vectors to be listed, d is the variety of dimensions decided by the embedding mannequin you utilize to generate the vectors and m is the variety of edges within the HSNW graphs, an index parameter that may be managed to tune efficiency. Utilizing this system, reminiscence necessities for vector storage for a configuration of 384 dimensions and an m worth of 16 could be:
- 1 million vectors: 1.830 GB (1.1 * (4 * 384 + 8 * 16) * 1000,000 bytes)
- 1 billion vectors: 1830 GB (1.1 * (4 * 384 + 8 * 16) * 1,000,000,000 bytes)
Though approximate nearest neighbor search will be optimized to deal with huge datasets with billions of vectors effectively, the reminiscence necessities for loading 32-bit full-precision vectors to reminiscence throughout the search course of can change into prohibitively pricey. To mitigate this, OpenSearch service helps the next 4 quantization strategies.
- Binary quantization
- Byte quantization
- FP16 quantization
- Product quantization
These strategies fall throughout the broader class of scalar and product quantization that we mentioned earlier. On this publish, you’ll be taught quantization strategies for optimizing vector workloads on OpenSearch Service, specializing in reminiscence discount and cost-efficiency. It introduces the brand new disk-based vector search method that allows environment friendly querying of vectors saved on disk with out loading them into reminiscence. The strategy integrates seamlessly with quantization strategies, that includes key configurations such because the on_disk mode and compression_level parameter. These settings facilitate built-in, out-of-the-box scalar quantization on the time of indexing.
Binary quantization (as much as 32x compression)
Binary quantization (BQ) is a sort of scalar quantization. OpenSearch leverages FAISS engine’s binary quantization, enabling as much as 32x compression throughout indexing. This method reduces the vector dimension from the default 32-bit float to a 1-bit binary by compressing the vectors right into a 0s and 1s. OpenSearch helps indexing, storing and looking binary vectors. You may as well select to encode every vector dimension utilizing 1, 2, or 4 bits, relying upon the specified compression issue as proven within the instance under. The compression issue will be adjusted utilizing bits settings. A worth of two yields 16x compression, whereas 4 leads to 8x compression. The default setting is 1. In binary quantization, the coaching is dealt with natively on the time of indexing, permitting you to keep away from an extra preprocessing step.
To implement binary quantization, outline the vector kind as knn_vector and specify the encoder identify as binary with the specified variety of encoding bits. Observe, the encoder parameter refers to a way used to compress vector information earlier than storing it within the index. Optimize efficiency by utilizing space_type, m, and ef_construction parameters. See the OpenSearch documentation for details about the underlying configuration of the approximate k-NN.
Reminiscence necessities for implementing binary quantization with FAISS-HNSW:
1.1 * (bits * (d/8)+ 8 * m) * num_vectors bytes.
| Compression | Encoding bits |
Reminiscence required for 1 billion vector with d=384 and m=16 (in GB) |
| 32x | 1 | 193.6 |
| 16x | 2 | 246.4 |
| 8x | 4 | 352.0 |
For detailed implementation steps on binary quantization, see the OpenSearch documentation.
Byte-quantization (4x compression)
Byte quantization compresses 32-bit floating-point dimensions to 8-bit integers, starting from –128 to +127, lowering reminiscence utilization by 75%. OpenSearch helps indexing, storing, and looking byte vectors, which should be transformed to 8-bit format previous to ingestion. To implement byte vectors, specify the k-NN vector subject data_type as byte within the index mapping. This function is suitable with each Lucene and FAISS engines. An instance of making an index for byte-quantized vectors follows.
This technique requires ingesting a byte-quantized vector into OpenSearch for direct storage within the k-NN vector subject (of byte kind). Nonetheless, the lately launched disk-based vector search function eliminates the necessity for exterior vector quantization. This function shall be mentioned intimately later on this weblog.
Reminiscence necessities for implementing byte quantization with FAISS-HNSW:
1.1 * (1 * d + 8 * m) * num_vectors bytes.
For detailed implementation steps, see to the OpenSearch documentation. For efficiency metrics concerning accuracy, throughput, and latency, see Byte-quantized vectors in OpenSearch.
FAISS FP16 quantization (2x compression)
FP16 quantization is a way that makes use of 16-bit floating-point scalar illustration, lowering the reminiscence utilization by 50%. Every vector dimension is transformed from 32-bit to 16-bit floating-point, successfully halving the reminiscence necessities. The compressed vector dimensions should be within the vary [–65504.0, 65504.0]. To implement FP16 quantization, create the index with the k-NN vector subject and configure the next:
- Set k-NN vector subject technique and engine to HNSW and FAISS, respectively.
- Outline encoder parameter and set
identifytosqandkindtofp16.
Upon importing 32-bit floating-point vectors to OpenSearch, the scalar quantization FP16 (SQfp16) robotically quantizes them to 16-bit floating-point vectors throughout ingestion and shops them within the vector subject. The next instance demonstrates the creation of the index for quantizing and storing FP16-quantized vectors.
Reminiscence necessities for implementing FP16 quantization with FAISS-HNSW:
(1.1 * (2 * d + 8 * m) * num_vectors) bytes.
The previous FP16 instance introduces an elective Boolean parameter referred to as clip, which defaults to false. When false, vectors with out-of-range values (values not between –65504.0 and +65504.0) are rejected. Setting clip to true allows rounding of out-of-range vector values to suit throughout the supported vary. For detailed implementation steps, see the OpenSearch documentation. For efficiency metrics concerning accuracy, throughput, and latency, see Optimizing OpenSearch with Faiss FP16 scalar quantization: Enhancing reminiscence effectivity and cost-effectiveness.
Product quantization
Product quantization (PQ) is a complicated dimension-reduction approach that provides considerably greater ranges of compression. Whereas standard scalar quantization strategies sometimes obtain as much as 32x compression, PQ can present compression ranges of as much as 64x, making it a extra environment friendly answer for optimizing storage and value. OpenSearch helps PQ with each IVF and HNSW technique from FAISS engine. Product quantization partitions vectors into m sub-vectors, every encoded with a bit rely decided by the code dimension. The ensuing vector’s reminiscence footprint is m * code_size bits.
FAISS product quantization includes three key steps:
- Create and populate a coaching index to construct the PQ mannequin, optimizing for accuracy.
- Execute the
_trainAPI on the coaching index to generate the quantizer mannequin. - Assemble the vector index, configuring the kNN subject to make use of the ready quantizer mannequin.
The next instance demonstrates the three steps to establishing product quantization.
Step1: Create the coaching index. Populate the coaching index with an acceptable dataset, ensuring of dimensional alignment with train-index specs. Observe that the coaching index requires a minimal of 256 paperwork.
Step2: Create a quantizer mannequin referred to as my-model by operating the _train API on the coaching index you simply created. Observe that the encoder with identify outlined as pq facilitates native vector quantization. Different parameters for encoder embody code_size and m. FAISS-HNSW requires a code_size of 8 and a coaching dataset of at the least 256 (2^code_size) paperwork. For detailed parameter specs, see the PQ parameter reference.
Step3: Map the quantizer mannequin to your vector index.
Ingest the whole dataset into the newly created index, my-vector-index. The encoder will robotically course of the incoming vectors, making use of encoding and quantization based mostly on the compression parameters (code_size and m) specified within the quantizer mannequin configuration.
Reminiscence necessities for implementing product quantization with FAISS-HNSW:
1.1*(((code_size / 8) * m + 24 + 8 * m) * num_vectors bytes. Right here the code_size and m are parameters throughout the encoder parameter, num_vectors are the entire variety of vectors.
Throughout quantization, every of the coaching vectors is damaged right down to a number of sub-vectors or sub-spaces, outlined by a configurable worth m. The variety of bits to encode every of the sub-vector is managed by parameter code_size. Every of the sub-vectors is then compressed or quantized individually by operating the k-means clustering with the worth okay outlined as 2^code_size. On this approach, the vector is compressed roughly by m * code_size bits.
For detailed implementation tips and understanding of the configurable parameters throughout product quantization, see the OpenSearch documentation. For efficiency metrics concerning accuracy, throughput and latency utilizing FAISS IVF for PQ, see Select the k-NN algorithm to your billion-scale use case with OpenSearch.
Disk-based vector search
Disk-based vector search optimizes question effectivity by utilizing compressed vectors in reminiscence whereas sustaining full-precision vectors on disk. This method allows OpenSearch to carry out searches throughout massive vector datasets with out the necessity to load whole vectors into reminiscence, thus enhancing scalability and useful resource utilization. Implementation is achieved via two new configurations at index creation: mode and compression degree. As of OpenSearch 2.17, the mode parameter will be set to both in_memory or on_disk throughout indexing. The beforehand mentioned strategies default to an in-memory mode. On this configuration, the vector index is constructed utilizing both a graph (HNSW) or bucket (IVF) construction, which is then loaded into native reminiscence throughout search operations. Whereas providing wonderful recall, this method may impression reminiscence utilization, and scalability for prime quantity vector workload.
The on_disk mode optimizes vector search effectivity by storing full-precision vectors on disk whereas utilizing real-time, native quantization throughout indexing. Coupled with adjustable compression ranges, this method permits solely compressed vectors to be loaded into reminiscence, thereby enhancing reminiscence and useful resource utilization and search efficiency. The next compression ranges correspond to numerous scalar quantization strategies mentioned earlier.
- 32x: Binary quantization (1-bit dimensions)
- 4x: Byte and integer quantization (8-bit dimensions)
- 2x: FP16 quantization (16-bit dimensions)
This technique additionally helps different compression ranges resembling 16x and 8x that aren’t out there with the in-memory mode. To allow disk-based vector search, create the index with mode set to on_disk as proven within the following instance.
Configuring simply the mode as on_disk employs the default configuration, which makes use of the FAISS engine and HNSW technique with a 32x compression degree (1-bit, binary quantization). The ef_construction to optimize index time latency defaults to 100. For extra granular fine-tuning, you may override these k-NN parameters as proven within the instance that follows.
As a result of quantization is a lossy compression approach, greater compression ranges sometimes end in decrease recall. To enhance recall throughout quantization, you may configure the disk-based vector search to run in two phases utilizing the search time configuration parameter ef_search and the oversample_factor as proven within the following instance.
Within the first part, oversample_factor * okay outcomes are retrieved from the quantized vectors in reminiscence and the scores are approximated. Within the second part, the full-precision vectors of these oversample_factor * okay outcomes are loaded into reminiscence from disk, and scores are recomputed in opposition to the full-precision question vector. The outcomes are then lowered to the highest okay.
The oversample_factor for rescoring is set by the configured dimension and compression degree at indexing. For dimensions under 1,000, the issue is mounted at 5. For dimensions exceeding 1,000, the default issue varies based mostly on the compression degree, as proven within the following desk.
| Compression degree | Default oversample_factor for rescoring |
| 32x (default) | 3 |
| 16x | 2 |
| 8x | 2 |
| 4x | No default rescoring |
| 2x | No default rescoring |
As beforehand mentioned, the oversample_factor will be dynamically adjusted at search time. This worth presents a essential trade-off between accuracy and search effectivity. Whereas a better issue improves accuracy, it proportionally will increase reminiscence utilization and reduces search throughput. See the OpenSearch documentation to be taught extra about disk-based vector search and perceive the proper utilization for oversample_factor.
Efficiency evaluation of quantization strategies: Reviewing reminiscence, recall, and question latency.
The OpenSearch documentation on approximate k-NN search offers a place to begin for implementing vector similarity search. Moreover, Select the k-NN algorithm to your billion-scale use case with OpenSearch provides useful insights into designing environment friendly vector workloads for dealing with billions of vectors in manufacturing environments. It introduces product quantization strategies as a possible answer to scale back reminiscence necessities and related prices by cutting down the reminiscence footprint.
The next desk illustrates the reminiscence necessities for storing and looking via 1 billion vectors utilizing numerous quantization strategies. The desk compares the default reminiscence consumption of full-precision vector utilizing the HNSW technique in opposition to reminiscence consumed by quantized vectors. The mannequin employed on this evaluation is the sentence-transformers/all-MiniLM-L12-v2, which operates with 384 dimensions. The uncooked metadata is assumed to be no more than 100Gb.
| With out quantization (in GB) |
Product quantization (in GB) |
Scalar quantization (in GB) |
|||
| FP16 vectors | Byte vectors | Binary vectors | |||
| m worth | 16 | 16 | 16 | 16 | 16 |
| pq_m, code_size | – | 16, 8 | – | – | – |
| Native reminiscence consumption (GB) | 1830.4 | 184.8 | 985.6 | 563.2 | 193.6 |
| Complete storage = 100 GB+vector |
1930.4 | 284.8 | 1085.6 | 663.2 | 293.6 |
Reviewing the previous desk reveals that for a dataset comprising 1 billion vectors, the HNSW graph with 32-bit full-precision vector requires roughly 1830 GB of reminiscence. Compression strategies resembling product quantization can scale back this to 184.8 GB, whereas scalar quantization provides various ranges of compression. The next desk summarizes the correlation between compression strategies and their impression on key efficiency indicators together with price financial savings, recall price, and question latency. This evaluation builds upon our earlier evaluation of reminiscence utilization to help in choosing compression approach that meets your requirement.
The desk presents two key search metrics: search latency on the ninetieth percentile (p90) and recall at 100.
- Search latency @p90 signifies that 90% of search queries shall be accomplished inside that particular latency time.
- recall@100 – The fraction of the highest 100 floor reality neighbors discovered within the 100 outcomes returned.
| With out quantization (in GB) |
Product quantization (in GB) |
Scalar quantization (in GB) |
|||
| FP16 quantization [mode=in_memory] |
Byte quantization [mode=in_memory] |
Binary quantization [mode=on_disk] |
|||
| Preconditions/Datasets | Relevant to all datasets | Recall is determined by the character of the coaching information | Works for dimension worth in vary [-65536 to 65535] |
Works for dimension worth in vary [-128 to 127] |
Works nicely for bigger dimensions >=768 |
| Preprocessing required? | No | Sure, preprocessing/coaching is required |
No | No | No |
| Rescoring | No | No | No | No | Sure |
| Recall @100 | >= 0.99 | >0.7 | >=0.95 | >=0.95 | >=0.90 |
| p90 question latency (ms) | |||||
| Value (baseline $X) |
$X | $0.1*X (as much as 90% financial savings) |
$0.5*X (as much as 50% financial savings) |
$0.25*X (as much as 75%) |
$0.15*X (as much as 85% financial savings) |
| Pattern price for a billion vector | $20,923.14 | $2,092.31 | $10,461.57 | $5,230.79 | $3,138.47 |
The pattern price estimate for billion vector relies on a configuration optimized for price. Please notice that precise financial savings might fluctuate based mostly in your particular workload necessities and chosen configuration parameters. Notably within the desk, product quantization provides as much as 90% price discount in comparison with the baseline HNSW graph-based vector search price ($X). Scalar quantization equally yields proportional price financial savings, starting from 50% to 85% relative to the compressed reminiscence footprint. The selection of compression approach includes balancing cost-effectiveness, accuracy, and efficiency, because it impacts precision and latency.
Conclusion
By leveraging OpenSearch’s quantization strategies, organizations could make knowledgeable tradeoffs between price effectivity, efficiency, and recall, empowering them to fine-tune their vector database operations for optimum outcomes. These quantization strategies considerably scale back reminiscence necessities, enhance question effectivity and provide built-in encoders for seamless compression. Whether or not you’re coping with large-scale textual content embeddings, picture options, or another high-dimensional information, OpenSearch’s quantization strategies provide environment friendly options for vector search necessities, enabling the event of cost-effective, scalable, and high-performance methods.
As you progress ahead along with your vector database tasks, we encourage you to:
- Discover OpenSearch’s compression strategies in-depth
- Consider applicability of the proper approach to your particular use case
- Decide the suitable compression ranges based mostly in your necessities for recall and search latency
- Measure and examine price financial savings based mostly on accuracy, throughput, and latency
Keep knowledgeable concerning the newest developments on this quickly evolving subject, and don’t hesitate to experiment with completely different quantization strategies to seek out the optimum stability between price, efficiency, and accuracy to your purposes.
Concerning the Authors
Aruna Govindaraju is an Amazon OpenSearch Specialist Options Architect and has labored with many business and open-source engines like google. She is obsessed with search, relevancy, and consumer expertise. Her experience with correlating end-user indicators with search engine conduct has helped many shoppers enhance their search expertise.
Vamshi Vijay Nakkirtha is a software program engineering supervisor engaged on the OpenSearch Undertaking and Amazon OpenSearch Service. His major pursuits embody distributed methods. He’s an lively contributor to numerous OpenSearch tasks resembling k-NN, Geospatial, and dashboard-maps.
