Vector databases
A vector database stores and manages data in the form of vectors, which are numerical arrays of data points.
The use of vectors allows for complex queries and analyses, because vectors can be compared and analyzed using advanced techniques such as vector similarity search, quantization and clustering. Traditional databases aren't well-suited for handling the high-dimensional data that is becoming increasingly common in data analytics. However, vector databases are designed to handle high-dimensional data, such as text, images, and audio, by representing them as vectors. Vector databases are useful for tasks such as machine learning, natural language processing, and image recognition, where the goal is to identify patterns or similarities in large datasets.
This article gives some background about vector databases and explains conceptually how you can use an Eventhouse as a vector database in Real-Time Intelligence in Microsoft Fabric. For a practical example, see Tutorial: Use an Eventhouse as a vector database.
Key concepts
The following key concepts are used in vector databases:
Vector similarity
Vector similarity is a measure of how different (or similar) two or more vectors are. Vector similarity search is a technique used to find similar vectors in a dataset. Vectors are compared using a distance metric, such as Euclidean distance or cosine similarity. The closer two vectors are, the more similar they are.
Embeddings
Embeddings are a common way of representing data in a vector format for use in vector databases. An embedding is a mathematical representation of a piece of data, such as a word, text document, or an image, that is designed to capture its semantic meaning. Embeddings are created using algorithms that analyze the data and generate a set of numerical values that represent its key features. For example, an embedding for a word might represent its meaning, its context, and its relationship to other words. The process of creating embeddings is straightforward. While they can be created using standard python packages (for example, spaCy, sent2vec, Gensim), Large Language Models (LLM) generate highest quality embeddings for semantic text search. For example, you can send text to an embedding model in Azure OpenAI, and it generates a vector representation that can be stored for analysis. For more information, see Understand embeddings in Azure OpenAI Service.
General workflow
The general workflow for using a vector database is as follows:
- Embed data: Convert data into vector format using an embedding model. For example, you can embed text data using an OpenAI model.
- Store vectors: Store the embedded vectors in a vector database. You can send the embedded data to an Eventhouse to store and manage the vectors.
- Embed query: Convert the query data into vector format using the same embedding model used to embed the stored data.
- Query vectors: Use vector similarity search to find entries in the database that are similar to the query.
Eventhouse as a Vector Database
At the core of Vector Similarity Search is the ability to store, index, and query vector data. Eventhouses provide a solution for handling and analyzing large volumes of data, particularly in scenarios requiring real-time analytics and exploration, making it an excellent choice for storing and searching vectors.
The following components of the enable the use of Eventhouse a vector database:
- The dynamic data type, which can store unstructured data such as arrays and property bags. Thus data type is recommended for storing vector values. You can further augment the vector value by storing metadata related to the original object as separate columns in your table.
- The encoding type
Vector16
designed for storing vectors of floating-point numbers in 16-bits precision, which uses theBfloat16
instead of the default 64 bits. This encoding is recommended for storing ML vector embeddings as it reduces storage requirements by a factor of four and accelerates vector processing functions such as series_dot_product() and series_cosine_similarity() by orders of magnitude. - The series_cosine_similarity function, which can perform vector similarity searches on top of the vectors stored in Eventhouse.
Optimize for scale
For more information on optimizing vector similarity search, read the blog.
To maximize performance and the resulting search times, follow the following steps:
- Set the encoding of the embeddings column to Vector16, the 16-bit encoding of the vectors coefficients (instead of the default 64-bit).
- Store the embedding vectors table on all cluster nodes with at least one shard per processor, which is done by the following steps:
- Limit the number of embedding vectors per shard by altering the ShardEngineMaxRowCount of the sharding policy. The sharding policy balances data on all nodes with multiple extents per node so the search can use all available processors.
- Change the RowCountUpperBoundForMerge of the merging policy. The merge policy is needed to suppress merging extents after ingestion.
Example optimization steps
In the following example, a static vector table is defined for storing 1M vectors. The embedding policy is defined as Vector16, and the sharding and merging policies are set to optimize the table for vector similarity search. For this let's assume the cluster has 20 nodes each has 16 processors. The table’s shards should contain at most 1000000/(20*16)=3125 rows.
The following KQL commands are run one by one to create the empty table and set the required policies and encoding:
.create table embedding_vectors(vector_id:long, vector:dynamic) // This is a sample selection of columns, you can add more columns .alter column embedding_vectors.vector policy encoding type = 'Vector16' // Store the coefficients in 16 bits instead of 64 bits accelerating calculation of dot product, suppress redundant indexing .alter-merge table embedding_vectors policy sharding '{ "ShardEngineMaxRowCount" : 3125 }' // Balanced data on all nodes and, multiple extents per node so the search can use all processors .alter-merge table embedding_vectors policy merge '{ "RowCountUpperBoundForMerge" : 3125 }' // Suppress merging extents after ingestion
Ingest the data to the table created and defined in the previous step.