Tutorial: Use an Eventhouse as a vector database
In this tutorial, you learn how to use an Eventhouse as a vector database to store and query vector data in Real-Time Intelligence. For general information about vector databases, see Vector databases
The given scenario involves the use of semantic searches on Wikipedia pages to find pages with common themes. You use an available sample dataset, which includes vectors for tens of thousands of Wikipedia pages. These pages were embedded with an OpenAI model to produce vectors for each page. The vectors, along with some pertinent metadata related to the page, are then stored in an Eventhouse. You can use this dataset to find pages that are similar to each other, or to find pages that are similar to some theme you want to find. For example, say you want to look up "famous female scientists of the 19th century." You encode this phrase using the same OpenAI model, and then run a vector similarity search over the stored Wikipedia page data to find the pages with the highest semantic similarity.
Specifically, in this tutorial you will:
- Prepare a table in the Eventhouse with
Vector16
encoding for the vector columns. - Store vector data from a pre-embedded dataset to an Eventhouse.
- Embed afor a natural language query using the Open AI model.
- Use the series_cosine_similarity KQL function to calculate the similarities between the query embedding vector and those of the wiki pages.
- View rows of the highest similarity to get the wiki pages that are most relevant to your search query.
This flow can be visualized as follows:
Prerequisites
- A workspace with a Microsoft Fabric-enabled capacity
- An eventhouse in your workspace
- An Azure OpenAI resource with the text-embedding-ada-002 (Version 2) model deployed. This model is currently only available in certain regions. For more information, see Create a resource.
- Make sure that local authentication is enabled on your Azure OpenAI resource.
- Download the sample notebook from the GitHub repository
Note
While this tutorial uses Azure OpenAI, you can use any embedding model provider to generate vectors for text data.
Prepare your Eventhouse environment
In this setup step, you create a table in an Eventhouse with the necessary columns and encoding policies to store the vector data.
Browse to your workspace homepage in Real-Time Intelligence.
Select the Eventhouse you created in the prerequisites.
Select the target database where you want to store the vector data. If you don't have a database, you can create one by selecting Add database.
Select Explore my data. Copy/paste the following KQL query to create a table with the necessary columns:
.create table Wiki (id:string,url:string,['title']:string,text:string,title_vector:dynamic,content_vector:dynamic,vector_id:long)
Copy/paste the following commands to set the encoding policy of the vector columns. Run these commands sequentially.
.alter column Wiki.title_vector policy encoding type='Vector16' .alter column Wiki.content_vector policy encoding type='Vector16'
Write vector data to an Eventhouse
The following steps are used to import the embedded Wikipedia data and write it in an Eventhouse:
Import notebook
- Download the sample notebook from the GitHub repository.
- Browse to your Fabric environment. In the experience switcher, choose Develop and then your workspace.
- Select Import > Notebook > From this computer > Upload then choose the notebook you downloaded in a previous step.
- Once the import is complete, open the imported notebook from your workspace.
Write data to the Eventhouse
Run the cells to set up your environment.
%%configure -f {"conf": { "spark.rpc.message.maxSize": "1024" } }
%pip install wget
%pip install openai
Run the cells to download the precomputed embeddings.
import wget embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip" # The file is ~700 MB so it might take some time wget.download(embeddings_url)
import zipfile with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref: zip_ref.extractall("/lakehouse/default/Files/data")
import pandas as pd from ast import literal_eval article_df = pd.read_csv('/lakehouse/default/Files/data/vector_database_wikipedia_articles_embedded.csv') # Read vectors from strings back into a list article_df["title_vector"] = article_df.title_vector.apply(literal_eval) article_df["content_vector"] = article_df.content_vector.apply(literal_eval) article_df.head()
To write to the eventhouse, enter your Cluster URI, which can be found on the system overview page, and the name of the database. The table is created in the notebook and later referenced in the query.
# replace with your Eventhouse Cluster URI, Database name, and Table name KUSTO_CLUSTER = "Eventhouse Cluster URI" KUSTO_DATABASE = "Database name" KUSTO_TABLE = "Wiki"
Run the remaining cells to write the data to the Eventhouse. This operation can take some time to execute.
kustoOptions = {"kustoCluster": KUSTO_CLUSTER, "kustoDatabase" :KUSTO_DATABASE, "kustoTable" : KUSTO_TABLE } access_token=mssparkutils.credentials.getToken(kustoOptions["kustoCluster"])
#Pandas data frame to spark dataframe sparkDF=spark.createDataFrame(article_df)
# Write data to a table in Eventhouse sparkDF.write. \ format("com.microsoft.kusto.spark.synapse.datasource"). \ option("kustoCluster",kustoOptions["kustoCluster"]). \ option("kustoDatabase",kustoOptions["kustoDatabase"]). \ option("kustoTable", kustoOptions["kustoTable"]). \ option("accessToken", access_token). \ option("tableCreateOptions", "CreateIfNotExist").\ mode("Append"). \ save()
View the data in the Eventhouse
At this point, you can verify the data was written to the eventhouse by browsing to the database details page.
- Browse to your workspace homepage in Real-Time Intelligence.
- Select the database item that was provided in the previous section. You should see a summary of the data that was written to the "Wiki" table.
Generate embedding for the search term
Now that you stored the embedded wiki data in your eventhouse, you can use this data as a reference to find pages on a particular article. In order to make the comparison, you embed the search term, and then do a comparison between the search term and the Wikipedia pages.
To successfully make a call against Azure OpenAI, you need an endpoint, key, and deployment ID.
Variable name | Value |
---|---|
endpoint | This value can be found in the Keys & Endpoint section when examining your resource from the Azure portal. Alternatively, you can find the value in the Azure OpenAI Studio > Playground > Code View. An example endpoint is: https://docs-test-001.openai.azure.com/ . |
API key | This value can be found in the Keys & Endpoint section when examining your resource from the Azure portal. You can use either KEY1 or KEY2. |
deployment ID | This value can be found under the Deployments section in the Azure OpenAI Studio. |
Use the information in the table when running the Azure OpenAI cells.
Important
Local authentication must be enabled on your Azure Open AI resource in order to use the API key.
import openai
openai.api_version = '2022-12-01'
openai.api_base = 'endpoint' # Add your endpoint here
openai.api_type = 'azure'
openai.api_key = 'api key' # Add your api key here
def embed(query):
# Creates embedding vector from user query
embedded_query = openai.Embedding.create(
input=query,
deployment_id="deployment id", # Add your deployment id here
chunk_size=1
)["data"][0]["embedding"]
return embedded_query
searchedEmbedding = embed("most difficult gymnastics moves in the olympics")
#print(searchedEmbedding)
Query the similarity
The query is run directly from the notebook, and uses the returned embedding from the previous step in a comparison against the embedded Wikipedia pages stored in your eventhouse. This query uses the cosine similarity function and returns the top 10 most similar vectors.
Run the cells in the notebook to see the results of the query. You can change the search term and rerun the query to see different results. You could also compare an existing entry in the Wiki database to find similar entries.
kustoQuery = "Wiki | extend similarity = series_cosine_similarity(dynamic("+str(searchedEmbedding)+"), content_vector) | top 10 by similarity desc"
accessToken = mssparkutils.credentials.getToken(KUSTO_CLUSTER)
kustoDf = spark.read\
.format("com.microsoft.kusto.spark.synapse.datasource")\
.option("accessToken", accessToken)\
.option("kustoCluster", KUSTO_CLUSTER)\
.option("kustoDatabase", KUSTO_DATABASE)\
.option("kustoQuery", kustoQuery).load()
# Example that uses the result data frame.
kustoDf.show()
Clean up resources
When you finish the tutorial, you can delete the resources, you created to avoid incurring other costs. To delete the resources, follow these steps:
- Browse to your workspace homepage.
- Delete the notebook created in this tutorial.
- Delete the Eventhouse or database used in this tutorial.