Once you generate the embeddings for your chunks, the next step is to generate the index in the vector database and experiment to determine the optimal searches to perform. When you're experimenting with information retrieval, there are several areas to consider, including configuration options for the search index, the types of searches you should perform, and your reranking strategy. This article covers these three topics.
This article is part of a series. Read the introduction.
Search index
Note
Azure AI Search is a first party Azure search service. This section will mention some specifics for AI Search. If you're using a different store, consult the documentation to find the key configuration for that service.
The search index in your store has a column for every field in your data. Search stores generally have support for nonvector data types such as string, Boolean, integer, single, double, datetime, and collections like Collection(single) and vector data types such as Collection(single). For each column, you must configure information such as the data type, whether the field is filterable, retrievable, and/or searchable.
The following are some key decisions you must make for the vector search configuration that are applied to vector fields:
- Vector search algorithm - The algorithm used to search for relative matches. Azure AI Search has a brute-force algorithm option that scans the entire vector space called exhaustive KNN, and a more performant algorithm option that performs an approximate nearest neighbor (ANN) search called Hierarchical Navigable Small World (HNSW).
- metric - This configuration is the similarity metric used to calculate nearness by the algorithm. The options in Azure AI Search are cosine, dotProduct, and Euclidean. If you're using Azure OpenAI embedding models, choose
cosine
. - efConstruction - Parameter used during Hierarchical Navigable Small Worlds (HNSW) index construction that sets the number of nearest neighbors that are connected to a vector during indexing. A larger efConstruction value results in a better-quality index than a smaller number. The tradeoff is that a larger value requires more time, storage, and compute. efConstruction should be higher for a large number of chunks and lower for a low number of chunks. Determining the optimal value requires experimentation with your data and expected queries.
- efSearch - Parameter that is used at query time to set the number of nearest neighbors (that is, similar chunks) used during search.
- m - The bidirectional link count. The range is 4 to 10, with lower numbers returning less noise in the results.
In Azure AI Search, the vector configurations are encapsulated in a vectorSearch
configuration. When configuring your vector columns, you reference the appropriate configuration for that vector column and set the number of dimensions. The vector column's dimensions attribute represents the number of dimensions generated by the embedding model you chose. For example, the storage-optimized text-embedding-3-small
model generates 1,536 dimensions.
Searches
When executing queries from your prompt orchestrator against your search store, you have many options to consider. You need to determine:
- What type of search you're going to perform: vector or keyword or hybrid
- Whether you're going to query against one or more columns
- Whether you're going to manually run multiple queries, such as a keyword query and a vector search
- Whether the query needs to be broken down into subqueries
- Whether filtering should be used in your queries
Your prompt orchestrator might take a static approach or a dynamic approach mixing the approaches based on context clues from the prompt. The following sections address these options to help you experiment to find the right approach for your workload.
Search types
Search platforms generally support full text and vector searches. Some platforms, like Azure AI Search support hybrid searches. To see capabilities of various vector search offerings, review Choose an Azure service for vector search.
Vector search
Vector searches match on similarity between the vectorized query (prompt) and vector fields.
Important
You should perform the same cleaning operations you performed on chunks before embedding the query. For example, if you lowercased every word in your embedded chunk, you should lowercase every word in the query before embedding.
Note
You can perform a vector search against multiple vector fields in the same query. In Azure AI Search, that is technically a hybrid search. For more information, see that section.
embedding = embedding_model.generate_embedding(
chunk=str(pre_process.preprocess(query))
)
vector = RawVectorQuery(
k=retrieve_num_of_documents,
fields="contentVector",
vector=embedding,
)
results = client.search(
search_text=None,
vector_queries=[vector],
top=retrieve_num_of_documents,
select=["title", "content", "summary"],
)
The sample code performs a vector search against the contentVector
field. Note that the code that embeds the query preprocesses the query first. That preprocess should be the same code that preprocesses the chunks prior to embedding. The embedding model must be the same embedding model that embedded the chunks.
Full text search
Full text searches match plain text stored in an index. It's a common practice to extract keywords from a query and use those extracted keywords in a full text search against one or more indexed columns. Full text searches can be configured to return matches where any terms or all terms match.
You have to experiment to determine which fields are effective to run full text searches against. As discussed in the Enrichment phase, keyword, and entity metadata fields are good candidates to consider for full text search in scenarios where content has similar semantic meaning, but entities or keywords differ. Other common fields to consider for full text search are title, summary, and the chunk text.
formatted_search_results = []
results = client.search(
search_text=query,
top=retrieve_num_of_documents,
select=["title", "content", "summary"],
)
formatted_search_results = format_results(results)
The sample code performs a full text search against the title, content, and summary fields.
Hybrid search
Azure AI Search supports Hybrid queries where your query can contain one or more text searches and one or more vector searches. The platform performs each query, get the intermediate results, rerank the results using Reciprocal Rank Fusion (RRF), and return the top N results.
embedding = embedding_model.generate_embedding(
chunk=str(pre_process.preprocess(query))
)
vector1 = RawVectorQuery(
k=retrieve_num_of_documents,
fields="contentVector",
vector=embedding,
)
vector2 = RawVectorQuery(
k=retrieve_num_of_documents,
fields="questionVector",
vector=embedding,
)
results = client.search(
search_text=query,
vector_queries=[vector1, vector2],
top=retrieve_num_of_documents,
select=["title", "content", "summary"],
)
The sample code performs a full text search against the title, content, and summary fields and vector searches against the contentVector and questionVector fields. The Azure AI Search platform runs all the queries in parallel, reranks the results, and returns the top retrieve_num_of_documents documents.
Manual multiple
You can, of course, run multiple queries, such as a vector search and a keyword full text search, manually. You aggregate the results and rerank the results manually and return the top results. The following are use cases for manual multiple:
- You're using a search platform that doesn't support hybrid searches. You would follow this option to perform your own hybrid search.
- You want to run full text searches against different queries. For example, you might extract keywords from the query and run a full text search against your keywords metadata field. You might then extract entities and run a query against the entities metadata field.
- You want to control the reranking process yourself.
- The query requires decomposed subqueries to be run to retrieve grounding data from multiple sources.
Query translation
Query translation is an optional step in the information retrieval phase of a RAG solution. In the translation step, a query is transformed or translated into a form that is optimized to retrieve better results. There are many forms of query translation including the four addressed in this article: augmentation, decomposition, rewriting, and Hypothetical Document Embeddings (HyDE).
Query augmentation
Query augmentation is a translation step where the goal is to make the query simpler, more usable, and to enhance the context. You should consider augmentation if your query is too small or vague. For example, consider the query: "Compare the earnings of Microsoft." That query is vague. You did not mention time frames, or time units to compare and only specified earnings. Now consider an augmented version of the query: "Compare the earnings and revenue of Microsoft current year vs last year by quarter." The new query is clear and specific.
When you're augmenting a query, you maintain the original query, but add more context. There's no harm in augmenting a query as long as you don't remove or alter the original query and you don't change the nature of the query.
You can use a language model to augment your query. Not all queries can be augmented, however. If you have a context, you can pass it along to your language model to augment the query. If you don't have a context, you have to determine if there's information in your language model that can be useful in augmenting the query. For example, if you're using a large language model like one of the GPT models, you can determine if there's information readily available on the internet about the query. If so, you can use the model to augment the query. Otherwise, you shouldn't augment the query.
The following prompt is taken from the RAG Experiment Accelerator GitHub repository that shows a language model being used to augment a query.
Input Processing:
Analyze the input query to identify the core concept or topic.
Check if any context is provided with the query.
If context is provided, use it as the primary basis for augmentation and explanation.
If no context is provided, determine the likely domain or field (e.g., science, technology, history, arts, etc.) based on the query.
Query Augmentation:
If context is provided:
Use the given context to frame the query more specifically.
Identify any additional aspects of the topic not covered in the provided context that would enrich the explanation.
If no context is provided, expand the original query by adding the following elements, as applicable:
Include definitions about every word (adjective, noun etc) and meaning of each keyword, concept and phrase including synonyms and antonyms.
Include historical context or background information if relevant.
Identify key components or sub-topics within the main concept.
Request information on practical applications or real-world relevance.
Ask for comparisons with related concepts or alternatives, if applicable.
Inquire about current developments or future prospects in the field.
Additional Guidelines:
Prioritize information from provided context when available.
Adapt your language to suit the complexity of the topic, but aim for clarity.
Define technical terms or jargon when first introduced.
Use examples to illustrate complex ideas when appropriate.
If the topic is evolving, mention that your information might not reflect the very latest developments.
For scientific or technical topics, briefly mention the level of scientific consensus if relevant.
Use markdown formatting for better readability when appropriate.
Example Input-Output:
Example 1 (With provided context):
Input: "Explain the impact of the Gutenberg Press"
Context provided: "The query is part of a discussion about revolutionary inventions in medieval Europe and their long-term effects on society and culture."
Augmented Query: "Explain the impact of the Gutenberg Press in the context of revolutionary inventions in medieval Europe. Cover its role in the spread of information, its effects on literacy and education, its influence on the Reformation, and its long-term impact on European society and culture. Compare it to other medieval inventions in terms of societal influence."
Example 2 (Without provided context):
Input: "Explain CRISPR technology"
Augmented Query: "Explain CRISPR technology in the context of genetic engineering and its potential applications in medicine and biotechnology. Cover its discovery, how it works at a molecular level, its current uses in research and therapy, ethical considerations surrounding its use, and potential future developments in the field."
Now, provide a comprehensive explanation based on the appropriate augmented query.
Context: {context}
Query: {query}
Augmented Query:
Notice there are examples for when context is and isn't present.
Decomposition
Some queries are complex and require more than one collection of data to ground the model. For example, the query "How do electric cars work and how do they compare to ICE vehicles?" likely requires grounding data from multiple sources. One source might describe how electric cars work, where another compares them to internal combustion engine (ICE) vehicles.
Decomposition is the process of breaking down a complex query into multiple smaller and simpler subqueries. You run each of the decomposed queries independently and aggregate the top results of all the decomposed queries as an accumulated context. You then run the original query, passing the accumulated context to the language model.
It's good practice to determine whether the query requires multiple searches before running any searches. If you deem multiple subqueries are required, you can run manual multiple queries for all the queries. Use a language model to determine whether multiple subqueries are recommended. The following prompt is taken from the RAG Experiment Accelerator GitHub repository that is used to categorize a query as simple or complex, with complex requiring multiple queries:
Consider the given question to analyze and determine if it falls into one of these categories:
1. Simple, factual question
a. The question is asking for a straightforward fact or piece of information
b. The answer could likely be found stated directly in a single passage of a relevant document
c. Breaking the question down further is unlikely to be beneficial
Examples: "What year did World War 2 end?", "What is the capital of France?, "What is the features of productX?"
2. Complex, multi-part question
a. The question has multiple distinct components or is asking for information about several related topics
b. Different parts of the question would likely need to be answered by separate passages or documents
c. Breaking the question down into sub-questions for each component would allow for better results
d. The question is open-ended and likely to have a complex or nuanced answer
e. Answering it may require synthesizing information from multiple sources
f. The question may not have a single definitive answer and could warrant analysis from multiple angles
Examples: "What were the key causes, major battles, and outcomes of the American Revolutionary War?", "How do electric cars work and how do they compare to gas-powered vehicles?"
Based on this rubric, does the given question fall under category 1 (simple) or category 2 (complex)? The output should be in strict JSON format. Ensure that the generated JSON is 100 percent structurally correct, with proper nesting, comma placement, and quotation marks. There should not be any comma after last element in the JSON.
Example output:
{
"category": "simple"
}
A language model can also be used to decompose a complex query. The following prompt is taken from the RAG Experiment Accelerator GitHub repository that decomposes a complex query.
Analyze the following query:
For each query, follow these specific instructions:
- Expand the query to be clear, complete, fully qualified and concise.
- Identify the main elements of the sentence: typically a subject, an action or relationship, and an object or complement. Determine which element is being asked about or emphasized (usually the unknown or focus of the question). Invert the sentence structure: Make the original object or complement the new subject. Transform the original subject into a descriptor or qualifier. Adjust the verb or relationship to fit the new structure.
- Break the query down into a set sub-queries with clear, complete, fully qualified, concise and self-contained propositions.
- Include an addition sub-query using one more rule: Identify the main subject and object. Swap their positions in the sentence. Adjust the wording to make the new sentence grammatically correct and meaningful. Ensure the new sentence asks about the original subject.
- Express each idea or fact as a standalone statement that can be understood with help of given context.
- Break down the query into ordered sub-questions, from least to most dependent.
- The most independent sub-question does not requires or dependend on the answer to any other sub-question or prior knowledge.
- Try having a complete sub-question having all information only from base query. There is no additional context or information available.
- Separate complex ideas into multiple simpler propositions when appropriate.
- Decontextualize each proposition by adding necessary modifiers to nouns or entire sentences. Replace pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the entities they refer to.
- if you still need more question, the sub-question is not relevant and should be removed.
Provide your analysis in the following YAML format strictly adhering to the structure below and do not output anything extra including the language itself:
type: interdependent
queries:
- [First query or sub-query]
- [Second query or sub-query, if applicable]
- [Third query or sub-query, if applicable]
- ...
Examples:
1. Query: "What is the capital of France?"
type: interdependent
queries:
- What is the capital of France?
2. Query: "Who is the current CEO of the company that created the iPhone?"
type: interdependent
queries:
- Which company created the iPhone?
- Who is the current CEO of the Apple(identified in the previous question) company ?
3. Query: "What is the population of New York City and what is the tallest building in Tokyo?"
type: multiple_independent
queries:
- What is the population of New York City?
- What is the tallest building in Tokyo?
Now, analyze the following query:
{query}
Rewriting
An input query may not be in the optimal form to retrieve grounding data. Using a language model to rewrite the query is a common practice to achieve better results. The following are some challenges that you can address by rewriting a query:
- Vagueness
- Missing keywords
- Unneeded words
- Unclear semantically
The following prompt, taken from the RAG Experiment Accelerator GitHub repository, addresses the listed challenges by using a language model to rewrite the query.
Rewrite the given query to optimize it for both keyword-based and semantic similarity search methods. Follow these guidelines:
Identify the core concepts and intent of the original query.
Expand the query by including relevant synonyms, related terms, and alternate phrasings.
Maintain the original meaning and intent of the query.
Include specific keywords that are likely to appear in relevant documents.
Incorporate natural language phrasing to capture semantic meaning.
Include domain-specific terminology if applicable to the query's context.
Ensure the rewritten query covers both broad and specific aspects of the topic.
Remove any ambiguous or unnecessary words that might confuse the search.
Combine all elements into a single, coherent paragraph that flows naturally.
Aim for a balance between keyword richness and semantic clarity.
Provide the rewritten query as a single paragraph that incorporates various search aspects (e.g. keyword-focused, semantically-focused, domain-specific).
query: {original_query}
HyDE
Hypothetical Document Embeddings (HyDE) is an alternate information retrieval technique used in RAG solutions. Rather than converting a query into embeddings and using those embeddings to find the closest matches in a vector database, HyDE uses a language model to generate answers from the query. These answers are then converted into embeddings, which are used to find the closest matches. This process enables HyDE to perform answer-to-answer embedding similarity searches.
Combining query translations into a pipeline
You're not limited to choosing one query translation. In practice, multiple, or even all of these translations can be used in conjunction. The following diagram illustrates an example of how these translations can be combined into a pipeline.
Figure 1: RAG pipeline with query transformers
The diagram is numbered for notable steps in this pipeline:
The original query is passed to the optional query augmenter step. This step outputs the original query and the augmented query.
The augmented query is passed to the optional query decomposer step. This step outputs the original query, the augmented query, and the decomposed queries.
Each decomposed query is passed through three substeps. Once all of the decomposed queries are run through the substeps, the step outputs the original query, the augmented query, the decomposed queries, and an accumulated context. The accumulated context is the aggregation of the top N results from all of the decomposed queries being processed through the substeps.
- The optional query rewriter rewrites the decomposed query.
- The rewritten query, if it was rewritten, or the original query is executed against the search index. The query can be executed using any of the search types: vector, full text, hybrid, or manual multiple. It can also be executed using advanced query capabilities such as HyDE.
- The results are reranked. The top N reranked results are added to the accumulated context.
The original query, along with the accumulated context, is run through the same three substeps as each decomposed query was. The only difference is that there's only one query run and the top N results are returned to the caller.
Passing images in queries
Some multimodal models such as GPT-4V and GPT-4o can interpret images. If you're using these models, you can choose whether you want to avoid chunking your images and pass the image as part of the prompt to the multimodal model. You should experiment to determine how this approach performs compared to chunking the images with and without passing additional context. You should also compare the difference in cost between the approaches and do a cost-benefit analysis.
Filtering
Fields in the search store that are configured as filterable can be used to filter queries. Consider filtering on keywords and entities for queries that use those fields to help narrow down the result. Filtering allows you to retrieve only the data that satisfies certain conditions from an index by eliminating irrelevant data. Filtering improves the overall performance of the query with more relevant results. As with every decision, it's important to experiment and test. Queries might not have keywords or wrong keywords, abbreviations, or acronyms. You need to take these cases into consideration.
Weighting
Some search platforms, such as Azure AI Search, support the ability to influence the ranking of results based on criteria, including weighting fields.
Note
This section discusses the weighting capabilities in Azure AI Search. If you're using a different data platform, research the weighting capabilities of that platform.
Azure AI Search supports scoring profiles that contain parameters for weighted fields and functions for numeric data. Currently, scoring profiles only apply to nonvector fields, while support for vector and hybrid search is in preview. You can create multiple scoring profiles on an index and optionally choose to use one on a per-query basis.
Choosing which fields to weight depends upon the type of query and the use case. For example, if you determine that the query is keyword-centric, such as "Where is Microsoft headquartered?", you would want a scoring profile that weights entity or keyword fields higher. You may also use different profiles for different users, allow users to choose their focus, or choose profiles based on the application.
We recommend that in production systems, you only maintain profiles that you actively use in production.
Reranking
Reranking allows you to run one or more queries, aggregate the results, and rank those results. Consider the following reasons to rerank your search results:
- You performed manual multiple searches and you want to aggregate the results and rank them.
- Vector and keyword searches aren't always accurate. You can increase the count of documents returned from your search, potentially including some valid results that would otherwise be ignored, and use reranking to evaluate the results.
You can use a language model or cross-encoder to perform reranking. Some platforms, like Azure AI Search have proprietary methods to rerank results. You can evaluate these options for your data to determine what works best for your scenario. The following sections provide details on these methods.
Language model reranking
The following is a sample language model prompt from the RAG experiment accelerator that reranks results.
A list of documents is shown below. Each document has a number next to it along with a summary of the document. A question is also provided.
Respond with the numbers of the documents you should consult to answer the question, in order of relevance, as well as the relevance score as json string based on json format as shown in the schema section. The relevance score is a number from 1–10 based on how relevant you think the document is to the question. The relevance score can be repetitive. Don't output any additional text or explanation or metadata apart from json string. Just output the json string and strip rest every other text. Strictly remove any last comma from the nested json elements if it's present.
Don't include any documents that aren't relevant to the question. There should exactly be one documents element.
Example format:
Document 1:
content of document 1
Document 2:
content of document 2
Document 3:
content of document 3
Document 4:
content of document 4
Document 5:
content of document 5
Document 6:
content of document 6
Question: user defined question
schema:
{
"documents": {
"document_1": "Relevance",
"document_2": "Relevance"
}
}
Cross-encoder reranking
The following example from the RAG Experiment Accelerator GitHub repository uses the CrossEncoder provided by Hugging Face to load the Roberta model. It then iterates over each chunk and uses the model to calculate similarity, giving them a value. We sort results and return the top N.
from sentence_transformers import CrossEncoder
...
model_name = 'cross-encoder/stsb-roberta-base'
model = CrossEncoder(model_name)
cross_scores_ques = model.predict(
[[user_prompt, item] for item in documents],
apply_softmax=True,
convert_to_numpy=True,
)
top_indices_ques = cross_scores_ques.argsort()[-k:][::-1]
sub_context = []
for idx in list(top_indices_ques):
sub_context.append(documents[idx])
Semantic ranking
Azure AI Search has a proprietary feature called semantic ranking. This feature uses deep learning models that were adapted from Microsoft Bing that promote the most semantically relevant results. Read the following to see How semantic ranker works.
Search guidance
Consider the following general guidance when implementing your search solution:
- Title, summary, source, and the raw content (not cleaned) are good fields to return from a search.
- Determine up front whether a query needs to be broken down into subqueries.
- In general, it's a good practice to run queries on multiple fields, both vector and text queries. When you receive a query, you don't know whether vector search or text search are better. You further don't know what fields the vector search or keyword search are best to search. You can search on multiple fields, potentially with multiple queries, rerank the results and return the results with the highest scores.
- Keywords and entities fields are good candidates to consider filtering on.
- It's a good practice to use keywords along with vector searches. The keywords filter the results to a smaller subset. The vector store works against that subset to find the best matches.
Search evaluation
In the preparation phase, you should have gathered test queries along with test document information. You can use the following information you gathered in that phase to evaluate your search results:
- The query - The sample query
- The context - The collection of all the text in the test documents that address the sample query
The following are three well established retrieval evaluation methods you can use to evaluate your search solution:
- Precision at K - The percentage of correctly identified relevant items out of the total search results. This metric focuses on the accuracy of your search results.
- Recall at K - Recall at K measures the percentage of relevant items in the top K out of the total possible relative items. This metric focuses on search results coverage.
- Mean Reciprocal Rank (MRR) - MRR measures the average of the reciprocal ranks of the first relevant answer in your ranked search results. This metric focuses on where the first relevant result occurs in the search results.
You should test both positive and negative examples. For the positive examples, you want the metrics to be as close to 1 as possible. For the negative examples, where your data shouldn't be able to address the queries, you want the metrics to be as close to 0 as possible. You should test all your test queries and average the positive query results and the negative query results to understand how your search results are performing in aggregate.