Technical limitations, operational factors, and ranges
All vectors uploaded to Azure AI Search must be generated externally from the service by using a model of your choice. It is your responsibility to consider the technical limitations and operating factors of each model, and whether the embeddings it creates are well optimized or even appropriate for your use case. This includes both the inferences of meaning extracted from content, and the dimensionality of the vector embedding space.
The vectorization model creates an embeddings space that defines the resulting end user search experience of an application. There could be downsides to a model that negatively affects both functionality and performance if a model does not align well with a desired use case or the embeddings generated are poorly optimized.
While many limitations of vector search will stem from the model used to generate embeddings, there are some additional options you should consider at query time. You can choose from two algorithms to determine relevance for vector search results: Exhaustive k-nearest neighbors (KNN) or Hierarchical Navigable Small World. Exhaustive k-nearest neighbors (KNN) performs a brute-force search of the entire vector space for matches that are most similar to the query by calculating the distances between all pairs of data points and finding the exact k nearest neighbors of a query point. While more precise, this algorithm can be slow. If low latency is the primary goal, consider using the Hierarchical Navigable Small World (HNSW) algorithm. HNSW performs an efficient approximate nearest neighbor (ANN) search in high-dimensional embedding spaces. See the vector search documentation for more information about these options.
- Do spend time A/B testing your application with the different content and query types you expect your application to support. Figure out which query experience is best for your needs.
- Do spend time testing your models with a full range of input content to understand how it behaves in many situations. This content could include potentially sensitive input to understand whether there is any bias inherent in the model. The Azure OpenAI Responsible AI overview provides guidance for how to responsibly use AI.
- Do consider adding Azure AI Content Safety to your application architecture. It includes an API to detect harmful user-generated and AI-generated text or images in applications and services.
Evaluating and integrating vector search for your use
To ensure optimal performance, conduct your own evaluations of the solutions you plan to implement by using vector search. Follow an evaluation process that: (1) uses some internal stakeholders to evaluate results, (2) uses A/B experimentation to roll out vector search to users, (3) incorporates key performance indicators (KPIs) and metrics monitoring when the service is deployed in experiences for the first time, and (4) tests and tweaks the semantic ranker configuration and/or index definition, including the surrounding experiences like user interface placement or business processes.
Microsoft has rigorously evaluated vector search both in terms of latency and recall and relevance by leveraging diverse datasets to measure the speed, scalability, and accuracy of results returned. The primary focus of your evaluation efforts should be on selecting the appropriate model for your specific use case, understanding the limitations and biases of the model, and rigorously testing the end to end vector search experience.
Technical limitations, operational factors, and ranges
There may be cases where semantic results, captions, and answers may not appear to be correct. The models used by semantic ranker are trained on various data sources (including open source and selections from the Microsoft Bing corpus). Semantic ranker supports a broad range of languages and attempts to match user queries to content from your search results. Semantic ranker is also a premium feature at additional cost which should be considered when projecting the overall cost of your end-to-end solution.
Semantic ranker is most likely to improve relevance over content that is semantically rich, such as articles and descriptions. It looks for context and relatedness among terms, elevating matches that make more sense given the query. Language understanding "finds" summarizations or captions and answers within your content, but unlike generative models like Azure OpenAI Service models GPT-3.5 or GPT-4, it does not create them. Only verbatim text from source documents is included in the response, which can then be rendered on a search results page for a more productive search experience.
State-of-the-art, pretrained models are used for summarization and ranking. To maintain the fast performance that users expect from search, semantic summarization and ranking are applied to just the top 50 results, as scored by the default scoring algorithm. Inputs are derived from the content in the search result. It cannot reach back to the search index to access other fields in the search document that were not returned in the query response. Inputs are subject to a token length of 8,960. These limits are necessary to maintain millisecond response times.
The default scoring algorithm is from Bing and Microsoft Research and integrated into the Azure AI Search infrastructure as an add-on feature. The models are used internally, are not exposed to the developer, and are nonconfigurable. For more information about the research and AI investments backing semantic ranker, see How AI from Bing is powering Azure AI Search (Microsoft Research Blog).
Semantic ranker also offers answers, captions, and highlighting within the response. For example, if the model classifies a query as a question and is 70% confident in the answer, the model returns a semantic answer. Additionally, semantic captions provide the most relevant content within the results and provide a brief snippet highlighting the most relevant words or phrases within that snippet.
Semantic ranker results are based off the data in the underlying search index, and the models provide relevance ranking and answers and captions based on the information retrieved from the index. Prior to using semantic ranker in a production environment, it is important to do further testing and to ensure that the dataset is accurate and appropriate for the intended use case. For more information and examples of how to evaluate semantic ranker, please see the content and appendix here.
In many AI systems, performance is often defined in relation to accuracy—that is, how often the AI system offers a correct prediction or output. With large-scale natural language models, two different users may look at the same output and have different opinions of how useful or relevant it is, which means that performance for these systems must be defined more flexibly. Here, we broadly consider performance to mean that the application performs as you and your users expect, including not generating harmful outputs.
Semantic ranker was trained on public content. As a result, the semantic relevance will vary based on the documents in the index and the queries issued against it. It is important to use your own judgment and research when you use this content for decision making.
- Do spend time A/B testing your application with different query types, such as keyword versus hybrid plus semantic ranker. Figure out which query experience is best for your needs.
- Do expend a reasonable effort to set up your semantic configuration in accordance with the feature documentation.
- Do not trust the semantic answers if you do not have confidence in the accuracy of the information within the search index.
- Do not always trust semantic captions because they are extracted from customer content through a series of models that predicts the most relevant answers in a brief snippet.
Evaluation of Semantic ranker
Evaluation methods
Semantic ranker was evaluated through internal testing, including automated and human judgment on multiple datasets as well as feedback from internal customers. Testing includes the ranking of documents by scoring them as relevant or not relevant along with ranking documents in priority order of relevance. Likewise, the captions and answers functionality were also ranked via internal testing.
Evaluation results
We strive to ship all model updates regression-free (that is, the updated model should only improve the current production model). Each candidate is compared directly to the current production model by using metrics suitable for the feature being evaluated (for example, Normalized Discounted Cumulative Gain for ranking and precision/recall for answers). Semantic ranker models are trained, tuned, and evaluated by using a wide range of training data that is representative of documents that have different properties (language, length, formatting, styles, and tones) to support the broadest array of search scenarios. Our training and test data are drawn from:
Sources of documents:
- Academic and industry benchmarks
- Customer data (testing only, performed with customer permission)
- Synthetic data
Sources of queries:
- Benchmark query sets
- Customer-provided query sets (testing only, performed with customer permission)
- Synthetic query sets
- Human-generated query sets
Sources of labels for scoring query and document pairs:
- Academic and industry benchmark labels
- Customer labels (testing only, performed with customer permission)
- Synthetic data labels
- Human-scored labels
Evaluating and integrating semantic ranker for your use
The performance of semantic ranker varies depending on the real-world uses and conditions in which people use it. The quality of the relevance provided through the deep learning models that power semantic ranker capabilities is directly correlated with the data quality of your search index. For example, the models currently have token limitations that consider only the top 8,960 tokens for semantic answers. Therefore, if the semantic answer to a search query is found toward the end of a long document (beyond the 8,960 token limit), the answer will not be provided. The same rule applies for captions. Also, the semantic configuration lists relevant search fields in priority order. You can reorder the fields in this list to help tailor relevance to better suit your needs.
To ensure optimal performance in their scenarios, customers should conduct their own evaluations of the solutions they implement by using semantic ranker. Customers should generally follow an evaluation process that: (1) uses some internal stakeholders to evaluate results, (2) uses A/B experimentation to roll out semantic ranker to users, (3) incorporates KPIs and metrics monitoring when the service is deployed in experiences for the first time, and (4) tests and tweaks the semantic ranker configuration and/or index definition, including the surrounding experiences like user interface placement or business processes.
If you are developing an application in a high-stakes domain or industry, such as healthcare, human resources, education, or the legal field, assess how well the application works in your scenario, implement strong human oversight, evaluate how well users understand the limitations of the application, and comply with all relevant laws. Consider other mitigations based on your scenario.