教學課程：使用聊天模型搜尋您的資料 (Azure AI 搜尋服務中的 RAG)

發行項
11/19/2024

Azure AI 搜尋上 RAG 解決方案的定義特性是將查詢傳送至大型語言模型（LLM），以取得索引內容上的交談式搜尋體驗。如果您僅實作基本概念，可能很容易。

在本教學課程中，您已：

設定用戶端
撰寫 LLM 的指示
提供專為 LLM 輸入設計的查詢
檢閱結果並探索後續步驟

本教學課程會以上一個教學課程為基礎來建置。其假設您有索引管線所建立的搜尋索引。

必要條件

具有 Python 延伸模組和 Jupyter 套件的 Visual Studio Code。如需詳細資訊，請參閱 Visual Studio Code 中的 Python。
Azure AI 搜尋服務，位於與 Azure OpenAI 共用的區域。
Azure OpenAI，部署 gpt-4o。如需詳細資訊，請參閱在 Azure AI 搜尋服務中選擇 RAG 的模型

下載範例

您可以使用上一個索引管線教學課程中的相同筆記本。查詢 LLM 的指令碼會遵循管線建立步驟。如果您還沒有筆記本，請從 GitHub 下載。

設定用戶端以傳送查詢

Azure AI 搜尋服務中的 RAG 模式是與搜尋索引的同步處理系列連線，以取得建基資料，接著是 LLM 的連線，以制定使用者問題的回應。這兩個用戶端都會使用相同的查詢字串。

您正在設定兩個用戶端，因此您需要這兩個資源的端點和許可權。本教學課程假設您已設定授權連線的角色指派，但您應該在範例筆記本中提供端點：

# Set endpoints and API keys for Azure services
AZURE_SEARCH_SERVICE: str = "PUT YOUR SEARCH SERVICE ENDPOINT HERE"
# AZURE_SEARCH_KEY: str = "DELETE IF USING ROLES, OTHERWISE PUT YOUR SEARCH SERVICE ADMIN KEY HERE"
AZURE_OPENAI_ACCOUNT: str = "PUR YOUR AZURE OPENAI ENDPOINT HERE"
# AZURE_OPENAI_KEY: str = "DELETE IF USING ROLES, OTHERWISE PUT YOUR AZURE OPENAI KEY HERE"

提示和查詢的範例指令碼

以下是將用戶端具現化、定義提示並設定查詢的 Python 指令碼。您可以在筆記本中執行此指令碼，以從聊天模型部署產生回應。

針對 Azure Government 雲端，將令牌提供者上的 API 端點修改為 "https://cognitiveservices.azure.us/.default"。

# Import libraries
from azure.search.documents import SearchClient
from openai import AzureOpenAI

token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")
openai_client = AzureOpenAI(
     api_version="2024-06-01",
     azure_endpoint=AZURE_OPENAI_ACCOUNT,
     azure_ad_token_provider=token_provider
 )

deployment_name = "gpt-4o"

search_client = SearchClient(
     endpoint=AZURE_SEARCH_SERVICE,
     index_name=index_name,
     credential=credential
 )

# Provide instructions to the model
GROUNDED_PROMPT="""
You are an AI assistant that helps users learn from the information found in the source material.
Answer the query using only the sources provided below.
Use bullets if the answer has multiple points.
If the answer is longer than 3 sentences, provide a summary.
Answer ONLY with the facts listed in the list of sources below. Cite your source when you answer the question
If there isn't enough information below, say you don't know.
Do not generate answers that don't use the sources below.
Query: {query}
Sources:\n{sources}
"""

# Provide the search query. 
# It's hybrid: a keyword search on "query", with text-to-vector conversion for "vector_query".
# The vector query finds 50 nearest neighbor matches in the search index
query="What's the NASA earth book about?"
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")

# Set up the search results and the chat thread.
# Retrieve the selected fields from the search index related to the question.
# Search results are limited to the top 5 matches. Limiting top can help you stay under LLM quotas.
search_results = search_client.search(
    search_text=query,
    vector_queries= [vector_query],
    select=["title", "chunk", "locations"],
    top=5,
)

# Newlines could be in the OCR'd content or in PDFs, as is the case for the sample PDFs used for this tutorial.
# Use a unique separator to make the sources distinct. 
# We chose repeated equal signs (=) followed by a newline because it's unlikely the source documents contain this sequence.
sources_formatted = "=================\n".join([f'TITLE: {document["title"]}, CONTENT: {document["chunk"]}, LOCATIONS: {document["locations"]}' for document in search_results])

response = openai_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": GROUNDED_PROMPT.format(query=query, sources=sources_formatted)
        }
    ],
    model=deployment_name
)

print(response.choices[0].message.content)

檢閱結果

在此回應中，答案是以五個輸入（top=5）為基礎，其中包含由搜尋引擎決定最相關的區塊。提示中的指示會告知 LLM 僅使用 sources 中的資訊或格式化的搜尋結果。

第一個查詢"What's the NASA earth book about?"的結果看起來應該類似下列範例。

The NASA Earth book is about the intricate and captivating science of our planet, studied 
through NASA's unique perspective and tools. It presents Earth as a dynamic and complex 
system, observed through various cycles and processes such as the water cycle and ocean 
circulation. The book combines stunning satellite images with detailed scientific insights, 
portraying Earth’s beauty and the continuous interaction of land, wind, water, ice, and 
air seen from above. It aims to inspire and demonstrate that the truth of our planet is 
as compelling as any fiction.

Source: page-8.pdf

即使提示和查詢未變更，LLM 還是可能會傳回不同的答案。您的結果看起來可能與範例大不相同。如需詳細資訊，請參閱了解如何使用可重現的輸出。

注意

在測試本教學課程時，我們看到各種不同的回應，有些會較其他回應更相關。有些時候，重複相同的要求會導致回應惡化，很可能是因為聊天記錄產生混淆，可能是模型將重複的要求註冊為對所生成的答案不滿。管理聊天記錄已超過本教學課程的範圍，但將其包含在您的應用程式程式碼中應該會減輕甚至消除此行為。

新增篩選

回想一下，您已使用套用的 AI 建立了 locations 欄位，並填入實體辨識技能所辨識到的位置。位置的欄位定義包含 filterable 屬性。讓我們重複上述要求，但這次新增篩選，以在位置欄位中選取冰這個字詞。

篩選引進了包含詞/句或排除準則。搜尋引擎仍會在 "What's the NASA earth book about?"上執行向量搜尋，但現在會排除不包含冰的相符項目。如需有關篩選字串集合和向量查詢的詳細資訊，請參閱文字篩選基本概念、瞭解集合篩選，以及將篩選新增至向量查詢。

將 search_results 定義取代為包含篩選條件的下列範例：

query="what is the NASA earth book about?"
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")

# Add a filter that selects documents based on whether locations includes the term "ice".
search_results = search_client.search(
    search_text=query,
    vector_queries= [vector_query],
    filter="search.ismatch('ice*', 'locations', 'full', 'any')",
    select=["title", "chunk", "locations"],
    top=5,
)

sources_formatted = "=================\n".join([f'TITLE: {document["title"]}, CONTENT: {document["chunk"]}, LOCATIONS: {document["locations"]}' for document in search_results])

search_results = search_client.search(
    search_text=query,
    top=10,
    filter="search.ismatch('ice*', 'locations', 'full', 'any')",
    select="title, chunk, locations"

篩選查詢的結果現在看起來應該類似下列回應。請對注意冰蓋的強調。

The NASA Earth book showcases various geographic and environmental features of Earth through 
satellite imagery, highlighting remarkable landscapes and natural phenomena. 

- It features extraordinary views like the Holuhraun Lava Field in Iceland, captured by 
Landsat 8 during an eruption in 2014, with false-color images illustrating different elements 
such as ice, steam, sulfur dioxide, and fresh lava ([source](page-43.pdf)).
- Other examples include the North Patagonian Icefield in South America, depicted through 
clear satellite images showing glaciers and their changes over time ([source](page-147.pdf)).
- It documents melt ponds in the Arctic, exploring their effects on ice melting and 
- heat absorption ([source](page-153.pdf)).
  
Overall, the book uses satellite imagery to give insights into Earth's dynamic systems 
and natural changes.

變更輸入

增加或減少 LLM 的輸入數目可能會對回應產生很大的影響。設定 top=8 之後，請嘗試再次執行相同的查詢。當您增加輸入時，模型每次都會傳回不同的結果，即使查詢未變更也一樣。

以下是模型在將輸入增加至 8 之後所傳回的其中一個範例。

The NASA Earth book features a range of satellite images capturing various natural phenomena 
across the globe. These include:

- The Holuhraun Lava Field in Iceland documented by Landsat 8 during a 2014 volcanic 
eruption (Source: page-43.pdf).
- The North Patagonian Icefield in South America, highlighting glacial landscapes 
captured in a rare cloud-free view in 2017 (Source: page-147.pdf).
- The impact of melt ponds on ice sheets and sea ice in the Arctic, with images from 
an airborne research campaign in Alaska during July 2014 (Source: page-153.pdf).
- Sea ice formations at Shikotan, Japan, and other notable geographic features in various 
locations recorded by different Landsat missions (Source: page-168.pdf).

Summary: The book showcases satellite images of diverse Earth phenomena, such as volcanic 
eruptions, icefields, and sea ice, to provide insights into natural processes and landscapes.

由於模型系結至地面數據，因此當您增加輸入的大小時，答案會變得更加廣泛。您可以使用相關性微調來產生更聚焦的答案。

變更提示

您也可以變更提示，以控制輸出、音調的格式，以及是否希望模型變更提示來補充答案與自己的定型資料。

以下是 LLM 輸出的另一個範例，如果我們將提示重新聚焦於識別科學研究的位置。

# Provide instructions to the model
GROUNDED_PROMPT="""
You are an AI assistant that helps scientists identify locations for future study.
Answer the query cocisely, using bulleted points.
Answer ONLY with the facts listed in the list of sources below.
If there isn't enough information below, say you don't know.
Do not generate answers that don't use the sources below.
Do not exceed 5 bullets.
Query: {query}
Sources:\n{sources}
"""

僅變更提示的輸出，否則會保留先前查詢的所有層面，看起來可能像這個範例。

The NASA Earth book appears to showcase various locations on Earth captured through satellite imagery, 
highlighting natural phenomena and geographic features. For instance, the book includes:

- The Holuhraun Lava Field in Iceland, detailing volcanic activity and its observation via Landsat 8.
- The North Patagonian Icefield in South America, covering its glaciers and changes over time as seen by Landsat 8.
- Melt ponds in the Arctic and their impacts on the heat balance and ice melting.
- Iceberg A-56 in the South Atlantic Ocean and its interaction with cloud formations.

(Source: page-43.pdf, page-147.pdf, page-153.pdf, page-39.pdf)

提示

如果您要繼續進行本教學課程，請記得將提示還原至其先前的值（You are an AI assistant that helps users learn from the information found in the source material）。

變更參數和提示會影響 LLM 的回應。當您自行探索時，請記住下列秘訣：

提高 top 值可能會耗盡模型的可用配額。如果沒有配額，則會傳回錯誤訊息，或模型可能會傳回「我不知道」。
提高 top 值不一定能改善結果。在以頂端測試時，我們有時會注意到答案不會更好。
那麼，什麼可能會有用？一般而言，答案是相關性微調。改善 Azure AI 搜尋服務中搜尋結果的相關性通常是將 LLM 公用程式最大化最有效的方法。

在下一系列的教學課程中，焦點會轉向最大化相關性，並將查詢效能最佳化，以獲得速度和準確度。我們會重新瀏覽結構描述定義和查詢邏輯來實作相關性功能，但管線和模型的其餘部分仍保持不變。

後續步驟

最大化相關性

共用方式為