チュートリアル: Azure AI Search で RAG のインデックス作成パイプラインを構築する

[アーティクル]
11/19/2024

Azure AI Search で RAG ソリューションの自動インデックス作成パイプラインを構築する方法について説明します。インデックス作成の自動化は、インデックス作成とスキルセットの実行を駆動するインデクサーを通じて行われ、1 回限りまたは定期的な増分更新のためのデータチャンキングとベクトル化の統合を提供します。

このチュートリアルでは、次の作業を行いました。

以前のチュートリアルのインデックススキーマを指定する
データソース接続を作成する
インデクサーの作成
エンティティをチャンク化、ベクトル化、認識するスキルセットを作成する
インデクサーを実行して結果を確認する

Azure サブスクリプションをお持ちでない場合は、始める前に無料アカウントを作成してください。

ヒント

データのインポートとベクトル化ウィザードを使用してパイプラインを作成できます。「画像検索」と「ベクトル検索」のクイックスタートをお試しください。

前提条件

Python 拡張機能と Jupyter パッケージを持つ Visual Studio Code。詳細については、「Visual Studio Code での Python」を参照してください。
Azure Storage 汎用アカウント。この演習では、自動インデックス作成用に PDF ファイルを BLOB ストレージにアップロードします。
Azure AI 検索 (マネージド ID およびセマンティックランク付けのため、Basic レベル以上)。 Azure OpenAI および Azure AI サービスと共有されているリージョンを選択します。
Azure OpenAI (Azure AI 検索と同じリージョン内にあり、text-embedding-3-large をデプロイ済み)。 RAG ソリューションで使用される埋め込みモデルの詳細については、「Azure AI 検索で RAG の埋め込みモデルを選択する」を参照してください。
Azure AI サービスのマルチサービスアカウント (Azure AI 検索と同じリージョン内)。このリソースは、コンテンツ内の場所を検出するエンティティ認識スキルに使用されます。

サンプルのダウンロード

GitHub から Jupyter Notebook をダウンロードして、Azure AI 検索に要求を送信します。詳細については、「GitHub からファイルをダウンロードする」を参照してください。

インデックススキーマを指定する

Visual Studio Code で Jupyter Notebook (.ipynb) を開くか作成して、パイプラインを構成するスクリプトを含めます。最初の手順では、パッケージをインストールし、接続用の変数を収集します。セットアップ手順を完了すると、インデックス作成パイプラインのコンポーネントの使用を開始する準備が整います。

以前のチュートリアルのインデックススキーマから始めましょう。これはベクトル化されたチャンクとベクトル化されていないチャンクを中心に構成されています。これには、スキルセットによって作成された AI によって生成されたコンテンツを保存する locations フィールドが含まれます。

from azure.identity import DefaultAzureCredential
from azure.identity import get_bearer_token_provider
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SearchIndex
)

credential = DefaultAzureCredential()

# Create a search index  
index_name = "py-rag-tutorial-idx"
index_client = SearchIndexClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)  
fields = [
    SearchField(name="parent_id", type=SearchFieldDataType.String),  
    SearchField(name="title", type=SearchFieldDataType.String),
    SearchField(name="locations", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True),
    SearchField(name="chunk_id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name="keyword"),  
    SearchField(name="chunk", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),  
    SearchField(name="text_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1024, vector_search_profile_name="myHnswProfile")
    ]  
  
# Configure the vector search configuration  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="myHnsw"),
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="myHnswProfile",  
            algorithm_configuration_name="myHnsw",  
            vectorizer_name="myOpenAI",  
        )
    ],  
    vectorizers=[  
        AzureOpenAIVectorizer(  
            vectorizer_name="myOpenAI",  
            kind="azureOpenAI",  
            parameters=AzureOpenAIVectorizerParameters(  
                resource_url=AZURE_OPENAI_ACCOUNT,  
                deployment_name="text-embedding-3-large",
                model_name="text-embedding-3-large"
            ),
        ),  
    ], 
)  
  
# Create the search index
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)  
result = index_client.create_or_update_index(index)  
print(f"{result.name} created")

データソース接続を作成する

この手順では、サンプルデータと Azure Blob Storage への接続を設定します。インデクサーは、コンテナーから PDF を取得します。この手順では、コンテナーを作成し、ファイルをアップロードします。

元の電子ブックは、100 ページ以上、サイズが 35 MB を超える大きさです。 API 呼び出しあたり 16 MB のインデクサー用のドキュメント制限と AI エンリッチメントのデータ制限を維持するために、テキストのページごとに 1 つずつの小さな PDF に分割しました。わかりやすくするために、この演習では画像のベクター化を省略します。

Azure portal にサインインして、目的の Azure Storage アカウントを見つけます。
コンテナーを作成し、earth_book_2019_text_pages の PDF をアップロードします。
Azure AI 検索に、リソースに対するストレージ BLOB データ閲覧者のアクセス許可があることを確認します。

次に、Visual Studio Code で、インデックス作成中に接続情報を提供するインデクサーデータソースを定義します。

from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection
)

# Create a data source 
indexer_client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)
container = SearchIndexerDataContainer(name="nasa-ebooks-pdfs-all")
data_source_connection = SearchIndexerDataSourceConnection(
    name="py-rag-tutorial-ds",
    type="azureblob",
    connection_string=AZURE_STORAGE_CONNECTION,
    container=container
)
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)

print(f"Data source '{data_source.name}' created or updated")

接続に対して Azure AI 検索のマネージド ID を設定した場合、接続文字列には ResourceId= サフィックスが含まれます。次の例のようになります。"ResourceId=/subscriptions/FAKE-SUBCRIPTION=ID/resourceGroups/FAKE-RESOURCE-GROUP/providers/Microsoft.Storage/storageAccounts/FAKE-ACCOUNT;"

スキルセットを作成する

スキルは、データチャンキングとベクトル化の統合の基礎となります。少なくとも、コンテンツをチャンク化するテキスト分割スキルと、チャンク化されたコンテンツのベクトル表現を作成する埋め込みスキルが必要です。

このスキルセットでは、インデックスに構造化データを作成するために追加のスキルが使用されます。エンティティ認識スキルは、固有名詞から "海" や "山" といった一般的な名称まで、さまざまな場所を識別するために使用されます。構造化データを使用すると、興味深いクエリを作成し、関連性を高めるためのより多くのオプションが提供されます。

ロールベースのアクセス制御を使用している場合でも、AZURE_AI_MULTISERVICE_KEY が必要です。 Azure AI 検索では、請求の目的でキーが使用されるため、ワークロードが無料の制限内に収まっている場合を除き、キーは必須です。最新のプレビュー API またはベータパッケージを使っている場合は、キーレス接続を使うこともできます。詳しくは、Azure AI マルチサービスリソースのスキルセットへのアタッチに関する記事をご覧ください。

from azure.search.documents.indexes.models import (
    SplitSkill,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    AzureOpenAIEmbeddingSkill,
    EntityRecognitionSkill,
    SearchIndexerIndexProjection,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    SearchIndexerSkillset,
    CognitiveServicesAccountKey
)

# Create a skillset  
skillset_name = "py-rag-tutorial-ss"

split_skill = SplitSkill(  
    description="Split skill to chunk documents",  
    text_split_mode="pages",  
    context="/document",  
    maximum_page_length=2000,  
    page_overlap_length=500,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/content"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],  
)  
  
embedding_skill = AzureOpenAIEmbeddingSkill(  
    description="Skill to generate embeddings via Azure OpenAI",  
    context="/document/pages/*",  
    resource_url=AZURE_OPENAI_ACCOUNT,  
    deployment_name="text-embedding-3-large",  
    model_name="text-embedding-3-large",
    dimensions=1536,
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/pages/*"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="text_vector")  
    ],  
)

entity_skill = EntityRecognitionSkill(
    description="Skill to recognize entities in text",
    context="/document/pages/*",
    categories=["Location"],
    default_language_code="en",
    inputs=[
        InputFieldMappingEntry(name="text", source="/document/pages/*")
    ],
    outputs=[
        OutputFieldMappingEntry(name="locations", target_name="locations")
    ]
)
  
index_projections = SearchIndexerIndexProjection(  
    selectors=[  
        SearchIndexerIndexProjectionSelector(  
            target_index_name=index_name,  
            parent_key_field_name="parent_id",  
            source_context="/document/pages/*",  
            mappings=[  
                InputFieldMappingEntry(name="chunk", source="/document/pages/*"),  
                InputFieldMappingEntry(name="text_vector", source="/document/pages/*/text_vector"),
                InputFieldMappingEntry(name="locations", source="/document/pages/*/locations"),  
                InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),  
            ],  
        ),  
    ],  
    parameters=SearchIndexerIndexProjectionsParameters(  
        projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
    ),  
) 

cognitive_services_account = CognitiveServicesAccountKey(key=AZURE_AI_MULTISERVICE_KEY)

skills = [split_skill, embedding_skill, entity_skill]

skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset to chunk documents and generating embeddings",  
    skills=skills,  
    index_projection=index_projections,
    cognitive_services_account=cognitive_services_account
)
  
client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)  
client.create_or_update_skillset(skillset)  
print(f"{skillset.name} created")

インデクサーを作成して実行する

インデクサーは、すべてのプロセスを動作させるコンポーネントです。インデクサーは無効な状態で作成することもできますが、既定では直ちに実行されます。このチュートリアルでは、インデクサーを作成して実行し、BLOB ストレージからデータを取得し、チャンク化やベクター化などのスキルを実行して、インデックスを読み込みます。

インデクサーの実行には数分かかります。完了したら、最後の手順であるインデックスに対するクエリの実行に進むことができます。

from azure.search.documents.indexes.models import (
    SearchIndexer,
    FieldMapping
)

# Create an indexer  
indexer_name = "py-rag-tutorial-idxr" 

indexer_parameters = None

indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",  
    skillset_name=skillset_name,  
    target_index_name=index_name,  
    data_source_name=data_source.name,
    # Map the metadata_storage_name field to the title field in the index to display the PDF title in the search results  
    field_mappings=[FieldMapping(source_field_name="metadata_storage_name", target_field_name="title")],
    parameters=indexer_parameters
)  

# Create and run the indexer  
indexer_client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)  
indexer_result = indexer_client.create_or_update_indexer(indexer)  

print(f' {indexer_name} is created and running. Give the indexer a few minutes before running a query.')

クエリを実行して結果を確認する

クエリを送信して、インデックスが動作していることを確認します。この要求は、テキスト文字列 "what's NASA's website?" をベクトル検索用のベクトルに変換します。結果は SELECT ステートメントのフィールドで構成されており、その一部は出力として表示されます。

現時点では、チャットや生成 AI はありません。結果には、検索インデックスの元の内容がそのまま表示されます。

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Vector Search using text-to-vector conversion of the querystring
query = "what's NASA's website?"  

search_client = SearchClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential, index_name=index_name)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")
  
results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query],
    select=["chunk"],
    top=1
)  
  
for result in results:  
    print(f"Score: {result['@search.score']}")
    print(f"Chunk: {result['chunk']}")

このクエリは、検索エンジンによって最も関連性が高いと判断された 1 つのチャンクで構成される 1 つの一致 (top=1) を返します。クエリの結果は次の例のようになります。

Score: 0.01666666753590107
Chunk: national Aeronautics and Space Administration

earth Science

NASA Headquarters 

300 E Street SW 

Washington, DC 20546

www.nasa.gov

np-2018-05-2546-hQ

さらにいくつかのクエリを試して、検索エンジンが直接返す内容を把握し、LLM 対応の応答と比較できるようにします。このクエリ ("patagonia geography") を使用して前のスクリプトを再実行し、top を 3 に設定して複数の応答を返します。

この 2 番目のクエリの結果は、次のようになります。簡潔にするために少し編集しています。出力はノートブックからコピーされており、この例に示されているように応答が切り詰められています。セルの出力を展開すると、完全な回答を確認できます。

Score: 0.03306011110544205
Chunk: 

Swirling Bloom off Patagonia
Argentina

Interesting art often springs out of the convergence of different ideas and influences. 
And so it is with nature. 

Off the coast of Argentina, two strong ocean currents converge and often stir up a colorful 
brew, as shown in this Aqua image from 

December 2010. 

This milky green and blue bloom formed on the continental shelf off of Patagonia, where warmer, 
saltier waters from the subtropics 

meet colder, fresher waters flowing from the south. Where these currents collide, turbulent 
eddies and swirls form, pulling nutrients 

up from the deep ocean. The nearby Rio de la Plata also deposits nitrogen- and iron-laden 
sediment into the sea. Add in some 
...

while others terminate in water. The San Rafael and San Quintín glaciers (shown at the right) 
are the icefield’s largest. Both have 

been receding rapidly in the past 30 years.

この例では、チャンクがどのように逐語的に返されるのか、キーワードと類似性の検索で上位の一致がどのように特定されるのかを簡単に確認できます。この特定のチャンクには、パタゴニアと地理に関する情報が確かに含まれていますが、クエリとは必ずしも関連がありません。セマンティックランカーにより、より良い回答が見つかるように関連性の高いチャンクが上位に表示されますが、次のステップとして、会話型検索のために Azure AI 検索を LLM に接続する方法を見てみましょう。

次のステップ

データに基づいたグラフ

次の方法で共有

チュートリアル: Azure AI Search で RAG のインデックス作成パイプラインを構築する

前提条件

サンプルのダウンロード

インデックススキーマを指定する

データソース接続を作成する

スキルセットを作成する

インデクサーを作成して実行する

クエリを実行して結果を確認する

次のステップ

フィードバック

その他のリソース

次の方法で共有

チュートリアル: Azure AI Search で RAG のインデックス作成パイプラインを構築する

前提条件

サンプルのダウンロード

インデックス スキーマを指定する

データ ソース接続を作成する

スキルセットを作成する

インデクサーを作成して実行する

クエリを実行して結果を確認する

次のステップ

フィードバック

その他のリソース

インデックススキーマを指定する

データソース接続を作成する