使用適用於 PostgreSQL 的 Azure 資料庫中 - 彈性伺服器上的 Azure OpenAI 產生向量內嵌

發行項
08/15/2024

適用於： 適用於 PostgreSQL 的 Azure 資料庫 - 彈性伺服器

叫用 Azure OpenAI 內嵌輕鬆取得輸入的向量表示法，然後可用於向量相似性搜尋，並被機器學習模型取用。

必要條件

啟用和設定 azure_ai 延伸模組。
建立 OpenAI 帳戶，並要求 Azure OpenAI Service 的存取。
在所需的訂用帳戶中授與 Azure OpenAI 的存取權。
授與建立 Azure OpenAI 資源及部署模型的權限。
建立及部署 Azure OpenAI 服務資源和模型，例如部署內嵌模型 text-embedding-ada-002。複製部署名稱，因為需要用以建立內嵌。

設定 OpenAI 端點和金鑰

在 Azure OpenAI 資源中，在 [資源管理]>[金鑰和端點] 下，您可以找到 Azure OpenAI 資源的端點和金鑰。若要叫用模型部署，請使用端點和其中一個金鑰來啟用 azure_ai 延伸模組。

select azure_ai.set_setting('azure_openai.endpoint', 'https://<endpoint>.openai.azure.com'); 
select azure_ai.set_setting('azure_openai.subscription_key', '<API Key>');

`azure_openai.create_embeddings`

叫用 Azure OpenAI API，以透過指定的輸入使用提供的部署來建立內嵌。

azure_openai.create_embeddings(deployment_name text, input text, timeout_ms integer DEFAULT 3600000, throw_on_error boolean DEFAULT true, max_attempts integer DEFAULT 1, retry_delay_ms integer DEFAULT 1000)
azure_openai.create_embeddings(deployment_name text, input text[], batch_size integer DEFAULT 100, timeout_ms integer DEFAULT 3600000, throw_on_error boolean DEFAULT true, max_attempts integer DEFAULT 1, retry_delay_ms integer DEFAULT 1000)

引數

`deployment_name`

text Azure OpenAI Studio 中包含模型的部署名稱。

`input`

text 或 text[] 單一文字或文字陣列，視用來建立內嵌的函式多載而定。

`dimensions`

integer DEFAULT NULL 產生的輸出內嵌應具有的維度數目。僅在 text-embedding-3 及更高版本中支援。適用於 azure_ai 延伸模組 1.1.0 版和更新版本

`batch_size`

integer DEFAULT 100 一次要處理的記錄數目 (僅適用於參數 input 為 text[] 類型的函式的多載)。

`timeout_ms`

作業停止之前的 integer DEFAULT 3600000 逾時 (以毫秒為單位)。

`throw_on_error`

boolean DEFAULT true 如果函式擲回例外狀況導致換行交易的復原，會發生錯誤。

`max_attempts`

integer DEFAULT 1 如果延伸模組因任何可重試的錯誤而失敗，延伸模組會重試 Azure OpenAI 內嵌建立的次數。

`retry_delay_ms`

integer DEFAULT 1000 如果延伸模組因任何可重試的錯誤而失敗時，延伸模組再次呼叫 Azure OpenAI 端點以進行內嵌建立之前的等待時間 (毫秒)。

傳回類型

real[] 或 TABLE(embedding real[]) 單一元素或單一資料行資料表，視所使用函式的多載而定，由選取的部署處理時，具有輸入文字的向量表示。

使用 OpenAI 建立內嵌，並將其儲存在向量資料類型中

-- Create tables and populate data
DROP TABLE IF EXISTS conference_session_embeddings;
DROP TABLE IF EXISTS conference_sessions;

CREATE TABLE conference_sessions(
  session_id int PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  title text,
  session_abstract text,
  duration_minutes integer,
  publish_date timestamp
);

-- Create a table to store embeddings with a vector column.
CREATE TABLE conference_session_embeddings(
  session_id integer NOT NULL REFERENCES conference_sessions(session_id),
  session_embedding vector(1536)
);

-- Insert a row into the sessions table
INSERT INTO conference_sessions
    (title,session_abstract,duration_minutes,publish_date) 
VALUES
    ('Gen AI with Azure Database for PostgreSQL flexible server'
    ,'Learn about building intelligent applications with azure_ai extension and pg_vector' 
    , 60, current_timestamp)
    ,('Deep Dive: PostgreSQL database storage engine internals'
    ,' We will dig deep into storage internals'
    , 30, current_timestamp)
    ;

-- Get an embedding for the Session Abstract
SELECT
     pg_typeof(azure_openai.create_embeddings('text-embedding-ada-002', c.session_abstract)) as embedding_data_type
    ,azure_openai.create_embeddings('text-embedding-ada-002', c.session_abstract)
  FROM
    conference_sessions c LIMIT 10;

-- Insert embeddings 
INSERT INTO conference_session_embeddings
    (session_id, session_embedding)
SELECT
    c.session_id, (azure_openai.create_embeddings('text-embedding-ada-002', c.session_abstract))
FROM
    conference_sessions as c  
LEFT OUTER JOIN
    conference_session_embeddings e ON e.session_id = c.session_id
WHERE
    e.session_id IS NULL;

-- Create a HNSW index
CREATE INDEX ON conference_session_embeddings USING hnsw (session_embedding vector_ip_ops);


-- Retrieve top similarity match
SELECT
    c.*
FROM
    conference_session_embeddings e
INNER JOIN
    conference_sessions c ON c.session_id = e.session_id
ORDER BY
    e.session_embedding <#> azure_openai.create_embeddings('text-embedding-ada-002', 'Session to learn about building chatbots')::vector
LIMIT 1;

下一步

深入了解使用 pgvector 的向量相似性搜尋

共用方式為