準備基礎模型微調的數據

發行項
12/15/2024

重要

這項功能在下列區域開放公開預覽：centralus、eastus、eastus2、northcentralus 和 westus。

本文說明基礎模型微調的已接受的訓練和評估數據檔案格式（現在是馬賽克 AI 模型訓練的一部分）。

筆記本：訓練運行的資料驗證

下列筆記本示範如何驗證資料。其設計目的是在開始訓練之前獨立執行。它會驗證您的資料是否為基礎模型微調的正確格式，並包含程式碼，以透過標記原始數據集，協助您在定型執行期間預估成本。

驗證訓練回合筆記本的資料

取得筆記本

整理數據以完成聊天對話生成

對於聊天完成工作，聊天格式化的數據必須位於 .jsonl 檔案中，其中每一行都是代表單一聊天會話的個別 JSON 物件。每個聊天工作階段都以具有單一金鑰的 JSON 物件 (messages) 代表，而且會對應訊息物件陣列。若要訓練聊天數據，請在創建訓練運行時提供 task_type = 'CHAT_COMPLETION'。

聊天格式的訊息會根據模型的聊天範本自動格式化，因此不需要新增特殊聊天令牌，以手動方式標記聊天回合的開頭或結尾。使用自訂聊天模板的模型範例是 Meta Llama 3.1 8B 指示。

陣列中的每個訊息物件都代表交談中的單一訊息，並具有下列結構：

role：代表訊息作者的字串。可能的值是 system、user、assistant。如果角色是 system，它必須是郵件清單中的第一個聊天。至少必須有一則包含角色 assistant 的訊息，而且 (選用) 系統提示之後的任何訊息都必須在使用者/助理之間轉換角色。兩則相鄰訊息不能有相同角色。 messages 陣列中的最後一則訊息必須具有角色 assistant。
content：包含訊息文字的字串。

注意

Mistral 模型不接受其資料格式的 system 角色。

以下是聊天格式資料範例：

{"messages": [
  {"role": "system", "content": "A conversation between a user and a helpful assistant."},
  {"role": "user", "content": "Hi there. What's the capital of the moon?"},
  {"role": "assistant", "content": "This question doesn't make sense as nobody currently lives on the moon, meaning it would have no government or political institutions. Furthermore, international treaties prohibit any nation from asserting sovereignty over the moon and other celestial bodies."},
  ]
}

準備持續預先訓練的資料

針對持續預先訓練工作，訓練資料是非結構化文字資料。訓練數據必須位於包含 .txt 檔案的 Unity Catalog 磁碟區中。每個 .txt 檔案都會視為單一範例。如果您的 .txt 檔案位於 Unity Catalog 磁碟區資料夾中，也會取得這些檔案以供訓練數據使用。會忽略磁碟區中的任何非 txt 檔案。請參閱將檔案上傳至 Unity Catalog 磁碟區。

下圖顯示 Unity 目錄磁碟區中 .txt 檔案的範例。若要在持續預訓練運行設定中使用這些資料，請設定 train_data_path = "dbfs:/Volumes/main/finetuning/cpt-data" 並設定 task_type = 'CONTINUED_PRETRAIN'。

有持續預先訓練資料集檔案範例的 UC 磁碟區

自行格式化數據

警告

不建議使用本節中的指引，但適用於需要自定義數據格式的案例。

Databricks 強烈建議使用聊天格式的資料，以便根據您使用的模型自動將適當的格式套用至您的資料。

基礎模型微調可讓您自行進行數據格式設定。定型及提供模型時，必須套用任何數據格式設定。若要使用格式化的數據來訓練您的模型，請在建立訓練運行時設定 task_type = 'INSTRUCTION_FINETUNE'。

訓練和評估資料必須符合下列其中一個格式：

提示和回覆組。

{"prompt": "your-custom-prompt", "response": "your-custom-response"}

提示和完成組。

{"prompt": "your-custom-prompt", "completion": "your-custom-response"}

重要

提示-回應和提示-完成並未範本化，因此任何特定模型的範本化，例如 Mistral 的指令格式化，都必須當成前置處理步驟執行。

支援的數據格式

以下是支援的資料格式：

有 .jsonl 檔案的 Unity Catalog 磁碟區。訓練資料必須是 JSONL 格式，其中每一行都是有效的 JSON 物件。下列範例顯示提示和回應組範例：

{"prompt": "What is Databricks?","response": "Databricks is a cloud-based data engineering platform that provides a fast, easy, and collaborative way to process large-scale data."}

符合上述其中一個可接受結構描述的差異資料表。針對差異資料表，您必須提供可用於資料處理的 data_prep_cluster_id 參數。請參閱設定訓練回合。

公用 Hugging Face 資料集。

如果您使用公用 Hugging Face 資料集作為訓練資料，請使用分割指定完整路徑，例如 mosaicml/instruct-v3/train and mosaicml/instruct-v3/test。這個帳戶適用於分割結構描述不同的資料集。不支援來自 Hugging Face 的巢狀資料集。

如需更廣泛的範例，請參閱 Hugging Face 上的 mosaicml/dolly_hhrlhf 資料集。

下列資料列範例來自 mosaicml/dolly_hhrlhf 資料集。

{"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Databricks? ### Response: ","response": "Databricks is a cloud-based data engineering platform that provides a fast, easy, and collaborative way to process large-scale data."}
{"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Van Halen famously banned what color M&Ms in their rider? ### Response: ","response": "Brown."}

共用方式為