模型服務的偵錯指南

發行項
12/27/2024

本文示範使用服務端點的模型時，使用者可能會遇到的常見問題偵錯步驟。常見問題可能包括使用者在端點無法初始化或啟動時遇到的錯誤、與容器相關的建置失敗，或在端點上作業或執行模型時發生問題。

存取和檢閱記錄

Databricks 建議檢閱建置記錄，以針對服務工作負載的模型中的錯誤進行偵錯和疑難解答。如需記錄的相關信息，以及如何檢視記錄，請參閱監視模型品質和端點健康情況。

檢查工作區UI中模型的事件記錄檔，並檢查是否有成功的容器建置訊息。如果您在一小時后看不到組建訊息，請連絡 Databricks 支援以尋求協助。

如果您的組建成功，但您遇到其他錯誤，請參閱在容器建置成功之後進行偵錯。如果您的組建失敗，請參閱在容器建置失敗之後進行偵錯。

已安裝的連結庫套件版本

在您的組建記錄中，您可以確認已安裝的套件版本。

針對 MLflow 版本，如果您沒有指定版本，Model Serving 會使用最新版本。
針對自定義 GPU 服務，模型服務會根據公用 PyTorch 和 Tensorflow 檔安裝和的建議版本cudacuDNN。

在模型部署前的驗證檢查

Databricks 建議您在提供模型之前，先套用本節中的指引。下列 parameters 可以在等待最終點之前提早偵測問題。請參閱部署之前先驗證模型輸入，在部署模型之前先驗證模型輸入。

在部署之前測試預測

將模型部署到服務端點之前，請先使用 mlflow.models.predict 和輸入範例，以虛擬環境測試離機預測。如需更詳細的指引，請參閱 MLflow 文件，裡面有測試預測的詳細說明。


input_example = {
                  "messages":
                  [
                    {"content": "How many categories of products do we have? Name them.", "role": "user"}
                  ]
                }

mlflow.models.predict(
   model_uri = logged_chain_info.model_uri,
   input_data = input_example,
)

在部署之前驗證模型輸入

模型服務端點需要特殊格式的 json 輸入，以在部署前確認您的模型輸入能在服務端點上正常運作。您可以在 MLflow 中使用 validate_serving_input 來執行這類驗證。

如果您的模型記錄了有效的輸入範例，以下是執行歷程的成品標籤頁中的自動生成程式碼範例。

from mlflow.models import validate_serving_input

model_uri = 'runs:/<run_id>/<artifact_path>'

serving_payload = """{
 "messages": [
   {
     "content": "How many product categories are there?",
     "role": "user"
   }
 ]
}
"""

# Validate the serving payload works on the model
validate_serving_input(model_uri, serving_payload)

您也可以使用 convert_input_example_to_serving_input API 來測試已記錄的模型的任何輸入範例，以 generate 提供有效的 json 輸入。

from mlflow.models import validate_serving_input
from mlflow.models import convert_input_example_to_serving_input

model_uri = 'runs:/<run_id>/<artifact_path>'

# Define INPUT_EXAMPLE with your own input example to the model
# A valid input example is a data instance suitable for pyfunc prediction

serving_payload = convert_input_example_to_serving_input(INPUT_EXAMPLE)

# Validate the serving payload works on the model
validate_serving_input(model_uri, serving_payload)

容器建置成功之後進行偵錯

即使容器建置成功，當您執行模型或在端點本身的作業期間可能會發生問題。下列小節詳細說明常見問題，以及如何進行疑難解答和偵錯

遺漏相依性

您可能會發生get類似An error occurred while loading the model. No module named <module-name>.的錯誤。此錯誤可能表示容器中遺漏相依性。確認您已正確表示應該包含在容器組建中的所有相依性。請特別注意自定義連結庫，並確保 .whl 檔案會包含為成品。

服務記錄迴圈

如果您的容器組建失敗，請檢查服務記錄，以查看端點嘗試載入模型時是否注意到它們迴圈。如果您看到此行為嘗試下列步驟：

開啟筆記本並附加至使用 Databricks Runtime 版本的 All-Purpose 叢集，而不是 databricks Runtime for 機器學習。
使用 MLflow 載入模型，然後嘗試從該處進行偵錯。

您也可以在本機計算機上載入模型，並從該處進行偵錯。使用下列項目在本機載入模型：

import os
import mlflow

os.environ["MLFLOW_TRACKING_URI"] = "databricks://PROFILE"

ARTIFACT_URI = "model_uri"
if '.' in ARTIFACT_URI:
    mlflow.set_registry_uri('databricks-uc')
local_path = mlflow.artifacts.download_artifacts(ARTIFACT_URI)
print(local_path)

conda env create -f local_path/artifact_path/conda.yaml
conda activate mlflow-env

mlflow.pyfunc.load_model(local_path/artifact_path)

將要求傳送至端點時模型失敗

您可能會在模型上呼叫時Encountered an unexpected error while evaluating the model. Verify that the input is compatible with the model for inference.收到類似predict()錯誤。

函式中有 predict() 程式代碼問題。 Databricks 建議您從筆記本中的 MLflow 載入模型並加以呼叫。這樣做會強調 predict() 函式中的問題，並且您可以看到 where 函式內發生失敗。

工作區超過布建的並行存取

您可能會收到 Workspace exceeded provisioned concurrency quota 錯誤。

視區域可用性而定，您可以增加並行。請連絡 Databricks 帳戶小組，並提供工作區標識碼以要求並行增加。

容器建置失敗後偵錯

本節詳細說明組建失敗時可能發生的問題。

`OSError: [Errno 28] No space left on device`

錯誤 No space left 可能是因為不必要地與模型一起記錄太多大型成品。簽入 MLflow，該外部成品不會與模型一起記錄，並嘗試重新部署精簡套件。

Unity 開發模型中 Azure 防火牆的問題 Catalog

您可能會看到錯誤：Build could not start due to an internal error. If you are serving a model from UC and Azure Firewall is enabled, this is not supported by default.。

若要協助解決，請連絡 Databricks 客戶團體。

建置失敗，因為 GPU 可用性不足

您可能會看到錯誤：Build could not start due to an internal error - please contact your Databricks representative.。