PyTorch

發行項
12/15/2024

PyTorch 專業是 Python 套件，可提供 GPU 加速的張量計算和高階功能，以組建深度學習網路。如需授權詳細資料，請參閱 GitHub 上的 PyTorch 授權文件。

若要監視和偵錯 PyTorch 模型，請考慮使用 TensorBoard。

PyTorch 包括在適用於機器學習的 Databricks Runtime 中。如果您使用 Databricks Runtime，請參閱[安裝 PyTorch] 以取得安裝 PyTorch 的指示。

注意

這不是 PyTorch 的完整指南。如需詳細資訊，請參閱 PyTorch 網站。

單一節點和分散式訓練

若要測試和移轉單一機器工作流程，請使用 [單一節點] 叢集。

如需深度學習的分散式訓練選項，請參閱分散式訓練。

範例筆記本

PyTorch 筆記本

Get 筆記本

安裝 PyTorch

適用於 ML 的 Databricks Runtime

適用於機器學習的 Databricks Runtime 包括 PyTorch，讓您可以組建叢集並開始使用 PyTorch。若要了解您使用之 Databricks Runtime ML 版本中安裝的 PyTorch 版本，請參閱版本資訊。

Databricks Runtime

Databricks 建議您使用適用於機器學習的 Databricks Runtime 中隨附的 PyTorch。不過，如果您必須使用標準 Databricks Runtime，則可以將 PyTorch 安裝為 Databricks PyPI 程式庫。下列範例示範如何安裝 PyTorch 1.5.0：

在 GPU 叢集上，透過指定下列項目來安裝 pytorch 及 torchvision：
- torch==1.5.0
- torchvision==0.6.0

在 CPU 叢集上，使用下列 Python 轉輪檔案來安裝 pytorch 及 torchvision：

https://download.pytorch.org/whl/cpu/torch-1.5.0%2Bcpu-cp37-cp37m-linux_x86_64.whl

https://download.pytorch.org/whl/cpu/torchvision-0.6.0%2Bcpu-cp37-cp37m-linux_x86_64.whl

分散式 PyTorch 的錯誤和疑難排解

下列各章節說明類別的常見錯誤訊息和疑難排解指引：PyTorch DataParallel 或 PyTorch DistributedDataParallel。這些錯誤大部分都可能透過 TorchDistributor 來解決，您可以在 Databricks Runtime ML 13.0 和更新版本上使用。不過，如果 TorchDistributor 不是可行的解決方案，也會在每個章節中提供建議的解決方案。

以下是如何使用 TorchDistributor 的範例：


from pyspark.ml.torch.distributor import TorchDistributor

def train_fn(learning_rate):
        # ...

num_processes=2
distributor = TorchDistributor(num_processes=num_processes, local_mode=True)

distributor.run(train_fn, 1e-3)

進程 0 以結束代碼 1 終止

在 Databricks 或本機使用筆記本時，可能會發生下列錯誤：

process 0 terminated with exit code 1

若要避免此錯誤，請使用 torch.multiprocessing.start_processes 搭配 start_method=fork，而不是使用 torch.multiprocessing.spawn。

例如：

import torch

def train_fn(rank, learning_rate):
    # required setup, e.g. setup(rank)
        # ...

num_processes = 2
torch.multiprocessing.start_processes(train_fn, args=(1e-3,), nprocs=num_processes, start_method="fork")

伺服器套接字無法繫結至端口

當您在訓練期間中斷儲存格後重新啟動分散式訓練時，會出現以下錯誤：

The server socket has failed to bind to [::]:{PORT NUMBER} (errno: 98 - Address already in use).

若要修正此問題，請重新啟動叢集。如果重新啟動無法解決問題，訓練函式代碼可能會發生錯誤。

您可以遇到 CUDA 的其他問題，因為 start_method=”fork” 與 CUDA 不相容。在任何資料格中使用任何 .cuda 命令可能會導致失敗。若要避免這些錯誤，請在呼叫 torch.multiprocessing.start_method 之前新增下列檢查：

if torch.cuda.is_initialized():
    raise Exception("CUDA was initialized; distributed training will fail.") # or something similar

訓練 PyTorch 模型

共用方式為

PyTorch

單一節點和分散式訓練

範例筆記本

PyTorch 筆記本

安裝 PyTorch

適用於 ML 的 Databricks Runtime

Databricks Runtime

分散式 PyTorch 的錯誤和疑難排解

進程 0 以結束代碼 1 終止

伺服器套接字無法繫結至端口

意見反應

其他資源

共用方式為

PyTorch

單一節點和分散式訓練

範例筆記本

PyTorch 筆記本

安裝 PyTorch

適用於 ML 的 Databricks Runtime

Databricks Runtime

分散式 PyTorch 的錯誤和疑難排解

進程 0 以結束代碼 1 終止

伺服器套接字無法繫結至端口

CUDA 相關的錯誤

意見反應

其他資源