コンポーネントとパイプラインの入力と出力を管理する

[アーティクル]
09/23/2024

Azure Machine Learning のパイプラインでは、コンポーネントとパイプライン両方のレベルで入力と出力がサポートされます。この記事では、パイプラインとコンポーネントの入力と出力、およびそれらを管理する方法について説明します。

コンポーネントレベルでは、入力と出力によってコンポーネントのインターフェイスが定義されます。 1 つのコンポーネントからの出力を、同じ親パイプライン内の別のコンポーネントの入力として使用でき、コンポーネント間でデータやモデルを受け渡すことができます。この相互接続が、まさにパイプライン内のデータフローを表しています。

パイプラインレベルでは、さまざまなデータ入力やパラメーター (learning_rate など) を使ってパイプラインジョブを送信するために、入力と出力を利用できます。入力と出力は、REST エンドポイントを介してパイプラインを呼び出すときに特に便利です。パイプライン入力に異なる値を割り当てたり、異なるパイプラインジョブの出力にアクセスしたりできます。詳細については、「バッチエンドポイントのジョブと入力データを作成する」を参照してください。

入力と出力の種類

コンポーネントまたはパイプラインの入力と出力として、次の種類がサポートされています。

データ型詳細については、データ型を参照してください。
- uri_file
- uri_folder
- mltable
モデルの種類。
- mlflow_model
- custom_model

次のプリミティブ型は、入力に対してのみサポートされます。

プリミティブ型
- string
- number
- integer
- boolean

プリミティブ型の出力はサポートされていません。

入力と出力の例

次の例は、GitHub リポジトリの「Azure Machine Learning examples」にある NYC Taxi Data Regression パイプラインのものです。

train コンポーネントには、test_split_ratio という名前の number 入力があります。
prep コンポーネントには、uri_folder タイプの出力があります。コンポーネントのソースコードでは、入力フォルダーから CSV ファイルを読み取り、ファイルを処理し、処理された CSV ファイルを出力フォルダーに書き込みます。
train コンポーネントには、mlflow_model タイプの出力があります。コンポーネントのソースコードでは、mlflow.sklearn.save_model メソッドを使用してトレーニング済みのモデルを保存します。

出力のシリアル化

データまたはモデル出力を使用して、出力をシリアル化し、保存場所にファイルとして保存します。後続の手順では、このストレージの場所をマウントするか、コンピューティングファイルシステムにファイルをダウンロードまたはアップロードすることで、ジョブの実行中にファイルにアクセスできます。

コンポーネントのソースコードでは、通常メモリに格納されている出力オブジェクトをファイルにシリアル化する必要があります。たとえば、pandas データフレームを CSV ファイルにシリアル化できます。 Azure Machine Learning では、オブジェクトのシリアル化のための標準化されたメソッドは定義されていません。オブジェクトをファイルにシリアル化する方法を柔軟に選択できます。ダウンストリームのコンポーネントでは、これらのファイルを逆シリアル化して読み取る方法を選択できます。

データ型の入力パスと出力パス

データ資産の入力と出力の場合、データの場所を指す path パラメーターを指定する必要があります。次の表は、Azure Machine Learning パイプラインの入力と出力でサポートされているデータの場所と、path パラメーターの例を示しています。

場所	入力	出力	例
ローカルコンピューター上のパス	✓		`./home/<username>/data/my_data`
パブリック HTTP(S) サーバー上のパス	✓		`https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv`
Azure Storage 上のパス	*		`wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>` または `abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>`
Azure Machine Learning データストアのパス	✓	✓	`azureml://datastores/<data_store_name>/paths/<path>`
データ資産へのパス	✓	✓	`azureml:my_data:<version>`

* データを読み取るために追加の ID 構成が必要になる場合があるため、入力には Azure Storage を直接使用することはお勧めしません。さまざまなパイプラインジョブの種類でサポートされている Azure Machine Learning データストアパスを使用することをお勧めします。

データ型の入力パスと出力モード

データ型の入力と出力の場合、ダウンロード、アップロード、マウントの複数のモードから選択して、コンピューティングターゲットがデータにアクセスする方法を定義できます。次の表は、さまざまな種類の入力と出力でサポートされているモードを示しています。

Type	`upload`	`download`	`ro_mount`	`rw_mount`	`direct`	`eval_download`	`eval_mount`
`uri_folder` 入力		✓	✓		✓
`uri_file` 入力		✓	✓		✓
`mltable` 入力		✓	✓		✓	✓	✓
`uri_folder` 出力	✓			✓
`uri_file` 出力	✓			✓
`mltable` 出力	✓			✓	✓

ほとんどの場合、ro_mount または rw_mount モードをお勧めします。詳細については、モードに関するページを参照してください。

パイプライングラフの入力と出力

Azure Machine Learning スタジオのパイプラインジョブページでは、コンポーネントの入力と出力は、入力/出力ポートと呼ばれる小さな円として表示されます。これらのポートは、パイプラインのデータフローを表します。パイプラインレベルの出力は、簡単に識別できるように紫色のボックスに表示されます。

NYC Taxi Data Regression パイプライングラフの次のスクリーンショットは、複数のコンポーネントとパイプラインの入出力を示しています。

入出力ポートにマウスを合わせると、種類が表示されます。

ポートの上にマウスを合わせた際に、ポートの種類が強調表示されているスクリーンショット。

パイプライングラフには、プリミティブ型の入力は表示されません。これらの入力は、パイプラインの [ジョブの概要] パネル (パイプラインレベルの入力の場合) またはコンポーネントパネル (コンポーネントレベルの入力の場合) の [設定] タブにあります。コンポーネントパネルを開くには、グラフ内のコンポーネントをダブルクリックします。

スタジオの Designer でパイプラインを編集すると、パイプラインの入力と出力が [パイプラインインターフェイス] パネルに表示され、コンポーネントの入力と出力がコンポーネントパネルに表示されます。

Designer のパイプラインインターフェイスが強調されているスクリーンショット。

コンポーネントの入出力をパイプラインレベルに上げる

コンポーネントの入出力をパイプラインレベルに上げると、パイプラインジョブを送信するときに、コンポーネントの入出力を上書きできます。この機能は、REST エンドポイントを使用してパイプラインをトリガーする場合に特に便利です。

次の例で、コンポーネントレベルの入出力をパイプラインレベルの入出力に上げる方法を示します。

次のパイプラインでは、3 つの入力と 3 つの出力がパイプラインレベルに上げられます。たとえば、pipeline_job_training_max_epocs はルートレベルの inputs セクションで宣言されているため、パイプラインレベルの入力になります。

jobs セクションの train_job では、max_epocs という名前の入力が ${{parent.inputs.pipeline_job_training_max_epocs}} として参照されています。これは、train_job の入力 max_epocs がパイプラインレベルの入力 pipeline_job_training_max_epocs を参照していることを示します。パイプライン出力は、同じスキーマを使用してレベルを上げます。

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 1b_e2e_registered_components
description: E2E dummy train-score-eval pipeline with registered components

inputs:
  pipeline_job_training_max_epocs: 20
  pipeline_job_training_learning_rate: 1.8
  pipeline_job_learning_rate_schedule: 'time-based'

outputs: 
  pipeline_job_trained_model:
    mode: upload
  pipeline_job_scored_data:
    mode: upload
  pipeline_job_evaluation_report:
    mode: upload

settings:
 default_compute: azureml:cpu-cluster

jobs:
  train_job:
    type: command
    component: azureml:my_train@latest
    inputs:
      training_data: 
        type: uri_folder 
        path: ./data      
      max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
      learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
      learning_rate_schedule: ${{parent.inputs.pipeline_job_learning_rate_schedule}}
    outputs:
      model_output: ${{parent.outputs.pipeline_job_trained_model}}
    services:
      my_vscode:
        type: vs_code
      my_jupyter_lab:
        type: jupyter_lab
      my_tensorboard:
        type: tensor_board
        log_dir: "outputs/tblogs"
    #  my_ssh:
    #    type: tensor_board
    #    ssh_public_keys: <paste the entire pub key content>
    #    nodes: all # Use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node.

  score_job:
    type: command
    component: azureml:my_score@latest
    inputs:
      model_input: ${{parent.jobs.train_job.outputs.model_output}}
      test_data: 
        type: uri_folder 
        path: ./data
    outputs:
      score_output: ${{parent.outputs.pipeline_job_scored_data}}

  evaluate_job:
    type: command
    component: azureml:my_eval@latest
    inputs:
      scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
    outputs:
      eval_output: ${{parent.outputs.pipeline_job_evaluation_report}}

完全な例については、Azure Machine Learning examples リポジトリにある train-score-eval パイプライン (登録済みコンポーネントを含む) を参照してください。

次のコードの例では、nyc_taxi_data_regression パイプラインを定義しています。パイプラインは 1 つの入力 (pipeline_job_input) を受け取り、return ステートメントで定義されている 6 つの出力を生成します。パイプライン出力は、スキーマ <step_name.outputs.output_name> (prepare_sample_data.outputs.prep_data など) を使用して子コンポーネントからレベルが上げられます。

最初から最後までの手順を含むノートブックは、Azure Machine Learning examples リポジトリの「NYC taxi data regression」にあります。

# import required libraries
from azure.identity import DefaultAzureCredential

from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component

# set subscription, resource group, and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# define the directory that stores the input data 
parent_dir = ""

# load components
prepare_data = load_component(source=parent_dir + "./prep.yml")
transform_data = load_component(source=parent_dir + "./transform.yml")
train_model = load_component(source=parent_dir + "./train.yml")
predict_result = load_component(source=parent_dir + "./predict.yml")
score_data = load_component(source=parent_dir + "./score.yml")

# construct pipeline
@pipeline()
def nyc_taxi_data_regression(pipeline_job_input):
    """NYC taxi data regression example."""
    prepare_sample_data = prepare_data(raw_data=pipeline_job_input)
    transform_sample_data = transform_data(
        clean_data=prepare_sample_data.outputs.prep_data
    )
    train_with_sample_data = train_model(
        training_data=transform_sample_data.outputs.transformed_data
    )
    predict_with_sample_data = predict_result(
        model_input=train_with_sample_data.outputs.model_output,
        test_data=train_with_sample_data.outputs.test_data,
    )
    score_with_sample_data = score_data(
        predictions=predict_with_sample_data.outputs.predictions,
        model=train_with_sample_data.outputs.model_output,
    )
    return {
        "pipeline_job_prepped_data": prepare_sample_data.outputs.prep_data,
        "pipeline_job_transformed_data": transform_sample_data.outputs.transformed_data,
        "pipeline_job_trained_model": train_with_sample_data.outputs.model_output,
        "pipeline_job_test_data": train_with_sample_data.outputs.test_data,
        "pipeline_job_predictions": predict_with_sample_data.outputs.predictions,
        "pipeline_job_score_report": score_with_sample_data.outputs.score_report,
    }
# define pipeline job
pipeline_job = nyc_taxi_data_regression(
    Input(type="uri_folder", path=parent_dir + "./data/")
)
# demo how to change pipeline output settings
pipeline_job.outputs.pipeline_job_prepped_data.mode = "rw_mount"

# set pipeline level compute
pipeline_job.settings.default_compute = "cpu-cluster"
# set pipeline level datastore
pipeline_job.settings.default_datastore = "workspaceblobstore"

省略可能な入力を定義する

既定では、すべての入力は必須であり、パイプラインジョブを送信するたびに、既定値にするか、値を割り当てる必要があります。ただし、省略可能な入力を定義できます。

Note

省略可能な出力はサポートされていません。

省略可能な入力は、次の 2 つのシナリオで役立ちます:

省略可能なデータ/モデル型入力を定義し、パイプラインジョブの送信時に値を割り当てない場合は、パイプラインコンポーネントにはそのデータ依存関係がありません。コンポーネントの入力ポートがコンポーネントまたはデータ/モデルノードにリンクされていない場合、パイプラインは、先行する依存関係を待機するのではなく、コンポーネントを直接呼び出します。
パイプラインの continue_on_step_failure = True を設定したものの、node2 が node1 からの必須入力を使用する場合、node1 が失敗したときに、node2 は実行されません。 node1 入力が省略可能な場合は、node1 が失敗したときでも、node2 が実行されます。次のグラフにこのシナリオを示します。

Azure CLI/Python SDK
スタジオ UI

次のコード例は、省略可能な入力を定義する方法を示しています。入力が optional = true として設定されている場合は、$[[]] を使用して、例の強調表示された行のようにコマンドライン入力を受け入れる必要があります。

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_data_component_cli
display_name: train_data
description: A example train component
tags:
  author: azureml-sdk-team
type: command
inputs:
  training_data: 
    type: uri_folder
  max_epocs:
    type: integer
    optional: true
  learning_rate: 
    type: number
    default: 0.01
    optional: true
  learning_rate_schedule: 
    type: string
    default: time-based
    optional: true
outputs:
  model_output:
    type: uri_folder
code: ./train_src
environment: azureml://registries/azureml/environments/sklearn-1.5/labels/latest
command: >-
  python train.py 
  --training_data ${{inputs.training_data}} 
  $[[--max_epocs ${{inputs.max_epocs}}]]
  $[[--learning_rate ${{inputs.learning_rate}}]]
  $[[--learning_rate_schedule ${{inputs.learning_rate_schedule}}]]
  --model_output ${{outputs.model_output}}

出力パスをカスタマイズする

既定では、コンポーネントの出力は、パイプライン (azureml://datastores/${{default_datastore}}/paths/${{name}}/${{output_name}}) に対して設定した {default_datastore} に格納されます。設定しなかった場合、既定値はワークスペースの BLOB ストレージです。

ジョブ {name} はジョブの実行時に解決され、{output_name} はコンポーネント YAML で定義した名前です。ただし、出力のパスを定義することで、出力を格納する場所をカスタマイズできます。

登録されたコンポーネントの例を含む train-score-eval パイプラインの pipeline.yml ファイルは、3 つのパイプラインレベルの出力を持つパイプラインを定義します。次のコマンドを使って、pipeline_job_trained_model 出力のカスタム出力パスを設定できます。

# define the custom output path using datastore uri
# add relative path to your blob container after "azureml://datastores/<datastore_name>/paths"
output_path="azureml://datastores/{datastore_name}/paths/{relative_path_of_container}"  

# create job and define path using --outputs.<outputname>
az ml job create -f ./pipeline.yml --set outputs.pipeline_job_trained_model.path=$output_path

出力パスをカスタマイズする方法を示す次のコードは、「Build pipeline with command_component decorated python function」のノートブックから引用しています。

cluster_name = "cpu-cluster"
custom_path = "azureml://datastores/workspaceblobstore/paths/custom_path/${{name}}/"

# define a pipeline with component
@pipeline(default_compute=cluster_name)
def pipeline_with_python_function_components(input_data, test_data, learning_rate):
    """E2E dummy train-score-eval pipeline with components defined via python function components"""

    # Call component obj as function: apply given inputs & parameters to create a node in pipeline
    train_with_sample_data = train_model(
        training_data=input_data, max_epochs=5, learning_rate=learning_rate
    )
    score_with_sample_data = score_data(
        model_input=train_with_sample_data.outputs.model_output,
        test_data=test_data,
        model_file=train_with_sample_data.outputs.output,
    )
    # example how to change path of output on step level,
    # please note if the output is promoted to pipeline level you need to change path in pipeline job level
    score_with_sample_data.outputs.score_output = Output(
        type="uri_folder", mode="rw_mount", path=custom_path
    )
    eval_with_sample_data = eval_model(
        scoring_result=score_with_sample_data.outputs.score_output,
        scoring_file=score_with_sample_data.outputs.output,
    )

    # Return: pipeline outputs
    return {
        "eval_output": eval_with_sample_data.outputs.eval_output,
        "model_output": train_with_sample_data.outputs.model_output,
    }


pipeline_job = pipeline_with_python_function_components(
    input_data=Input(
        path="wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv", type="uri_file"
    ),
    test_data=Input(
        path="wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv", type="uri_file"
    ),
    learning_rate=0.1,
)
# example how to change path of output on pipeline level
pipeline_job.outputs.model_output = Output(
    type="uri_folder", mode="rw_mount", path=custom_path
)

出力をダウンロードする

出力は、パイプラインまたはコンポーネントレベルでダウンロードできます。

パイプラインレベルの出力をダウンロードする

ジョブのすべての出力をダウンロードすることも、特定の出力をダウンロードすることもできます。

# Download all the outputs of the job
az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

# Download a specific output
az ml job download --output-name <OUTPUT_PORT_NAME> -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

まず、ワークスペースを参照するハンドルとして ml_client を作成して初期化します。詳細については、「ワークスペースへのハンドルを作成する」を参照してください。

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

ジョブのすべての出力をダウンロードするか、特定の出力をダウンロードします。

# Download all the outputs of the job
output = client.jobs.download(name=job.name, download_path=tmp_path, all=True)

# Download specific output
output = client.jobs.download(name=job.name, download_path=tmp_path, output_name=output_port_name)

コンポーネントの出力をダウンロードする

子コンポーネントの出力をダウンロードするには、まずパイプラインジョブのすべての子ジョブを一覧表示してから、同様のコードを使用して出力をダウンロードします。

# List all child jobs in the job and print job details in table format
az ml job list --parent-job-name <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID> -o table

# Select the desired child job name to download output
az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# List all child jobs in the job
child_jobs = client.jobs.list(parent_job_name=job.name)

# Traverse and download all the outputs of child job
for child_job in child_jobs:
    client.jobs.download(name=child_job.name, all=True)

出力を名前付き資産として登録する

name と version を出力に割り当てることで、コンポーネントまたはパイプラインの出力を名前付き資産として登録できます。登録された資産は、スタジオ UI、CLI、または SDK を使ってワークスペースで一覧表示でき、将来のワークスペースで参照することもできます。

パイプラインレベルの出力を登録する

display_name: register_pipeline_output
type: pipeline
jobs:
  node:
    type: command
    inputs:
      component_in_path:
        type: uri_file
        path: https://dprepdata.blob.core.windows.net/demo/Titanic.csv
    component: ../components/helloworld_component.yml
    outputs:
      component_out_path: ${{parent.outputs.component_out_path}}
outputs:
  component_out_path:
    type: mltable
    name: pipeline_output  # Define name and version to register pipeline output
    version: '1'
settings:
  default_compute: azureml:cpu-cluster

from azure.ai.ml import dsl, Output

# Load component functions
components_dir = "./components/"
helloworld_component = load_component(source=f"{components_dir}/helloworld_component.yml")

@pipeline()
def register_pipeline_output():
  # Call component obj as function: apply given inputs & parameters to create a node in pipeline
  node = helloworld_component(component_in_path=Input(
    type='uri_file', path='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'))

  return {
      'component_out_path': node.outputs.component_out_path
  }

pipeline = register_pipeline_output()
# Define name and version to register pipeline output
pipeline.settings.default_compute = "azureml:cpu-cluster"
pipeline.outputs.component_out_path.name = 'pipeline_output'
pipeline.outputs.component_out_path.version = '1'

コンポーネント出力を登録する

display_name: register_node_output
type: pipeline
jobs:
  node:
    type: command
    component: ../components/helloworld_component.yml
    inputs:
      component_in_path:
        type: uri_file
        path: 'https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
    outputs:
      component_out_path:
        type: uri_folder
        name: 'node_output'  # Define name and version to register a child job's output
        version: '1'
settings:
  default_compute: azureml:cpu-cluster

from azure.ai.ml import dsl, Output

# Load component functions
components_dir = "./components/"
helloworld_component = load_component(source=f"{components_dir}/helloworld_component.yml")

@pipeline()
def register_node_output():
  # Call component obj as function: apply given inputs & parameters to create a node in pipeline
  node = helloworld_component(component_in_path=Input(
    type='uri_file', path='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'))

  # Define name and version to register node output
  node.outputs.component_out_path.name = 'node_output'
  node.outputs.component_out_path.version = '1'

pipeline = register_node_output()
pipeline.settings.default_compute = "azureml:cpu-cluster"

次の方法で共有

コンポーネントとパイプラインの入力と出力を管理する

入力と出力の種類

入力と出力の例

出力のシリアル化

データ型の入力パスと出力パス

データ型の入力パスと出力モード

パイプライングラフの入力と出力

コンポーネントの入出力をパイプラインレベルに上げる

省略可能な入力を定義する

出力パスをカスタマイズする

出力をダウンロードする

パイプラインレベルの出力をダウンロードする

コンポーネントの出力をダウンロードする

出力を名前付き資産として登録する

パイプラインレベルの出力を登録する

コンポーネント出力を登録する

フィードバック

その他のリソース

次の方法で共有

コンポーネントとパイプラインの入力と出力を管理する

入力と出力の種類

入力と出力の例

出力のシリアル化

データ型の入力パスと出力パス

データ型の入力パスと出力モード

パイプライン グラフの入力と出力

コンポーネントの入出力をパイプライン レベルに上げる

省略可能な入力を定義する

出力パスをカスタマイズする

出力をダウンロードする

パイプライン レベルの出力をダウンロードする

コンポーネントの出力をダウンロードする

出力を名前付き資産として登録する

パイプライン レベルの出力を登録する

コンポーネント出力を登録する

関連するコンテンツ

フィードバック

その他のリソース

パイプライングラフの入力と出力

コンポーネントの入出力をパイプラインレベルに上げる

パイプラインレベルの出力をダウンロードする

パイプラインレベルの出力を登録する