ハイパーパラメーターチューニング (プレビュー)

[アーティクル]
04/09/2024

ハイパーパラメーターチューニングは、トレーニング中に機械学習モデルによって学習されず、トレーニングプロセスが開始される前にユーザーによって設定されるパラメーターの最適な値を見つけるプロセスです。これらのパラメーターは一般にハイパーパラメーターと呼ばれ、例として学習率、ニューラルネットワーク内の隠れ層の数、正則化の強さ、バッチサイズなどがあります。

機械学習モデルのパフォーマンスは、ハイパーパラメーターの選択に対して非常に機密性が高く、最適なハイパーパラメーターのセットは、特定の問題とデータセットによって大きく異なる場合があります。したがって、ハイパーパラメーターチューニングは、モデルの精度と一般化のパフォーマンスに大きな影響を与える可能性があるため、機械学習パイプラインの重要なステップとなります。

Fabric では、データサイエンティストは、ハイパーパラメーターチューニング要件に対して、機械学習と AI 操作を効率的に自動化するための軽量 Python ライブラリである FLAML を活用できます。 Fabric ノートブック内では、ユーザーは flaml.tune を呼び出して、経済的なハイパーパラメーターチューニングを行うことができます。

重要

この機能はプレビュー中です。

チューニングワークフロー

flaml.tune を使用して基本的なチューニングタスクを完了するには、次の 3 つの基本的な手順を使用します。

ハイパーパラメーターに関してチューニング目標を指定します。
ハイパーパラメーターの検索空間を指定します。
チューニングを行うリソース予算の制約、構成に対する制約、または１つ (または複数) の特定のメトリックに対する制約など、チューニング制約を指定します。

チューニング目標

最初の手順では、チューニング目標を指定します。これを行うには、まず、ユーザー定義関数 evaluation_function のハイパーパラメーターに対して評価手順を指定する必要があります。この関数には、入力としてハイパーパラメーター構成が必要です。単純にスカラーでメトリック値を返すことも、メトリック名とメトリック値のペアのディクショナリを返すこともできます。

次の例では、x と y という名前の 2 つのハイパーパラメーターに対して評価関数を定義できます。

import time

def evaluate_config(config: dict):
    """evaluate a hyperparameter configuration"""
    score = (config["x"] - 85000) ** 2 - config["x"] / config["y"]


    faked_evaluation_cost = config["x"] / 100000
    time.sleep(faked_evaluation_cost)
    # we can return a single float as a score on the input config:
    # return score
    # or, we can return a dictionary that maps metric name to metric value:
    return {"score": score, "evaluation_cost": faked_evaluation_cost, "constraint_metric": config["x"] * config["y"]}

探索空間

次に、ハイパーパラメーターの検索空間を指定します。検索空間では、ハイパーパラメーターの有効な値と、それらの値のサンプリング方法(一様分布や対数一様分布など)を指定する必要があります。次の例では、ハイパーパラメーター x と y の検索空間を指定できます。両方の有効な値は、[1, 100,000] の範囲の整数です。これらのハイパーパラメーターは、指定された範囲で均一にサンプリングされます。

from flaml import tune

# construct a search space for the hyperparameters x and y.
config_search_space = {
    "x": tune.lograndint(lower=1, upper=100000),
    "y": tune.randint(lower=1, upper=100000)
}

# provide the search space to tune.run
tune.run(..., config=config_search_space, ...)

FLAML を使用すると、ユーザーは特定のハイパーパラメーターのドメインをカスタマイズできます。これにより、ユーザーはパラメーターをサンプリングする型と有効な範囲を指定できます。 FLAML では、float、integer、および categorical のハイパーパラメーター型がサポートされています。一般的に使用されるドメインについては、次の例を参照してください。

config = {
    # Sample a float uniformly between -5.0 and -1.0
    "uniform": tune.uniform(-5, -1),

    # Sample a float uniformly between 3.2 and 5.4,
    # rounding to increments of 0.2
    "quniform": tune.quniform(3.2, 5.4, 0.2),

    # Sample a float uniformly between 0.0001 and 0.01, while
    # sampling in log space
    "loguniform": tune.loguniform(1e-4, 1e-2),

    # Sample a float uniformly between 0.0001 and 0.1, while
    # sampling in log space and rounding to increments of 0.00005
    "qloguniform": tune.qloguniform(1e-4, 1e-1, 5e-5),

    # Sample a random float from a normal distribution with
    # mean=10 and sd=2
    "randn": tune.randn(10, 2),

    # Sample a random float from a normal distribution with
    # mean=10 and sd=2, rounding to increments of 0.2
    "qrandn": tune.qrandn(10, 2, 0.2),

    # Sample a integer uniformly between -9 (inclusive) and 15 (exclusive)
    "randint": tune.randint(-9, 15),

    # Sample a random uniformly between -21 (inclusive) and 12 (inclusive (!))
    # rounding to increments of 3 (includes 12)
    "qrandint": tune.qrandint(-21, 12, 3),

    # Sample a integer uniformly between 1 (inclusive) and 10 (exclusive),
    # while sampling in log space
    "lograndint": tune.lograndint(1, 10),

    # Sample a integer uniformly between 2 (inclusive) and 10 (inclusive (!)),
    # while sampling in log space and rounding to increments of 2
    "qlograndint": tune.qlograndint(2, 10, 2),

    # Sample an option uniformly from the specified choices
    "choice": tune.choice(["a", "b", "c"]),
}

検索空間内でドメインをカスタマイズする方法の詳細については、検索スペースのカスタマイズに関する FLAML ドキュメントを参照してください。

チューニングの制約

最後の手順では、チューニングタスクの制約を指定します。 flaml.tune の重要なプロパティの 1 つは、必要なリソース制約内でチューニングプロセスを完了できることです。これを行うには、ユーザーは、time_budget_s 引数を使用して実時間 (秒単位) の観点から、または num_samples 引数を使用して試行回数の観点から、リソースの制約を指定できます。

# Set a resource constraint of 60 seconds wall-clock time for the tuning.
flaml.tune.run(..., time_budget_s=60, ...)

# Set a resource constraint of 100 trials for the tuning.
flaml.tune.run(..., num_samples=100, ...)

# Use at most 60 seconds and at most 100 trials for the tuning.
flaml.tune.run(..., time_budget_s=60, num_samples=100, ...)

追加構成の制約の詳細については、高度なチューニングオプションに関する FLAML ドキュメントを参照してください。

組み合わせる

チューニング条件を定義したら、チューニング試用版を実行できます。試用版の結果を追跡するために、MLFlow 自動ログを利用して、これらの各実行のメトリックとパラメーターをキャプチャできます。このコードでは、ハイパーパラメーターチューニング試用版全体がキャプチャされ、FLAML によって探索された各ハイパーパラメーターの組み合わせが強調表示されます。

import mlflow
mlflow.set_experiment("flaml_tune_experiment")
mlflow.autolog(exclusive=False)

with mlflow.start_run(nested=True, run_name="Child Run: "):
    analysis = tune.run(
        evaluate_config,  # the function to evaluate a config
        config=config_search_space,  # the search space defined
        metric="score",
        mode="min",  # the optimization mode, "min" or "max"
        num_samples=-1,  # the maximal number of configs to try, -1 means infinite
        time_budget_s=10,  # the time budget in seconds
    )

Note

MLflow 自動ログ記録を有効にすると、MLFlow の実行時にメトリック、パラメーター、モデルが自動的にログに記録されます。ただし、これはフレームワークによって異なります。特定のモデルのメトリックとパラメーターがログに記録されない場合があります。たとえば、XGBoost、LightGBM、Spark、SynapseML のモデルのメトリックはログに記録されません。 MLFlow 自動ログ記録のドキュメントを使用して、各フレームワークからキャプチャされるメトリックとパラメーターの詳細を確認できます。

Apache Spark を使用した並列チューニング

flaml.tune 機能では、Apache Spark と単一ノード学習器の両方のチューニングがサポートされています。さらに、単一ノード学習器 (Scikit-Learn 学習器など) をチューニングする場合は、チューニングを並列化して、use_spark = True を設定することでチューニングプロセスを高速化することもできます。 Spark クラスターの場合、既定では、FLAML は Executor ごとに 1 つの試用版を起動します。 n_concurrent_trials 引数を使用して、同時試行回数をカスタマイズすることもできます。


analysis = tune.run(
    evaluate_config,  # the function to evaluate a config
    config=config_search_space,  # the search space defined
    metric="score",
    mode="min",  # the optimization mode, "min" or "max"
    num_samples=-1,  # the maximal number of configs to try, -1 means infinite
    time_budget_s=10,  # the time budget in seconds
    use_spark=True,
)
print(analysis.best_trial.last_result)  # the best trial's result
print(analysis.best_config)  # the best config

チューニング軌跡を並列化する方法の詳細については、並列 Spark ジョブの FLAML ドキュメントを参照してください。

結果を視覚化する

この flaml.visualization モジュールには、Plotly を使用して最適化プロセスをプロットするためのユーティリティ関数が用意されています。 Plotly を利用することで、ユーザーは AutoML の実験結果を対話的に調べることができます。これらのプロット関数を使用するには、単に最適化された flaml.AutoML または flaml.tune.tune.ExperimentAnalysis オブジェクトを入力として指定します。

ノートブック内で次の関数を使用できます。

plot_optimization_history:実験内のすべての試行の最適化履歴をプロットします。
plot_feature_importance: データセット内の各特徴の重要度をプロットします。
plot_parallel_coordinate: 実験内の高次元パラメーターのリレーションシップをプロットします。
plot_contour: 実験のコンタープロットとしてパラメーターリレーションシップをプロットします。
plot_edf:実験の目標値EDF(経験分布関数)をプロットします。
plot_timeline: 実験のタイムラインをプロットします。
plot_slice: スタディのスライスプロットとしてパラメータリレーションシップをプロットします。
plot_param_importance:実験のハイパーパラメーターの重要度をプロットします。

次の方法で共有

ハイパーパラメーターチューニング (プレビュー)

チューニングワークフロー

チューニング目標

探索空間

チューニングの制約

組み合わせる

Apache Spark を使用した並列チューニング

結果を視覚化する

フィードバック

その他のリソース

次の方法で共有

ハイパーパラメーター チューニング (プレビュー)

チューニング ワークフロー

チューニング目標

探索空間

チューニングの制約

組み合わせる

Apache Spark を使用した並列チューニング

結果を視覚化する

関連するコンテンツ

フィードバック

その他のリソース

ハイパーパラメーターチューニング (プレビュー)

チューニングワークフロー