HyperParameterTuning - Fighting Breast Cancer
This tutorial shows how SynapseML can be used to identify the best combination of hyperparameters for your chosen classifiers, ultimately resulting in more accurate and reliable models. In order to demonstrate this, we'll show how to perform distributed randomized grid search hyperparameter tuning to build a model to identify breast cancer.
1 - Set up dependencies
Start by importing pandas and setting up our Spark session.
import pandas as pd
from pyspark.sql import SparkSession
# Bootstrap Spark Session
spark = SparkSession.builder.getOrCreate()
Next, read the data and split it into tuning and test sets.
data = spark.read.parquet(
"wasbs://publicwasb@mmlspark.blob.core.windows.net/BreastCancer.parquet"
).cache()
tune, test = data.randomSplit([0.80, 0.20])
tune.limit(10).toPandas()
Define the models to be used.
from synapse.ml.automl import TuneHyperparameters
from synapse.ml.train import TrainClassifier
from pyspark.ml.classification import (
LogisticRegression,
RandomForestClassifier,
GBTClassifier,
)
logReg = LogisticRegression()
randForest = RandomForestClassifier()
gbt = GBTClassifier()
smlmodels = [logReg, randForest, gbt]
mmlmodels = [TrainClassifier(model=model, labelCol="Label") for model in smlmodels]
2 - Find the best model using AutoML
Import SynapseML's AutoML classes from synapse.ml.automl
.
Specify the hyperparameters using the HyperparamBuilder
. Add either DiscreteHyperParam
or RangeHyperParam
hyperparameters. TuneHyperparameters
will randomly choose values from a uniform distribution:
from synapse.ml.automl import *
paramBuilder = (
HyperparamBuilder()
.addHyperparam(logReg, logReg.regParam, RangeHyperParam(0.1, 0.3))
.addHyperparam(randForest, randForest.numTrees, DiscreteHyperParam([5, 10]))
.addHyperparam(randForest, randForest.maxDepth, DiscreteHyperParam([3, 5]))
.addHyperparam(gbt, gbt.maxBins, RangeHyperParam(8, 16))
.addHyperparam(gbt, gbt.maxDepth, DiscreteHyperParam([3, 5]))
)
searchSpace = paramBuilder.build()
# The search space is a list of params to tuples of estimator and hyperparam
print(searchSpace)
randomSpace = RandomSpace(searchSpace)
Next, run TuneHyperparameters to get the best model.
bestModel = TuneHyperparameters(
evaluationMetric="accuracy",
models=mmlmodels,
numFolds=2,
numRuns=len(mmlmodels) * 2,
parallelism=1,
paramSpace=randomSpace.space(),
seed=0,
).fit(tune)
3 - Evaluate the model
We can view the best model's parameters and retrieve the underlying best model pipeline
print(bestModel.getBestModelInfo())
print(bestModel.getBestModel())
We can score against the test set and view metrics.
from synapse.ml.train import ComputeModelStatistics
prediction = bestModel.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()