使用隔離樹系進行多變量異常偵測

發行項
10/15/2024

本文說明如何使用 Apache Spark 上的 SynapseML，進行多變量異常偵測。多變量異常偵測允許偵測許多變數或時間序列之間的異常狀況，並考慮不同變數之間的所有相互關聯和相依性。在此案例中，我們使用 SynapseML 來訓練隔離樹系模型，以進行多變量異常偵測；然後，我們會用於經過訓練的模型，以推斷資料集內包含三個 IoT 感應器綜合測量的多變量異常狀況。

若要深入了解隔離樹系模型，請參閱 Liu 等人的原始論文。

必要條件

將筆記本連結至 Lakehouse。在左側，選取 [新增]，以新增現有的 Lakehouse 或建立 Lakehouse。

程式庫匯入

from IPython import get_ipython
from IPython.terminal.interactiveshell import TerminalInteractiveShell
import uuid
import mlflow

from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
from pyspark.ml import Pipeline

from synapse.ml.isolationforest import *

from synapse.ml.explainers import *

%matplotlib inline

from pyspark.sql import SparkSession

# Bootstrap Spark Session
spark = SparkSession.builder.getOrCreate()

from synapse.ml.core.platform import *

if running_on_synapse():
    shell = TerminalInteractiveShell.instance()
    shell.define_macro("foo", """a,b=10,20""")

輸入資料

# Table inputs
timestampColumn = "timestamp"  # str: the name of the timestamp column in the table
inputCols = [
    "sensor_1",
    "sensor_2",
    "sensor_3",
]  # list(str): the names of the input variables

# Training Start time, and number of days to use for training:
trainingStartTime = (
    "2022-02-24T06:00:00Z"  # datetime: datetime for when to start the training
)
trainingEndTime = (
    "2022-03-08T23:55:00Z"  # datetime: datetime for when to end the training
)
inferenceStartTime = (
    "2022-03-09T09:30:00Z"  # datetime: datetime for when to start the training
)
inferenceEndTime = (
    "2022-03-20T23:55:00Z"  # datetime: datetime for when to end the training
)

# Isolation Forest parameters
contamination = 0.021
num_estimators = 100
max_samples = 256
max_features = 1.0

讀取資料

df = (
    spark.read.format("csv")
    .option("header", "true")
    .load(
        "wasbs://publicwasb@mmlspark.blob.core.windows.net/generated_sample_mvad_data.csv"
    )
)

將資料行轉換成適當的資料類型

df = (
    df.orderBy(timestampColumn)
    .withColumn("timestamp", F.date_format(timestampColumn, "yyyy-MM-dd'T'HH:mm:ss'Z'"))
    .withColumn("sensor_1", F.col("sensor_1").cast(DoubleType()))
    .withColumn("sensor_2", F.col("sensor_2").cast(DoubleType()))
    .withColumn("sensor_3", F.col("sensor_3").cast(DoubleType()))
    .drop("_c5")
)

display(df)

訓練資料準備

# filter to data with timestamps within the training window
df_train = df.filter(
    (F.col(timestampColumn) >= trainingStartTime)
    & (F.col(timestampColumn) <= trainingEndTime)
)
display(df_train)

測試資料準備

# filter to data with timestamps within the inference window
df_test = df.filter(
    (F.col(timestampColumn) >= inferenceStartTime)
    & (F.col(timestampColumn) <= inferenceEndTime)
)
display(df_test)

訓練隔離樹系模型

isolationForest = (
    IsolationForest()
    .setNumEstimators(num_estimators)
    .setBootstrap(False)
    .setMaxSamples(max_samples)
    .setMaxFeatures(max_features)
    .setFeaturesCol("features")
    .setPredictionCol("predictedLabel")
    .setScoreCol("outlierScore")
    .setContamination(contamination)
    .setContaminationError(0.01 * contamination)
    .setRandomSeed(1)
)

接下來，我們會建立 ML 管線來訓練隔離樹系模型。我們也會示範如何建立 MLflow 實驗並註冊經過訓練的模型。

只有在稍後存取經過訓練的模型時，才嚴格要求 MLflow 模型註冊。若要訓練模型，並在相同的筆記本中執行推斷，模型化物件模型就已足夠。

va = VectorAssembler(inputCols=inputCols, outputCol="features")
pipeline = Pipeline(stages=[va, isolationForest])
model = pipeline.fit(df_train)

執行推斷

載入經過訓練的隔離樹系模型

執行推斷

df_test_pred = model.transform(df_test)
display(df_test_pred)

預先製作異常偵測程式

Azure AI 異常偵測程式

最新點的異常狀態：使用上述點產生模型，並判斷最新點是否異常 (Scala、Python)
尋找異常：使用整個數列產生模型，並尋找數列中的異常狀況 (Scala、Python)

共用方式為

使用隔離樹系進行多變量異常偵測

必要條件

程式庫匯入

輸入資料

讀取資料

訓練資料準備

測試資料準備

訓練隔離樹系模型

執行推斷

預先製作異常偵測程式

意見反應

其他資源

共用方式為

使用隔離樹系進行多變量異常偵測

必要條件

程式庫匯入

輸入資料

讀取資料

訓練資料準備

測試資料準備

訓練隔離樹系模型

執行推斷

預先製作異常偵測程式

相關內容

意見反應

其他資源