Overview of forecasting methods in AutoML

This article describes the methods that AutoML in Azure Machine Learning uses to prepare time series data and build forecasting models. For instructions and examples about training forecasting models in AutoML, see Set up AutoML for time series forecasting.

Forecasting methods in AutoML

AutoML uses several methods to forecast time series values. These methods can be roughly assigned to two categories:

  • Time series models that use historical values of the target quantity to make predictions into the future
  • Regression, or explanatory, models that use predictor variables to forecast values of the target

Suppose you need to forecast daily demand for a particular brand of orange juice from a grocery store. For the expression, let $y_t$ represent the demand for this brand on day $t$. A time series model predicts demand at $t+1$ by using some function of historical demand with the following expression:

$y_{t+1} = f(y_t, y_{t-1}, \ldots, y_{t-s})$

The function $f$ often has parameters that you tune by using observed demand from the past. The amount of history that $f$ uses to make predictions, $s$, can also be considered a parameter of the model.

The time series model in the orange juice demand example might not be accurate enough because it uses only information about past demand. There are many other factors that can influence future demand, such as price, day of the week, and holiday periods. Consider a regression model that uses these predictor variables with the following expression:

$y = g(\text{price}, \text{day of week}, \text{holiday})$

Again, the function $g$ generally has a set of parameters, including values that govern regularization, which AutoML tunes by using past values of the demand and the predictors. You omit $t$ from the expression to emphasize that the regression model uses correlational patterns between contemporaneously defined variables to make predictions. To predict $y_{t+1}$ from $g$, you need to know which day of the week corresponds to $t+1$, whether the day is a holiday, and the orange juice price on day $t+1$. The first two pieces of information are easy to identify by using a calendar. A retail price is commonly set in advance, so the price of orange juice is likely also known one day in advance. However, the price might not be known 10 days into the future. It's important to understand that the utility of this regression is limited by how far into the future you need forecasts, also called the forecast horizon, and to what degree you know the future values of the predictors.

Important

AutoML's forecasting regression models assume that all features provided by the user are known into the future, at least up to the forecast horizon.

AutoML's forecasting regression models can also be augmented to use historical values of the target and predictors. The result is a hybrid model with characteristics of a time series model and a pure regression model. Historical quantities are extra predictor variables in the regression referred to as lagged quantities. The order of the lag refers to how far back the value is known. For example, the current value of an order-two lag of the target for the orange juice demand example is the observed juice demand from two days ago.

Another notable difference between the time series models and the regression models is how they generate forecasts. Recursion relations generally define time series models that produce forecasts one-at-a-time. To forecast many periods into the future, they iterate up-to the forecast horizon, feeding previous forecasts back into the model to generate the next one-period-ahead forecast as needed. In contrast, the regression models are considered direct forecasters that generate all forecasts up to the horizon in a single attempt. Direct forecasters can be preferable to recursive methods because recursive models compound prediction error when they feed previous forecasts back into the model. When lag features are included, AutoML makes some important modifications to the training data so the regression models can function as direct forecasters. For more information, see Lag features for time-series forecasting in AutoML.

Forecasting models in AutoML

AutoML in Machine Learning implements the following forecasting models. For each category, the models are listed roughly in order of the complexity of patterns they can incorporate, also known as the model capacity. A Naive model, which simply forecasts the last observed value, has low capacity while the Temporal Convolutional Network (TCNForecaster), a deep neural network (DNN) with potentially millions of tunable parameters, has high capacity.

Time series models Regression models
Naive, Seasonal Naive, Average, Seasonal Average, ARIMA(X), Exponential Smoothing Linear SGD, LARS LASSO, Elastic Net, Prophet, K Nearest Neighbors, Decision Tree, Random Forest, Extremely Randomized Trees, Gradient Boosted Trees, LightGBM, XGBoost, TCNForecaster

AutoML also includes ensemble models that create weighted combinations of the best performing models to further improve accuracy. For forecasting, you use a soft voting ensemble where composition and weights are found by using the Caruana Ensemble Selection Algorithm.

Note

There are two important caveats for forecast model ensembles:

  • The TCN cannot currently be included in ensembles.
  • By default, AutoML disables the stack ensemble method, which is included with default regression and classification tasks in AutoML. The stack ensemble fits a meta-model on the best model forecasts to find ensemble weights. During internal benchmarking, this strategy has an increased tendency to over fit time series data. This result can result in poor generalization, so the stack ensemble is disabled by default. You can enable the ensemble in the AutoML configuration, as needed.

How AutoML uses your data

AutoML accepts time series data in tabular "wide" format. Each variable must have its own corresponding column. AutoML requires one column to be the time axis for the forecasting problem. This column must be parsable into a datetime type. The simplest time series dataset consists of a time column and a numeric target column. The target is the variable you intend to predict into the future. The following table shows example values for this format:

timestamp quantity
2012-01-01 100
2012-01-02 97
2012-01-03 106
... ...
2013-12-31 347

In more complex cases, the dataset might contain other columns aligned with the time index:

timestamp SKU price advertised quantity
2012-01-01 JUICE1 3.5 0 100
2012-01-01 BREAD3 5.76 0 47
2012-01-02 JUICE1 3.5 0 97
2012-01-02 BREAD3 5.5 1 68
... ... ... ... ...
2013-12-31 JUICE1 3.75 0 347

The second example includes a SKU, a retail price, and a flag to indicate whether an item was advertised in addition to the timestamp and target quantity. The second dataset reveals two series: one for the JUICE1 SKU and one for the BREAD3 SKU. The SKU column is a time series ID column because grouping by these column values produces two groups that each contain a single series. Before the model sweep, AutoML does basic validation of the input configuration and data and adds engineered features.

Data length requirements

To train a forecasting model, you must have a sufficient amount of historical data. This threshold quantity varies with the training configuration. If you provide validation data, the minimum number of training observations required per time series is expressed as follows:

$T_{\text{user validation}} = H + \text{max}(l_{\text{max}}, s_{\text{window}}) + 1$

In this expression, $H$ is the forecast horizon, $l_{\text{max}}$ is the maximum lag order, and $s_{\text{window}}$ is the window size for rolling aggregation features. If you use cross-validation, the minimum number of observations is expressed as follows:

$T_{\text{CV}} = 2H + (n_{\text{CV}} - 1) n_{\text{step}} + \text{max}(l_{\text{max}}, s_{\text{window}}) + 1$

In this version, $n_{\text{CV}}$ is the number of cross-validation folds and $n_{\text{step}}$ is the CV step size, or offset between CV folds. The basic logic behind these formulas is that you should always have at least a horizon of training observations for each time series, including some padding for lags and cross-validation splits. For more information about cross-validation for forecasting, see Model selection in AutoML.

Missing data handling

The time series models in AutoML require regularly spaced observations in time, which includes cases like monthly or yearly observations where the number of days between observations can vary. Before the modeling process initiates, AutoML must ensure there are no missing series values and that the observations are regular. As a result, there are two missing data cases:

  • A value is missing for some cell in the tabular data.
  • A row is missing, which corresponds with an expected observation given the time series frequency.

In the first case, AutoML imputes missing values by using common configurable techniques. The following table shows an example of an expected row that's missing:

timestamp quantity
2012-01-01 100
2012-01-03 106
2012-01-04 103
... ...
2013-12-31 347

This series ostensibly has a daily frequency, but there's no observation for January 2, 2012 (2012-01-02). In this case, AutoML attempts to fill in the data by adding a new row for the missing value. The new value for the quantity column, and any other columns in the data, are then imputed like other missing values. To execute this process, AutoML must recognize the series frequency to be able to fill in observation gaps as demonstrated in this case. AutoML automatically detects this frequency, or, optionally, the user can provide it in the configuration.

The imputation method for supplying missing values can be configured in the input. The following table lists the default methods:

Column type Default imputation method
Target Forward fill (last observation carried forward)
Numeric feature Median value

Missing values for categorical features are handled during numerical encoding by including another category that corresponds to a missing value. Imputation is implicit in this case.

Automated feature engineering

AutoML generally adds new columns to user data to increase modeling accuracy. Engineered features can include default or optional items.

Default engineered features:

  • Calendar features derived from the time index, such as day of the week
  • Categorical features derived from time series IDs
  • Encoding categorical types to numeric type

Optional engineered features:

You can configure featurization from the AutoML SDK by using the ForecastingJob class or from the Azure Machine Learning studio web interface.

Nonstationary time series detection and handling

A time series where mean and variance change over time is called non-stationary. Time series that exhibit stochastic trends are nonstationary by nature.

The following image presents a visualization for this scenario. The chart plots a series that's generally trending upward. If you compute and compare the mean (average) values for the first and second half of the series, you can identify the differences. The mean of the series in the first half of the plot is smaller than the mean in the second half. The fact that the mean of the series depends on the time interval under review is an example of the time-varying moments. In this scenario, the mean of a series is the first moment.

Diagram showing retail sales for a nonstationary time series.

The next image shows a chart that plots the original series in first differences, $\Delta y_{t} = y_t - y_{t-1}$. The mean of the series is roughly constant over the time range while the variance appears to vary. This scenario demonstrates an example of a first-order stationary times series:

Diagram showing retail sales for a weakly stationary time series.

AutoML regression models can't inherently deal with stochastic trends or other well-known problems associated with nonstationary time series. As a result, out-of-sample forecast accuracy can be poor when such trends are present.

AutoML automatically analyzes a time series dataset to determine its level or stationarity. When nonstationary time series are detected, AutoML applies a differencing transform automatically to mitigate the effects of nonstationary behavior.

Model sweeping

After data is prepared with missing data handling and feature engineering, AutoML sweeps over a set of models and hyper-parameters by using a model recommendation service.

The models are ranked based on validation or cross-validation metrics, and then, optionally, the top models can be used in an ensemble model. The best model, or any of the trained models, can be inspected, downloaded, or deployed to produce forecasts as needed. For more information, see Model sweeping and selection for forecasting in AutoML.

Model grouping

When a dataset contains more than one time series, there are multiple ways to model the data. You can group by the data in the time series ID columns and train independent models for each series. A more general approach is to partition the data into groups that can each contain multiple (likely related) series and train a model per group.

By default, AutoML forecasting uses a mixed approach to model grouping. Time series models, plus ARIMAX and Prophet, assign one series to one group and other regression models assign all series to a single group.

Here's how each model type uses groups:

  • Each series in a separate group (1:1): Naive, Seasonal Naive, Average, Seasonal Average, Exponential Smoothing, ARIMA, ARIMAX, Prophet

  • All series in the same group (N:1): Linear SGD, LARS LASSO, Elastic Net, K Nearest Neighbors, Decision Tree, Random Forest, Extremely Randomized Trees, Gradient Boosted Trees, LightGBM, XGBoost, TCNForecaster

More general model groupings are possible by using the many models solution in AutoML. For more information, see Many Models - Automated ML notebook.