Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate SQL, Power BI, Fabric, and AI community-led event. March 31 - April 2. Use code MSCUST for a $150 discount. Prices go up Feb 11th.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
microsoftml.rx_ensemble(formula: str,
data: [<class 'revoscalepy.datasource.RxDataSource.RxDataSource'>,
<class 'pandas.core.frame.DataFrame'>, <class 'list'>],
trainers: typing.List[microsoftml.modules.base_learner.BaseLearner],
method: str = None, model_count: int = None,
random_seed: int = None, replace: bool = False,
samp_rate: float = None, combine_method: ['Average', 'Median',
'Vote'] = 'Median', max_calibration: int = 100000,
split_data: bool = False, ml_transforms: list = None,
ml_transform_vars: list = None, row_selection: str = None,
transforms: dict = None, transform_objects: dict = None,
transform_function: str = None,
transform_variables: list = None,
transform_packages: list = None,
transform_environment: dict = None, blocks_per_read: int = None,
report_progress: int = None, verbose: int = 1,
compute_context: revoscalepy.computecontext.RxComputeContext.RxComputeContext = None)
Train an ensemble of models.
rx_ensemble
is a function that trains a number of models
of various kinds to obtain better predictive performance than could be
obtained from a single model.
A symbolic or mathematical formula in valid Python syntax,
enclosed in double quotes. A symbolic formula might reference objects in the
data source, such as "creditScore ~ yearsEmploy"
.
Interaction terms (creditScore * yearsEmploy
) and
expressions (creditScore == 1
) are not currently supported.
A data source object or a character string specifying a .xdf file or a data frame object. Alternatively, it can be a list of data sources indicating each model should be trained using one of the data sources in the list. In this case, the length of the data list must be equal to model_count.
A list of trainers with their arguments. The trainers are
created by using FastTrees
, FastForest
, FastLinear
,
LogisticRegression
, NeuralNetwork
, or OneClassSvm
.
A character string that specifies the type of ensemble:
"anomaly"
for Anomaly Detection, "binary"
for Binary Classification,
multiClass
for Multiclass Classification, or "regression"
for Regression.
Specifies the random seed. The default value is None
.
Specifies the number of models to train. If this number is greater
than the length of the trainers list, the trainers list is duplicated to match model_count
.
A logical value specifying if the sampling of observations should be done
with or without replacement. The default value is False
.
A scalar of positive value specifying the percentage of observations to sample for
each trainer. The default is 1.0
for sampling with replacement (i.e., replace=True
) and 0.632
for sampling without replacement (i.e., replace=False
). When split_data
is True
, the default of
samp_rate
is 1.0
(no sampling is done before splitting).
A logical value specifying whether or not to train the base models on non-overlapping partitions.
The default is False
. It is available only for RxSpark
compute context and ignored for others.
Specifies the method used to combine the models:
"Median"
: to compute the median of the individual model outputs,
"Average"
: to compute the average of the individual model outputs and
"Vote"
: to compute (pos-neg) / the total number of models, where 'pos' is the number of positive outputs and 'neg' is the number of negative outputs.
Specifies the maximum number of examples to use for calibration. This argument is ignored for all tasks other than binary classification.
Specifies a list of MicrosoftML transforms to be
performed on the data before training or None if no transforms are
to be performed. Transforms that require an additional pass over the data
(such as featurize_text
, categorical
are not allowed.
These transformations are performed after any specified R transformations.
The default value is None.
Specifies a character vector of variable names to be used in ml_transforms or None if none are to be used. The default value is None.
NOT SUPPORTED. Specifies the rows (observations) from the data set that are to be used by the model with the name of a logical variable from the data set (in quotes) or with a logical expression using variables in the data set. For example:
rowSelection = "old"
will only use observations in which the value of the variable old
is True
.
rowSelection = (age > 20) & (age < 65) & (log(income) > 10)
only uses observations in which the value of the age
variable is between 20 and 65 and the value of the log
of the income
variable is greater than 10.
The row selection is performed after processing any data
transformations (see the arguments transforms
or
transform_func
). As with all expressions, row_selection
can be
defined outside of the function call using the expression
function.
NOT SUPPORTED. An expression of the form that represents
the first round of variable transformations. As with
all expressions, transforms
(or row_selection
) can be defined
outside of the function call using the expression
function.
NOT SUPPORTED. A named list that contains objects that can be
referenced by transforms
, transform_function
, and
row_selection
.
The variable transformation function.
A character vector of input data set variables needed for the transformation function.
NOT SUPPORTED. A character vector specifying additional Python packages
(outside of those specified in RxOptions.get_option("transform_packages")
) to
be made available and preloaded for use in variable transformation functions.
For example, those explicitly defined in revoscalepy functions via
their transforms
and transform_function
arguments or those defined
implicitly via their formula
or row_selection
arguments. The
transform_packages
argument may also be None, indicating that
no packages outside RxOptions.get_option("transform_packages")
are preloaded.
NOT SUPPORTED. A user-defined environment to serve as a parent to all
environments developed internally and used for variable data transformation.
If transform_environment = None
, a new "hash" environment with parent
revoscalepy.baseenv
is used instead.
Specifies the number of blocks to read for each chunk of data read from the data source.
An integer value that specifies the level of reporting on the row processing progress:
0
: no progress is reported.
1
: the number of processed rows is printed and updated.
2
: rows processed and timings are reported.
3
: rows processed and all timings are reported.
An integer value that specifies the amount of output wanted.
If 0
, no verbose output is printed during calculations. Integer
values from 1
to 4
provide increasing amounts of information.
Sets the context in which computations are executed,
specified with a valid revoscalepy.RxComputeContext
.
Currently local and revoscalepy.RxSpark compute contexts
are supported. When revoscalepy.RxSpark is specified,
the training of the models is done in a distributed way, and the ensembling
is done locally. Note that the compute context cannot be non-waiting.
A rx_ensemble
object with the trained ensemble model.
Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate SQL, Power BI, Fabric, and AI community-led event. March 31 - April 2. Use code MSCUST for a $150 discount. Prices go up Feb 11th.
Register today