SparkJob Class
A standalone Spark job.
- Inheritance
-
azure.ai.ml.entities._job.job.JobSparkJobazure.ai.ml.entities._job.parameterized_spark.ParameterizedSparkSparkJobazure.ai.ml.entities._job.job_io_mixin.JobIOMixinSparkJobazure.ai.ml.entities._job.spark_job_entry_mixin.SparkJobEntryMixinSparkJob
Constructor
SparkJob(*, driver_cores: int | str | None = None, driver_memory: str | None = None, executor_cores: int | str | None = None, executor_memory: str | None = None, executor_instances: int | str | None = None, dynamic_allocation_enabled: bool | str | None = None, dynamic_allocation_min_executors: int | str | None = None, dynamic_allocation_max_executors: int | str | None = None, inputs: Dict[str, Input | str | bool | int | float] | None = None, outputs: Dict[str, Output] | None = None, compute: str | None = None, identity: Dict[str, str] | ManagedIdentityConfiguration | AmlTokenConfiguration | UserIdentityConfiguration | None = None, resources: Dict | SparkResourceConfiguration | None = None, **kwargs: Any)
Keyword-Only Parameters
Name | Description |
---|---|
driver_cores
|
The number of cores to use for the driver process, only in cluster mode. |
driver_memory
|
The amount of memory to use for the driver process, formatted as strings with a size unit suffix ("k", "m", "g" or "t") (e.g. "512m", "2g"). |
executor_cores
|
The number of cores to use on each executor. |
executor_memory
|
The amount of memory to use per executor process, formatted as strings with a size unit suffix ("k", "m", "g" or "t") (e.g. "512m", "2g"). |
executor_instances
|
The initial number of executors. |
dynamic_allocation_enabled
|
Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload. |
dynamic_allocation_min_executors
|
The lower bound for the number of executors if dynamic allocation is enabled. |
dynamic_allocation_max_executors
|
The upper bound for the number of executors if dynamic allocation is enabled. |
inputs
|
The mapping of input data bindings used in the job. |
outputs
|
The mapping of output data bindings used in the job. |
compute
|
The compute resource the job runs on. |
identity
|
Optional[Union[dict[str, str], <xref:azure.ai.ml.ManagedIdentityConfiguration>, <xref:azure.ai.ml.AmlTokenConfiguration>, <xref:azure.ai.ml.UserIdentityConfiguration>]]
The identity that the Spark job will use while running on compute. |
resources
Required
|
|
Examples
Configuring a SparkJob.
from azure.ai.ml import Input, Output
from azure.ai.ml.entities import SparkJob
spark_job = SparkJob(
code="./sdk/ml/azure-ai-ml/tests/test_configs/dsl_pipeline/spark_job_in_pipeline/basic_src",
entry={"file": "sampleword.py"},
conf={
"spark.driver.cores": 2,
"spark.driver.memory": "1g",
"spark.executor.cores": 1,
"spark.executor.memory": "1g",
"spark.executor.instances": 1,
},
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:33",
inputs={
"input1": Input(
type="uri_file", path="azureml://datastores/workspaceblobstore/paths/python/data.csv", mode="direct"
)
},
compute="synapsecompute",
outputs={"component_out_path": Output(type="uri_folder")},
args="--input1 ${{inputs.input1}} --output2 ${{outputs.output1}} --my_sample_rate ${{inputs.sample_rate}}",
)
Methods
dump |
Dumps the job content into a file in YAML format. |
filter_conf_fields |
Filters out the fields of the conf attribute that are not among the Spark configuration fields listed in ~azure.ai.ml._schema.job.parameterized_spark.CONF_KEY_MAP and returns them in their own dictionary. |
dump
Dumps the job content into a file in YAML format.
dump(dest: str | PathLike | IO, **kwargs: Any) -> None
Parameters
Name | Description |
---|---|
dest
Required
|
The local path or file stream to write the YAML content to. If dest is a file path, a new file will be created. If dest is an open file, the file will be written to directly. |
Exceptions
Type | Description |
---|---|
Raised if dest is a file path and the file already exists. |
|
Raised if dest is an open file and the file is not writable. |
filter_conf_fields
Filters out the fields of the conf attribute that are not among the Spark configuration fields listed in ~azure.ai.ml._schema.job.parameterized_spark.CONF_KEY_MAP and returns them in their own dictionary.
filter_conf_fields() -> Dict[str, str]
Returns
Type | Description |
---|---|
A dictionary of the conf fields that are not Spark configuration fields. |
Exceptions
Type | Description |
---|---|
Raised if dest is a file path and the file already exists. |
|
Raised if dest is an open file and the file is not writable. |
Attributes
base_path
creation_context
The creation context of the resource.
Returns
Type | Description |
---|---|
The creation metadata for the resource. |
entry
environment
The Azure ML environment to run the Spark component or job in.
Returns
Type | Description |
---|---|
The Azure ML environment to run the Spark component or job in. |
id
The resource ID.
Returns
Type | Description |
---|---|
The global ID of the resource, an Azure Resource Manager (ARM) ID. |
identity
The identity that the Spark job will use while running on compute.
Returns
Type | Description |
---|---|
The identity that the Spark job will use while running on compute. |
inputs
log_files
Job output files.
Returns
Type | Description |
---|---|
The dictionary of log names and URLs. |
outputs
resources
The compute resource configuration for the job.
Returns
Type | Description |
---|---|
The compute resource configuration for the job. |
status
The status of the job.
Common values returned include "Running", "Completed", and "Failed". All possible values are:
NotStarted - This is a temporary state that client-side Run objects are in before cloud submission.
Starting - The Run has started being processed in the cloud. The caller has a run ID at this point.
Provisioning - On-demand compute is being created for a given job submission.
Preparing - The run environment is being prepared and is in one of two stages:
Docker image build
conda environment setup
Queued - The job is queued on the compute target. For example, in BatchAI, the job is in a queued state
while waiting for all the requested nodes to be ready.
Running - The job has started to run on the compute target.
Finalizing - User code execution has completed, and the run is in post-processing stages.
CancelRequested - Cancellation has been requested for the job.
Completed - The run has completed successfully. This includes both the user code execution and run
post-processing stages.
Failed - The run failed. Usually the Error property on a run will provide details as to why.
Canceled - Follows a cancellation request and indicates that the run is now successfully cancelled.
NotResponding - For runs that have Heartbeats enabled, no heartbeat has been recently sent.
Returns
Type | Description |
---|---|
Status of the job. |
studio_url
type
CODE_ID_RE_PATTERN
CODE_ID_RE_PATTERN = re.compile('\\/subscriptions\\/(?P<subscription>[\\w,-]+)\\/resourceGroups\\/(?P<resource_group>[\\w,-]+)\\/providers\\/Microsoft\\.MachineLearningServices\\/workspaces\\/(?P<workspace>[\\w,-]+)\\/codes\\/(?P<co)