OutputFileDatasetConfig Class
Represent how to copy the output of a run and be promoted as a FileDataset.
The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.
An example of not passing any arguments:
workspace = Workspace.from_config()
experiment = Experiment(workspace, 'output_example')
output = OutputFileDatasetConfig()
script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])
run = experiment.submit(script_run_config)
print(run)
An example of creating an output then promoting the output to a tabular dataset and register it with name foo:
workspace = Workspace.from_config()
experiment = Experiment(workspace, 'output_example')
datastore = Datastore(workspace, 'example_adls_gen2_datastore')
# for more information on the parameters and methods, please look for the corresponding documentation.
output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')
script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])
run = experiment.submit(script_run_config)
print(run)
Initialize a OutputFileDatasetConfig.
The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.
An example of not passing any arguments:
workspace = Workspace.from_config()
experiment = Experiment(workspace, 'output_example')
output = OutputFileDatasetConfig()
script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])
run = experiment.submit(script_run_config)
print(run)
An example of creating an output then promoting the output to a tabular dataset and register it with name foo:
workspace = Workspace.from_config()
experiment = Experiment(workspace, 'output_example')
datastore = Datastore(workspace, 'example_adls_gen2_datastore')
# for more information on the parameters and methods, please look for the corresponding documentation.
output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')
script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])
run = experiment.submit(script_run_config)
print(run)
- Inheritance
-
OutputFileDatasetConfigOutputFileDatasetConfig
Constructor
OutputFileDatasetConfig(name=None, destination=None, source=None, partition_format=None)
Parameters
Name | Description |
---|---|
name
Required
|
The name of the output specific to this run. This is generally used for lineage purposes. If set to None, we will automatically generate a name. The name will also become an environment variable which contains the local path of where you can write your output files and folders to that will be uploaded to the destination. |
destination
Required
|
The destination to copy the output to. If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}, where run-id is the Run's ID and the output-name is the output name from the name parameter above. The destination is a tuple where the first item is the datastore and the second item is the path within the datastore to copy the data to. The path within the datastore can be a template path. A template path is just a regular path but with placeholders inside. Those placeholders will then be resolved at the appropriate time. The syntax for placeholders is {placeholder}, for example, /path/with/{placeholder}. Currently only two placeholders are supported, {run-id} and {output-name}. |
source
Required
|
The path within the compute target to copy the data from. If set to None, we will set this to a directory we create inside the compute target's OS temporary directory. |
partition_format
Required
|
Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'. |
name
Required
|
The name of the output specific to this run. This is generally used for lineage purposes. If set to None, we will automatically generate a name. The name will also become an environment variable which contains the local path of where you can write your output files and folders to that will be uploaded to the destination. |
destination
Required
|
The destination to copy the output to. If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}, where run-id is the Run's ID and the output-name is the output name from the name parameter above. The destination is a tuple where the first item is the datastore and the second item is the path within the datastore to copy the data to. The path within the datastore can be a template path. A template path is just a regular path but with placeholders inside. Those placeholders will then be resolved at the appropriate time. The syntax for placeholders is {placeholder}, for example, /path/with/{placeholder}. Currently only two placeholders are supported, {run-id} and {output-name}. |
source
Required
|
The path within the compute target to copy the data from. If set to None, we will set this to a directory we create inside the compute target's OS temporary directory. |
partition_format
Required
|
Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'. |
Remarks
You can pass the OutputFileDatasetConfig as an argument to your run, and it will be automatically translated into local path on the compute. The source argument will be used if one is specified, otherwise we will automatically generate a directory in the OS's temp folder. The files and folders inside the source directory will then be copied to the destination based on the output configuration.
By default the mode by which the output will be copied to the destination storage will be set to mount. For more information about mount mode, please see the documentation for as_mount.
Methods
as_input |
Specify how to consume the output as an input in subsequent pipeline steps. |
as_mount |
Set the mode of the output to mount. For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed. |
as_upload |
Set the mode of the output to upload. For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded. |
as_input
Specify how to consume the output as an input in subsequent pipeline steps.
as_input(name=None)
Parameters
Name | Description |
---|---|
name
Required
|
The name of the input specific to the run. |
Returns
Type | Description |
---|---|
A DatasetConsumptionConfig instance describing how to deliver the input data. |
as_mount
Set the mode of the output to mount.
For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.
as_mount(disable_metadata_cache=False)
Parameters
Name | Description |
---|---|
disable_metadata_cache
Required
|
Whether to cache metadata in local node, if disabled a node will not be able to see files generated from other nodes during job running. |
Returns
Type | Description |
---|---|
A OutputFileDatasetConfig instance with mode set to mount. |
as_upload
Set the mode of the output to upload.
For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.
as_upload(overwrite=False, source_globs=None)
Parameters
Name | Description |
---|---|
overwrite
Required
|
Whether to overwrite files that already exists in the destination. |
source_globs
Required
|
Glob patterns used to filter files that will be uploaded. |
Returns
Type | Description |
---|---|
A OutputFileDatasetConfig instance with mode set to upload. |