OutputFileDatasetConfig Class

Reference

Represent how to copy the output of a run and be promoted as a FileDataset.

The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.

An example of not passing any arguments:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = OutputFileDatasetConfig()

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

An example of creating an output then promoting the output to a tabular dataset and register it with name foo:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   datastore = Datastore(workspace, 'example_adls_gen2_datastore')

   # for more information on the parameters and methods, please look for the corresponding documentation.
   output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

Initialize a OutputFileDatasetConfig.

An example of not passing any arguments:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = OutputFileDatasetConfig()

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

An example of creating an output then promoting the output to a tabular dataset and register it with name foo:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   datastore = Datastore(workspace, 'example_adls_gen2_datastore')

   # for more information on the parameters and methods, please look for the corresponding documentation.
   output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

Inheritance: OutputDatasetConfig

OutputFileDatasetConfig

TransformationMixin

OutputFileDatasetConfig

Constructor

OutputFileDatasetConfig(name=None, destination=None, source=None, partition_format=None)

Parameters

Name	Description
name Required	str The name of the output specific to this run. This is generally used for lineage purposes. If set to None, we will automatically generate a name. The name will also become an environment variable which contains the local path of where you can write your output files and folders to that will be uploaded to the destination.
destination Required	tuple The destination to copy the output to. If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}, where run-id is the Run's ID and the output-name is the output name from the name parameter above. The destination is a tuple where the first item is the datastore and the second item is the path within the datastore to copy the data to. The path within the datastore can be a template path. A template path is just a regular path but with placeholders inside. Those placeholders will then be resolved at the appropriate time. The syntax for placeholders is {placeholder}, for example, /path/with/{placeholder}. Currently only two placeholders are supported, {run-id} and {output-name}.
source Required	str The path within the compute target to copy the data from. If set to None, we will set this to a directory we create inside the compute target's OS temporary directory.
partition_format Required	str Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.
name Required	str The name of the output specific to this run. This is generally used for lineage purposes. If set to None, we will automatically generate a name. The name will also become an environment variable which contains the local path of where you can write your output files and folders to that will be uploaded to the destination.
destination Required	tuple The destination to copy the output to. If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}, where run-id is the Run's ID and the output-name is the output name from the name parameter above. The destination is a tuple where the first item is the datastore and the second item is the path within the datastore to copy the data to. The path within the datastore can be a template path. A template path is just a regular path but with placeholders inside. Those placeholders will then be resolved at the appropriate time. The syntax for placeholders is {placeholder}, for example, /path/with/{placeholder}. Currently only two placeholders are supported, {run-id} and {output-name}.
source Required	str The path within the compute target to copy the data from. If set to None, we will set this to a directory we create inside the compute target's OS temporary directory.
partition_format Required	str Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

Remarks

You can pass the OutputFileDatasetConfig as an argument to your run, and it will be automatically translated into local path on the compute. The source argument will be used if one is specified, otherwise we will automatically generate a directory in the OS's temp folder. The files and folders inside the source directory will then be copied to the destination based on the output configuration.

By default the mode by which the output will be copied to the destination storage will be set to mount. For more information about mount mode, please see the documentation for as_mount.

Methods

as_input

Specify how to consume the output as an input in subsequent pipeline steps.

as_mount

Set the mode of the output to mount.

For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.

as_upload

Set the mode of the output to upload.

For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.

as_input

Specify how to consume the output as an input in subsequent pipeline steps.

as_input(name=None)

Parameters

Name	Description
name Required	str The name of the input specific to the run.

Returns

Type	Description
DatasetConsumptionConfig	A DatasetConsumptionConfig instance describing how to deliver the input data.

as_mount

Set the mode of the output to mount.

For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.

as_mount(disable_metadata_cache=False)

Parameters

Name	Description
disable_metadata_cache Required	bool Whether to cache metadata in local node, if disabled a node will not be able to see files generated from other nodes during job running.

Returns

Type	Description
OutputFileDatasetConfig	A OutputFileDatasetConfig instance with mode set to mount.

as_upload

Set the mode of the output to upload.

For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.

as_upload(overwrite=False, source_globs=None)

Parameters

Name	Description
overwrite Required	bool Whether to overwrite files that already exists in the destination.
source_globs Required	list[str] Glob patterns used to filter files that will be uploaded.

Returns

Type	Description
OutputFileDatasetConfig	A OutputFileDatasetConfig instance with mode set to upload.

Share via

OutputFileDatasetConfig Class

Constructor

Parameters

Remarks

Methods

as_input

Parameters

Returns

as_mount

Parameters

Returns

as_upload

Parameters

Returns

Feedback

Additional resources