Partilhar via


IndexingParametersConfiguration Class

A dictionary of indexer-specific configuration properties. Each name is the name of a specific property. Each value must be of a primitive type.

Inheritance
azure.search.documents.indexes._generated._serialization.Model
IndexingParametersConfiguration

Constructor

IndexingParametersConfiguration(*, additional_properties: Dict[str, Any] | None = None, parsing_mode: str | _models.BlobIndexerParsingMode = 'default', excluded_file_name_extensions: str = '', indexed_file_name_extensions: str = '', fail_on_unsupported_content_type: bool = False, fail_on_unprocessable_document: bool = False, index_storage_metadata_only_for_oversized_documents: bool = False, delimited_text_headers: str | None = None, delimited_text_delimiter: str | None = None, first_line_contains_headers: bool = True, document_root: str | None = None, data_to_extract: str | _models.BlobIndexerDataToExtract = 'contentAndMetadata', image_action: str | _models.BlobIndexerImageAction = 'none', allow_skillset_to_read_file_data: bool = False, pdf_text_rotation_algorithm: str | _models.BlobIndexerPDFTextRotationAlgorithm = 'none', execution_environment: str | _models.IndexerExecutionEnvironment = 'standard', query_timeout: str = '00:05:00', **kwargs: Any)

Keyword-Only Parameters

Name Description
additional_properties

Unmatched properties from the message are deserialized to this collection.

parsing_mode

Represents the parsing mode for indexing from an Azure blob data source. Known values are: "default", "text", "delimitedText", "json", "jsonArray", and "jsonLines".

Default value: default
excluded_file_name_extensions
str

Comma-delimited list of filename extensions to ignore when processing from Azure blob storage. For example, you could exclude ".png, .mp4" to skip over those files during indexing.

indexed_file_name_extensions
str

Comma-delimited list of filename extensions to select when processing from Azure blob storage. For example, you could focus indexing on specific application files ".docx, .pptx, .msg" to specifically include those file types.

fail_on_unsupported_content_type

For Azure blobs, set to false if you want to continue indexing when an unsupported content type is encountered, and you don't know all the content types (file extensions) in advance.

fail_on_unprocessable_document

For Azure blobs, set to false if you want to continue indexing if a document fails indexing.

index_storage_metadata_only_for_oversized_documents

For Azure blobs, set this property to true to still index storage metadata for blob content that is too large to process. Oversized blobs are treated as errors by default. For limits on blob size, see https://learn.microsoft.com/azure/search/search-limits-quotas-capacity.

delimited_text_headers
str

For CSV blobs, specifies a comma-delimited list of column headers, useful for mapping source fields to destination fields in an index.

delimited_text_delimiter
str

For CSV blobs, specifies the end-of-line single-character delimiter for CSV files where each line starts a new document (for example, "|").

first_line_contains_headers

For CSV blobs, indicates that the first (non-blank) line of each blob contains headers.

Default value: True
document_root
str

For JSON arrays, given a structured or semi-structured document, you can specify a path to the array using this property.

data_to_extract

Specifies the data to extract from Azure blob storage and tells the indexer which data to extract from image content when "imageAction" is set to a value other than "none". This applies to embedded image content in a .PDF or other application, or image files such as .jpg and .png, in Azure blobs. Known values are: "storageMetadata", "allMetadata", and "contentAndMetadata".

Default value: contentAndMetadata
image_action

Determines how to process embedded images and image files in Azure blob storage. Setting the "imageAction" configuration to any value other than "none" requires that a skillset also be attached to that indexer. Known values are: "none", "generateNormalizedImages", and "generateNormalizedImagePerPage".

Default value: none
allow_skillset_to_read_file_data

If true, will create a path //document//file_data that is an object representing the original file data downloaded from your blob data source. This allows you to pass the original file data to a custom skill for processing within the enrichment pipeline, or to the Document Extraction skill.

pdf_text_rotation_algorithm

Determines algorithm for text extraction from PDF files in Azure blob storage. Known values are: "none" and "detectAngles".

Default value: none
execution_environment

Specifies the environment in which the indexer should execute. Known values are: "standard" and "private".

Default value: standard
query_timeout
str

Increases the timeout beyond the 5-minute default for Azure SQL database data sources, specified in the format "hh:mm:ss".

Default value: 00:05:00

Variables

Name Description
additional_properties

Unmatched properties from the message are deserialized to this collection.

parsing_mode

Represents the parsing mode for indexing from an Azure blob data source. Known values are: "default", "text", "delimitedText", "json", "jsonArray", and "jsonLines".

excluded_file_name_extensions
str

Comma-delimited list of filename extensions to ignore when processing from Azure blob storage. For example, you could exclude ".png, .mp4" to skip over those files during indexing.

indexed_file_name_extensions
str

Comma-delimited list of filename extensions to select when processing from Azure blob storage. For example, you could focus indexing on specific application files ".docx, .pptx, .msg" to specifically include those file types.

fail_on_unsupported_content_type

For Azure blobs, set to false if you want to continue indexing when an unsupported content type is encountered, and you don't know all the content types (file extensions) in advance.

fail_on_unprocessable_document

For Azure blobs, set to false if you want to continue indexing if a document fails indexing.

index_storage_metadata_only_for_oversized_documents

For Azure blobs, set this property to true to still index storage metadata for blob content that is too large to process. Oversized blobs are treated as errors by default. For limits on blob size, see https://learn.microsoft.com/azure/search/search-limits-quotas-capacity.

delimited_text_headers
str

For CSV blobs, specifies a comma-delimited list of column headers, useful for mapping source fields to destination fields in an index.

delimited_text_delimiter
str

For CSV blobs, specifies the end-of-line single-character delimiter for CSV files where each line starts a new document (for example, "|").

first_line_contains_headers

For CSV blobs, indicates that the first (non-blank) line of each blob contains headers.

document_root
str

For JSON arrays, given a structured or semi-structured document, you can specify a path to the array using this property.

data_to_extract

Specifies the data to extract from Azure blob storage and tells the indexer which data to extract from image content when "imageAction" is set to a value other than "none". This applies to embedded image content in a .PDF or other application, or image files such as .jpg and .png, in Azure blobs. Known values are: "storageMetadata", "allMetadata", and "contentAndMetadata".

image_action

Determines how to process embedded images and image files in Azure blob storage. Setting the "imageAction" configuration to any value other than "none" requires that a skillset also be attached to that indexer. Known values are: "none", "generateNormalizedImages", and "generateNormalizedImagePerPage".

allow_skillset_to_read_file_data

If true, will create a path //document//file_data that is an object representing the original file data downloaded from your blob data source. This allows you to pass the original file data to a custom skill for processing within the enrichment pipeline, or to the Document Extraction skill.

pdf_text_rotation_algorithm

Determines algorithm for text extraction from PDF files in Azure blob storage. Known values are: "none" and "detectAngles".

execution_environment

Specifies the environment in which the indexer should execute. Known values are: "standard" and "private".

query_timeout
str

Increases the timeout beyond the 5-minute default for Azure SQL database data sources, specified in the format "hh:mm:ss".

Methods

as_dict

Return a dict that can be serialized using json.dump.

Advanced usage might optionally use a callback as parameter:

Key is the attribute name used in Python. Attr_desc is a dict of metadata. Currently contains 'type' with the msrest type and 'key' with the RestAPI encoded key. Value is the current value in this object.

The string returned will be used to serialize the key. If the return type is a list, this is considered hierarchical result dict.

See the three examples in this file:

  • attribute_transformer

  • full_restapi_key_transformer

  • last_restapi_key_transformer

If you want XML serialization, you can pass the kwargs is_xml=True.

deserialize

Parse a str using the RestAPI syntax and return a model.

enable_additional_properties_sending
from_dict

Parse a dict using given key extractor return a model.

By default consider key extractors (rest_key_case_insensitive_extractor, attribute_key_case_insensitive_extractor and last_rest_key_case_insensitive_extractor)

is_xml_model
serialize

Return the JSON that would be sent to server from this model.

This is an alias to as_dict(full_restapi_key_transformer, keep_readonly=False).

If you want XML serialization, you can pass the kwargs is_xml=True.

as_dict

Return a dict that can be serialized using json.dump.

Advanced usage might optionally use a callback as parameter:

Key is the attribute name used in Python. Attr_desc is a dict of metadata. Currently contains 'type' with the msrest type and 'key' with the RestAPI encoded key. Value is the current value in this object.

The string returned will be used to serialize the key. If the return type is a list, this is considered hierarchical result dict.

See the three examples in this file:

  • attribute_transformer

  • full_restapi_key_transformer

  • last_restapi_key_transformer

If you want XML serialization, you can pass the kwargs is_xml=True.

as_dict(keep_readonly: bool = True, key_transformer: ~typing.Callable[[str, ~typing.Dict[str, ~typing.Any], ~typing.Any], ~typing.Any] = <function attribute_transformer>, **kwargs: ~typing.Any) -> MutableMapping[str, Any]

Parameters

Name Description
key_transformer
<xref:function>

A key transformer function.

keep_readonly
Default value: True

Returns

Type Description

A dict JSON compatible object

deserialize

Parse a str using the RestAPI syntax and return a model.

deserialize(data: Any, content_type: str | None = None) -> ModelType

Parameters

Name Description
data
Required
str

A str using RestAPI structure. JSON by default.

content_type
Required
str

JSON by default, set application/xml if XML.

Default value: None

Returns

Type Description

An instance of this model

Exceptions

Type Description
DeserializationError if something went wrong

enable_additional_properties_sending

enable_additional_properties_sending() -> None

from_dict

Parse a dict using given key extractor return a model.

By default consider key extractors (rest_key_case_insensitive_extractor, attribute_key_case_insensitive_extractor and last_rest_key_case_insensitive_extractor)

from_dict(data: Any, key_extractors: Callable[[str, Dict[str, Any], Any], Any] | None = None, content_type: str | None = None) -> ModelType

Parameters

Name Description
data
Required

A dict using RestAPI structure

content_type
Required
str

JSON by default, set application/xml if XML.

Default value: None
key_extractors
Required
Default value: None

Returns

Type Description

An instance of this model

Exceptions

Type Description
DeserializationError if something went wrong

is_xml_model

is_xml_model() -> bool

serialize

Return the JSON that would be sent to server from this model.

This is an alias to as_dict(full_restapi_key_transformer, keep_readonly=False).

If you want XML serialization, you can pass the kwargs is_xml=True.

serialize(keep_readonly: bool = False, **kwargs: Any) -> MutableMapping[str, Any]

Parameters

Name Description
keep_readonly

If you want to serialize the readonly attributes

Default value: False

Returns

Type Description

A dict JSON compatible object