Index data from external data sources using Azure Data Factory
Adding external data that doesn't reside in Azure is a common need in an organization's search solution. Azure AI Search is flexible as it allows many ways to create and push data into indexes.
Push data into a search index using Azure Data Factory (ADF)
A first approach is a zero-code option for pushing data into an index using ADF. ADF comes with connections to nearly 100 different data stores. With connectors like HTTP and REST that allow you to connect an unlimited number of data stores. These data stores are used as a source or a target (called sinks in the copy activity) in pipelines.
The Azure AI Search index connector can be used as a sink in a copy activity.
Create an ADF pipeline to push data into a search index
The steps you need to take to use and ADF pipeline to push data into a search index are:
- Create an Azure AI Search index with all the fields you want to store data in.
- Create a pipeline with a copy data step.
- Create a data source connection to where your data resides.
- Create a sink to connect to your search index.
- Map the fields from your source data to your search index.
- Run the pipeline to push the data into the index.
For example, imagine you've customer data in JSON format that is hosted externally. You want to copy these customers into a search index. The JSON is in this format:
{
"_id": "5fed1b38309495de1bc4f653",
"firstName": "Sims",
"lastName": "Arnold",
"isAlive": false,
"age": 35,
"address": {
"streetAddress": "Sumner Place",
"city": "Canoochee",
"state": "Palau",
"postalCode": 1558
},
"phoneNumbers": [
{
"type": "home",
"number": "+1 (830) 465-2965"
},
{
"type": "home",
"number": "+1 (889) 439-3632"
}
]
}
Create a search index
Create an Azure AI Search service and an index to store this information in. If you've completed the Create an Azure AI Search solution module, then you've seen how to do this. Follow the steps to create the search service but stop at the point of importing data. As pushing data into an index doesn't need you to create an indexer or skillset.
Create an index and add these fields and properties:
At the moment you have to create the index first, as ADF can't create indexes.
Create a pipeline using the ADF Copy Data tool
Open the Azure Data Factory Studio and select your Azure subscription and data factory name.
Select Ingest.
Select Next.
Note
You can choose to schedule the pipeline if your data is changing and you need to keep your index up-to-date. For this example, you'll import the data once.
Create the source linked service
In Source type, select HTTP.
Next to Connection, select + New connection.
In the New connection pane, in Name enter dataLocation.
In the Base URL, enter where your JSON file resides, in this example enter https://raw.githubusercontent.com/Azure-Samples/azure-sql-db-import-data/main/json/user1.json.
In Authentication type, select Anonymous.
Select Create.
Select Next.
In File format, select JSON.
Select Next.
Create the target linked service
In Destination type, select Azure Search. Then select + New connection.
In the New connection pane, in Name enter search_index.
In Azure subscription, select your Azure subscription.
In Service name, select your Azure AI Search service.
Select Create.
On the Destination data store pane, in Target, select the search index you created.
Map source fields to target fields
Select Next.
If you created an index with field names that match the JSON attributes ADF will automatically map the JSON to the field in your search index.
In the above example, three fields in the JSON document need mapping to fields in the index.
Map your fields, then select Next.
On the Settings pane, in Task name, enter jsonToSearchIndex.
Select Next.
Run the pipeline to push the data into the index
The pipeline has been deployed and run. The JSON document will have been added to your search index. You can use the Azure portal and run a search in the search explorer. You should see the imported JSON data.
Following these steps you've seen how you can push data into an index. The pipeline you've created by default merges updates into the index. If you amended the JSON data and rerun the pipeline, the search index would be updated. You can change the write behavior to upload only if you want the data to be replaced each time you run your pipeline.
Limitations of using the built-in Azure AI Search as a linked service
At the moment, the Azure AI Search linked service as a sink only supports these fields:
Azure AI Search data type |
---|
String |
Int32 |
Int64 |
Double |
Boolean |
DataTimeOffset |
This means ComplexTypes and arrays aren't currently supported. Looking at the JSON document above this means that it isn't possible to map all the phone numbers for the customer. Only the first telephone number has been mapped.