I am currently looking into the high-availability/DR options for our solution (Azure SQL Database, ADF, Databricks, Blob storage, etc). Currently, we are looking to set up an identical replica of the environment in a paired region. I have done some research on ADF recommendations for high-availability/DR. However, there is no clear documentation with recommended strategies. For high availability….. Does Microsoft automatically handle this as part of the PaaS service and offers 99.9% uptime in the selected region without any intervention required by the Customer? Is there anything the customer can configure to increase this availability? Are there any details of HA in the MS documentation about what is in place for HA? (for instance, the metadata is replicated x3 times to different locations in the same data centre (like blob storage)? IR will automatically recover from hardware failures, etc? For DR….. Currently, both ADF and IR are located in UK South region. We have 2 options for configuring the IR so that data remains in the UK (for compliance): Create IR in UK South region and link the source and target link services to the IR (both are also located in UK South due to compliance purposes). Create ADF in UK South region, and enable Managed Virtual Network with auto-resolve for Azure IR. The IR in the Data Factory region is used (which will be UK South). I read somewhere that we have 2 options for DR: We create an ADF instance and IR in UK South (IR can be specified to be in UK South or enable managed virtual network so it automatically uses the IR in UK South). Failover will occur automatically to the paired region. Can take up to 24hrs to recover. Requires no work from the customer. Customer-controlled (as RTO and RPO of 24hrs is not acceptable) where in the event of regional failure, the customer provisions a new ADF in the paired region (using CI/CD pipelines with region as a configurable parameter). Use auto-resolve IR and enable managed virtual network or update the IR region to the new region (using configurable parameter) and restart scheduled triggers. What does 'restart triggers' mean? With option 1: Microsoft automatically replicates the ADF instance to the paired region? Failover occurs automatically and requires no intervention by the customer? What about the linked services as they will be pointing to resources in UK South – do we have to update these manually or using CI/CD pipelines? Will jobs/schedules all start automatically once DR is complete? For option 2: Do we need to follow option 2 only if we need to recover in a quicker timeframe? Do we need a standby ADF instance or can this be done when the regional failure occurs? If we have already deployed an ADF instance in the paired region (UK West), do we have to make sure that all the linked services are configured to point to the resources in the paired region? How do we make the ADF instance operational? Is there some sort of Active/passive configuration for the ADF instance? Once operational, will schedules start automatically or is there further intervention required by the customer? Will the passive instance incur any costs, as long as it remains passive? (Costs are based on orchestration (Activity runs) and executions in IR, Data Flow cluster configuration, and Data Factory CRUD operations (CRUD on ADF entities – datasets, linked services, pipelines, IR config, Triggers) and monitoring operations), therefore I assume that it would have no associated costs? If we are provisioning an ADF instance in the paired region, then what prevents Microsoft from also failing over another instance to the paired region (option 1)? How do we configure the primary ADF to use option 1 or option 2? Only documentation I could find: https://zcusa.951200.xyz/en-us/azure/data-factory/concepts-data-redundancy

Data Factory High Availability and Disaster Recovery

Bex Starr 81

I am currently looking into the high-availability/DR options for our solution (Azure SQL Database, ADF, Databricks, Blob storage, etc). Currently, we are looking to set up an identical replica of the environment in a paired region. I have done some research on ADF recommendations for high-availability/DR. However, there is no clear documentation with recommended strategies.

For high availability…..
Does Microsoft automatically handle this as part of the PaaS service and offers 99.9% uptime in the selected region without any intervention required by the Customer? Is there anything the customer can configure to increase this availability? Are there any details of HA in the MS documentation about what is in place for HA? (for instance, the metadata is replicated x3 times to different locations in the same data centre (like blob storage)? IR will automatically recover from hardware failures, etc?

For DR…..
Currently, both ADF and IR are located in UK South region. We have 2 options for configuring the IR so that data remains in the UK (for compliance):

Create IR in UK South region and link the source and target link services to the IR (both are also located in UK South due to compliance purposes).
Create ADF in UK South region, and enable Managed Virtual Network with auto-resolve for Azure IR. The IR in the Data Factory region is used (which will be UK South).

I read somewhere that we have 2 options for DR:

We create an ADF instance and IR in UK South (IR can be specified to be in UK South or enable managed virtual network so it automatically uses the IR in UK South). Failover will occur automatically to the paired region. Can take up to 24hrs to recover. Requires no work from the customer.
Customer-controlled (as RTO and RPO of 24hrs is not acceptable) where in the event of regional failure, the customer provisions a new ADF in the paired region (using CI/CD pipelines with region as a configurable parameter). Use auto-resolve IR and enable managed virtual network or update the IR region to the new region (using configurable parameter) and restart scheduled triggers. What does 'restart triggers' mean?

With option 1:

Microsoft automatically replicates the ADF instance to the paired region?
Failover occurs automatically and requires no intervention by the customer?
What about the linked services as they will be pointing to resources in UK South – do we have to update these manually or using CI/CD pipelines?
Will jobs/schedules all start automatically once DR is complete?

For option 2:

Do we need to follow option 2 only if we need to recover in a quicker timeframe?
Do we need a standby ADF instance or can this be done when the regional failure occurs?
If we have already deployed an ADF instance in the paired region (UK West), do we have to make sure that all the linked services are configured to point to the resources in the paired region?
How do we make the ADF instance operational? Is there some sort of Active/passive configuration for the ADF instance?
Once operational, will schedules start automatically or is there further intervention required by the customer?
Will the passive instance incur any costs, as long as it remains passive? (Costs are based on orchestration (Activity runs) and executions in IR, Data Flow cluster configuration, and Data Factory CRUD operations (CRUD on ADF entities – datasets, linked services, pipelines, IR config, Triggers) and monitoring operations), therefore I assume that it would have no associated costs?
If we are provisioning an ADF instance in the paired region, then what prevents Microsoft from also failing over another instance to the paired region (option 1)? How do we configure the primary ADF to use option 1 or option 2?

Only documentation I could find: https://zcusa.951200.xyz/en-us/azure/data-factory/concepts-data-redundancy

Bex Starr 81 Reputation points

2022-08-24T06:15:18.42+00:00

Also, I am thinking of recommending using auto-failover groups for the Azure SQL Server, therefore the endpoint will remain the same. Does this mean the linked services do not need updating or a different configuration for the switchover?
MartinJaffer-MSFT 26,206 Reputation points

2022-08-24T23:55:36.66+00:00

Hello and welcome to Microsoft Q&A @Bex Starr

This is quite a comprehensive look at Disaster Recovery and High Availability. To my shame, it is more than I gave thought to. I think I see what you are getting at.

The cited document does guarantee that the definitions of your Data Factory assets are safe, but the state of execution is left ambiguous. That is, you are asking "when there is a failover, do the triggers pick up where the other left off?" and "Will they point to the appropriate resources given the other resources may also be failing over?"

Also if I understand correctly, you are making your entire solution high availability, not just the individual resources.

In my naive vision, the backup solution has a service taking the heartbeat signal of the primary solution's Factory. Upon loss of the heartbeat signal, said service would turn on all the triggers in the backup solution's Factory.

I'll forward your question to the product group. This sounds like the kind of thing someone would love to make a video or whitepaper on. Lets see what they have to say.
Bex Starr 81 Reputation points

2022-09-01T06:28:26.567+00:00

Thanks Martin. That would be great. And yes, you are spot on regarding your summary. I essentially need to create a Runbook with this level of detail, so we can be assured that should we need to failover, it would all work as planned.

Once we have all the information, I do believe we would simulate the process (do a planned failover). But first step is to understand exactly what will happen and what we need to configure.
MartinJaffer-MSFT 26,206 Reputation points

2022-09-01T20:07:31.1+00:00

Thank you for confirming you are still interested. I am bumping my internal email now.
Bex Starr 81 Reputation points

2022-09-23T07:00:09.997+00:00

Hi Martin

Do you have any feedback as of yet?
Vipin Sharma 121 Reputation points

2022-10-12T07:35:58.343+00:00

@MartinJaffer-MSFT I have same question, can we get your help on this?
Vivek Pandey 1 Reputation point

2022-10-13T16:46:44.723+00:00

Hi Vipin , Did MS answered the above questions ?
Vipin Sharma 121 Reputation points

2022-10-14T04:48:21.087+00:00

Unfortunately, No.
Senan, Arun 21 Reputation points

2022-11-21T04:08:59.573+00:00

I am also interested in the same question.

We have entire platform with ADF, ADLS and Synapse with HA enabled on each of the services but want to understand the most efficient way to failover to another region assuming an entire region outage.
Attaie, H (Hamza) 0 Reputation points

2023-02-15T13:35:11.7966667+00:00

@MartinJaffer-MSFT please reply to Bex Starrs question.
Dhiman Basu 0 Reputation points

2024-12-27T23:14:11.89+00:00

Hello MartinJaffer-MSFT,

I have the same questions with Bex Starr.. are they answered...

Bex Starr Request you to please let me know the details if you have already implemented the ADF DR solution

Share via

Data Factory High Availability and Disaster Recovery

Your answer