Data Factory High Availability and Disaster Recovery

Bex Starr 81 Reputation points
2022-08-23T18:54:17.563+00:00

I am currently looking into the high-availability/DR options for our solution (Azure SQL Database, ADF, Databricks, Blob storage, etc). Currently, we are looking to set up an identical replica of the environment in a paired region. I have done some research on ADF recommendations for high-availability/DR. However, there is no clear documentation with recommended strategies.

For high availability…..
Does Microsoft automatically handle this as part of the PaaS service and offers 99.9% uptime in the selected region without any intervention required by the Customer? Is there anything the customer can configure to increase this availability? Are there any details of HA in the MS documentation about what is in place for HA? (for instance, the metadata is replicated x3 times to different locations in the same data centre (like blob storage)? IR will automatically recover from hardware failures, etc?

For DR…..
Currently, both ADF and IR are located in UK South region. We have 2 options for configuring the IR so that data remains in the UK (for compliance):

  • Create IR in UK South region and link the source and target link services to the IR (both are also located in UK South due to compliance purposes).
  • Create ADF in UK South region, and enable Managed Virtual Network with auto-resolve for Azure IR. The IR in the Data Factory region is used (which will be UK South).

I read somewhere that we have 2 options for DR:

  1. We create an ADF instance and IR in UK South (IR can be specified to be in UK South or enable managed virtual network so it automatically uses the IR in UK South). Failover will occur automatically to the paired region. Can take up to 24hrs to recover. Requires no work from the customer.
  2. Customer-controlled (as RTO and RPO of 24hrs is not acceptable) where in the event of regional failure, the customer provisions a new ADF in the paired region (using CI/CD pipelines with region as a configurable parameter). Use auto-resolve IR and enable managed virtual network or update the IR region to the new region (using configurable parameter) and restart scheduled triggers. What does 'restart triggers' mean?

With option 1:

  • Microsoft automatically replicates the ADF instance to the paired region?
  • Failover occurs automatically and requires no intervention by the customer?
  • What about the linked services as they will be pointing to resources in UK South – do we have to update these manually or using CI/CD pipelines?
  • Will jobs/schedules all start automatically once DR is complete?

For option 2:

  • Do we need to follow option 2 only if we need to recover in a quicker timeframe?
  • Do we need a standby ADF instance or can this be done when the regional failure occurs?
  • If we have already deployed an ADF instance in the paired region (UK West), do we have to make sure that all the linked services are configured to point to the resources in the paired region?
  • How do we make the ADF instance operational? Is there some sort of Active/passive configuration for the ADF instance?
  • Once operational, will schedules start automatically or is there further intervention required by the customer?
  • Will the passive instance incur any costs, as long as it remains passive? (Costs are based on orchestration (Activity runs) and executions in IR, Data Flow cluster configuration, and Data Factory CRUD operations (CRUD on ADF entities – datasets, linked services, pipelines, IR config, Triggers) and monitoring operations), therefore I assume that it would have no associated costs?
  • If we are provisioning an ADF instance in the paired region, then what prevents Microsoft from also failing over another instance to the paired region (option 1)? How do we configure the primary ADF to use option 1 or option 2?

Only documentation I could find: https://zcusa.951200.xyz/en-us/azure/data-factory/concepts-data-redundancy

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,053 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.