SharePoint Disaster Recovery vs. Active Passive Farms
Just a quick clarification on terminology & methodologies for SharePoint “disaster recovery” (DR).
In case you didn’t know already, multiple SharePoint farms can be run sharing the same content data, which is very handy if you need near 100% uptime for your SharePoint sites & apps. If for any reason your primary farm dies, you have another farm waiting for you. This post isn’t new in ideas; this is mainly just a quick note on terminology & why there are these differences really, as there is still some confusion.
This trick is achieved by sharing certain databases between two farms, with said databases being in read/write on the “primary” side. All the users are on the whichever farm has read/write access to the databases with the other side(s) receiving any data changes & running in read-only mode.
If for any reason we need to switch farms, we put the read-only DBs in read/write mode and send everyone to the other farm.
SharePoint fully supports having service-application & content-databases in read-only mode and can switch between the two modes without any intervention.
A Quick History – Secondary Farm for DR Only
Back in the day, before SQL Server databases could be kept in sync with either mirroring or log-shipping. At least in the case of SharePoint DR with log-shipping at least (which was the most common DR method), the problem was that switching-over the read/write SQL instances was a “one-hit” use. That’s to say you could easily switch from primary SQL instance to the secondary instance (in the event of the need arising), but going back again took a lot more work.
In the event of a failure on farm 1, we’d go from this:
…to this…
The only problem is that the switch back for SQL can take a while to get online & ready to failover again.
Anyway, because of this “one-way” failover nature, the secondary farm was strictly used for emergencies only, hence the concept of the farm being “only for disasters” and therefore the name “disaster recovery farm”.
Normally this was fine because it’s rare for multiple critical, farm-wide problems to happen on both sides. Not impossible, but rare, so this was still a good option to have available even if it was just for critical failures on the primary farm only.
Enter: Quick-Switching Failovers with AlwaysOn
These days with SQL Server AlwaysOn, SQL instances switching is very much easier than before. For one thing, when we switch primary SQL instances, switching back doesn’t require any extra work.
What does that mean? Well it means now we can switch between the two SQL instances much quicker than before.
What does that mean? Simple really; enter the concept of “active/passive” farms rather than the more dramatic “disaster recovery” farm model. With active/passive farms switching is much quicker; just a simple SQL failover + users redirect each time.
Now we can go from this:
…to this…
…and back again. And back again; no restores or DB tricks needed except a failover.
What’s the Correct Name & DR Model Then?
Good question; a lot depends on your original plans for the 2nd farm. In general, the old-style “only in disasters” model is a bit rubbish as it relies on having a disaster to know whether everything will even work, as it is the “nuclear option” for disasters. I’ve seen cases where everything “worked” on the DR site but when push-came-to-shove, the hardware just wasn’t ready for the same load; and of course, this only became apparent when the primary went nuclear so everyone was moved over to the DR site.
Partly to avoid unpleasant surprises, my recommendation is we forget about “nuclear option DR farms” and just get used to having two equally active SharePoint farms (as in both switching between active/passive). Now we have an easy way of failing over there doesn’t seem like much reason not to, and it guarantees that both could work in the event of an emergency.
Remind me Again – What Could Force a Failover?
Patching & updates mainly; it’s a necessary but risky business, on any platform.
Patching Windows, SQL Server(s), SharePoint installations; all of these have risks in that there’s always a chance whatever update will break services. From Windows platform to .net to SharePoint; despite our best efforts, servers are complicated beasts so very occasionally something may slip through the cracks.
This isn’t something specifically tied to Microsoft patches either; I challenge someone to find a vendor that doesn’t have occasional patching fracases. Hint: there aren’t any.
The solution to this is simple: test patches first, and have another production system on stand-by in case something critical dies.
Have no single points of failure in other words, and active/passive farms are the ultimate high-availability SharePoint solution because they’re the only architectures that achieve this.
Patching isn’t the only reason; your own code may cause issues, and it’s nice to have quick failback options. I can think of more than more big SharePoint customer that’s benefited greatly by having this capability for handling bad code rollouts.
There’s all sorts of reasons why it’s worth doubling-up on farms, but mostly it’s for the reasons you can’t think of you’ll need it the most ;)
Cheers,
Sam Betts