Condividi tramite


HA/DR Questions on SAN (Async/Sync) Replication

Note: The detailed analysis is based on Aysnc SAN replication. This email has a  lot of information on Aync SAN Replication and also related to Aysnc I/O. I also got a chance to ask few questions to Bob Dorr (on the lines of Aysnc I/O Writes etc to get some further info for you, remember I have discussed internals with him and this may come in very handy for you), I have looked at this question from quite a few angles and I guess consolidating all the information would give you more help in deciding the solution (Though it was a quick discussion the other day and some of the questions would have been left out, but I have tried my best to evaluate some of the questions by myself and have considered a scenario to reach to a definite solution.)

 

Here is what we discussed with Bob Dorr based on few similar issues we have faced questions upon from other PFE’s, I have changed the question a bit to get the internal SQL API calls information for you;

Discussion Thread here:

Background what we do know :

Considering that in Windows the I/O API’s allow for async requests via calls to the API such as WriteFile that will not return control to the calling code until the operation is complete. Async hands the request off to the operating system and associated drivers and returns control to the calling code.  

[[rdorr]] Correct, this is the key to how SQL does the IO so we can hand off the IO and use other resources (CPU) while the IO is being handled by the hardware.

So , when we look at SQL Server which mostly use Async I/O patterns , it exposes the pending (async) I/O requests in the sys.dm_io_pending_io_requests ,the column ‘io_pending’ provides insight into the I/O request and who is responsible for it , if the value is TRUE it indicates that ‘HasOverlappedIOCompleted’ in the Windows the I/O API returned FALSE and the operating system or driver stack has yet to complete the I/O.   

[[rdorr]] Correct and exactly why we exposed the column.   This lets us tell if it is a Non-SQL issue or if the IO has been completed and SQL is not processing it in a timely fashion.

Looking at the io_pending_ms_ticks indicates how long the I/O has been pending ,if the column reports FALSE for io_pending it indicates that the I/O still has been completed at the operating system level and SQL Server is now responsible for handling the remainder of the request.

[[rdorr]] Correct.  We snap the start time when we post the IO (right before the read or write call) and the column is materialized as the result of HasOverlappedIoCompleted.   If you look at HasOverlappedIoCompleted this is really a macro that checks the INTERNAL member of the OVERLAPPED structures for != STATUS_PENDING (0×103, 259).   So in a dump we already look at the internal status on the OVERLAPPED for pending status value.

What we do NOT know and would like detailed feedback on :

When SQL server hands of the Async I/O request to the windows I/O API , A1) What is the order of action to complete a write from the I/O API’s side in a SAN environment ?
A2) Lets consider an environment where we have delayed IO as a result of a substandard SAN or SAN configuration a) In a case where the Windows I/O API could not complete the I/O (write) due to a time out or a missing I/O request below the driver / API stack, would the Windows I/O API try to reissue the I/O (write) OR would the Windows I/O API thru some call back mechanism in the SQLOS layer notify the SQLOS of such a failure and then rely on the SQLOS to reissue the I/O write request ?

Provide as much detail as possible.

[[rdorr]] The environment does not matter at all to SQL Server.  It could be direct attached, SAN or other Io subsystem.   SQL Server calls the write routine with the OVERLAPPED structure.   This is a kernel transition which builds an IRP and puts it in motion, returning control from the API to SQL Server.  Each of our IOs are placed on a pending list.  This list is associated with the local SQLOS scheduler the task is running on.  Each time the a switch takes place on the scheduler (< 4ms) the I/O list used HasOverlappedIoCompleted to check the status.   If the IO is no longer STATUS_PENDING the completion routine registered with SQL Server is fired.   This is not a completion routine setup with Windows it is an internal function pointer associated with the IO request structure maintained by SQL Server.   The callback routine will check the sanity of the IO (error code, bytes transferred, read check page id, etc..) and then release the EX latch so the page can be used by any requestor.

Not exactly sure what you mean by timeout of the IO.  SQL Server does not allow CancelIO so the IO will stay pending until it completes or returns an error.   From a SQL Server standpoint there is no timeout only success of failure.    If you are talking about HBA level timeouts it is driver specific and the hardware vendor implements the details.   For example a dual HBA system can have a timeout value.  When the HBA can’t service the IO and the timeout is exceeded it will transfer the IO to the alternate HBA controller.   We have seen issues in the past with this pattern where the timeout is not setup properly.  It was supposed to be set to 5 seconds but instead was 45 seconds.  This meant the IO was pending for 45 secs on an HBA that had lost communications with the SAN and would not move the IO to the second controller until that timeout was expired.   The query timeouts from the application were set to 30 seconds so when we went to flush a log record for a commit the SQL Server queries were getting cancelled at 30 seconds because of the underlying hardware setup.

I am not aware of any of the Windows APIs that retry the IO and even if they do SQL Server would not be aware of it.  The OVERLAPPED structure is only updated by the IRP completion routine here.   When the IRP completes one of the last things that occurs is the kernel callback routine if fired.  This callback routine does the basic cleanup, sets the values in the OVERLAPPED structure such as Internal (status value) InternalHigh (bytes transferred) and such so that a caller to GetOverlappedResult or HasOverlappedIoCompleted can obtain the proper state.   It then checks to see if the OVERLAPPED structure contains a valid event and if so will signal it and finally if the IO request is registered with a completion port it will queue the IO to the completion port.    For disk IO SQL Server does not use the completion port.   SQL Server posts the IOs with the OVERLAPPED and an event.  Every IO on the same SQLOS scheduler used the same event.  This allows a scheduler in an idle wait state wake on the event and check the IO list right away.

With that said there are a couple of retry locations in SQL Server but not like you might expect.   When you first attempt to post the IO Read/Write file you can get the error (1450 or 1452) returned which is an out of resources error indicating that the IO could not be started (no IRP created).  In these cases the SQL Server will Sleep for 1 second and attempt to post the IO again.   In these cases the IO is not on the IO list and may not show up in the DMV because the IO is not pending.

For Reads Only if the SQL Server completion routine detects damage (failure of some sort, bad page header, checksum, …) we can retry the same IO up to 4 times before we consider it a failure.   We have found (SQL and Exchange) that if a read fails you can retry the read and sometimes it works properly.  If you do fail 4 times in a row for the same page it is usually damaged and we will log the error.   In either case SQL Server logs information in the error log about this condition.

A failed write will leave the buffer in the buffer pool, hashed with the BUF->berrno set to a value other than 0.  This will essentially, poison the page and for data integrity SQL Server will no longer allow access to the page nor will it write the page to disk.   Since a write is often offline from the query (checkpoint or lazy writer) the original query that dirtied the data is usually not aware of the situation.  However, if that write was for the log records (WAL protocol requires log records to be flushed before commit if valid) not only is the query notified of the error but the database is often marked suspect.   SQL Server has some mechanisms to detect when a log write fails and will even try to take the database offline and bring it online to rectify the situation.  If this fails the database is marked suspect.  If this succeeds SQL Server has cleared a database error condition and allowed runtime but the DBA should be looking closely at the system and their backup strategies.

With all this said there are some situations we have found with snapshot databases that result in OS error 1450 or 1452 that can’t be resolved on Windows 2003.   The error code was changed by Windows 2008 when the spare file limitation is reached and the 1450/1452 can’t be resolved so SQL will stop the retry attempts and be given a clear error message instead.

b) Taking into account the answer above, How many times would the component responsible retry/reissue the I/O (write) request before it reports a failure up to the calling components ?

[[rdorr]] Any write request results in a report to the error log.  As I stated the only failure for a write that is generally retried are (1450 and 1452) error conditions.    I went back to the code in SQL Server that handles acquiring the buffer latch (AcquireIoLatch and AcquireLatchActions) on SQL 2008 and it always checks the BUF->berrorno value as I outlined.  There are some special conditions to allow page level restore on enterprise and such but for the most part once a write fails SQL Server won’t retry the write as we can’t trust the data.   Let me give you an example.

If the database is enabled for checkpoint it enables a feature named constant page validation.   When the page is read into memory or was written and is again clean and has a checksum the LW and other actions may validate the checksum the first time the page prepared to become dirty.  If the checksum fails we know the page has been changed while in memory, incorrectly.  Perhaps bad ram a memory scribbler or such activity.   We would never want to flush this page so the BUF->berrno is set to indicate the checksum failure, preventing any further use of the page for any action.

c) and most importantly, can we influence the wait-time before an I/O request gets re-issued ? d) Finally what is the difference in SQL2000 and SQL2008 wrt IO retries, and the has the IO engine been changed significantly between these two versions, specifically wrt retries, and layer where IO gets handled.

[[rdorr]] Read retry is the only major difference.  SQL 2000 did not retry a failed read because of a page header problem for example.    SQL 2008 can build a write request that is 2x the size of SQL 2000 but that is a perf issue.  Read ahead sizes have always been controlled by SKU.

Now getting into the Business and Technical Questions to provide a Strategy to this issue for developing a DR (I personally think SAN Replication (Async) is still the best choice)

Pointers: The most important question to answer when trying to decide HA options is, what is the business requirement? Is there an acceptable amount of downtime? Does the secondary server have to be in a remote location? What is your budget? What are you trying to protect against?

  • From the Discussion we had the other day this is what I believe and have assumed to think of a good DR plan for you.
  • The standby server will be located in a remote office about 1000 miles away. Budget is not defined at the moment and not really a constraint.(That’s why we are even looking at SAN Replication) and I believe that the This DR is to protect some business critical applications (SAP DW etc) in case of total loss of primary data center or catastrophic failure of our primary SAN.
  • I also understand that you are running the primary server on a 2 node clustered server for HA (both nodes in same physical location). However, the SAN has no HA protection and hence we are trying to find a way out if this is implemented how safe are we against Database Corruption.

Pointers: If not SAN replication, you would wish to use either Log-Shipping, Database Mirroring, Replication but then ultimately you have to find some balance. You don’t necessarily need a one size fits all solution (just a thought). You SLA, Uptime for some dbs might need mirroring, others might include log shipping. Though personally I wouldn’t hesitate to mix them up, and it’s not a much more complicated environment to have both. Log shipping is dirt simple in terms of how it works.

  • It would be preferable to have one solution to implement. Mixing different DR solutions will be a huge headache during a live failover event. Am I right people? And hence we are looking for SAN replication?

Pointers: The SAN replication has a 20% – 60% IO performance hit.  (considering this to be the situation), can you afford to take on performance hit while doing the SAN Replication, Log-Shipping on the other hand can be a handy option as it has network overhead but then it gives a better chance to recover in a DR Situation if the SAN is not configured for point in time recovery,

NOTE though, if the primary disk gets corrupted, or a database deleted, what stops that getting replicated to the secondary disks? I would think of this as a counter question on what we are trying to achieve.

  • With SAN replication I assume we get a Zero performance impact but some overhead in terms of manageability from Windows and Storage perspective might pitch in.
  • For corruption problems, with some of the SAN Replication you can roll back to timemarks until you get to the point before the corruption. (check with your SAN vendor they can tweak this setting for you)

 

 
Factors which influence the HADR planning      
Cost: (cheapest to most expensive) LogShipping Database Mirroring SAN Replication
Performance: (slowest to fastest) LogShipping Database Mirroring SAN Replication
Failover/Failback times: (slowest to fastest) LogShipping Database Mirroring SAN Replication
Administrative complexity: (most to least) LogShipping Database Mirroring SAN Replication
Data loss exposure: (most to least) LogShipping SAN Replication

Database Mirroring

(Running in High Safety Mode- Synchronous Replication

SQL Server 2012- Please read my Blog on HADR ON -NA-    

 

Now coming to the actual question, this is what we can derive from the above discussion

Question; Is it possible that in event of DR scenario that the database itself will become corrupt when the DR LUNs are presented to the DR cluster, after ‘disaster’ ?

Answer;

Yes, the chances are there for a database to go corrupt  in an Async (or Sync or any) SAN Replication scenario if and only if the underlying data page in .mdf file has gone corrupt; in other words If any mdf page is corrupt due to disk issues and we want to do recovery for that page, we would fail and that would get caught while running DBCC CHECKDB. Though our t-log, 3 phases of recovery and WAL protocol comes into picture during such scenario and help with data alignment and usage, but then corruption can occur if the data which was written to disk had a torn page, corrupt data.

  • Though, in DR if some data couldn’t make it. The T-log recovery will try and ensure a consistent state. Now if some writes to t log is lost- This is the “some data loss” which is mentioned in the article.
  • Apart from this since during SAN replication data is replicated by taking consistent incremental snapshots. There should not be a case where only a part of transaction is written and replicated to DR SAN (re-confirm this with the actual SAN vendors though we are sure about this), secondly for corruption problems, with SAN Replication you can roll back to timemarks until you get to the point before the corruption. (This is based on the SAN you are using and some SAN do provide high fault tolerance and corrupt data recovery)