Condividi tramite


Scaling Out with Analysis Services

Scaling Out with Analysis Services

 

 

A few months ago, I had the opportunity to work with a customer who had been intermittently encountering a variety of errors when attempting to connect to an Analysis Services server. The production environment consisted of a dedicated processing server and a dedicated query server. Both of the servers were configured with eight (8) logical cores and 64GB of RAM and were running SQL Server 2008 R2 Analysis Server.

Owing to a business requirement that the data be regularly refreshed at 15 minute intervals, the processing server was used to execute a full process of a single database every 15 minutes. Since the query server was not used for processing, the database was being refreshed using the Synchronization feature that is native to Analysis Services. All-in-all, this is a recognized practice for scaling out Analysis Services that was described by Denny Lee and Nicholas Dritsas in 2007 in their whitepaper entitled SQL Server Best Practices Article.

Shortly after the scale-out solution was implemented, the customer began experiencing periodic issues with connectivity. Initially, the connectivity issues were very sporadic and while there was some concern the issues weren’t considered to be critical. Unfortunately, the issues began to occur with increasing frequency over a period of several months and eventually reached a point that upper management considered it critical that the issues be addressed. At this point, a Profiler Trace on the Analysis Services service was captured and the analysis yielded some interesting (if not somewhat confusing) results. The particular trace was captured over a period of 127 minutes so data covering multiple synchronization cycles and a total of 7539 MDX queries was collected. That trace included information on 291 connectivity errors over the span of 127 minutes. The connectivity failures included messages similar to the following:

  • A connection cannot be made. Ensure that the server is running

  • A connection attempt failed because the connected party did not properly respond after a period of time.

  • Either the user, **\********, does not have access to the ****** database, or the database does not exist….

  • Server:  The operation was cancelled because of locking conflicts. The 'Employee' attribute in the 'Security' dimension has a generated dimension security expression that is not valid. Query Text:       SELECT {} on 0,      [Reporting Date].[Year - Quarter - Month - Week - Date].[Date] on 1      from      (      SELECT STRTOSET(@Week) on 0      FROM      (      SELECT STRTOSET(@QuarterMonth) on 0      FROM      (      SELECT STRTOSET (@Year) on 0      FROM [cube]      )      )      )    

  • Profiler traces indicating that users were attempting to connect using the NT AUTHORITY\ANONYMOUS LOGON account

Given the variety of errors, a number of theories rapidly evolved regarding the root cause or causes. Those theories included network issues, the expression used to implement dimension security, locking conflicts, and server load. Since there were a number of competing theories as to the root cause, it was decided to investigate each systematically.

 Setting up the Test Environment:

In order to test out the various theories that had been advanced, it was decided to create a test environment that consisted of three servers:

  • One Windows 2008 R2 server running Analysis Services 2012 to serve as a “Source” server for synchronizations.

  • One Windows 2008 R2 server running Analysis Services 2012 to serve as a “Target” server for synchronizations.

  • One Windows 2012 Server running a custom multi-threaded application capable of executing up to 100 MDX queries in parallel.

The strategy, at this point, was very simply to execute an Analysis Services Synchronization while submitting a total of 5000 MDX queries via the multi-threaded application. To monitor the behavior of the target server, a Profiler trace and Performance Monitor data collector were used.

Network Issue Theory:

There were a number of SQL Server Profiler traces captured, which hinted at the possibility of network issues due to a number of permissions errors (i.e. “Either the user, NA\someuser, does not have access to the <databasename> or…”) , and several “NT AUTHORITY\ANONYMOUS LOGON” entries. Coincidentally, all of these errors were logged in the trace while a Synchronization operation was being executed.  One of the more interesting profiler traces indicated 291 permissions related errors during execution of 7539 queries in the span of 127 minutes while a Synchronization transaction was running. Unfortunately, for one reason or other, the customer never captured a Network Monitor or WireShark network sniff to really make a determination regarding this theory, so it couldn’t be categorically ruled out.

Security Expression Theory:

To test this theory, using an analog database with very similar characteristics, security expressions modeled on the MDX expressions used in the production database, and a library of 5000 MDX queries modeled on those captured in multiple profiler traces, the UCASE VBA function was first removed from the Dimension Security expression and executed a total of 5000 MDX queries from two client applications, with each running 100 queries in parallel. That resulted in errors due to Lock Conflicts, permissions errors (i.e. “Either the user, NA\someuser, does not have access to the <databasename> or…”) , and several “NT AUTHORITY\ANONYMOUS LOGON” entries while the Synchronization operation was being executed. To test further, the UCASE VBA function was removed from the DefaultMember expression and the same test conditions repeated, resulting in errors due to Locking Conflicts, permissions errors, and the appearance of “NT AUTHORITY\ANONYMOUS LOGON” entries in the profiler trace. Finally, the security expression was completely removed from the secured dimension and the test was repeated which still resulted in errors.

(Note: At this point, I’m definitely seeing a pattern here that involves queries executing while the Synchronization operation is executing.)

Server Load Theory:

In one of the profiler traces, it had been noted that there were a total of 530 concurrent connections to the server. Since the server had eight (8) cores, and running Analysis Services 2012, the maximum number of threads in the query thread pool would have been throttled at 16, so queuing in the query thread pool was not at all unexpected and raised questions about load. To test that theory, 5000 MDX queries were submitted to the test query server via 200 concurrent connections (two instances of the multi-threaded application) while a Synchronization operation was executing. With that load, the entire range of connectivity errors being reported could be readily reproduced. It was noted that, even under peak load, there was nothing in the PerfMon counter traces to indicate that there were issues with disk performance, given that all of the Logical and Physical disk counters normally considered for performance (i.e. Avg. Disk sec/Read, Avg. Disk sec/Write, and Avg. Disk sec/Transfer) were all under a value of 0.10. Gradually the number of concurrent connections was reduced to 16, then 14, then 8, then 4, and finally to 1, with synchronization jobs running periodically during the test runs. An analysis of those data yielded evidence that there was a very definite relationship between queuing in both the Query Thread pool and the Short Parsing Thread pool with the connectivity errors. By increasing the number of concurrent connections and queries executed in parallel the connectivity errors could be readily reproduced.

What was actually happening:

Armed with some pretty clean data from a controlled environment and deep knowledge of how Synchronization works in Analysis Services, it was reasonably clear that the connectivity issues were directly related to the synchronization transaction. Essentially, what happens when a synchronization between servers is executed, is that the server compares the timestamps and versions of the files on the disk of the target server with timestamps and version of the files on the source server. Files on the target that are outdated are updated to the newer timestamp and version with the files from the source server, however, the older files remain on disk until the transaction is committed. Analysis Services uses a two phase commit process, so it first acquires a read commit lock in phase 1 of the commit. In phase 2 of the commit, a server level lock is acquired on the Master Version Map (master.vmp), while the ObjectID values for the new files are written to disk and the older files are deleted. If there are queries or other operations executing when the server level lock is acquired, the server will wait for the period of time specified in the ForceCommitTimeout (default is 30 seconds) and then terminate any executing queries or commands, which results in the Lock Conflict error. Until the server level lock is released and the updated Master Version Map is closed, access to objects on the server is denied, resulting in the permissions errors. If Phase 2 of the commit executes for an extended period of time, which would be the case with a reasonably large database on a server under load, the permissions errors would occur for an extended period of time because a read lock could not be obtained to allow execution of other commands and queries.

A different conceptual approach:

Armed with a much better understanding of what was actually happening, it was reasonably clear that eliminating the connectivity issue would require an approach that did not involve use of the Synchronization functionality native to Analysis Services. After reading Scale-Out Querying for Analysis Services with Read-Only Databases, by Denny Lee and Kay Unkroth, the idea of Database Copying seemed the obvious solution. Just one or two little problems with using either a SAN Snapshot or Robocopy. The SAN in the production environment wasn’t configured to support snapshots, so that option wasn’t viable. That left Robocopy, which, like a SAN Snapshot, would have required taking the Query server offline long enough to detach the database, copy files from source to target and then attach the refreshed database. Clearly time to think a little bit outside the box and build on the Scale-out concept.

Having a pair of load balanced query servers serving data from a single shared read-only database would provide a solution with no downtime. That reduced the problem to how can the data be refreshed on the query servers given the limitations of the production environment. Since iSCSI can be used to implement Pooled Storage, it is possible to attach the same LUN to multiple machines. Extending that concept, why not use iSCSI to attach a pair of LUNs to three machines (one dedicated to processing and two dedicated to query execution). On the processing server, keep one of the two LUNs Online in read-write mode and on the two query servers keep one of the two LUNs Online in read-only mode. After a processing cycle,

  • detach the freshly processed database from the processing server

  • mark the LUN with the freshly processed data as Offline

  • stop query server 1

  • allow the existing connections to drain off

  • detach the read-only database

  • mark the LUN containing stale data as Offline

  • mark the LUN with the newly refreshed data as Online and read-only

  • attach the freshly processed database in read-only mode

  • bring query server 1 back online

  • stop query server 2

  • allow the existing connections to drain off

  • detach the read-only database

  • mark the LUN containing stale data as Offline

  • mark the LUN with the newly refreshed data as Online and read-only

  • attach the freshly processed database in read-only mode

  • bring query server 2 back online

  • on the processing server mark the LUN with the stale data as Online and set it to read-write

  • attach the database in read-write mode

Setting up the Proof of Concept Environment:

In order to test out the various theories that had been advanced, it was decided to create a test environment that consisted of a total of five servers:

  • 1 Domain Controller

  • 1 Analysis Services server for dedicated processing

  • 1 Windows 2008 R2 Server running Network Load Balancing (NLB)

  • 2 Analysis Services 2012 servers that were configured as part of the NLB cluster.

iSCSI was used to mount the same two LUNs on the three Analysis Services servers. On the processing server, one LUN was marked as “Online” and “ReadWrite” with the other marked “Offline”. On the two Query servers, the LUN that was “Offline” on the processing server as marked as “Online” and “ReadOnly” while the LUN that was “Online” on the processing server was marked “Offline”.

A new NLB cluster was created and assigned a static IP address. The two query servers were then added to the NLB cluster. After that, an “NLB” DNS entry was created on the DC referencing the IP address assigned to the NLB cluster. Service Principal Names (SPNs) were also created for the NLB cluster as well as the two Analysis Services servers that were serving as query servers.

The next major task involved creation of code to handle detaching the database on the processing server as well as the query servers. To accomplish that, the following C# code was written to provide a simple command line application, which could be used on all three of the servers in the POC.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using Microsoft.AnalysisServices;

 

namespace Detach

{

    class Program

    {

        static void Main(string[] args)

        {

            Server asServer = new Server();

            asServer.Connect("Localhost");

            Database asDB = asServer.Databases.FindByName("ChangingDims");

            try

            {

                asDB.Detach();

            }

            catch (Exception goof)

            {

                Console.WriteLine("Database is not online");

            }

            finally

            {

                asServer.Disconnect();

                asServer.Dispose();

            }

            Console.WriteLine("Take Drive S: offline then bring the other LUN online");

            Console.ReadLine();

        }

    }

}

 

One of the goals of the POC was to eliminate connectivity issues and simply taking one of the two query servers offline would have resulted in a number of connectivity incidents (and probably a relatively large number of calls to the DBA team because users were suddenly disconnected). This was primary rationale for implementing a Network Load Balancing (NLB) cluster, since using the DrainStop command would allow one server to be taken offline without interrupting existing connections while the other server in the cluster handled any new connection requests. In a production environment, it would have been highly desirable to automate the process using either PowerShell (see Symon Perriman’s Blog post) or C# code, however, for the POC the Network Load Balancing Manager interface on the NLB cluster server was used. This entailed selecting the host from the cluster then implementing a “DrainStop”.

A second small C# application would be used on the Query Servers to monitor the number of connections that were actively executing queries after a server had been placed in a “DrainStop” status. Once the active connections had been allowed to “bleed off”, the application would then detach the database. In order to accomplish that, it was necessary to periodically check the number of connections and set a low limit threshold for the number of connections since the Monitor Application and Profiler each had one connection to the server while the Stress Application had two open connections to the server. It probably should be noted that Profiler traces were being captured on both of the query servers using local instances of Profiler. That code is shown below:

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Data;

using Microsoft.AnalysisServices;

using Microsoft.AnalysisServices.AdomdClient;

using System.Timers;

 

namespace Query_Detach

{

    class Program

    {

        static void Main(string[] args)

        {

            int tstart = (int)DateTime.Now.TimeOfDay.TotalSeconds;

            int tcurrent;

            AdomdConnection AdomdConn = new AdomdConnection();

            AdomdConn.ConnectionString = "Data Source=localhost";

            AdomdConn.Open();

            DataSet ds = new DataSet();

            while (true)

            {

                tcurrent = (int)DateTime.Now.TimeOfDay.TotalSeconds;

                if ((tcurrent-30)>tstart)

                {

                    Console.WriteLine(string.Format("Kicked at {0}", DateTime.Now.Second.ToString()));

                    Console.WriteLine(tcurrent.ToString() + "     :     " + tstart.ToString());

                    tstart=tcurrent;

                    ds.Clear();

                    ds=AdomdConn.GetSchemaDataSet(AdomdSchemaGuid.Connections, null);

                    Console.WriteLine(ds.Tables[0].Rows.Count.ToString());

                    /* The value 4 was used below to account for the following connections:

                     * FlightRecorder

                     * Profiler Trace

                     * Two connections for the Stress application

                     * The Query Monitor and Database Detach application

                    */

                    if (ds.Tables[0].Rows.Count <= 4)

                    {

                        break;

                    }

                    else

                    {

                        Console.WriteLine(string.Format("{0} connections are currently active",ds.Tables[0].Rows.Count.ToString()));

                    }

                }

             }

            AdomdConn.Close();

            AdomdConn.Dispose();

            Server asServer = new Server();

            asServer.Connect("localhost");

            Database asDB = asServer.Databases.FindByName("ChangingDims");

            try

            {

                asDB.Detach();

            }

            catch (AmoException goof)

            {

                Console.WriteLine("Database is not online");

            }

            finally

            {

                asServer.Disconnect();

                asServer.Dispose();

            }

            Console.WriteLine("Take Drive S: offline then bring the other LUN online");

            Console.ReadLine();

        }

    }

}

 

When the application executed, the screen displayed the following message:

 Take Drive S: offline then bring the other LUN online

If this was going to be a production environment, additional code would have been added to handle swapping out the drives but since this was a POC, changing the status of the two LUNs on the servers was handled using Server Manager as illustrated below. When marking a drive as “Offline”, Server Manager was opened, then Disk Management was selected and then the drive status was changed. Server Manager was also used to bring the other LUN to an “Online” state. Once the LUN containing the freshly processed database had been marked as online, it was necessary to set the LUN to ReadOnly status on the targeted query server. Again, a relatively simple C# application was created to provide a command line utility using the following code:

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using System.IO;

using Microsoft.AnalysisServices;

 

namespace Attach

{

    class Program

    {

 

        static void Main(string[] args)

        {

            Helper setDrives = new Helper();

            setDrives.ExecuteCommand("diskpart -s c:\\ctrl\\setdrives.txt");

            Server asServer = new Server();

            asServer.Connect("localhost");

            try

            {

                asServer.Attach("S:\\Data\\CHANGINDDIMS.0.db", ReadWriteMode.ReadOnly);

                Console.WriteLine("Database Attached");

            }

            catch (AmoException goof)

            {

                Console.WriteLine(goof.Message.ToString());

            }

            finally

            {

                asServer.Disconnect();

                asServer.Dispose();

            }

            Console.ReadLine();

        }

 

 

    }

 

    class Helper

    {

        public void ExecuteCommand(object Command)

        {

            try

            {

                System.Diagnostics.ProcessStartInfo procStart = new System.Diagnostics.ProcessStartInfo("cmd", "/c" + Command);

                procStart.RedirectStandardOutput = true;

                procStart.UseShellExecute = false;

                procStart.CreateNoWindow = true;

                System.Diagnostics.Process proc = new System.Diagnostics.Process();

                proc.StartInfo = procStart;

                proc.Start();

                string result = proc.StandardOutput.ReadToEnd();

                System.Diagnostics.Trace.WriteLine(result.ToString());

            }

            catch (Exception goof)

            {

                System.Diagnostics.Trace.WriteLine(goof.Message.ToString());

            }

        }

 

    }

}

 

The text of the “setdrives.txt” file was:

Select volume 3

assign letter=S

attrib volume set Readonly

 

Upon successful attachment of the database in Read-Only mode, the following message was displayed on screen:

 Database Attached

At this point, the Network Load Balancing Manager interface on the NLB cluster server was used to bring the server back online by selecting the server and then the Resume option. After the process had been completed with one query server, the same process was applied to the second query server in the cluster.

Once both query servers were back online, it was time to prepare the dedicated processing server for the next processing operation, which meant bringing the LUN containing the “stale” database online in ReadWrite mode and then attaching the database in Read-Write mode. To handle the tasks of , it was decided to write a small C# command line application to assign the drive letter S to the drive containing stale data then clear the ReadOnly attribute to make the drive “ReadWrite.” After the drive had been configured, the application would then attach the stale database so that the server would be ready for the next processing cycle. That code is below.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using System.IO;

using Microsoft.AnalysisServices;

 

namespace Proc_Attach

{

    class Program

    {

        static void Main(string[] args)

        {

            Helper setDrives = new Helper();

            setDrives.ExecuteCommand("diskpart -s c:\\ctrl\\setdrives.txt");

            Server asServer = new Server();

            asServer.Connect("localhost");

            try

            {

                asServer.Attach("S:\\Data\\CHANGINDDIMS.0.db", ReadWriteMode.ReadWrite);

                Console.WriteLine("Database Attached");

            }

            catch (AmoException goof)

            {

                Console.WriteLine(goof.Message.ToString());

            }

            finally

            {

                asServer.Disconnect();

                asServer.Dispose();

            }

            Console.ReadLine();

        }

    }

 

    class Helper

    {

        public void ExecuteCommand(object Command)

        {

            try

            {

                System.Diagnostics.ProcessStartInfo procStart = new System.Diagnostics.ProcessStartInfo("cmd", "/c" + Command);

                procStart.RedirectStandardOutput = true;

                procStart.UseShellExecute = false;

                procStart.CreateNoWindow = true;

                System.Diagnostics.Process proc = new System.Diagnostics.Process();

                proc.StartInfo = procStart;

                proc.Start();

                string result = proc.StandardOutput.ReadToEnd();

                System.Diagnostics.Trace.WriteLine(result.ToString());

            }

            catch (Exception goof)

            {

                System.Diagnostics.Trace.WriteLine(goof.Message.ToString());

            }

        }

    }

}

 

The setdrives.txt file referenced in the application code contained the following lines:

Select volume 3

assign letter=S

attrib volume Clear Readonly

 

Testing it out:

After spending a bit of time to get the servers configured and write a bit of code, it was time to see if this approach would address the connectivity issues that the customer had been experiencing. Since one profiler trace included information on 291 connectivity errors during execution of 7539 MDX queries over eight (8) processing cycles, there was a basis for making some comparisons. Not being one who is particularly afraid of pushing boundaries, it was decided to run a larger number of queries using a stress tool over 6 processing cycles. Each processing cycle included the following steps:

  • Process the database

  • Detach the freshly processed database from the processing server

  • Mark the LUN with the freshly processed data as Offline

  • DrainStop query server 1

  • Allow the existing connections to drain off

  • Detach the read-only database

  • Mark the LUN containing stale data as Offline

  • Mark the LUN with the newly refreshed data as Online and read-only

  • Attach the freshly processed database in read-only mode

  • Bring query server 1 back online by Resuming from NLB

  • DrainStop query server 2

  • Allow the existing connections to drain off

  • Detach the read-only database

  • Mark the LUN containing stale data as Offline

  • Mark the LUN with the newly refreshed data as Online and read-only

  • Attach the freshly processed database in read-only mode

  • Bring query server 2 back online by Resuming from NLB

  • Mark the LUN with the stale data on the Processing Server as Online and set it to read-write

  • Attach the database in read-write mode

  • Begin the next processing cycle

     

    Using the approach outlined here, a total of 21540 MDX queries were executed across 6 processing cycles with no errors.

Comments

  • Anonymous
    August 21, 2015
    Hi John, I actually developed almost exactly this solution a few years ago (was quite strange reading above actually), but moved to using snapshots instead. This meant we could incrementally process the source Lun without taking the lun offline.  Just process it, detach, flush, snapshot and then attach again. It works quite well apart from two things.  You need to make sure that snapshots don't exceed the snapshot pool, otherwise the snapshot disappears.  I guess you don't have that problem in this case. The other one, is more of a missing command.  After the cube is processed, I found it is quite important to flush to disk.  Interestingly, I found running the flush and waiting for around 5 seconds was most successfully.  This depended on how busy the San was at the time. Otherwise, it has been in production for around two years and is pretty good otherwise. Chris

  • Anonymous
    August 22, 2015
    Hi Chris Ideally, I really wanted to use SAN Snapshots but the production SAN wasn't configured for it. John

  • Anonymous
    August 22, 2015
    Great read - thanks for sharing! Were there any concerns from the client about adding an additional query server and taking on additional licensing costs?

  • Anonymous
    August 24, 2015
    Hi Bill I know that there were some discussions related to hardware and licensing costs. Ultimately it really came down to a question of the costs of the additional infrastructure vs. the costs of support coupled with the costs of a less than satisfactory user experience coupled.

  • Anonymous
    August 26, 2015
    A couple of minor points that I neglected to mention.

  1. To accommodate stateful applications like Excel and Web apps using stateful connections, NLB was configured for Single Affinity in order to effect Sticky Sessions.
  2. When waiting for connections to bleed off after a DrainStop, a runaway query can be a big problem that makes it necessary to regularly poll the DISCOVER_SESSIONS DMV and forcibly cancel connections associated with any runaway or long running queries.