Share via


Monitoring Operations Manager Agent Health

I’m working on clearing out some old items on my “to blog” list. This one happens to be a management pack I wrote back in 2010. Operations Manager agents use heartbeats to determine if an agent is working or not. There are some situations, however, where agents continue to heartbeat but other workflows fail to run. Any Operations Manager administrator that maintains a large environment has probably seen this at some point. Usually another administrator or the helpdesk notices a problem and then they come to you and ask why no alert was generated. You go to the console, sure enough everything is healthy but you know things aren’t working well. If you look at performance data for that agent you will likely see that no performance data has been collected from that agent recently. The event logs usually look fairly sparse, with little information logged. Restarting the health service on the agent usually resolves the issue.

While the resolution is simple we need a way to detect that an agent is in a bad state using a method that goes beyond agent heartbeats. The Agent Health Management Pack (attached) does this. It contains two monitors and a rule to detect when this occurs. The rule runs on the agent, but the two monitors run on the Root Management Server (RMS) or All Management Servers Resource Pool (AMSRP). The monitors cook down as long as the parameters are identical on each. They target the health service watcher class and query the data warehouse 4 times per day. If either the 6022 event (if enabled) or performance data doesn’t exist for an agent in the data warehouse within the last 6 hours then that agent instance, under the health service watcher class, will go red.

Rule: Custom.Example.AgentHealth.Rule.CollectEvent6022

  • This rule is used to collect the 6022 events that automatically get created every 15 minutes on the agent. This rule is disabled by default.

Monitor: Custom.Example.AgentHealth.Monitor.EventCollectionCheck

  • This monitor queries the data warehouse looking for agents that haven’t collected the 6022 event recently. This monitor is disabled by default.

Monitor: Custom.Example.AgentHealth.Monitor.PerfCollectionCheck

  • This monitor queries the data warehouse looking for agents that haven’t collected performance data recently. This monitor is enabled by default.

Instructions:

  1. Open the XML of the MP and modify the monitors to use your datawarehouse sql server and instance if applicable (it defaults to the OperationsManagerDW database, you’ll need to modify the script if your datawarehouse name is different)
    • <DataWarehouseConnectionString>mydatawarehouse.contoso.com\dw01</DataWarehouseConnectionString>
  2. In the XML, enable the event collection rule and monitor if you want to use this. The MP already checks the performance data so this is not necessary, just an extra check. If you use this check keep in mind that lots of events will be collected from all agents. If you data warehouse isn’t sized correctly for this you could fill it up.
  3. Import the management pack

Health Explorer:

image

Notice the Performance Collection Check Monitor and the Event Collection Health Check Monitor, those are the new monitors from this management pack. I also want to point out that this functionality was included in SCOM 2012. As you can see under the Agent returning data aggregate there is an event collection and performance data collection health monitor. I have not used these monitors and they are disabled by default. If anyone has used these monitors I would be interested in their feedback.

Custom Agent Health Management Pack

 <ManagementPack ContentReadable="true" xmlns:xsd="https://www.w3.org/2001/XMLSchema" xmlns:xsl="https://www.w3.org/1999/XSL/Transform">
  <Manifest>
    <Identity>
      <ID>Custom.Example.AgentHealth</ID>
      <Version>1.0.0.0</Version>
    </Identity>
    <Name>Custom.Example.AgentHealth</Name>
    <References>
      <Reference Alias="DW">
        <ID>Microsoft.SystemCenter.DataWarehouse.Library</ID>
        <Version>6.1.7221.0</Version>
        <PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
      </Reference>
      <Reference Alias="SC">
        <ID>Microsoft.SystemCenter.Library</ID>
        <Version>6.1.7221.0</Version>
        <PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
      </Reference>
      <Reference Alias="Windows">
        <ID>Microsoft.Windows.Library</ID>
        <Version>6.1.7221.0</Version>
        <PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
      </Reference>
      <Reference Alias="Health">
        <ID>System.Health.Library</ID>
        <Version>6.1.7221.0</Version>
        <PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
      </Reference>
      <Reference Alias="System">
        <ID>System.Library</ID>
        <Version>6.1.7221.0</Version>
        <PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
      </Reference>
    </References>
  </Manifest>
  <TypeDefinitions>
    <ModuleTypes>
      <DataSourceModuleType ID="Custom.Example.AgentHealth.DataSource.QueryForDeadAgents" Accessibility="Internal" RunAs="DW!Microsoft.SystemCenter.DataWarehouse.ActionAccount" Batching="false">
        <Configuration>
          <xsd:element minOccurs="1" name="IntervalInSeconds" type="xsd:integer" />
          <xsd:element minOccurs="1" name="TimeOutInSeconds" type="xsd:integer" />
          <xsd:element minOccurs="1" name="DataWarehouseConnectionString" type="xsd:string" />
          <xsd:element minOccurs="1" name="TimeToConsiderDeadInSeconds" type="xsd:integer" />
        </Configuration>
        <ModuleImplementation Isolation="Any">
          <Composite>
            <MemberModules>
              <DataSource ID="DS" TypeID="Windows!Microsoft.Windows.TimedScript.PropertyBagProvider">
                <IntervalSeconds>$Config/IntervalInSeconds$</IntervalSeconds>
                <SyncTime />
                <ScriptName>QueryForDeadAgents.vbs</ScriptName>
                <Arguments>$Config/DataWarehouseConnectionString$ $Config/TimeToConsiderDeadInSeconds$</Arguments>
                <ScriptBody><![CDATA[
SetLocale("en-us")

Wscript.Echo "Starting Script, getting arguments"
'Need to add error checking here
sConnectionString = "Provider=SQLOLEDB.1;Integrated Security=SSPI;Initial Catalog=OperationsManagerDW;Data Source=" & WScript.Arguments(0) & ";Connect Timeout=15;"
iTimeSpanInSeconds = WScript.Arguments(1)

WScript.Echo "Connecting to SQL"
Set oSQL = GetSQLConnection(sConnectionString)

WScript.Echo "Getting latest event insert date from SQL"
sSQL = "select top 1 DateTime from Event.vEvent where EventDisplayNumber = '6022' order by DateTime desc"
aLatestDateTime = QuerySQL(oSQL, sSQL)
For each sDate in aLatestDateTime
  sLatestDateTime = sDate
Next

WScript.Echo "Getting Agents from SQL"
sSQL = "select distinct DisplayName from vManagedEntity where FullName like " &_
"'Microsoft.SystemCenter.HealthServiceWatcher:Microsoft.SystemCenter.AgentWatchersGroup;%'" &_
" order by DisplayName"
aAgents = QuerySQL(oSQL, sSQL)

WScript.Echo "Converting Time Interval to DateTime format"
sDateTime = GetDateTime(iTimeSpanInSeconds, sLatestDateTime)

WScript.Echo "Getting Events from SQL"
sSQL = "select distinct me.DisplayName from Event.vEvent ev " & _
  "inner join Event.vEventRule evr on ev.EventOriginId = evr.EventOriginId " & _
  "inner join vManagedEntity me on evr.ManagedEntityRowId = me.ManagedEntityRowId " & _
  "where EventDisplayNumber = " & "'6022'" & _
  "and DateTime > " & "'" & sDateTime & "'" & " order by me.DisplayName"
aEvents = QuerySQL(oSQL, sSQL)

WScript.Echo "Getting Performance Data from SQL"
sSQL = "select distinct me.DisplayName from Perf.vPerfRaw pr " & _
  "inner join vManagedEntity me on pr.ManagedEntityRowId = me.ManagedEntityRowId " & _
  "where FullName like 'Microsoft.SystemCenter.HealthService%' " & _
  "and DateTime > " & "'" & sDateTime & "'" & " order by me.DisplayName"
aPerfCounters = QuerySQL(oSQL, sSQL)

WScript.Echo "Closing SQL Connection"
Call CloseSQLConnection(oSQL)

WScript.Echo "Creating MOM API object"
Set oAPI = CreateObject("MOM.ScriptAPI")

WScript.Echo "Compiling Data..."
For each sAgent in aAgents
  Set oBag = oAPI.CreatePropertyBag()
  call oBag.AddValue("DisplayName", sAgent)
  bEventHealthy = false
  bPerfHealthy = false
  For each sEvent in aEvents
    If sAgent = sEvent Then
      call oBag.AddValue("isEventCollectionHealthy", true)
      call oBag.AddValue("sEventMessage", "Event 6022 has been collected on this agent since " & sLatestDateTime & " GMT")
      bEventHealthy = true
      Exit for
    End If
  Next
  For each sPerfCounter in aPerfCounters
    If sAgent = sPerfCounter Then
      call oBag.AddValue("isPerfCollectionHealthy", true)
      call oBag.AddValue("sPerfMessage", "Performance counters from the Health Service class have been collected on this agent since " & sLatestDateTime & " GMT")
      bPerfHealthy = true
      Exit for
    End If
  Next
  If not bEventHealthy Then
    call oBag.AddValue("isEventCollectionHealthy", false)
    call oBag.AddValue("sEventMessage", "Event 6022 has not been collected on this agent since " & sLatestDateTime & " GMT")
  End If
  If not bPerfHealthy Then
    call oBag.AddValue("isPerfCollectionHealthy", false)
    call oBag.AddValue("sPerfMessage", "Performance counters from the Health Service class have not been collected on this agent since " & sLatestDateTime & " GMT")
  End If  
  call oAPI.AddItem(oBag)
Next

WScript.Echo "Returning data to OpsMgr"
call oAPI.ReturnItems()

WScript.Echo "Exiting Script"
WScript.Quit

Function GetDateTime(i, s)
  i = 0 - i
  GetDateTime = DateAdd("s", i, s)
End Function

Function GetSQLConnection(s)
  'On Error Resume Next
  Set oConn = CreateObject("ADODB.Connection")

  'Need to add error checking here
  oConn.Open s
  
  Set GetSQLConnection = oConn
End Function

Function QuerySQL(o, s)
  'On Error Resume Next

  Set oRS = CreateObject("ADODB.Recordset")
  Set oRS.ActiveConnection = o

  'Need to add error checking here
  oRS.Open s
  
  If Not oRS Is Nothing Then
    While Not oRS.EOF
      ReDim Preserve aRecords(i)
      aRecords(i) = ucase(oRS(0))
      i = i + 1
      oRS.MoveNext
    Wend
  End If
  
  oRS.Close

  'Return array here
  QuerySQL = aRecords  
End Function

Sub CloseSQLConnection(o)
  o.Close
End Sub            
                
]]></ScriptBody>
                <TimeoutSeconds>$Config/TimeOutInSeconds$</TimeoutSeconds>
              </DataSource>
              <ConditionDetection ID="CD" TypeID="System!System.ExpressionFilter">
                <Expression>
                  <SimpleExpression>
                    <ValueExpression>
                      <XPathQuery Type="String">Property[@Name='DisplayName']</XPathQuery>
                    </ValueExpression>
                    <Operator>Equal</Operator>
                    <ValueExpression>
                      <Value Type="String">$Target/Property[Type="SC!Microsoft.SystemCenter.HealthServiceWatcher"]/HealthServiceName$</Value>
                    </ValueExpression>
                  </SimpleExpression>
                </Expression>
              </ConditionDetection>
            </MemberModules>
            <Composition>
              <Node ID="CD">
                <Node ID="DS" />
              </Node>
            </Composition>
          </Composite>
        </ModuleImplementation>
        <OutputType>System!System.PropertyBagData</OutputType>
      </DataSourceModuleType>
    </ModuleTypes>
    <MonitorTypes>
      <UnitMonitorType ID="Custom.Example.AgentHealth.MonitorType.QueryForDeadAgents" Accessibility="Internal">
        <MonitorTypeStates>
          <MonitorTypeState ID="Healthy" NoDetection="false" />
          <MonitorTypeState ID="Unhealthy" NoDetection="false" />
        </MonitorTypeStates>
        <Configuration>
          <xsd:element minOccurs="1" name="IntervalInSeconds" type="xsd:integer" />
          <xsd:element minOccurs="1" name="TimeOutInSeconds" type="xsd:integer" />
          <xsd:element minOccurs="1" name="DataWarehouseConnectionString" type="xsd:string" />
          <xsd:element minOccurs="1" name="TimeToConsiderDeadInSeconds" type="xsd:integer" />
          <xsd:element minOccurs="1" name="PropertyBagElementIdentifier" type="xsd:string" />
        </Configuration>
        <MonitorImplementation>
          <MemberModules>
            <DataSource ID="DS" TypeID="Custom.Example.AgentHealth.DataSource.QueryForDeadAgents">
              <IntervalInSeconds>$Config/IntervalInSeconds$</IntervalInSeconds>
              <TimeOutInSeconds>$Config/TimeOutInSeconds$</TimeOutInSeconds>
              <DataWarehouseConnectionString>$Config/DataWarehouseConnectionString$</DataWarehouseConnectionString>
              <TimeToConsiderDeadInSeconds>$Config/TimeToConsiderDeadInSeconds$</TimeToConsiderDeadInSeconds>
            </DataSource>
            <ConditionDetection ID="CDHealthy" TypeID="System!System.ExpressionFilter">
              <Expression>
                <SimpleExpression>
                  <ValueExpression>
                    <XPathQuery Type="Boolean">Property[@Name='$Config/PropertyBagElementIdentifier$']</XPathQuery>
                  </ValueExpression>
                  <Operator>Equal</Operator>
                  <ValueExpression>
                    <Value Type="Boolean">true</Value>
                  </ValueExpression>
                </SimpleExpression>
              </Expression>
            </ConditionDetection>
            <ConditionDetection ID="CDUnHealthy" TypeID="System!System.ExpressionFilter">
              <Expression>
                <SimpleExpression>
                  <ValueExpression>
                    <XPathQuery Type="Boolean">Property[@Name='$Config/PropertyBagElementIdentifier$']</XPathQuery>
                  </ValueExpression>
                  <Operator>Equal</Operator>
                  <ValueExpression>
                    <Value Type="Boolean">false</Value>
                  </ValueExpression>
                </SimpleExpression>
              </Expression>
            </ConditionDetection>
          </MemberModules>
          <RegularDetections>
            <RegularDetection MonitorTypeStateID="Healthy">
              <Node ID="CDHealthy">
                <Node ID="DS" />
              </Node>
            </RegularDetection>
            <RegularDetection MonitorTypeStateID="Unhealthy">
              <Node ID="CDUnHealthy">
                <Node ID="DS" />
              </Node>
            </RegularDetection>
          </RegularDetections>
        </MonitorImplementation>
      </UnitMonitorType>
    </MonitorTypes>
  </TypeDefinitions>
  <Monitoring>
    <Rules>
      <Rule ID="Custom.Example.AgentHealth.Rule.CollectEvent6022" Enabled="false" Target="SC!Microsoft.SystemCenter.Agent" ConfirmDelivery="true" Remotable="true" Priority="Normal" DiscardLevel="100">
        <Category>EventCollection</Category>
        <DataSources>
          <DataSource ID="DS" TypeID="Windows!Microsoft.Windows.EventCollector">
            <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
            <LogName>Operations Manager</LogName>
            <AllowProxying>false</AllowProxying>
            <Expression>
              <SimpleExpression>
                <ValueExpression>
                  <XPathQuery Type="UnsignedInteger">EventDisplayNumber</XPathQuery>
                </ValueExpression>
                <Operator>Equal</Operator>
                <ValueExpression>
                  <Value Type="UnsignedInteger">6022</Value>
                </ValueExpression>
              </SimpleExpression>
            </Expression>
          </DataSource>
        </DataSources>
        <WriteActions>
          <WriteAction ID="WriteToDW" TypeID="DW!Microsoft.SystemCenter.DataWarehouse.PublishEventData" />
        </WriteActions>
      </Rule>
    </Rules>
    <Monitors>
      <UnitMonitor ID="Custom.Example.AgentHealth.Monitor.EventCollectionCheck" Accessibility="Public" Enabled="false" Target="SC!Microsoft.SystemCenter.AgentWatcher" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="Custom.Example.AgentHealth.MonitorType.QueryForDeadAgents" ConfirmDelivery="true">
        <Category>AvailabilityHealth</Category>
        <OperationalStates>
          <OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success" />
          <OperationalState ID="Unhealthy" MonitorTypeStateID="Unhealthy" HealthState="Error" />
        </OperationalStates>
        <Configuration>
          <IntervalInSeconds>21600</IntervalInSeconds>
          <TimeOutInSeconds>1800</TimeOutInSeconds>
          <DataWarehouseConnectionString>mydatawarehouse.contoso.com\dw01</DataWarehouseConnectionString>
          <TimeToConsiderDeadInSeconds>21600</TimeToConsiderDeadInSeconds>
          <PropertyBagElementIdentifier>isEventCollectionHealthy</PropertyBagElementIdentifier>
        </Configuration>
      </UnitMonitor>
      <UnitMonitor ID="Custom.Example.AgentHealth.Monitor.PerfCollectionCheck" Accessibility="Public" Enabled="true" Target="SC!Microsoft.SystemCenter.AgentWatcher" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="Custom.Example.AgentHealth.MonitorType.QueryForDeadAgents" ConfirmDelivery="true">
        <Category>AvailabilityHealth</Category>
        <OperationalStates>
          <OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success" />
          <OperationalState ID="Unhealthy" MonitorTypeStateID="Unhealthy" HealthState="Error" />
        </OperationalStates>
        <Configuration>
          <IntervalInSeconds>21600</IntervalInSeconds>
          <TimeOutInSeconds>1800</TimeOutInSeconds>
          <DataWarehouseConnectionString>mydatawarehouse.contoso.com\dw01</DataWarehouseConnectionString>
          <TimeToConsiderDeadInSeconds>21600</TimeToConsiderDeadInSeconds>
          <PropertyBagElementIdentifier>isPerfCollectionHealthy</PropertyBagElementIdentifier>
        </Configuration>
      </UnitMonitor>
    </Monitors>
  </Monitoring>
  <LanguagePacks>
    <LanguagePack ID="ENU" IsDefault="true">
      <DisplayStrings>
        <DisplayString ElementID="Custom.Example.AgentHealth">
          <Name>Agent Health Management Pack</Name>
          <Description>Collects event 6022 from all agents, writes it to the data warehouse, and then queries to ensure events and perf counters are being collected from all agents.</Description>
        </DisplayString>
        <DisplayString ElementID="Custom.Example.AgentHealth.DataSource.QueryForDeadAgents">
          <Name>Query For Dead Agents Data Source</Name>
          <Description>Queries the DW for agents that haven't reported event 6022 or perf counters in x amount of time.</Description>
        </DisplayString>
        <DisplayString ElementID="Custom.Example.AgentHealth.Monitor.EventCollectionCheck">
          <Name>Event Collection Health Check Monitor</Name>
          <Description>Checks to see if an agent is collecting event 6022 successfully.</Description>
        </DisplayString>
        <DisplayString ElementID="Custom.Example.AgentHealth.Monitor.EventCollectionCheck" SubElementID="Healthy">
          <Name>Healthy</Name>
        </DisplayString>
        <DisplayString ElementID="Custom.Example.AgentHealth.Monitor.EventCollectionCheck" SubElementID="Unhealthy">
          <Name>Unhealthy</Name>
        </DisplayString>
        <DisplayString ElementID="Custom.Example.AgentHealth.Monitor.PerfCollectionCheck">
          <Name>Performance Collection Check Monitor</Name>
          <Description>Checks to see if the agent is collecting any performance data.</Description>
        </DisplayString>
        <DisplayString ElementID="Custom.Example.AgentHealth.Monitor.PerfCollectionCheck" SubElementID="Healthy">
          <Name>Healthy</Name>
        </DisplayString>
        <DisplayString ElementID="Custom.Example.AgentHealth.Monitor.PerfCollectionCheck" SubElementID="Unhealthy">
          <Name>Unhealthy</Name>
        </DisplayString>
        <DisplayString ElementID="Custom.Example.AgentHealth.MonitorType.QueryForDeadAgents">
          <Name>Query For Dead Agents Monitor Type</Name>
          <Description>Determines, based on property bag data, whether an agent is healthy or unhealthy.</Description>
        </DisplayString>
        <DisplayString ElementID="Custom.Example.AgentHealth.Rule.CollectEvent6022">
          <Name>Collect Event 6022</Name>
          <Description>Event collection rule for event ID 6022</Description>
        </DisplayString>
      </DisplayStrings>
      <KnowledgeArticles>
        <KnowledgeArticle ElementID="Custom.Example.AgentHealth.Monitor.EventCollectionCheck" Visible="true">
          <MamlContent>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Summary</maml:title>
              <maml:para />
              <maml:para>This monitor queries the OpsMgr data warehouse on a scheduled basis and reports which agents are failing to collect event 6022 within a period of time.</maml:para>
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Configuration</maml:title>
              <maml:para />
              <maml:para>This is one of two workflows that run this query, but the query cooks down so as long as the parameters passed into both monitors are identical (except for the PropertyBagElementIdentifier) the script will only run once during the scheduled time.  </maml:para>
              <maml:list>
                <maml:listItem>
                  <maml:para>IntervalInSeconds = 21600:How often the script is configured to run.</maml:para>
                </maml:listItem>
                <maml:listItem>
                  <maml:para>TimeOutInSeconds = 1800:The timeout for the script is set high because in large environments the call to ReturnItems() takes a while.</maml:para>
                </maml:listItem>
                <maml:listItem>
                  <maml:para>DataWarehouseConnectionString = SQLBOX\INSTANCE:This should match what you would use in SQL Management Studio to connect to the SQL instance the OpsMgr data warehouse resides.  This assumes a database name of OperationsManagerDW which his hard coded in the script.</maml:para>
                </maml:listItem>
                <maml:listItem>
                  <maml:para>TimeToConsiderDeadInSeconds = 21600:How far the in time the script should look for an event 6022 which is created every 15 minutes by default.  The script gets the current time by finding the most recent 6022 submitted from any agent in the database to avoid all agents changing state if the management group is down.</maml:para>
                </maml:listItem>
                <maml:listItem>
                  <maml:para>PropertyBagElementIdentifier = isEventCollectionHealthy:This parameter should not be changed from the default.  It is used to determine which part of the returned property bag it should check for health (the event or performance property).</maml:para>
                </maml:listItem>
              </maml:list>
              <maml:para></maml:para>
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Causes</maml:title>
              <maml:para />
              <maml:para>Events not being collected on an agent can be caused by various problems including the machine being offline or the agent being in an unhealthy state.</maml:para>
              <maml:para />
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Resolutions</maml:title>
              <maml:para />
              <maml:para>The first step in troubleshooting an unhealthy agent should be to look in the OperationsManager event log on the agent.  A good step is to restart the health service and read the events that occur from the point of restart.</maml:para>
              <maml:para />
              <maml:para />
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Additional</maml:title>
              <maml:para />
              <maml:para>To view the state of these monitors in the Operations Manager Console go to Monitoring\Operations Manager\Agent\Agent Health State and view the Health Service Watcher column on the left.</maml:para>
              <maml:para />
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>External</maml:title>
              <maml:para />
              <maml:para>Kevin Holman’s blog on fixing unhealthy agents:</maml:para>
              <maml:para>
                <maml:navigationLink>
                  <maml:linkText>https://blogs.technet.com/b/kevinholman/archive/2009/10/01/fixing-troubled-agents.aspx</maml:linkText>
                  <maml:uri href="https://blogs.technet.com/b/kevinholman/archive/2009/10/01/fixing-troubled-agents.aspx" />
                </maml:navigationLink>
              </maml:para>
              <maml:para />
            </maml:section>
          </MamlContent>
        </KnowledgeArticle>
        <KnowledgeArticle ElementID="Custom.Example.AgentHealth.Monitor.PerfCollectionCheck" Visible="true">
          <MamlContent>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Summary</maml:title>
              <maml:para />
              <maml:para>This monitor queries the OpsMgr data warehouse on a scheduled basis and reports which agents are failing to collect performance counters targeted at the Health Service within a period of time.</maml:para>
              <maml:para />
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Configuration</maml:title>
              <maml:para />
              <maml:para>This is one of two workflows that run this query, but the query cooks down so as long as the parameters passed into both monitors are identical (except for the PropertyBagElementIdentifier) the script will only run once during the scheduled time.  </maml:para>
              <maml:list>
                <maml:listItem>
                  <maml:para>IntervalInSeconds = 21600:How often the script is configured to run.</maml:para>
                </maml:listItem>
                <maml:listItem>
                  <maml:para>TimeOutInSeconds = 1800:The timeout for the script is set high because in large environments the call to ReturnItems() takes a while.</maml:para>
                </maml:listItem>
                <maml:listItem>
                  <maml:para>DataWarehouseConnectionString = SQLMACHINENAME\INSTANCE:This should match what you would use in SQL Management Studio to connect to the SQL instance the OpsMgr data warehouse resides.  This assumes a database name of OperationsManagerDW which his hard coded in the script.</maml:para>
                </maml:listItem>
                <maml:listItem>
                  <maml:para>TimeToConsiderDeadInSeconds = 21600:How far the in time the script should look for performance data which is created every 15 minutes by default.  The script gets the current time by finding the most recent 6022 submitted from any agent in the database to avoid all agents changing state if the management group is down.</maml:para>
                </maml:listItem>
                <maml:listItem>
                  <maml:para>PropertyBagElementIdentifier = isPerfCollectionHealthy:This parameter should not be changed from the default.  It is used to determine which part of the returned property bag it should check for health (the event or performance property).</maml:para>
                </maml:listItem>
              </maml:list>
              <maml:para />
              <maml:para />
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Causes</maml:title>
              <maml:para />
              <maml:para>Performance counters not being collected on an agent can be caused by various problems including the machine being offline or the agent being in an unhealthy state.</maml:para>
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Resolutions</maml:title>
              <maml:para />
              <maml:para>The first step in troubleshooting an unhealthy agent should be to look in the OperationsManager event log on the agent.  A good step is to restart the health service and read the events that occur from the point of restart.</maml:para>
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>Additional</maml:title>
              <maml:para />
              <maml:para>To view the state of these monitors in the Operations Manager Console go to Monitoring\Operations Manager\Agent\Agent Health State and view the Health Service Watcher column on the left.</maml:para>
            </maml:section>
            <maml:section xmlns:maml="https://schemas.microsoft.com/maml/2004/10">
              <maml:title>External</maml:title>
              <maml:para />
              <maml:para>Kevin Holman’s blog on fixing unhealthy agents: <maml:navigationLink><maml:linkText>https://blogs.technet.com/b/kevinholman/archive/2009/10/01/fixing-troubled-agents.aspx</maml:linkText><maml:uri href="https://blogs.technet.com/b/kevinholman/archive/2009/10/01/fixing-troubled-agents.aspx" /></maml:navigationLink></maml:para>
              <maml:para />
              <maml:para />
            </maml:section>
          </MamlContent>
        </KnowledgeArticle>
      </KnowledgeArticles>
    </LanguagePack>
  </LanguagePacks>
</ManagementPack>

Custom.Example.AgentHealth.renametoxml

Comments

  • Anonymous
    April 13, 2014
    Hi Russ, Thanks for this. I imported the MP and "the agent returning data" monitor now appears in the health explorer. However just like your screenshot the monitor is in blank state. Does that mean something is wrong or it is the way it should be and it will only change when it goes to red? Cheers

  • Anonymous
    April 13, 2014
    HI Nassim, The agent returning data aggregate monitor isn't part of this MP. See the Performance Collection Check Monitor, that is part of the management pack.

  • Anonymous
    April 14, 2014
    Oh! :) Thanks a lot. I can now see that it has changed state from blank to healthy yesterday a few hours after I enabled it. Cheers

  • Anonymous
    April 27, 2015
    Hi Russ, this MP is very useful. I am going to use it. Only one question. In our SCOM environment the the script QueryForDeadAgents.vbs runs under Management Server Action Account and not under Data Warehouse Action Account as is declared in the MP and this  script failed to connect to OperationsManagerDW . I had to add Management Server Action Account as reader for OperationsManagerDW and now it works. I am only wonder why the script does not run under  Data Warehouse Action Account. as is written in the MP. Thanks!