Partilhar via


Confirmatory and Experimental Metrics

As we experiment more using data to understand the quality of our product, the proper use of telemetry becomes more clear. While initially we were enamored with using telemetry to understand whether the product was working as expected, recently it has become clear that there is another, more powerful use for data. Data can tell us not just what is working, but whether we are building the right thing.

There are two major types of metrics.  Both have their place in the data driven quality toolkit.  Confirmatory metrics are used to confirm that a feature or scenario is working correctly.  Experimental metrics are used to determine the effect of a change on desired outcomes.  While most teams will start using primarily the first, over time, they will shift to more of the second.

Confirmatory metrics can also be called Quality of Service (QoS) metrics .  They are monitors.  That is, metrics designed to monitor the health of the system.  Did users complete the scenario?  Did the feature crash?  These metrics can be gathered from real customers using the system or from synthetic workloads. Confirmatory metrics alert the team when something is broken, but say nothing about how it affects behavior.  They provide very similar information to test cases.  As such, the primary action they can induce is to file and fix a bug.

Experimental metrics can also called Quality of Experience (QoE) metrics.  Each scenario being monitored introduces a problem that users have and an outcome if the problem is resolved.  Experimental metrics measure that outcome.  The implementation of the solution should not matter.  What matters is how the change affected behavior.

An example may help.  There may be a scenario to improve the time taken to debug asynchronous call errors.  The problem is that debugging takes too long.  The outcome is that debugging takes less time.  Metrics can be added to measure the median time a debugging session takes (or a host of other measures).  This might be called a KPI (Key Performance Indicator).  Given the KPI, it is possible to run experiments.  The team might develop a feature to store the asynchronous call chain and expose it to developers when the app crashes.  The team can flight this change and measure how debug times are affected.  If the median time goes down, the experiment was a success.  If it stays flat or regresses, the experiment is a failure and the feature needs to be reworked or even scrapped.

Experimental metrics are a proxy for user satisfaction with the product.  The goal is to maximize (or minimize in the case of debug times) the KPI and to experiment until the team finds ways of doing so. This is the real power behind data driven quality.  It connects the team once again with the needs of the customers.

There is a 3rd kind of metric which is not desirable.  Those are called vanity metrics.  Vanity metrics are ones that make us feel good but do not drive actions.  Number of users is one such metric.  It is nice to see a feature or product being used, but what does that mean?  How does that change the team's behavior?  What action did they take to create the change?  If they don't know the answer to these questions, the metric merely makes them feel good.  You can read more about vanity metrics here.