Tuesday, 18 July 2023

How should we use metrics in DevOps?


At DevOpsDays Amsterdam I joined an Open Space about metrics. It was very interesting to hear that some companies make the use of DORA metrics mandatory. For those unfamiliar with these DORA metrics I will explain.

  1. Mean Time To Recover (MTTR): the amount of time it takes to restore a downed service
  2. Change Failure Rate (CFR): how many changes to production cause outage
  3. Lead Time (LT): How much time does it take to bring a change from inception to production
  4. Deployment Frequency (DF): how often can you deploy a change
  5. Availability (A): what is the uptime of your main process from a users perspective

These metrics are derived from teams in a great number of people in companies (about 33.000) via a survey. It is important to note that these metrics are just a subset of what is possible to measure. Which means that they might not be best suited for your own situation. Teams in organisations should select metrics based on the goals they want to achieve. Metrics can therefor differ from team to team. Each team is responsible for their own metrics and it is crucial that they regularly reassess the metrics they use.

In the workshops I do with teams about metrics I explain the most important rule: metrics are owned by the team. No one outside the team should decide which metrics to use. You see that I struggle with the mandatoriness of metrics. Because metrics can help increase performance in a team, they should be in control of the metrics, they should use it as a learning mechanism.

Metrics should also not be used to compare team performance between teams, especially not from a management perspective. This will kill the dialogue between the team and the leaders that should facilitate them in growing. Discussing team metrics with a leader is good; they can understand and help a team remove impediments and grow. Comparison leads to competition, competition leads to selective development of capabilities and will most probably result in improvements on the measured items, which are not necessarily the items that the team needs to improve on.

As Peter Drucker said: "What you measure improves". In other words, if you know where you want to improve, start measuring. Just measure what you think will help you improve. There are a couple of things to bear with you. First of all think of the possible side effect of the metric, how you can cheat it. For instance: measure velocity of story points to see how much work gets done. This sounds like a great metrics, but can easily be cheated. The increase in story points will not guarantee more delivery of work. It might be the case that stories just get more points; thus cheating the metric. 

If you define a metric, you should also define a goal where you want to be with that metrics after a certain amount of time. Don't go for 100%, that will be demoralising, set achievable goals.

What kind of metrics might be useful to a team? Here are a couple metrics you might want to use.

  1. Mean Time to Detect (MTTD): It measures the average time taken to detect issues or failures in the production environment. A lower MTTD indicates effective monitoring and alerting systems.
  2. Mean Time to Resolve (MTTR): It measures the average time taken to resolve issues or failures once they are detected. A lower MTTR indicates efficient incident response and problem-solving capabilities.
  3. Change Failure Rate: It calculates the percentage of changes that result in incidents or require rollbacks. A lower change failure rate indicates a higher level of stability and quality in the software delivery process.
  4. Test Coverage: It measures the percentage of code or system coverage by automated tests. Higher test coverage indicates a reduced risk of introducing bugs and promotes code reliability.
  5. Customer Satisfaction: This metric captures customer feedback and satisfaction levels, indicating how well the delivered software meets customer expectations and needs.
  6. Infrastructure as Code (IaC) Compliance: It measures the percentage of infrastructure managed as code and tracks adherence to infrastructure automation practices.
  7. Team Morale: While not directly tied to technical metrics, monitoring team morale and job satisfaction can provide insights into the health of the DevOps culture and its impact on productivity and collaboration.

My favourite metrics are the following:

  1. Predictability: this is a percentage of estimated story points versus delivered story points. It tells you how accurate you planning is. Measuring predictability will improve refinement, planning and the discussion of too much work on the sprint log. It also reduces unplanned work. In other words, how well do you understand changes on your product.
  2. Availability: how available is my product to the user of the product, this is measured from outside the organisation to simulate real users by running the main process of your product. This metric gives direct feedback on how the user is able to use your product.
  3. Technical Debt: how much time do we need to resolve all technical debt (you can use SonarQube for this). Technical debt will accumulate if you don't act upon it. It suffers from compound interest and will eventually block any new features. It is very important to get a grip on Technical debt and as a general rule you should always keep it below 40% of your total work.
  4. Age: what is the age of PBIs / Stories / Issues on your backlog. Stuff that is on the end of the backlog will probably never get done, or are total irrelevant when they are up. Define the maximum age of items and throw away stuff that is older. The odd thing about things you throw away is that if they come back, they will come back better (new insights are added, they are updated to the latest specs)

I have found these four metrics to help increase team performance more than the first four DORA metrics.

The big sidenote here is that it's up to a team to use them. They can remove or add metrics to this list. When I start with metrics in my teams, my first goal is to define one to three metrics to start with. Each retrospective we inspect the metrics and adapt. This means we are going to focus on something to improve. Which can result in modification (change description, measurement, KPI) or adding a new metric.

To conclude, metrics can really help you to understand the work and your performance. Metrics should be defined by the individual teams, they should be visible to everyone and are modified when needed.

What kind of metrics do you like to use with your team?