A board member asks the question every non-technical leader dreads: "Is our engineering team productive?" The honest answer used to be a shrug dressed up as a metric, lines of code, story points, hours logged, all of which reward looking busy over being effective. The DORA metrics exist because a long research programme found four measures that actually track with how a team performs. They are worth understanding, and worth handling with care.

The quick version

  • DORA gives you four delivery metrics: how often you ship (deployment frequency), how long a change takes to reach users (lead time for changes), how often a release breaks something (change failure rate), and how fast you recover when it does.
  • They split into speed and stability, and the central, repeated finding is that the best teams score well on both. Fast and safe are not a trade-off.
  • They measure the system, not the person. Use them on the team's pipeline; the moment you rank individuals by them, people game them and the signal dies.
  • They are necessary but not sufficient. A team can hit elite delivery numbers while building the wrong thing or burning out. Pair them with a measure of developer experience.

The idea in depth

The four metrics come out of the DevOps Research and Assessment programme (DORA), and the argument behind them is laid out in Accelerate by Nicole Forsgren, Jez Humble and Gene Kim (IT Revolution, 2018). The book's claim is not "these feel important", it is that, across years of survey data from tens of thousands of professionals, these measures statistically predicted stronger organisational outcomes such as profitability and productivity. That is a stronger foundation than most management dashboards stand on, and it is why DORA escaped engineering circles and reached boardrooms.

The four metrics, plainly

DORA's own definitions are deliberately concrete. Deployment frequency is "the number of deployments over a given period or the time between deployments." Change lead time is "the amount of time it takes for a change to go from committed to version control to deployed in production." Change fail rate is the share of deployments that "require immediate intervention", a fix, a rollback, a patch. Failed deployment recovery time is how long it takes to recover when a deployment fails (the team's primary source, dora.dev). The first two describe throughput, how quickly value moves. The second two describe stability, how much it breaks and how fast you set it right.

The point that makes the framework more than a dashboard is the relationship between those two pairs. The intuitive belief is that going faster makes you less safe, that you buy speed with bugs. DORA's research consistently found the opposite: high-performing teams ship more often and break things less often, because the same practices that let you deploy in minutes (small changes, automation, fast feedback) are the ones that keep failures small and recoverable. So drop the "fast versus careful" question entirely. The better one is: what is stopping the team from making each change small? Small, frequent changes are the mechanism underneath both columns.

flowchart LR
  A(["Small, frequent
changes"]) --> B(["Throughput
· deploy often
· short lead time"]) A --> C(["Stability
· low failure rate
· fast recovery"]) B --> D(["High-performing
delivery"]) C --> D
Speed and stability share a root cause, small batches, rather than trading off. Leaders Loop

Performance tiers, useful, but not a league table

DORA groups teams into clusters, commonly reported as Elite, High, Medium and Low, based on where they land across the metrics. Elite teams deploy on demand and recover from failures in well under an hour; lower-performing teams might deploy monthly and take days to recover. These bands are genuinely useful as a mirror: they tell a team roughly where it sits and what "good" can look like. Treat the tier as a starting diagnosis and a direction of travel, never as a target to hit by Friday. You want to move up a band over quarters by removing friction, not to manufacture the number.

An honest limitation: the exact cut-offs between tiers shift from report to report, and the comparison is across wildly different contexts, a payments platform and a marketing site live in different worlds. Treat the bands as orientation, not as a precise grade. Your trajectory against your own past is a more trustworthy signal than your rank against a global average.

What DORA quietly does not measure

This is the part most dashboards skip. DORA measures the delivery pipeline. It says nothing about whether the team is building the right thing, whether the code will be maintainable in two years, or whether the people writing it are coping. The same researchers behind DORA went on to develop complementary frameworks precisely because of this gap: the SPACE framework (Forsgren and colleagues from GitHub, Microsoft and the University of Victoria, published in ACM Queue, 2021) widens the lens to satisfaction, performance, activity, communication and efficiency, and the later DevEx work focuses on the day-to-day developer experience. The honest reading is that DORA is necessary but not sufficient. Don't let the four delivery numbers stand alone. Pair them with one human signal, a recurring, anonymous read on developer experience, so a team that is hitting its delivery targets while quietly burning out cannot hide inside green dashboards.

There is a live example of why this matters. The 2024 DORA report found that as teams adopted AI coding tools, developers reported productivity and satisfaction gains, yet the data associated rising AI use with worse delivery stability and throughput. If you watched only the human signal you would cheer; if you watched only delivery you would panic. You need both lenses pointed at the same team to see what is actually happening.

Measure the pipeline, not the person. The four numbers describe a system; the moment they describe an individual, they stop describing the truth.

A worked example

The figures below are illustrative, chosen to show how the metrics interact, not data from a real company.

Imagine a 12-person team at a mid-sized software firm. Their dashboard reads: deploys once a fortnight, lead time of three weeks from commit to production, a change failure rate around 30%, and recovery that takes the best part of a day when a release goes wrong. A new engineering lead is tempted to demand "ship twice as often." That would make everything worse, pushing more big, risky releases through the same broken pipe.

Reading the metrics as a system tells a different story. The three-week lead time and the 30% failure rate are linked: changes sit so long that they grow large, and large changes are the ones that break. The fix is not "deploy more", it is "make each change smaller and the path to production faster." Over two quarters the team invests in automated tests and a one-click release, and breaks work into smaller pieces. Lead time falls to a few days, deployments rise to several a week as a by-product, the failure rate drops toward 15% because each release carries less risk, and recovery shrinks to under an hour because a small change is easy to roll back. No one was told to "be more productive." The system got better, and all four numbers followed.

flowchart TD
  P(["Problem: 3-week lead time,
30% failure rate"]) --> R(["Root cause:
changes too large"]) R --> M1(["Move: smaller changes"]) R --> M2(["Move: automated tests"]) R --> M3(["Move: one-click release"]) M1 --> O(["Outcome: shorter lead time,
more frequent deploys,
fewer failures, faster recovery"]) M2 --> O M3 --> O
Improving the system, not the headcount's effort, moves every metric at once (illustrative). Leaders Loop

Frequently asked questions

Are DORA metrics a way to rank individual engineers?

No, and using them that way is the fastest route to ruining them. The metrics describe a team's delivery system. Attribute them to individuals and you invite gaming: people split commits to inflate deployment counts, or avoid risky-but-valuable work to protect a failure rate. Measure the pipeline, talk to the people.

Doesn't deploying more often just mean more risk?

That is the assumption DORA's research overturned. Frequent deployment correlates with lower failure rates, because frequency is achieved through small batches, automation and fast feedback, the same practices that make failures rare and easy to recover from. Speed and stability rise together when you fix the underlying flow.

What is the fifth metric I keep hearing about?

DORA has refined the set over time. The 2024 work added a measure of rework, deployments that happen unplanned because of a production incident, and the team also tracks reliability/operational performance as a separate dimension (dora.dev). For a non-engineer, the original four are the right place to start; the additions sharpen the stability picture rather than replacing it.

Our context is unusual, do these still apply?

The metrics themselves are broad enough to travel, but the tier cut-offs are not gospel. Regulated, safety-critical or hardware-coupled systems will legitimately deploy less often. Compare the team to its own past trajectory first, and to the global bands second.

If DORA is so good, why do I need anything else?

Because it only watches delivery. It cannot see whether you are building the right product or whether the team is sustainable. Pair it with a developer-experience signal (the SPACE/DevEx lineage) so the human side of "productivity" is not invisible.

Related in the Toolkit

Where to go next