Shipyard | How to Improve your DORA Metrics Faster

You’ve heard the term “DORA Metrics”, in fact, you’re probably using them in some form at your organization. DORA (DevOps Research and Assessment) Metrics were developed by the team behind the annual State of DevOps Report to measure the factors that contribute to an organization’s software delivery performance. In 2018, Google Cloud acquired DORA, and the metrics have continued to serve as a benchmark for software teams for over a decade now.

Here’s a quick history of DORA Metrics, from inception to Google Cloud acquisition.

There are four metrics:

Deployment frequency: How frequently you deploy to production
Lead time for changes: How much time it takes between commit and production
Change failure rate: What percent of deployments cause a failure in production
Time to restore service: How long it takes an org to fix a failure in production.

The DORA team even created a benchmarking tool, so you can see how your engineering org stacks up:

These metrics are useful, but the real question is: How do you improve your deployment frequency and lead time, while still ensuring you don’t introduce additional failures?

There are a million ways to waste a lot of time in pursuit of better DORA Metrics. (“Replatforming!” “Implementing a new internal developer platform!”)

We semi-humbly recommend implementing ephemeral environments as the lowest-lift, highest-value way to improve your DORA metrics fast and reliably, without a huge amount of engineering work or process change.

In this article, we lay out the case:

What are ephemeral environments?
How do they fit into an existing development process?
How can they impact your DORA metrics?

What are ephemeral environments?

Ephemeral environments are relatively new. They’re also known as PR environments, QA environments, or post-commit environments.

In short, ephemeral environments are short-lived, full-stack environments that are created for every pull request or merge request. Instead of sharing staging or testing environments across developers, each developer gets a personal near-clone of production that reflects their code changes.

How do ephemeral environments fit into your development process?

Today, most companies’ development processes feature shared testing or staging environments. After a developer submits a PR / merge request, their code changes are incorporated into a shared environment for all sorts of testing.

When you have shared testing and staging environments that all your developers use, you run into all sorts of problems and delays:

Developers usually have to wait for access to the staging environment. Often, staging is currently in use by QA, product, or other developers.
Then, once developers get access to the shared staging environment, they have to wait for testing to be complete across all of the different PRs in staging. This takes much longer, and is far more failure-prone, than testing individual PRs in isolation.
Often, staging breaks or there are significant bugs in other PRs. When this happens, everybody’s PR is stuck in staging until the bug is fixed and staging works again.
Ultimately, with all this waiting, bugfixes take much longer as developers have lost context. This also is more failure-prone in production, because with the pressure to clear up staging, things get pushed to production with less testing than ideal.

Clearly, this approach is disastrous for DORA metrics. We’ll get into exactly how this works after explaining how ephemeral environments can fit into your development process.

We’ve just walked through the current development process with a shared staging environment. What has to change when you implement ephemeral environments?

Quite simply, you don’t have to change anything else about your development tooling. You can plug in Shipyard to your Git provider and CI/CD, and every developer is instantly given an ephemeral environment on every PR.

The development process moves from one with a lot of waiting for shared staging environments:

…to a process where no developer blocks another developer from commit-to-production:

How ephemeral environments impact your DORA metrics

Remember the four DORA metrics:

Deployment frequency: How frequently you deploy to production
Lead time for changes: How much time it takes between commit and production
Change failure rate: What percent of deployments cause a failure in production
Time to restore service: How long it takes an org to fix a failure in production.

Here’s how ephemeral environments impact each metric:

Deployment frequency: How frequently you deploy to production

When you’re waiting on a shared staging environment, you might be stuck with a maximum deployment frequency of weekly, biweekly, or even monthly. This is just how shared staging environments work: there’s a lot of coordination to bring every relevant PR into staging, test everything together, fix stuff, and deploy.

Ephemeral environments make deploying less of a big deal. Each PR can be tested and deployed independently. If you want every developer to deploy multiple times per day… with a shared staging environment, that’s impossible. With ephemeral environments, doing this is trivial.

Verdict: Ephemeral environments are the best solution to quickly get your engineering team to “elite” deployment frequency.

Lead time for changes: How much time it takes between commit and production

Again, shared staging environments ruin your DORA metrics here. As we’ve explained earlier in this post, shared staging environments cause constant delays between commit and production:

Developers usually have to wait for access to the staging environment. Often, staging is currently in use by QA, product, or other developers.
Then, once developers can deploy to the shared staging environment, the entire test suite will run against all the PRs in staging. This is far more failure-prone than testing individual PRs in isolation.
Often, staging breaks or there are significant bugs in other PRs. When this happens, everybody’s PR is stuck in staging until the bug is fixed and staging works again.
Ultimately, with all this waiting, bugfixes take much longer as developers have lost context. This also is more failure-prone in production, because with the pressure to clear up staging, things get pushed to production with less testing than ideal.

With ephemeral environments, these go to zero:

You can run your integration tests on each PR, automatically: Testing wait times go to near-zero
Developers can send links to the staging environment to product and business stakeholders instantly: Reviews are instant

With ephemeral environments, you can continuously test PRs throughout the day and week, rather than testing big bundles of PRs on one stressful day per release cycle.

Verdict: Ephemeral environments are the best solution to get your lead time for changes to “elite” levels.

Change failure rate: What percent of deployments cause a failure in production

QA is often overloaded. This is because they are required to test a bunch of PRs bundled together in a staging environment with a quick turnaround time. (Let’s not even mention how understaffed most QA teams are relative to what’s asked of them.)

This isn’t just failure-prone, it’s failure-guaranteeing. If you’re running a scrum process and have 10 developers merging code, and QA has a tight timeline to test everything in a shared staging environment, it would be a miracle if bugs didn’t get to production.

Instead with ephemeral environments, QA is able to test PRs individually throughout a continuous deployment process, rather than bundled up all at once. This means better testing that doesn’t drive your QAs or developers crazy.

Verdict: Ephemeral environments are a crazy-effective solution to get your change failure rate to “elite” levels. Your QA team will thank you.

Time to restore service: How long it takes an org to fix a failure in production

When bugs cause failures in production, is it easier to find and reverse them when:

It could have been one of 25 PRs merged together in the most recent deployment
Or, it was clearly just one PR that was merged to an ephemeral environment

Clearly, the latter.

Verdict: You know the answer.

Summary: The fast path to elite DORA metrics

If you want to make real improvements to your DORA metrics without quarters-long debates, complex implementations, excessive costs, or ripping out your existing stack… consider plugging in ephemeral environments into your existing development stack and process. You’ll never go back.

Shipyard is here with a 30-day free trial. We can also meet 1:1 to help you figure out how much work it will be to fit ephemeral environments into your deployment process and stack. Reach out! Or don’t. They’re your DORA metrics, after all.