Shipyard | Recap of 'E2E Testing Before Merge'

When deploying any code changes, running automated tests against your working branch is an absolute given. However, hosting an overwhelming amount of tests on the same staging environment can be a challenge to orchestrate. Testing later in the pipeline can reveal conflicts which often lead to blocks, slowing overall development velocity. If only we could catch these bugs earlier…

During our second Shipyard Session, we spoke with Amrisha from MaestroQA about the importance of frequent and thorough end-to-end testing before merging PRs. Amrisha shared the challenges inherent to e2e testing on a single staging environment, and how Shipyard’s ephemeral environments have streamlined the process by integrating with her team’s testing suite.

Summary Of Things We Covered:

What is an ephemeral environment?
What are end-to-end tests?
Automating tests in your CI pipeline
Left-shifting e2e tests in your pipeline
Managing testing environments
Challenges of building an e2e suite
Running multiple testing environments with Shipyard

E2E Testing Before Merge

0:03 Benjie: I’m Benjie, I’m the Co-founder of Shipyard, and at Shipyard, we are pretty focused on ephemeral environments.

Ephemeral environments are short-lived, isolated unique copies of your application that ideally are created and taken down within your software development life cycle.

Make a code change, and all of a sudden there is a fully isolated environment for everybody in your team to use.

That includes developers for code reviews, product people for feedback, and obviously QA engineers, or in this particular instance, end-to-end tests — that’s what we’re going to dive into. Using end-to-end tests with ephemeral environments.

Amrisha will be joining us a little bit later and she’ll introduce herself then.

We also have Ashley, who is moderating this for us, so very much appreciated, and she will be taking questions and then moderating a little bit of Q&A at the end there.

So please feel free to send send in your questions, and she will make sure that we all get them answered.

1:20 Benjie: We’re going to dive in, first going to talk a bit about end-to-end, about how ephemeral environments fit into those, and then obviously the bulk of this will be about MaestroQA using those end-to-end tests to improve a processes.

So first off, end-to-end tests. You have a few different types of tests in software – the overarching view is that when you’re about to release some software, you should have some confidence that things aren’t going to be broken, and the easiest way to do that (or the most basic way), is using unit tests.

That’s taking small subsets of your code and giving it a value, and then making sure that the expected value comes out of it.

That’s pretty easy to do if you just have code, gets a little more complicated when you start doing integration tests, and then it gets a lot more complicated when you do end-to-end tests.

Some people refer to integration tests and end-to-end tests interchangeably, but for this exercise we’ll kind of separate them a bit.

2:21 Benjie: Ultimately, what end-to-end tests mean, is that you are testing your entire application workflow from start to finish.

So you’re finding potential timing bugs that you wouldn’t have seen if it was just testing a subset or other types of conflicts that you’re not necessarily finding.

The thing that’s really important about end-to-end tests is that you want to be as realistic as possible. You don’t really want to run them on production, some people do, but you really want to be as close to production as possible – which includes good data, real integrations, and just your full-stack environment.

Obviously there’s a reason that Shipyard likes people doing end-to-end tests, I will dive into that a little bit more later. How do teams do end-to-end testing?

This is to each their own, but the pattern that we’ve seen a lot of, that is a good pattern to emulate – is you have your QA teams (or maybe you have developers) and they kind of have their own internal QA checklist that they start creating.

They say, ”Hey, can I log in?”, “Hey, can I log out?”, “Hey, can I change the setting?”, “Can I do that?”, “Can I do this?”, and we all have these internal ones as a developer, but then also product people and QA engineers as well.

3:34 Benjie: So first, you want to formalize that, and then when you start to get into the end-to-end automated side of things using your CI pipeline, you get to use frameworks like Cypress, or Selenium or Playwright, and what they do is they fake a browser, and the browser literally does the clicking for you.

We highly suggest you use a framework. Cypress is great, Selenium is good. Selenium has kind of been the grandfather of this whole thing. Playwright is another great option, it’s kind of the newest kid on the block.

4:14 Benjie: So how do teams typically do their end-to-end tests if you’re not using ephemeral environments?

Well they develop, then they have a pull request typically. They merge into main (or whatever their deploy branch is), and then you deploy that to staging (or subversion of staging), and then you start to test on staging.

There’s a lot of reasons (that I’m actually going to let Amrisha with her real-world experience speak to) of why doing it so late in the process of your software development lifecycle can cause problems.

It’s not ideal, and the main reason is the feedback loop: there could possibly be weeks between a deployment to staging and when the actual developer made the pull request.

That’s probably a little extreme, it’s probably days, but we all know developers switching their mental model can be super tough.

5:12 Benjie: We recommend a better way. You should shift left with end-to-end tests before merge with a dedicated testing ephemeral environment per branch.

In this world, we develop, we create a pull request and immediately an ephemeral environment that is production-like and full-stack is there. You can start running your tests and get that feedback loop a lot shorter, so that’s super cool.

5:38 Benjie: “This is the way” to keep the theme from last ephemeral environment Shipyard Session.

A quick side note here: if you want to understand more about ephemeral environments and how to do them, some of the catches, and the best practices and whatnot – we actually did a prior Shipyard Session (or webinar), and you can find that on our blog or on our YouTube channel.

We dove into a lot more specifics on why use ephemeral environments and how to actually do them, and we gave a bit of a demo using Shipyard to help you on your journey.

6:19 Benjie: So “Why isn’t everyone doing ephemeral environments?” and “Why aren’t they all doing this easier, sooner version of end-to-end testing?”

Well, it’s a lot of work. You got to manage your state, you got to worry about your seed data, you got to worry about snapshotting data and moving that around.

Third-party dependencies become a real problem as well if you want to really do a full end-to-end test. You integrate with someone else and often times that is very hard to do, you need the full end-to-end test to do that. So having an ability to to mock that out doesn’t always cut it.

You really need to use some advanced tools, like some of the stuff that Shipyard offers for third-party dependency stuff.

7:03 Benjie: Access controls — these are not environments that are meant for prime time. These are environments that are meant for testing and for internal usage and stakeholders playing with it, kicking the tires and whatnot. So it’s really important to have good security around these. Making these securely accessible, it’s a problem.

The other thing is that if you are creating environments on every pull request, or even in some cases, on every code change, you run into a problem where if you’re not keeping an eye on this stuff, your cloud costs are going to be extremely high.

You have to make sure that they’re up for the right of that time, that they’re down for the right amount of time — they’re available when you need them, but they’re not just sitting around when you don’t.

Then obviously, the ongoing maintenance and oversight on the platform itself can be challenging and a full-time job, and as Amrisha can dive into in a second.

7:53 Benjie: You know when you’re in a situation where you’re responsible for production application, if your job all day long is dealing with environments for non-production, you kind of become an SRE for your internal teams and you can’t focus.

It can’t be on the external stuff, which is where the customers live. Which is what this is all about.

So again, end-to-end testing is good for solid, secure, safe, reliable releases that you have faith in doing it earlier, better, but I’m going to let Amrisha take it from here.

9:25 Amrisha: Hi, my name is Amrisha. I’m the infrastructure Engineering Team Lead at MaestroQA, and what we do is we build software for omnichannel quality assurance from modern support teams.

Our customers are MailChimp, Peloton, Zendesk, Etsy, GrubHub and many more. They use our software to improve agent performance and optimize their customer experience processes.

Something I’d like to point out is that we recently closed a Series A and we’re hiring, so if you’re interested in working with me after you hear me talk, please find me on LinkedIn and we can see if there’s any roles that fit your profile.

10:12 Amrisha: So we were beginning to build our QA team to build out our end-to-end test suite using Cypress, and after we built out our initial suite of tests, we realized that the runtime for all of those tests, end-to-end testing and all of our integrations was about 40 mins.

This was because we were running them in serial on CircleCI, and because they were taking so long, we couldn’t run them as often as we wanted to.

We needed to really bring that down, which we were able to do with CircleCI’s parallelized testing feature.

10:37 Amrisha: We brought our run time down to 5 mins, so we ran all hundred plus of our tests in under 5 mins against our staging environment.

With that, we started running into new issues when our development team grew, and we were running a lot more PRs, a lot more branches, on any particular day.

With that came the need to test them, and anytime we tried to test them against our staging environment, we were running into a lot of collisions.

So either two tests would be running against one environment at the same time and driving the utilization up, or we would do a deploy mid-test from a different branch that would then change the test state.

11:34 Amrisha: We were just seeding a ton of data and every time we tried to reset it, you could monkey with another running test.

So we were just running into a lot of collisions at various levels and because of these collisions, we built in a lot of manual triggers, so it was also painful for anyone who wanted to run tests, particularly developers, who have to babysit their branches through our pipeline to try to get some data back on testing. That was the initial state that we were at.

12:02 Amrisha: Like I said, the challenge was a lot more changes, a lot more branches, a lot more PRs in flight at any particular time.

We needed to get a proper process in place for how and when we’re going to run these tests and because we had a single environment, there was a ton of production. So one thing we were able to solidify was okay.

Regardless of whether you run it during the development process, you want to make sure that we run our test suite before it goes into production so everything gets merged into master, and then deployed to staging.

12:37 Amrisha: Then we run our tests, but that would mean that there would be eight or nine, you know, between five and ten, depending on how many engineers, how we were scaling, and pull requests merging in and being tested at once.

So anytime there was a failure, which was pretty often at that point, we wouldn’t know really which PR caused it or which PR was causing the regression, or whether that was intended or not.

It would delay our deploys every time this happened, because we had to go on a hunt to figure out which PR, “What do we need to revert?”, “What do we need to resolve?”, “Is it okay to deploy even with this failure because it’s a planned change in behavior that was just not caught in testing before?”

13:20 Amrisha: It was a very painful process.

Our deploys were getting delayed, and we needed to figure out how to do this testing earlier in the process before any merge and get feedback back to the developers when they were still very engaged in the changes being made, rather than hours or days later when they were being deployed.

13:47 Benjie: So it sounds like the feedback loop being that long is blocking your actual deploy. Because changing button color wasn’t that important, but it might break a test.

Whereas, you might have a critical bug fix that you needed to get up, but they’re all merged at the same time, and so all of a sudden you’re actually blocked from deploying because of your test. So it’s kind of the antithesis of the whole test thing.

You’re not moving slower because of your tests, and so that’s a common problem that everybody has. It’s a paradigm we’ve seen a lot at Shipyard. One we have ourselves, we all share this pain.

14:23 Amrisha: That’s where Shipyard came in for us. We were at the precipice of figuring out how to do this ourselves. I’m a DevOps Engineer to begin with.

I run the QA effort over in MaestroQA, but I’m a DevOps Engineer, Infrastructure Engineer, so I know how to do this. I also know that it’s going to be very difficult to manage the states of every different database. To manage uptime, to make DNS changes, networking changes, routing changes, and then also keep them secure because these are running in our our cloud environment.

So I know exactly how hard this was, but the pain to our developers and the QA process was becoming so painful that it was something that we would have to figure out sooner, rather than later.

15:12 Amrisha: At that point, I was introduced to Shipyard and we started the process of seeing how well it would fit into our process.

One thing that it solved for us, which I didn’t address earlier, was we were actually running one database for our staging environment, regardless of how many test environments we pointed to it.

So we didn’t have the data isolation, and that was the cause of a lot of our pain points. With our setup with Shipyard, we have different databases running for every PR so every environment has its own truth that is different from the next environment and they don’t interfere with each other, which was key for for some of our testing.

15:52 Benjie: So your tests are basically running on top of each other and so you’re getting flaky tests, but they actually weren’t necessarily broken. It’s just like the different test was running at a different time. That’s a huge problem that happens to a lot of people when you use a shared testing environment, even if you parallelize these tests.

I know this is an obvious thing to state, but we all kind of forget we’ve all had that nightmare where like, all 87 tests are failing for some reason — “what happened?” and they’re like, “oh wait, I just ran a second contest at the same time.”

So that isolation of data is kind of something I spoke about a second ago, but that became a really important thing for you guys?

16:32 Amrisha: Yes, it became the main thing that we needed out of any solution that came out of it was the isolation of data, the additional overhead of managing these environments was a cherry on top, but really isolating that data, and in a way that worked with our test suite, was key.

16:51 Benjie: Right now you guys use Circle with Shipyard and Cyprus, is that correct?

Amrisha: Yes, we do.

Benjie: So you guys are using the CircleCI Shipyard Orb?

Amrisha: We are.

Benjie: I want everyone to know that CircleCI is where a lot of our developers interact with testing and with Shipyard. It’s also where we build our code centrally, so anytime there’s a new commit or a new PR, CircleCI does the heavy upfront work of building the code and then triggers Cyprus once Shipyard is ready.

So Circle is kind of helping coordinate between GitHub itself, Shipyard, and Cypress. We use the orb to figure out when the environment is ready, and then we tell Cyprus, “here are your details, here’s the UI, here’s the bypass token, go do your testing”, and then Circle also does the reporting for us, tells us when when certain things are not passing or failing as is.

17:56 Benjie: So what we were able to accomplish with this is now every PR that is open into our main branch is getting an isolated environment with isolated data. It all happens automatically if i push a new commit, then my branch is automatically updated within about 15 minutes of that commit, which is without any babysitting of the CircleCI pipeline. It kind of takes care of itself. I don’t have to barter with anybody to take over a QA environment, at least before it’s merged into master and has to go into our pre-prod staging environment.

One thing that happened is, as we were running tests once before every deploy to production, which would be two or three times a day, when we built out the Shipyard pipeline, we started running them about five or six times an hour.

We discovered that some of our tests, which would show up flaky once in a while, were actually really flaky when running at so many times in the day, and we were able to go and then prune some of our testing and really do the audit of why this test is flaky, what are we doing here, and bring that test suite down to like a core group of tests that we knew were stable and indicative of an issue, and then because Shipyard is kind of taking care of this pipeline from a DevOps perspective, we’re able to focus more on production, and testing is happening seamlessly.

Amrisha: Developers are more able to focus on the code that they’re working on, because they don’t have to babysit things through CircleCI. The full end-to-end automation is actually pretty nice for us.

19:32 Benjie: I know one thing that we talked about is that one of the challenges for you, the person kind of overseeing the QA stuff, is building, getting internal usage from the developers themselves to trust the tests and to use the tests. I know that you said earlier that it maybe that’s getting a little bit better, I mean have you seen an uptick in people leaning on an end-to-end test for deploys and helping with confidence and overall process?

Amrisha: It’s much easier to get engagement with test results now that they’re running more often.

We’re directly pushing the results from these tests into PR so we’re making it really easy for developers to get data on their branches on whether they’re passing tests or not, so the level of engagement is definitely increasing, but this is one of the hurdles that we had to cross in order to get more adoption and more feedback around our QA process itself is giving them this data, so the engagement has definitely increased. The adoption we’re still waiting to measure.

20:43 Benjie: Any other cool side effects that you’ve seen from Shipyard or from ephemeral environments or end-to-end tests in general? It’s not just about Shipyard obviously, are there any other insights? Because I feel like there’re some people that are watching this that are about to take on the pain that you just described. So any other like helpful things? Is there anything else on the end-to-end test side that you’re like, “oh don’t do this”, or “you should do this as soon as possible”?

Amrisha: One thing that I think is key to point out is that we reduced the number of tests that we were running. We knew they were somewhat flaky before we started running them so often, but that wasn’t clearer than we were running them on every commit of every PR, and there were just some tests that were not able to keep up because they were resource-intensive, where there was some flakiness in how we were doing the test and getting the amount of data that we got from the number of runs that we did was necessary in order to get that feedback on our test suite itself.

So it isn’t just a one-way feedback process where we were running the test and then telling developers what’s failing it’s also two-way and that it’s telling us as a QA team whether the tests we’re riding are robust or not, and given that this ephemeral environment end-to-end testing, like the amount of testing we’re doing, it’s just increasing how much data we have as a whole team around our tests, our commits, on what workflows we’re testing. It’s just a wealth of data for us to really action on insights from that.

So one thing was going back to QA and saying, “okay, do we really need all of these tests?”, “why are they always so flaky?”, “should we be doing them differently, putting in guards, removing them entirely, and re-approaching it a different way?”, and that also shows developers that this isn’t just something to dump data into their PR.

We’re looking at this as well, and it’s a shared responsibility.

22:46 Benjie: I mean one thing that we struggle with, or that I think every team struggles with, is that are these flakey tests actually like a problem or not a problem? So shortening that feedback loop for the tests themselves, forget the code, but the tests themselves, we’ve seen a lot more ownership on the developer side. They’re stepping in, helping out with tests, and we’re definitely seeing, like we’ve started getting really good about our end-to-end testing lately, partially because of working with you on this case study, and this Shipyard Session, so thank you again for coming, but we’re just seeing really good results about where we’re leaning on the tests more. I think one of the things that I would say as an engineer, well, as a former engineer, is that what we’re seeing is is that we’re building confidence with the developers themselves so that they’re actually like, “oh testing is not just this pain that slows everything down, that it works one every hundredth time” to “it’s something that I can rely on”.

Unit tests don’t cover everything, but you know that your database field is a string, or not or a number, or an integer, or whatever with unit tests. We’re kind of getting there a bit with our other side.

Amrisha: Well one thing to add to that is because we have the Cyprus and GitHub integration turned on, so when we run our Shipyard Cyprus test (that’s what we call them), we run them twice. So we run them on Shipyard, then we run them in our staging environment, so we have them as two different items in our CircleCI config.

The Shipyard/Cypress integration dumps into the PR directly, so if I’m reviewing a PR for something and I notice that I have my ticket comment, and then I have my Shipyard comment, and then I have my Cypress comment, it’s a lot easier for me to judge whether this PR is ready for merge or not. When they know that all my Cyprus tests ran and all of them passed that’s a good indicator that there’s a level of stability already here in this PR that gives me confidence with approving it without needing to really dig into that.

Not that I’m reviewing a lot of the development PRs, I’m really reviewing some of the more DevOps-y stuff that’s in the in the repo, but if I know that I’m making a change that in some way tangentially affects the code.

If I see tests pass for me, then I feel really comfortable about those changes myself, as a person making the change, but also as a reviewer.

Q&A

26:22 Ashley: So the first question we have in the chat is: can you use other tools other than Cypress, GitHub Actions, or CircleCI with Shipyard?

Benjie: So the way that it works under the hood is that Shipyard has an API that is kind of a “hey, go give me an environment, give me a token so I can bypass the SSO that is associated with every environment, and then tell me when it’s ready and I can go out and do that.” So you can do that with Argo, you can do Jenkins, we do have a GitHub Action that we actually released a week or two ago. Shout out to [Shipyard Engineer] AK on the team, he did a great job getting that lined up in the docs, we’ll be doing a blog post on that.

No matter what your CI system is, you can use Shipyard. We do currently have an orb for Circle and GitHub Actions. We do not yet have a Jenkins. It’s ultimately a bash script, and we’re happy to help if you reach out.

27:47 Ashley: If I have multiple repositories front and backend, can I still use the same workflow that Amrisha was using: Shipyard, Cypress, CircleCI?

Benjie: Yeah, Amrisha can actually talk to maybe the future of what they want to do over at Maestro, but I can say that one of the things that’s really valuable for us, but also for a bunch of our other customers, is we have multi-repo support.

So if you want to do it as a mono-repo where you have all your code and all your infrastructure in one repository, we support that, but we also have the ability to mix and match repos, so there’s a use case of having a frontend repository, a backend repository, and maybe let’s say a data science repository or inference engine or something like that. You put all three of those together and when there is a change or a PR on any of those, it’ll trigger a brand-new environment and you could run your end-to-end test suite off of all that.

Honestly kind of the the mecca of all of this is testing off a multi-repo because that’s where your frontend code lives that’s very independent from your backend code often, and so that’s really the power of end-to-end tests, when you could test all those things without having to have a human do anything and it just kind of just works.

Amrisha – do you think you guys have some plans to start using mono-repo potentially?

Amrisha: Yeah, depending on how testing goes for those repos, but yeah there’s an opportunity for us to leverage this as well.

29:18 Ashley: Next question is for Amrisha: have you been able to measure an increase in the quality of your releases since you embraced end-to-end testing?

Amrisha: We’re not measuring an increase in the quality, but we’re looking for more adoption of the QA process within our internal team itself, so QA has been working on its own to build out our test suite. The main metric I’m looking for is getting more holistic adoption of QA itself, and doing testing as part of our development process, rather than the quality of our application.

29:56 Ashley: There is one more question on end-to-end tests at Maestro, which is: do you have a rule of thumb for what you end-to-end test?

Amrisha: Yes we do, we look for our top ten workflows on what our customers are using across different configuration buckets, and we try to implement those so that we can validate that the top ten most important workflows for our customers are validating on each build.

30:28 Benjie: I have a follow-up to that one: how do you know what the top ten workflows are? What do you guys use to measure that?

Amrisha: We have logs and metrics on what our customers are doing, so we analyze our production data and we translate that into what it would look like in the staging environment, and then write automated tests across those.

Benjie: Super cool. We’ve been playing with Full Story a little bit here and there and some other ones, but not sure what the right combo is for figuring that out. Very good. Ashley, do we have anything else or wrap-up time?

31:09 Ashley: I think that’s a wrap!

31:13 Benjie: Alright great, well I want to give a big thanks to Amrisha and MaestroQA in general.

A few plugs here: check out MaestroQA.com, they are hiring.

And Shipyard, sign up if you want to check it out, go to Shipyard.build. Check out our blog, we’ve also got a case study on Maestro covering the details on what we talked about today.

As always, make sure to check out our community site for ephemeral environments, EphemeralEnvironments.io. PRs welcome!

Right now, we’re at the early stages of kind of giving everybody in the community the ability to do end-to-end tests early on, with a full-stack environment, and also giving product owners and other stakeholders access to features as they’re being worked on. EphemeralEnvironments.io is not necessarily for Shipyard, it’s a general resource for the community at large.

32:32: Down there is a link for the Shipyard Slack and our GitHub. We’ve got some sample repositories, some other cool stuff, and our Twitter of course.

Again, I want to thank Amrisha from MaestroQA for coming, and thank you all for attending as well. Have a great afternoon and feel free to reach out. Getting your testing right is an investment, but also has a huge payoff. Hopefully some of this was insightful for you and you can use it everyday for yourselves.

Recap of 'E2E Testing Before Merge'

Summary Of Things We Covered:

E2E Testing Before Merge

Q&A

Try Shipyard today

What is Shipyard?

Stay connected

Latest Articles

E2E Testing with Claude Code

Best Frameworks for Building an IDP in 2025

Developing a Full-Stack App with Claude Code and Docker Compose