Using ephemeral environments to sandbox agentic workflows

When you’re coding with AI agents, you’ll need infrastructure to host those code changes. This is so you can collect feedback, run tests, and make sure your feature is performing as you’ve intended.

/images/blog/cover-images/sandbox-agentic-workflows.png
Sandboxing your agentic workflows with ephemeral environments

by on

When you’re coding with AI agents, you’ll need infrastructure to host those code changes. This is so you can collect feedback, run tests, and make sure your feature is performing as you’ve intended. Here’s what sandboxing looks like in agentic workflows, and why ephemeral environments in particular are a popular choice for this.

What is sandboxing?

Sandboxing is the practice of deploying your code changes to an isolated environment, and seeing how they respond to testing and normal use. Typically, sandbox environments are close to production, especially when it comes to infrastructure, services, data, and networking. This is so they can shed light onto anything that can possibly go wrong in prod after a new feature rollout. Sandboxing is used for code changes that might be risky or harmful (especially when it comes to network vulnerabilities and open source libraries), and require close monitoring before they can be trusted in a codebase.

Why do I need to sandbox my agentic workflows?

LLMs are prone to hallucinations. It’s an inevitable byproduct of how they generate tokens. That’s only one of the risks of AI code gen: LLMs also might suggest poor-practice implementations, leaving your code vulnerable or poorly-optimized. In addition to manually reviewing every AI code suggestion, you’ll want to extensively test and QA these code changes before a deployment.

Sandboxing allows you to take risks with new features. Since they’re decoupled from your other services and dependencies, you won’t need to worry about introducing vulnerabilities or taking down infra.

Additionally, since agents are capable of writing code so fast, you’ll want to keep feedback loops short. As soon as your agent has revised/written a code snippet and you’ve reviewed it, you’ll want to deploy it to a sandbox and see if it performs as you’ve intended.

The agentic workflow loop

As a developer, you want to stay in flow as much as you can. If you’re not in flow, it’s easy to lose context about a complex feature — you’re best off doing large chunks (or the whole thing) in one sitting. This especially applies to working with agents.

Here’s how you can design a fast (and reliable) inner loop.

1. Making a commit

When working with (or without) agents, creating small, standalone code changes is the best way to move fast while getting sufficient testing + review. Splitting features into smaller tasks and committing them one-by-one is much safer than pushing large chunks. There’s a lower chance of bugs seeping through, and you (and your reviewers) spend more time focused on every line, thanks to lower cognitive load.

2. Pushing the commit to a sandbox

As soon as you push a code change, it should deploy to your sandbox environment via GitOps event triggers. Here, you’ll be able to see if your code changes respond to the sandbox like they do to your dev environment (the sandbox will have closer parity to prod than to your dev environment, so don’t be surprised if things break here).

Pro tip: Shipyard is a great sandbox for agentic workflows. Push a commit and you’ll get an environment spun up and ready to use.

3. Running your test suite

Once your code is deployed to the sandbox, you can run your full test suite on the spot. If any tests fail, you can quickly push fixes and keep trying until your pipeline is green. Ideally, you get your tests to pass before looping in any peers for code review or UAT, so that you can minimize the amount of back-and-forth until your code is bug-free.

4. Getting code review + feedback

Your sandbox environment is a good supplement to your open PR for code review. Instead of replicating your local dev environment, your reviewer can click a link and see the code changes in the sandbox. A live environment can give good context for the changes you make, and it can be easy to compare/contrast changes with your production app.

Sandbox environments aren’t exclusive to code review. You can also share them with your designers or product managers and make sure frontend/UX is up-to-par with what they requested.

5. UATing that commit

When you feel confident in a commit or feature, you can send the product owner a link to the sandbox environment and have them UAT it. They can test user workflows and make sure it fulfills all requirements. This is usually the final gate before a feature gets deployed.

6. Deploying to production

Now that you’ve gotten automated feedback from your CI/CD pipeline as well as manual review, your feature is production-ready! Your sandbox environment should automatically spin down upon a merge and deployment. If you need to push a hotfix, you can bring the environment right back.

Ephemeral sandbox environments: the easy choice

Sandbox environments have been around for awhile, and there are several ways to build/manage them. Nowadays, the industry best-practice is trending towards architecting them as ephemeral, or short-lived.

Agentic workflows need realistic infra and quick feedback, which is why they’re well-suited to ephemeral sandbox environments. Getting an environment merely minutes after you (or your agent) has committed code is absolutely critical, as is having an environment that can support E2E tests. When you need review from multiple team members, an ephemeral environment makes setup and teardown easy, and can be available when they need it, regardless of timezone.

The best part? Ephemeral sandbox environments are only there when you need them, meaning you won’t have to deal with excessive cloud costs.

Shipyard: instant testing + review for GenAI workflows

Agentic workflows thrive in sandbox environments. After all, you’ll need infrastructure to preview/test/verify your LLM-assisted code changes (or any code changes). Better yet, using ephemeral environments for sandboxing helps keep your feedback loops short and productive: you can accomplish all quality checks within the same environment. Sandbox environments are a huge asset for testing risky (read: LLM-written) code, since they’re entirely detached from the rest of your infrastructure; you can really push them to their limits without worrying about breaking anything.

Shipyard grants you full-stack environments on every branch and/or PR, and these environments automatically update to reflect your latest commits. Sandboxing is often a lift, especially when it comes to configuration and environment management. Shipyard abstracts away all the complexity, so that you spend more time developing and testing, and less on DevOps. But don’t take our word for it! Try it yourself free for 30 days.

Try Shipyard today

Get isolated, full-stack ephemeral environments on every PR.

What is Shipyard?

Shipyard is the Ephemeral Environment Self-Service Platform.

Automated review environments on every pull request for Developers, Product, and QA teams.

Stay connected

Latest Articles

Shipyard Newsletter
Stay in the (inner) loop

Hear about the latest and greatest in cloud native, container orchestration, DevOps, and more when you sign up for our monthly newsletter.