The Flaky Test That Fixes Itself: Bitbucket's Agentic Pipelines Now Speaks Claude

All Posts

The Flaky Test That Fixes Itself: Bitbucket's Agentic Pipelines Now Speaks Claude

June 8, 2026

Atlassian

Bitbucket

Automation

Developer Experience (DevEx)

Autodev

A macro view of a circuit board's interconnected pathways, a nod to the under-the-hood machinery of a CI/CD pipeline that now quietly handles its own repetitive chores.

Every engineering team has a ghost in the build. It's that one test that passes on Tuesday, fails on Wednesday, and passes again on Thursday for reasons no one can name. So we do the thing we all swore we'd never do: we hit "re-run" until the build goes green, mutter something about timing, and move on. The test never gets fixed. CI trust quietly erodes. And six weeks later, half the team treats a red build the way we treat a "check engine" light — technically important, practically ignored.

Atlassian just shipped something that goes straight for that ghost. As of mid-May, Bitbucket's Agentic Pipelines — the feature that hands repetitive engineering chores to AI agents — now supports Claude as a provider alongside Atlassian's own Rovo Dev, and teams already using Claude Code can plug it directly into Bitbucket Pipelines without extra infrastructure or integration glue. The mechanism is almost suspiciously simple: add provider: claude to your pipeline configuration, and if you leave that line out, Bitbucket defaults to Rovo Dev. No new platform project. No quarter-long rollout.

I'll be honest about why this caught my attention, and it isn't the headline. It's what it says about where Bitbucket is heading.

From "build and test" to "handle the boring stuff"

For years, Pipelines was a CI/CD engine: build, test, deploy, repeat. Agentic Pipelines reframes Bitbucket as an orchestration platform for AI agents that handle the low-value, high-effort, repetitive work surrounding code — things like updating READMEs, triaging security reports, cleaning up feature flags, and generating PR descriptions. The stuff that's too small to schedule and too annoying to enjoy.

There's a number Atlassian likes to cite here, and it stuck with me: development teams spend roughly 84% of their day doing things other than writing code. If even a slice of that 84% can be delegated to an agent embedded right in the pipeline, you're not "adding AI" for the brochure — you're giving expensive, talented people their afternoons back. That's the anti-hype version of the AI story, and it's the one I actually believe in.

The flaky-test agent, and the part that matters

Atlassian's flagship example is a flaky-test remediation agent, and the workflow is worth walking through because the final step is the whole point.

You point it at a flaky test. The agent pulls the test's full execution history, failure patterns, and surrounding code context, runs the test to observe the failure firsthand, hypothesizes a likely root cause — timing issues, shared state, environment dependencies, test ordering — proposes a plan, makes targeted changes, and re-runs the test to verify the fix holds.

And then it stops. It opens a draft pull request and tags the person who triggered the fix as the reviewer, who then reviews the change and merges it.

Read that again, because it's the difference between a tool I'd recommend to a client and one I'd quietly steer them away from. The agent does the archaeology nobody wants to do. The human keeps the judgment and the merge button. That's not a limitation Atlassian forgot to remove — that's good design. Agents are extraordinary at tedious investigation. They are not your release manager.

The fine print I'd make every client read first

Here's where the trusted-advisor hat goes on. Running Claude Code through Agentic Pipelines falls under Atlassian's third-party product terms, which means your source code, prompts, and logs are sent to Claude Code, and Atlassian states it is not responsible for the privacy, security, or costs involved on that side.

For a lot of teams, that's a perfectly reasonable trade. For a regulated client, a defense contractor, or anyone with strict data-residency commitments, it's a conversation to have before anyone types provider: claude — not after. The capability is excellent. Knowing exactly what leaves your boundary, and getting sign-off to send it, is just due diligence. The Rovo Dev default keeps that data inside the Atlassian estate, which is why it's the sensible starting point for the cautious.

The Avaratak Take

If you want to try this without turning your CI into a science experiment, here's how we'd roll it out:

Start with exactly one chore. Resist the urge to automate everything you've ever resented. Pick a single repetitive task — flaky-test cleanup is a great first candidate — and let the agent earn trust there before you expand its job description.
Keep the human in the loop on purpose. The draft-PR-and-reviewer pattern isn't training wheels; it's the operating model. Treat agent output like a sharp junior engineer's first pass: frequently right, occasionally confidently wrong, always worth a read.
Settle data governance up front. Decide whether provider: claude or the Rovo Dev default fits your compliance posture, and write that decision down where your team can actually find it.
Measure the thing. Track how many flaky tests actually get retired, and how many agent PRs you merge versus reject. Good numbers make the case for expanding. Bad numbers mean you spent one line of config learning that — cheaply.

It's still open beta, and Atlassian has signaled that Claude Code is just the first of more third-party CLIs to come. Translation: the orchestration layer is opening up, and the teams that get comfortable now will have a head start when the catalog grows.

The exciting part isn't that your pipeline can suddenly think. It's that it can finally take the chores off your plate so your people can do the work only people should be doing. The agent fixes the flaky test. You decide whether the fix is right.

Always with your best interests first. That part doesn't get automated.

Weighing where AI agents fit in your Atlassian stack — and which guardrails belong around them? That's exactly what we do at Avaratak.

Share this post:

Avaratak Blog