Agentic AI for Mobile QA: How AI Agents Test Apps (Maestro)

Look, I’ll be straight with you. I’ve been vibe coding in Cursor for about a year and a half, and my workflow has basically been deploy, deploy, deploy.
I don’t even do manual testing, not proud of it, but I know I’m not the only one doing it.
I know most of the little apps I’m building aren’t enterprise production systems.
But it did get me thinking.
If I’m skipping testing because AI makes it so easy to move fast, how many teams are doing the same thing on software that actually matters?
That’s a little scary.
So when I sat down with Leland Takamine, co-founder and CEO of Maestro, and watched a coding agent build a feature, test it, catch its own mistake, and write a permanent end to end test on its own, I had the same reaction I think a lot of you are about to have: wait, this actually works?
Leland has plenty of street cred in the mobile space.
Before co-founding Maestro, he was a Mobile Platform Tech Lead at Uber.
This post is a breakdown of what agentic AI testing really means across the mobile testing lifecycle. And what AI in testing really mean for mobile QA teams.
Why it’s showing up now, and how Maestro is one of the AI testing tools helping make it real on iOS and Android.
It’s pulled straight from my full conversation with Leland on episode a589 of the podcast.
By Joe Colantonio, host of the TestGuild Automation Podcast, 25+ years in software testing, and founder of TestGuild. This post is based on Episode a589 of the TestGuild Automation Podcast, an exclusive sit down with the creator of Maestro.
Who this is for: If you’re a mobile developer or QA engineer already using a coding agent, or thinking about it, this one’s for you.
Key Takeaways
- Agentic mobile testing means the AI agent can generate tests, write code, run it on a live device, catches its own mistakes, and saves a permanent test, with no human tester tapping through the app.
- Maestro MCP is the open source tool that gives coding agents like Cursor and Claude Code the ability to drive iOS simulators and Android emulators directly.
- Deterministic YAML beats AI at runtime for CI/CD mobile testing and regression testing pipelines.
- Maestro is open source with over 14,500 GitHub stars and is used in production seamlessly at Microsoft, Meta, Amazon, and DoorDash.
What is agentic AI in mobile app testing?
Agentic AI testing is when an AI agent does more than write a test. It writes the test, runs it against a live device, watches what happens, and fixes its own work until the feature behaves the way it should. No human tapping through the app in between.
Here’s how Leland described the old way versus the new way. As a human engineer, write the test suite and the code, build it, run it on an Android emulator or iOS simulator, and click through it yourself to make sure it works. If it doesn’t, you go back to the code and fix it, then test again, and you keep going until it’s right.
Now an autonomous agent generates that code for you.
But to really get value out of an agent, you want it doing that whole loop on its own running your user journeys.
As Leland put it, the agent needs “some way to do what you used to do, which is tap through the phone, use the actual application on a device, and verify that it’s functioning as intended.”
That’s closing the agentic loop. The agent generates the code and checks its own work.
In other words, autonomous AI agents aren’t just generating code anymore. They’re starting to validate their own work.
I described it back to him as immediate automate testing but a tighter cycle, with the AI agent as a non-human consumer of the app. He said yeah, exactly.
Why agentic testing is showing up now
Also, for as long as I can remember mobile testing was just hard for a really long time. For years, mobile automation promised a lot more than it delivered for many teams.
The tooling wasn’t there. Building real end-to-end coverage was expensive, which is one reason so many mobile teams struggled to maintain automation over time.
It’s that coding agents created a brand-new need for it.
As AI generates more code changes, teams need a reliable way to validate those changes before they reach users.
Leland’s point stuck with me. Engineers are going more and more hands off and delegating more and more to agents.
That shift is forcing teams to rethink QA workflows that were designed around humans doing most of the validation.
So if testing workflows doesn’t adapt to that shift, people are going to stop testing.
Why?
Because velocity is the currency for engineering teams right now.
That’s what agents give them. And anything that breaks that, anything that becomes a bottleneck, gets thrown out. His words: “if testing becomes this bottleneck, people are gonna say, well, I’m not gonna do that anymore.”
So agentic testing isn’t a nice to have bolted on top of AI coding.
It’s one of the first examples of AI-powered testing that feels practical instead of promotional.
It’s the thing that keeps testing from being the speed bump that teams abandon.
Why traditional mobile automation breaks down here
If you’ve done mobile automation the old way, you already know the pain.
You’re writing test scripts, native frameworks for each platform, test cases, wrestling brittle selectors, and XPath, and burning most of your time on maintenance instead of new coverage.
That last part is the real killer. Leland said it flat out: the big cost of test automation isn’t creating the test, it’s maintaining it.
Anyone who’s spent years dealing with brittle test scripts knows exactly what he’s talking about.
This is where Maestro’s design starts to matter.
A Maestro test is just YAML, and it reads like plain English. You don’t have to be the person who wrote it to understand exactly what it does.
It’s self-documenting.
And because of that, maintenance gets a lot easier, because anybody can come in, read it, debug it, and edit it without diving into spaghetti code. As Leland said, “it’s very difficult to create spaghetti code with Maestro.”
One tool, one syntax, and it talks to the device through the accessibility layer, so it doesn’t care if your app is Flutter, React Native, or fully native. It works with all of them.
Enter Maestro: the tool behind the agentic loop
What makes Maestro interesting is that it’s not just another automation framework. It’s one of the first agentic testing tools designed specifically for AI-assisted development workflows.
Quick background, because it explains why this thing is built the way it is.
Maestro started about six years ago as a performance testing company on mobile. To collect those benchmarks, they needed to automatically walk through an app, and they ran into the same wall everyone hits: there were no good options for reliable mobile automation.
So they built one internally, realized that was the actual thing the market needed, and open sourced it.
Today Maestro is used for mobile end-to-end testing at companies like Microsoft, Amazon, DoorDash, and Meta. Leland told me the React Native core team at Meta even uses Maestro to test the framework itself.
Here’s what makes it the right fit for agents specifically.
- It’s open source.
- The tests are human readable YAML, not a black box.
- There’s no SDK to install and no vendor lock in.
- The flow lives right next to your code, so you own it, and an agent can work with it the same way it works with any other file in your repo.
- No hosted interface, no proprietary syntax hiding what’s actually happening.
That last point is the whole game. Agents are good at working with code, and Maestro is just code.
Maestro MCP and the agentic loop in action
This is the part that got me. Maestro shipped an MCP server, which is what lets a coding agent like Claude Code, Cursor, Codex, or Gemini actually autonomously drive the device.
Here’s what I watched Leland do live.
He had Cursor open on the left and an iOS simulator on the right, running an open source Hacker News app. The app lets you bookmark posts, but you couldn’t delete a bookmark once you saved it. So the task was simple: add swipe to delete on the bookmarks tab, and validate it with Maestro.
He installed the Maestro MCP with one click, opened an agent chat, and typed essentially: add the ability to swipe to delete a bookmarked post, then validate with Maestro. That was it.
The agent explored the repo, wrote the actual Swift production code, and then started driving the simulator itself using Maestro. And it wasn’t perfect, which is exactly why it was convincing. It long pressed to create a bookmark, then tapped the wrong thing trying to reach the bookmarks tab. Then it caught itself, picked a more stable selector using the element’s ID, and got it right. It hit one more weird app behavior, figured that out too, and ran the whole thing end to end. Green.
Then I asked, so what did you actually do? Leland’s answer: “All I did was ask it to build the feature and test it with Maestro.” It built the feature and validated it on its own.
Now, a quick honest note, because you’ll see this called self healing tests all over the place. To be precise about what actually happened: the agent fixed its own selector while it was writing the test, not while the test was running later.
I think this distinction matters.
The test it saved is plain, deterministic YAML. Leland was clear that they do not want AI running at execution time, because that hurts the deterministic properties you need when a test runs in CI.
So the agent self corrects during authoring, and what you ship is a boring, repeatable test.
That’s the part I think testers will actually trust.
It gets better.
When he asked it to turn that into a permanent test, the agent didn’t just record what worked. It added an accessibility identifier to the UI to make the test more stable, then wrote a reusable Maestro flow as a YAML file right in the repo. Anyone who’s worked with UI automation knows how fragile tests can become when UI changes start piling up.
The result is a deterministic test you can run in CI any time.
And because that flow lives in the repo, every test you add makes the next agent smarter about your app. Leland called it “compounding value” because each new flow makes the agent more and more familiar with your application.
One thing I appreciated: Maestro tried the full natural language, AI at runtime approach last year and shut it down.
They found it was, in Leland’s words, “all kind of ceremony that you really don’t need.” Running AI at runtime hurts the deterministic properties of a test, which is exactly what you need when you’re relying on it in CI. The breakdown they landed on is clean: agents write Maestro flows, and you execute them deterministically in CI. As a bonus, that saves you tokens too.
So where do humans fit?
I had to ask the question on everyone’s mind. A lot of testers are worried about their jobs. A lot of testers don’t even have access to the code base. Is there still a role here?
Leland was honest about it, and I respect that he didn’t oversell. His user base is a pretty even split between mobile engineers and mobile testers, and Maestro doesn’t really cater to one over the other.
But the human role he kept coming back to is ownership. Say you’ve got an app with tons of features and barely any coverage. Who decides what’s actually important to test? Who guides the agent? Who’s ultimately responsible when it’s not perfect, because it isn’t perfect. As he put it, “until we have, I guess, AI VP of QA’s, I think that there needs to be somebody at some layer who has ownership over this.”
That’s the part the agent can’t own for you. The judgment about what matters still needs a person you trust.
He also gave me a couple of honest rapid fire answers worth repeating.
Flakiest platform? It hurt his Android soul to say it, but Android, because of device fragmentation.
Will AI produce more bugs or fewer? His take: per feature, AI will eventually produce fewer bugs than a human. But per week or per month, more, because we’re going to generate way more features and way more code. More volume, more surface area for mistakes.
Humans still own prioritization. The agent can generate tests, but it can’t tell you which customer journeys matter most to your business.
Which, honestly, is the whole reason testing matters more now, not less.
Traditional mobile automation vs agentic testing with Maestro
| What matters | Traditional automation (Appium, Espresso, XCUITest) | Agentic testing with Maestro + MCP |
|---|---|---|
| Test language | Java, Kotlin, Swift, plus client libraries | Human readable YAML that reads like plain English |
| Setup | Server, SDKs, platform tooling per language | No SDK, no vendor lock in, flow lives next to your code |
| Cross platform | Often a separate framework per platform | One syntax for iOS and Android via the accessibility layer |
| Who can read it | Whoever wrote the code, usually | Anyone, because it’s self documenting |
| AI agent fit | Bolt on, proprietary glue per tool | Native MCP server, agent drives the real device |
| Determinism in CI | Deterministic, but heavy to build and maintain | Agent writes the flow, you run it deterministically, no AI at runtime |
Still not sure which one? Try our Test Tool Matcher
How to start with agentic mobile testing this week
I asked Leland the practical question: for a mobile team listening right now, maybe already using a coding agent, maybe not, what’s the first concrete thing they should do this week to start closing that loop?
His answer was dead simple. Go to docs.maestro.dev, find the MCP server section, and install Maestro MCP on whatever coding agent you already use.
That’s it.
It’s open source, the MCP server comes bundled with the CLI, and the only real cost is the AI model you’re already paying for.
It’s also well documented with a up-to-date APIs reference covering maestro commands, selectors for UI elements, and workspace configuration for global settings.
If you want to see the full live demo and hear Leland walk through all of it, including the natural language product they killed and why, check out the full episode here: episode a589 on TestGuild.
I went into this as a guy who doesn’t test his vibe coded apps. I came out thinking I no longer have an excuse.
I wasn’t expecting to be impressed, but seeing it actually build the feature, test it, and create the permanent test changed my mind.
FAQ
What is agentic AI in mobile testing?
Agentic AI in mobile testing is when an AI agent writes a test, runs it on a live device, checks its own work, and fixes problems on its own until the feature works. It closes the loop between generating code and verifying it without a human tapping through the app.
What is Maestro MCP?
Maestro MCP is Maestro’s Model Context Protocol server. It lets coding agents like Claude Code, Cursor, Codex, and Gemini drive a real iOS simulator or Android emulator: launch the app, tap, scroll, inspect the screen, run Maestro flows, and validate features end to end.
What coding agents work with Maestro MCP?
Maestro MCP works with Claude Code, Cursor, GitHub Copilot, Codex, Gemini CLI, Windsurf, and JetBrains AI Assistant. Any coding agent that supports the Model Context Protocol can connect to it.
Does Maestro MCP work with real devices or only emulators?
During development, Maestro MCP works with iOS simulators and Android emulators. Maestro Cloud, the paid tier, runs tests on hosted infrastructure at scale. Physical real device support is something Maestro is building out this year.
How does Maestro close the agentic loop?
The agent writes the feature code, then uses Maestro to run commands against the live device and confirm the feature behaves correctly. If it makes a mistake, like picking the wrong selector, it corrects itself and reruns, then can save a permanent YAML test for CI.
How is Maestro MCP different from Appium MCP?
Both use the Model Context Protocol to let AI agents drive mobile devices. The difference is the test syntax. Appium uses WebDriver based scripts in Java, Python, or JavaScript, while Maestro uses human readable YAML that the agent generates and that you can read and own without any SDK. Maestro also skips the WebDriver translation layer, which makes it faster to start and easier to maintain.
Is Maestro an Appium alternative?
Maestro is built on lessons from tools like Appium and uses one human readable YAML syntax for both iOS and Android. It talks to the device through the accessibility layer, so it works with native, React Native, and Flutter apps without a separate framework per platform.
Does AI replace mobile testers?
No. According to Maestro’s CEO, someone still has to own what’s worth testing, guide the agent, and stay accountable for coverage. The judgment about what matters to the business is the part the agent can’t own for you.
Is Maestro free?
Maestro is open source and free to run locally, and the MCP server is bundled with the Maestro CLI at no extra cost. Maestro Cloud is the paid option for running tests in parallel on hosted infrastructure at scale.
Does Maestro MCP support API testing?
Maestro is primarily focused on mobile UI testing and end-to-end testing. While teams may combine it with API testing tools as part of a broader testing strategy, its core strength is automating user interactions on iOS and Android devices.
Joe Colantonio is the founder of TestGuild, an industry-leading platform for automation testing and software testing tools. With over 25 years of hands-on experience, he has worked with top enterprise companies, helped develop early test automation tools and frameworks, and runs the largest online automation testing conference, Automation Guild.
Joe is also the author of Automation Awesomeness: 260 Actionable Affirmations To Improve Your QA & Automation Testing Skills and the host of the TestGuild podcast, which he has released weekly since 2014, making it the longest-running podcast dedicated to automation testing. Over the years, he has interviewed top thought leaders in DevOps, AI-driven test automation, and software quality, shaping the conversation in the industry.
With a reach of over 400,000 across his YouTube channel, LinkedIn, email list, and other social channels, Joe’s insights impact thousands of testers and engineers worldwide.
He has worked with some of the top companies in software testing and automation, including Tricentis, Keysight, Applitools, and BrowserStack, as sponsors and partners, helping them connect with the right audience in the automation testing space.
Follow him on LinkedIn or check out more at TestGuild.com.
Related Posts
Here’s the thing about API testing tools: the list has exploded. What used to be “Postman or SoapUI?” is now […]
After blogging about testing for over fifteen years, I realized something embarrassing a while back: I’d never actually sat down […]
While many testers only focus on browser automation there is still a need for Automating Testing Desktop Applications. Desktop automation […]
Bottom Line: Kobiton is the first real device testing platform I’ve seen that makes AI-powered mobile testing feel like it […]



