Why Your AI Test Agent Should Speak English, Not Playwright

If you’re still writing Playwright scripts by hand in 2026, you’re solving yesterday’s problem.
I talked to Karim Jouini, CEO of Thunders.ai, on my TestGuild Automation testing Podcast about what killed velocity at his last company, Expensya, a SaaS product operating in 60 countries and 17 languages.
They had one million lines of Playwright code.
When they did a rebrand and redesign, it took six months just to update the test automation.
Six months. Not to add features. Not to ship products. Just to update CSS selectors and element IDs so the tests would run again.
That’s not maintenance. That’s a death spiral.
- What We Got Wrong About AI Testing (And What Actually Works)
- Why This Isn’t Just a Thunders Thing (The Industry Is Moving Here)
- The Real Cost of Test Maintenance (Why Expensya’s Story Matters)
- The Manual Tester on Steroids Model
- How to Actually Vibe Code Your QA Stack
- The Mortgage App Example
- The Domain Module Play (Reusable Test Assets)
- From Script Writer to Agent Manager
- Why 95% of AI Testing Pilots Fail (And How to Be in the 5%)
- Common Questions About Interpreted Automation
- What happens when the AI can’t find an element?
- Can this replace all manual testing?
- What about mobile apps?
- How does this work with existing CI/CD?
- Try It Yourself
- The Bottom Line
What We Got Wrong About AI Testing (And What Actually Works)
Most AI testing tools today use AI to generate code.
You give it a prompt, it spits out a Playwright script, and now you own that code. You have to debug it, version it, and—most importantly—update it every time a developer changes a button ID or renames a CSS class.
You’re still stuck in the same maintenance loop.
The AI just moved the bottleneck from “writing the script” to “owning the script.”
Karim’s team is doing something different: Interpreted Automation.
Instead of generating a brittle script that breaks when the UI changes, the AI interprets your English instructions at runtime. When you say “Click the login button,” the AI looks at the page like a human would.
If the button changed from #login-btn to .auth-submit, the AI doesn’t care.
It sees the button, understands the intent, and clicks it.
Here’s the fundamental difference:
Generated automation:
- AI writes you a Playwright script
- You own the code
- When the UI changes, your script breaks
- You update selectors manually
- You maintain code forever
Interpreted automation:
- AI reads your English instructions at runtime
- You own the intent, not the code
- When the UI changes, the AI adapts
- No selector updates needed
- You maintain functional specs, not scripts
This is the shift from Generative AI (making suggestions) to Agentic AI (taking action). This I think is going to become the norm so you might want to get used to it.
Why This Isn’t Just a Thunders Thing (The Industry Is Moving Here)
Thunders isn’t the only company betting on natural language interpretation over code generation.
If you’ve been paying attention to the new wave of AI-first testing tools launching in 2024 and 2025, you’ll notice a pattern: BDD-style natural language specifications are becoming the interface, not just the output.
Multiple vendors are moving in this direction. The difference is in the execution—some are still generating Gherkin that compiles to code you have to own, others are doing true runtime interpretation where the AI executes your English directly.
The question to ask when evaluating any AI testing tool in 2026:
“Am I getting code I have to maintain, or am I getting an interpreter that maintains itself?”
If the tool generates a Playwright script and hands it to you, you’re still in the maintenance game.
You’ve just automated the initial writing, not the ongoing ownership. If it keeps the interpretation layer and executes your natural language at runtime, you’re in a different world.
This shift is happening because the economics finally make sense. LLMs are cheap enough and fast enough to interpret on every test run. Five years ago, you’d never pay for an API call per test step—the cost would be insane. Today, it’s pennies compared to the engineering time spent updating selectors.
The broader trend: Natural language is becoming the source code for testing, not just a way to generate source code.
Try Thunders Yourself Now Free
The Real Cost of Test Maintenance (Why Expensya’s Story Matters)
Before founding Thunders, Karim was CEO of Expensya—think Expensify for Europe, same scale, different market. The company sold for over $100 million, but when he and his co-founder did the post-mortem, they both agreed: Quality assurance was what held them back from unicorn territory.
Operating in 60 countries meant regulation changes constantly.
If every country changes their rules twice a year, you’re dealing with regulatory updates three or four times a week. Every new release became more painful than the last. Large enterprise customers—defense contractors, Fortune 500s—weren’t used to multi-tenant SaaS with daily updates. When you break their system, they don’t just file a ticket. They question the entire relationship.
Karim broke down the problem into four dimensions:
- Resource drain A million lines of Playwright code meant developers and QAs spending time maintaining tests instead of evolving the product. That’s headcount you wish you had building features.
- Campaign complexity
A major release required four to five weeks of test campaigns just to get the right level of quality. They had corporate cards, bank integrations, travel agency integrations—the surface area was massive. That limits how many major releases you can ship per year. - Maintenance hell
The rebrand story I mentioned earlier: six months just to update the tests. Not add coverage. Not improve quality. Just get the existing tests working again after a design refresh. - Collaboration breakdown
Despite being an agile, AI-forward company (their product used AI pre-GPT), the testing process was waterfall. You couldn’t easily answer the question: “What did we test?” There wasn’t a transparent system to tell an enterprise customer what you tested in their specific setup with high confidence.
At 250 people, that lack of visibility is a huge problem.

The Manual Tester on Steroids Model
Karim doesn’t like the term “agent.” It means different things to different people. He prefers “QA assistant” or what I’d call the vibe tester.
The vision: Combine the quality and flexibility of a manual tester with the efficiency and scalability of test automation.
Think about a great QA on your team. You can throw a functional spec at them, explain the feature in a Slack message, onboard them in a day, and they figure it out. They test it. They give you subtle feedback about UX and performance. It’s high quality because they’re human.
But it doesn’t scale. The cost is insane. You end up with low coverage.
Now think about traditional automation. Super scalable once you build it, but expensive to set up. You need specialized profiles—test automation engineers who know Selenium or Playwright. Your functional leader has to transfer knowledge to someone else who has the technical expertise to write the tests. Immense inefficiencies. Immense maintenance.\
Thunders is trying to be the manual tester on steroids: the quality of manual testing, the scalability of automation.
How to Actually Vibe Code Your QA Stack
You can’t vibe code your features and then go back to stone-age scripting for your tests. Here’s the workflow Karim described for using Thunders with the Model Context Protocol (MCP):
Step 1: Write the Functional Spec in English
When you’re building a feature in Cursor or VS Code, don’t just ask the AI to write code. Ask it to draft the functional specification in plain English. This becomes the source of truth for both the developer and the AI test agent.
Example: Instead of immediately generating React components, you’d first produce:
Feature: User login flow
- User enters email and password
- System validates credentials
- On success, redirect to dashboard
- On failure, show error message “Invalid credentials”
- After 3 failed attempts, show “Account locked” message
That’s your spec. It’s readable by humans, executable by AI.
Step 2: Call the Thunders MCP from Your IDE
Thunders integrates directly into MCP, which means you can stay inside your editor and call the test agent as a tool:
bash agent mcp call thunders-ai create-test, from spec.md
The AI doesn’t generate code. It registers a “persona” Karim’s term that knows how to execute that spec in natural language. The persona understands what “login button” means, what “error message” looks like, what “redirect to dashboard” should do. (This is also becoming really common in tools like Openclase to create tese details personal.md type files)
Step 3: The Self-Healing Loop
When a test fails in CI/CD, the agent doesn’t just throw an error. It analyzes the change:
- What it expected: “I was looking for a ‘Submit’ button”
- What it found: “I found an ‘Arrow’ icon that performs the same function”
- What it does: Updates the English step automatically: “Click the arrow icon to submit”
- What it logs: Creates a Jira ticket for the dev team with the exact discrepancy
One of Karim’s customers did a full platform upgrade—new framework, new architecture, everything rebuilt from scratch. Their Playwright tests would have been completely rewritten. With Thunders, it took four months instead of starting from zero.
Karim’s honest about this: it’s not 100% zero-touch. It’s 80–90% there. But that’s the difference between “we can’t afford this migration” and “we can actually ship this.”
Hard Time Deciding? Try Out Test Tool Matcher
The Mortgage App Example
Karim used this analogy: If you’re applying for a mortgage, the functional process is the same regardless of how the app is built. You’re filling the same fields. Maybe instead of one form, it’s a four-step wizard. Maybe the component library changed. But functionally, it’s still a mortgage application.
If you built your tests in Playwright or Selenium, you’d rewrite from scratch. Every selector changed. Every element reference is broken.
With interpreted automation, you make minimal changes. The AI understands “mortgage application form” regardless of whether it’s one page or four steps, regardless of the CSS framework.
The Domain Module Play (Reusable Test Assets)
This is where it gets interesting for consulting businesses and testing-as-a-service companies.
Karim mentioned a partner—a consulting company that specializes in car leasing software. They work with big brands like Chevrolet and Toyota, deploying solutions to manage the lifecycle of leased vehicles. Every company is different. Every version of the product is customized. They do deployment, professional services, customization, testing—the full stack.
Now they’re building all the test assets for what it means to test a leasing system using Thunders. Once they have that library, they can reuse it across different brands, different infrastructures, different product versions.
You could do the same thing for:
- Healthcare insurance enrollment flows
- Banking loan origination systems
- E-commerce checkout processes
- SaaS onboarding workflows
Build the domain knowledge once as interpreted test cases, reuse it everywhere. That’s not just efficiency. That’s a new business model.
Karim’s position: Thunders won’t build these domain modules because they’re not experts in every industry. But the experts ,the consultants, the testing shops, the QA teams embedded in verticals, can use Thunders to package their expertise.

From Script Writer to Agent Manager
Karim’s take on the future of QA: You’re not being replaced. You’re being promoted.
The goal isn’t to be the person who writes the 1,000th line of Selenium. The goal is to be the manager of an army of AI agents. You define the strategy, you set the guardrails, you review the work, and you let the interpreted engine handle the grunt work of clicking buttons.
This is the vibe tester model: senior QAs can use it, junior QAs can use it, even non-technical folks can use it. You’re not writing code. You’re writing intent.
Think about how you’d brief a human QA:
- “Test the checkout flow with an expired credit card”
- “Verify the dashboard loads correctly for a free-tier user”
- “Make sure the export function works with 10,000 rows”
That’s what you write. The AI figures out how to execute it.
Why 95% of AI Testing Pilots Fail (And How to Be in the 5%)
Karim quoted an MIT study: 95% of AI pilots fail. Not just in testing—across the board.
The reason? Companies don’t define what success looks like before they start.
His advice:
- Don’t get into AI before you know why you’re doing it. Have a super clear definition of success.
- Once you know, move fast. Because testers who use AI will replace testers who don’t.
The framework he’s selling to customers:
- Ship twice as fast
- Cover 10x more with the same resources
- Discover bugs much sooner
Those are measurable.
You can track velocity. You can measure test coverage. You can count time-to-detection for bugs.
If you can’t measure it, you can’t tell if the pilot worked.
Common Questions About Interpreted Automation
Look, I know what you’re thinking — Joe, how is this any different from Selenium/Playwright?
Selenium and Playwright are execution engines.
They click buttons, fill forms, navigate pages.
But they require you to write code that specifies exactly which button to click, which form field to fill.
Interpreted automation uses those same engines under the hood (Karim mentioned they use “the best technologies out there”), but you don’t write the code. You write the intent in English, and the AI figures out the execution at runtime.
What happens when the AI can’t find an element?
The AI does what a human QA would do: it logs the failure with context. “I expected a ‘Save’ button but couldn’t find one.” It creates a ticket. It doesn’t just crash silently.
The self-healing loop means it also suggests fixes. If it finds a different element that performs the same function, it proposes updating the test case.
Can this replace all manual testing?
Honestly if a vendor said yes that’s a pretty big red flag.
Luckily Karim’s position: No. Thunders is a manual tester on steroids, not a replacement for human judgment.
You still need humans for exploratory testing, UX feedback, edge case discovery.
But for regression testing, for repetitive scenarios, for coverage at scale—that’s where interpreted automation shines.
What about mobile apps?
At Expensya, they used different tools for web vs mobile (they mentioned Waldo for mobile, which is now disappearing). Karim didn’t go deep on mobile in this conversation, but the interpreted automation model should work the same way: you describe the action in English, the AI figures out how to execute it on iOS or Android.
How does this work with existing CI/CD?
Thunders integrates into standard CI/CD pipelines. When a test fails, it doesn’t just return an exit code. It returns context: what changed, what broke, what the suggested fix is, and a Jira ticket with the details.
The MCP integration means you can call it from your IDE, from your build scripts, from GitHub Actions—wherever you’re already running tests.
Try It Yourself
Thunders has a two-week free trial. You can also see it in action without creating an account—just enter a URL, and the AI will discover your app, create a test case, and run it in about 40 seconds.
That “wow effect” Karim mentioned is real.
Watching an AI explore your app, generate test cases, and execute them in different personas (mobile, desktop, different user roles) in under a minute is pretty wild.
The Bottom Line
If you’re maintaining Playwright scripts by hand, you’re paying the maintenance tax every time the UI changes. If you’re doing manual testing exclusively, you’re paying the scale tax because humans don’t scale.
Interpreted automation is the play to avoid both taxes. You maintain functional specs in English. The AI handles the execution. When the UI changes, the AI adapts.
Karim’s story at Expensya—six months to update tests after a rebrand—is the canary in the coal mine. If you’re growing fast, if you’re shipping often, if you’re in multiple markets with different regulations, that maintenance debt will catch you.
The question isn’t whether you’ll adopt AI testing.
The question is whether you’ll adopt it before your competitors do.
Try Thunders Yourself Now Free
Resources:
• TestGuild Automation Podcast episode with Karim Jouini
• Thunders.ai free trial
• Model Context Protocol (MCP) documentation
Joe Colantonio is the founder of TestGuild, an industry-leading platform for automation testing and software testing tools. With over 25 years of hands-on experience, he has worked with top enterprise companies, helped develop early test automation tools and frameworks, and runs the largest online automation testing conference, Automation Guild.
Joe is also the author of Automation Awesomeness: 260 Actionable Affirmations To Improve Your QA & Automation Testing Skills and the host of the TestGuild podcast, which he has released weekly since 2014, making it the longest-running podcast dedicated to automation testing. Over the years, he has interviewed top thought leaders in DevOps, AI-driven test automation, and software quality, shaping the conversation in the industry.
With a reach of over 400,000 across his YouTube channel, LinkedIn, email list, and other social channels, Joe’s insights impact thousands of testers and engineers worldwide.
He has worked with some of the top companies in software testing and automation, including Tricentis, Keysight, Applitools, and BrowserStack, as sponsors and partners, helping them connect with the right audience in the automation testing space.
Follow him on LinkedIn or check out more at TestGuild.com.
Related Posts
The Bottom Line for 2026: After 25+ years in QA and interviewing over 580 automation experts on the TestGuild podcast, […]
Mailinator How to Test Email and SMS Workflows (Product Spotlight) Mailinator is a disposable inbox platform built for developers and […]
Look, I’ve been doing test automation for over 25 years. I’ve heard the predictions. “Manual testing is dead.” “AI will […]
The 72.8% Paradox That Changes Everything After interviewing 50+ testing experts in 2025 and analyzing data from our 40,000+ member […]



