AI Testing from Production Logs: Generate Smarter Regression Tests with Tanvi Mittal

17 March 2026 at 10:14 PM

By Test Guild

AI Testing from Production Logs: Generate Smarter Regression Tests with Tanvi Mittal

About This Episode:

What if your production logs could automatically generate new test cases?

In this episode, Joe Colantonio sits down with Tanvi Mittal to break down how AI-powered log mining is changing the way teams approach software testing, quality engineering, and DevOps.

Most teams ignore production logs or use them only for debugging. But those logs contain real user behavior, real failures, and real edge cases—the exact scenarios your test suite is probably missing.

???? Learn how to:

Convert production logs into automated regression tests
Use AI to detect real-world failure patterns
Apply shift-right testing to catch bugs earlier (and smarter)
Handle the challenge of testing non-deterministic AI systems
Reduce flaky tests and automation debt with real data

If you’re working with Playwright, Selenium, Cypress, or AI-driven testing tools, this episode will give you a completely new way to think about test coverage.

0:00 / 0:00

Join the Guild for (FREE)!

Email New Tab

About Tanvi Mittal

Tanvi Mittal is an AI QA Engineer and Test Lead at US Bancorp, specializing in building quality systems for financial transactions and banking platforms. She focuses on applying AI and LLM-driven tools to areas like fraud detection and API compliance, and is an active open-source contributor and startup advisor helping teams rethink how testing works in an AI-driven world.

Connect with Tanvi Mittal

- Company: www.#home
- Blog: www.#loganalyser
- LinkedIn: www.tanvi-mittal

What You’ll Learn

Your production logs are generating gigabytes of data every day, and most teams treat them as nothing more than debugging artifacts. But what if those logs are actually a goldmine for testing? In this episode, Tanvi breaks down how AI-powered log mining can analyze real production behavior, detect the exact sequence of events that triggered a failure, and automatically turn that into a regression test so it never happens again.

We cover:

Why your production logs are an untapped testing asset
How to transform messy log files into clean Gherkin test scenarios
The shift-right to shift-left approach: using production data to inform test design
How to test non-deterministic AI systems (where the same input doesn’t always produce the same output)
Security considerations for log mining in regulated industries like finance
Getting started with open-source log mining tools

Listen time: ~28 minutes

Why Production Logs Matter for Testing

Here’s the reality: you’re running hundreds or thousands of test cases, but bugs still show up in production. Every. Single. Release.

That gap between what you test and what actually happens in production? That’s where Tanvi’s Log Miner tool comes in. Instead of treating logs as post-mortem debugging artifacts, this approach treats them as a continuous testing engine.

The core idea: If you take chunks of production logs from last month, process them with AI, and convert all the fraud alerts, risk analysis flows, and user journeys into Gherkin-formatted test cases, you can add those scenarios to your next regression suite. It’s a continuous feedback loop where production behavior informs what you test before the next release.

Tanvi calls this approach shift-right to shift-left — you’re using production data (shift-right) to improve your pre-production testing strategy (shift-left). Makes total sense when you think about it.

How Log Mining Works: From Messy Logs to Clean Test Cases

Production logs are noisy. You’ve got everything from developer debug statements to actual user transactions. So how do you go from that chaos to something a tester can actually use?

The Log Miner workflow:

Input sources: CSV files, JSON files, or direct connections to Datadog and Elasticsearch
Clustering: AI analyzes the logs and groups similar scenarios together
Event mapping: The tool uses a mapping file specific to your domain (banking, e-commerce, etc.) to translate production events into test-friendly language
Gherkin generation: Python libraries (transformer sentences, Fast API) convert clustered scenarios into proper Gherkin format
Security layer: Sensitive data like account numbers are converted to hash tokens before processing

Important: Your logs need to have structured events with clear names. If your log just says “session ended," there’s not much to work with. But if you’re logging events like “user clicked payment button" with associated account details and payment information, now you’ve got something the tool can parse and convert into meaningful test scenarios.

The Developer Conversation: What Makes a Good Log for Testing?

One side benefit of this approach: it forces the conversation between testers and developers about log quality.

If your logs aren’t meaningful enough to generate tests from, that’s a signal that your production observability might need work anyway. Tanvi mentioned this can help testers articulate to developers what additional logging would be valuable — not just for debugging, but for continuous test case generation.

Real talk: most developers over-log or under-log. Having a concrete use case like “we need these event types logged so we can auto-generate regression tests" gives them a target to aim for.

Risk-Based Testing: Let Production Tell You What to Test

Here’s something I hadn’t thought about before this conversation: production logs can reveal risk areas you didn’t know existed.

Example: Maybe your logs show that this month you had 10,000 deposit transactions but only 1,000 withdrawal transactions. And when you look at your test suite, you’ve got the same number of test cases for each. That’s a mismatch. The logs are telling you where to focus.

Instead of running a thousand different test cases across the board, you could focus more heavily on deposits for the next release because that’s what’s actually happening in production. You’ve only got six days to test before the next release anyway — might as well test what matters.

Testing Non-Deterministic AI Systems

We got into this question: if AI-driven systems are non-deterministic (meaning the same input doesn’t always produce the same output), how do you test them at all?

Tanvi’s take: You need human judgment. With deterministic systems, you can automate everything and trust the output. With AI systems, you need a human to evaluate whether the response is “good enough" even if it’s different from the last time you ran the same input.

This actually makes testing harder to automate away, not easier. If anything, you need more experienced testers who can apply judgment, not just check that field A equals value B.

My two cents: We’re a long way from AI replacing testers, especially in regulated industries like banking and healthcare where you can get sued or kill someone if you ship bad code. Human eyes are going to stay on that for a long time.

Security Considerations: AI + Sensitive Data

Tanvi works with financial companies, so she gets this question immediately: “Where does the AI come into the picture? Is our data going outside our network? How are you securing logs that contain PII?"

Fair question. The Log Miner tool addresses this by:

Processing logs locally (not sending production data to external AI services)
Using hash tokens to mask sensitive data like account numbers
Giving enterprises control over what data gets analyzed

Honestly, this question should be asked for any tool that touches production data. Finance, healthcare, legal — if you’re in a regulated industry, you need to vet where your data goes, AI or not.

Getting Started with Log Miner

Log Miner is open source. You can grab it on GitHub and start experimenting.

What you need to get started:

Decide which log source you want to use (Datadog, Elasticsearch, CSV exports)
Understand the format Log Miner expects (structured events with clear names)
Create a mapping file for your domain (banking, e-commerce, etc.)
Run it on a chunk of historical logs (not terabytes — keep it manageable)
Review the generated Gherkin scenarios and add them to your test suite

Tanvi recommends starting with a small dataset to see what it produces. Don’t try to process six months of production logs on day one. Start with a week’s worth, see what test cases it generates, and refine from there.

Additional Use Cases: Beyond Test Generation

We also talked about other ways teams are using log analysis:

Stack trace analysis: When you run thousands of regression tests, some fail because of actual bugs and some fail because of environment issues or flaky test data. Can you use AI to analyze stack traces and differentiate real failures from noise? Tanvi’s working on that now.

Anomaly detection: Production logs can surface unexpected user journeys or edge cases you’d never think to test. If you see a pattern in the logs that doesn’t match your expected user flows, that’s worth investigating.

Final Thoughts: Don’t Be Scared, Embrace AI

Look, there’s a lot of hype about “AI agents replacing QA engineers." I see it on LinkedIn constantly. But here’s the thing: we’ve been through this before with automation.

When test automation first became a thing, a lot of manual testers said “that’s not real testing" and ignored it. Then suddenly job requirements said “must know Selenium" and those testers were left behind. I see the same pattern with AI now.

My advice: Don’t be afraid. Mess around with AI tools. See what they can do and what they can’t do. Build something open source like Tanvi did. The testers who embrace AI and figure out how to apply it to their domain are going to be the ones companies want to hire.

And honestly? If your application is so trivial that an AI agent can fully test it with zero human oversight, maybe your application isn’t that valuable to begin with. For anything with real risk — finance, healthcare, legal — you’re going to want human judgment involved for a long time.

Transcript Highlights

“If we run those logs and see what we missed, we can add new test cases for the next release. It’s a continuous process where you keep adding based on what’s happening in production." — Tanvi

“I’m still waiting for tools that can completely replace human QA from A to Z. Human judgment is very much essential when you talk about quality." — Tanvi

“If your application is so trivial that an AI agent can fully test it, maybe your application isn’t that valuable. For healthcare and finance, you’re going to want human eyes on it — you could get sued or kill someone." — Joe

Key Takeaways (for quick scanning)

✅ Production logs contain real user behavior that your test cases might be missing
✅ AI-powered log mining can cluster similar scenarios and auto-generate Gherkin tests
✅ Shift-right to shift-left: use production data to improve pre-production testing
✅ Requires structured logs with clear event names (not just “session ended")
✅ Helps identify risk areas based on actual usage patterns
✅ Open source tools like Log Miner make this accessible to any team
✅ Security can be maintained with local processing and data masking
✅ Testing non-deterministic AI systems requires human judgment, not less

About TestGuild

TestGuild is a software testing media, education, and events company serving 40,000+ community members. Founded by Joe Colantonio, TestGuild produces the TestGuild Automation Podcast (580+ episodes), runs the annual Automation Guild conference, and provides free tools including Tool Matcher, Automation Scorecard, and Accessibility Scanner. TestGuild’s mission: help quality engineers stay ahead of the curve with practical, no-BS guidance on test automation and AI testing.

Rate and Review TestGuild

Thanks again for listening to the show. If it has helped you in any way, shape, or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.

Transcript

Download New Tab

Scroll back to top