AI Testing Made Trustworthy using FizzBee with Jayaprabhakar Kadarkarai

By Test Guild
  • Share:
Join the Guild for FREE

About This Episode:

As AI tools like Copilot, Claude, and Cursor start writing more of our code, the biggest challenge isn’t generating software — it’s trusting it.

In this episode, JP (Jayaprabhakar) Kadarkarai, founder of FizzBee, joins Joe Colantonio to explore how autonomous, model-based testing can validate AI-generated software automatically and help teams ship with confidence.

FizzBee uses a unique approach that connects design, code, and behavior into one continuous feedback loop — automatically testing for concurrency issues and validating that your implementation matches your intent.

You’ll discover:

Why AI-generated code can’t be trusted without validation

How model-based testing works and why it’s crucial for AI-driven development

The difference between example-based and property-based testing

How FizzBee detects concurrency bugs without intrusive tracing

Why autonomous testing is becoming mandatory for the AI era

Whether you’re a software tester, DevOps engineer, or automation architect, this conversation will change how you think about testing in the age of AI-generated code.

Exclusive Sponsor

Discover TestGuild – a vibrant community of over 40k of the world's most innovative and dedicated Automation testers. This dynamic collective is at the forefront of the industry, curating and sharing the most effective tools, cutting-edge software, profound knowledge, and unparalleled services specifically for test automation.

We believe in collaboration and value the power of collective knowledge. If you're as passionate about automation testing as we are and have a solution, tool, or service that can enhance the skills of our members or address a critical problem, we want to hear from you.

Take the first step towards transforming your and our community's future. Check out our done-for-you services awareness and lead generation demand packages, and let's explore the awesome possibilities together now https://testguild.com/mediakit

About

JP Kadarkarai founded FizzBee, an autonomous testing platform that validates software behavior and helps teams ship code with confidence. Before FizzBee, JP spent nearly 20 years building large-scale distributed systems and AI applications at Google and other startups. FizzBee’s mission is to help developers deliver high-quality software faster, especially as AI begins to write more of the code we run.

Connect with Jayaprabhakar Kadarkarai

Rate and Review TestGuild

Thanks again for listening to the show. If it has helped you in any way, shape, or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.

[00:00:35] Joe Colantonio Hey, do you want to know how to design reliable, scalable, distributed systems using model-based testing? Well, if that's you, you're in for a treat because we have the creator of FizzBee in the house, JP, if you don't know, FizzBee is an autonomous testing platform that validates software behavior and.

[00:00:52] Helps teams ship code with confidence before FizzBee JP spent nearly 20 years building large scale distributed systems and AI applications at Google and other startups, he really knows his stuff. It was AI, it was reliability, model-based testing. This is something I think you need to learn more about. You don't want to miss it. Check it out.

[00:01:11] Joe Colantonio Hey JP, welcome to the Guild.

[00:01:17] Jayaprabhakar Kadarkarai Hi Joe, it's nice talking to you and test build the audience.

[00:01:22] Joe Colantonio Awesome. Great to have you. So I'm just curious to know, Google's probably a really good gig. What drove you to do a Startup?

[00:01:30] Jayaprabhakar Kadarkarai No, actually, after Google, I was at a couple of other places as well. I wanted to explore and work in a smaller company. I always wanted to start my own company as well, so just moved from bigger company to slightly smaller to even smaller and then decided maybe I shouldn't jump in. That's about it. When I was working in my previous job, I noticed that it was need for a better way to design a software or better way to prove our design is good. That's where I got into formal methods. And then back then the formal methods tools were too complicated that very few people were using. And 2 years ago with the ChatGPT and copilot started and pretty much now like a 10-20 coding agents also there. And now I feel like coding is not much of a bottleneck anymore. It is actually what instructions we give to the agent, and the quality of that, the instructions that we give to the agents becomes more important, and being able to validate the output of the agents is crucial as well. So that's where I thought, okay, this would be a valuable tool that I can build, and that's where FizzBee started.

[00:02:56] Joe Colantonio Do you see this as a problem everyone's facing now? You know, using Claude or Cursor or anything, when they're prompting, they're able to generate, like I said, a bunch of code, but how do you know it's doing what you want it to do or if it's good code, is this the like the main pain point that you're aiming for?

[00:03:13] Joe Colantonio Yes, again, based on the feed or talking to people that keeps coming up. On one side, you will see like, okay, I built an entire app with Vibe Coding, but they are not production apps. Yeah, it will work for most cases but the moment you see or talk to people in the industry, they would be like, Okay, it generates a lot of code. It is good for first two to three months and after that is when things start to messy because every edit is now becoming more complex and you don't know whether the test is wrong or the code is wrong. And with the many self-healing ones, even for testing, it's easier for some trivial ones where you can say that, okay, yeah, fix the test whenever my code changes. But the thing is that in many times, we don't even know whether the code correct or the test correct when a test fails. So self-hailing testing solutions don't really solve that problem as well. And whether it is for coding or even for other cases, I do see people keep saying like, okay, one fact checking hallucinations are real with LLMs. And even in coding that is again, it keeps coming back often. The tools are now getting much better at generating the code, but then we are still not, we cannot be fully confident. It's not like a compiler where, okay, if it compiles, you never have to look at the output of the compiled code, right? Whether it is a higher level ones like TypeScript to JavaScript or low level ones, like a C compiler, you don't look at generated output. Whereas, with LLM generated code, you're supposed to review to ensure it is correct but the speed at which the LLMs can generate the code, we cannot keep up reviewing it and humans have very short attention span. It's also a boring work. People like to write code, not to review code. So it becomes even harder and it gets worse for tests. No one likes to review tests as well. Even in the olden days when people were writing test manually, writing, we can force them to write, but reviewing it, just skim through the test. That's always been a problem. I believe with the high quality tests generated, now we can be very confident that the output generated by the LLMs are valid or the code generated is correct. And if it is correct, then we will be able to ship faster.

[00:05:44] Joe Colantonio Love it. Love it and I first learned about FizzBee on my news show. I came across your LinkedIn profile and you were announcing it. And I think you said something in there that development is incremental, but testing is cumulative. I'm just curious, can you explain that? Does that make sense? Like why is testing cumulative? Are you saying that like the more and more code that's being generated, you just have to create more and more tests to validate it?

[00:06:10] Jayaprabhakar Kadarkarai That's right. Basically, every time you add a new feature, each feature is incremental. You're just adding one feature. But you're testing not just that particular feature, that particular code. You need to test how that works or interacts with all the previous features and the rest of the system. It's not testing only that part in isolation. The whole suite of automated testing is for that purpose anyway where there is no regression introduced kind of a thing. But yeah, that's where the incrementality comes. And that is also why as a startup, when we don't have many features, we can just keep launching. As a company gets bigger, we have a lot of features and because each customer has a different set of high priority features for them, and you have a whole bunch of things, things get messy. What happens when each of them interact with each other? So that's why I say testing is cumulative.

[00:07:10] Joe Colantonio Absolutely. So if we're able to generate code for software and automation tests is basically a framework that's software. Why can't we just rely on like it helped co-pilot or something to generate the test for us? Why is that not enough?

[00:07:29] Jayaprabhakar Kadarkarai That's a good point but like I said, basically, so far that's one solution that most people are using anyway. We'll start with the test, either it is before, because it's just another code. The issue is still the test themselves are example-based tests. Like instead of you telling, try with this example, LLMs are going to create a whole bunch of examples for you to test. The code itself would be written for those examples. The difficult part is, again, the reviewing. We still cannot be confident that it tested every scenario or any estimate on how good the quality of the test. And with a bunch of examples, the issue comes, like whenever there is a small feature change, even a small API change, now like a hundred different tests would be updated. When I get a code review with hundreds of tests updated, and now do I have the patience or even the interest to review through all of them is non-trivial. Again, there are some people trying out like things like how two different LLMs, one to create code, one to test, and other thing, but it's still it's a similar problem. Whereas, the solution that would work for these type of problems are like property-based tests, as an example. In a property-base, you're just defining the property. You're not listing down the examples. If you have used a hypothesis or a similar quick check and similar tools. In other languages, you just say that, okay, for example, it's a naive square root problem. Okay, if your square root of any integer would be or any number, real valued number or positive real value number, the result, if you multiply, it should be the same. Now, you can generate any number of real valued numbers randomly and then say that okay, it's correct which means the amount of test written is trivial. Trivially small and you don't need multiple examples. It's a similar concept. Again, another type of such tests are fuzzing, where you don't list down an example, it's just randomly generated. Now you can be confident that the system won't crash on a random set of bytes. Those are the type of examples we are looking for. Those are the types of tests I believe would be more valuable going forward. And going back to property-based tests, property- based tests are good for stateless systems and typically suitable for unit tests and small individual ..... But if your system itself is stateful, like a naive example is a key value store, okay? You're inserting, deleting, and update and all can happen continuously, but we can also test with multiple ordering of these events without having to list down. All we need to say is, okay, these are the actions that can happen. You can put, you can list, you delete, just a crude. If these are APIs, the test can generate like maybe thousands of sequences, randomly generated with randomly generated out. And validate whether the state transitions are valid. That's the thing that FizzBee is trying to do. For this, in addition to having your code, we need a way to know whether the estate transitions are valued or not. That is where the model comes. We define like some trivial models to indicate how the system is expected to behave. And that's usually very concise. A mapping to indicate this state in a model is mapped to this particular API in your implementation.

[00:11:42] Joe Colantonio All right, so for older people like me, when they hear model-based testing, it was very much a manual approach and it was based on a lot of assumptions. I guess first for people that don't know what is model- based testing and like how does your tool handle model- base testing from a more modern approach than maybe when I started.

[00:12:01] Jayaprabhakar Kadarkarai I have been looking at it only in the recent times and the latest tools. The thing is that the concept of model-based testing is obviously not new. It has been there for a long time. And basically we need to create a simplified model or in-memory model. For example, a key value store, maybe we may be able to have a trivialized model using a hash map or something like that. And then we compare whether the behavior matches or not between this, like if you are put on the real system and put on this trivial model, works correctly or not. Or if it is more complex system like an email application. Now, real email application would be much more complicated. But in a trivial model, you put everything into a small list and the in-memory data structures. When you say send, it just puts it into an outbox, which is again going to be another list or something like and change the status to sent and so on. Basically, you can just put them into an in-memory structure. That is a high-level concept of model-based testing. You have a rough model, and we just validate whether your real implementation is similar to your model or how well it maps to your actual model. The issue is representing that model itself used to be hard. So different people have been trying out different approaches on how to define the model. Many tools use the same language itself to represent the model. For example, I think a hypothesis of Python also supports model-based testing. They call it rules-based-testing, which is roughly an extension of property-based testing for stateful applications, which is quite similar to model- based testing. The model should be trivial, and we should be able to quickly extract it from a design document itself. The model should be independently validatable. And that's what I was looking at, because if I'm going to design some system and the design itself is going to be more and more valuable, then the design it should become testable. Am I designing my product correctly? That's where the formal methods come in. But again, formal methods are another that has been there for like 20, 30 years, very few people use, and everyone says that we should use, but too complicated kind of a thing. That's because that you also used to be too mathematical So coming from both these, I felt having a Python-like language to express your model would be more accessible to most engineers in the industry. And with the LLMs coming in, we would be able to generate that model also. But the model being something that unlike traditional formal methods, which are too complicated for people to review, this is both concise. It is within our attention span. And then it can be easy to understand and readable. Unlike other systems where pure plain text, which cannot be validated, we can actually validate by using the formal methods techniques. So once you validate, that basically can even set various assertions, including the safety properties and the liveness properties. Like some good things should eventually happen. When you send them, even if some replicas crashes or anything, the system should eventually become consistent. Those kind of eventual are called temporal properties or liveness properties that also can be validated at the design phase itself. These are very hard to test it in the implementation. We can actually validate those in the design face itself. Pretty much, we started off there as a formal methods and then I was looking for a way to test whether our implementation matches the design.

[00:16:01] Joe Colantonio Awesome. All right, so, say you have someone Vibe Coding, not necessarily a hardcore coder, what's the next step then? What do they have to give FizzBee then in order to understand what they Vibe code and when they may not even know what they coded. I don't know if that makes sense. Does it plug in to like cursor or something? You just say, Hey, FizzBee, like an MCP server, create a model based on whatever I created.

[00:16:25] Jayaprabhakar Kadarkarai Yeah, actually, right now, I'm working on the MCP and agents for generating the FizzBee spec, but I have a very basic copilot has instructions. I can give that instructions to Copilot instructions and it is able to generate a valid FizzBee spec, but it is not fully trained. We still have to review it and whether the model is correct or not. The thing is that here in FizzBee, you can actually also, it would automatically generate interactive visualizations like how in a whiteboard, if you are explaining a design. So when you say this is a, like a key value store implementation, if you put a design, it would actually generate an interactive visualizer where you can click through the button to say, okay, put means this is how the data would be stored. Now click another button and say, Okay, maybe if it is partitioned or sharded by key. This is how it would be stored kind of a thing. You can actually visualize them. It automatically generates them. That is another form of a review where you can not just review the spec, but now you can also review from the visualization to understand whether the matches are expected behavior. That's the one part. We still have to be a little bit more manual. Work is needed. And rest of the things are pretty trivially automated because basically, once you have the model and if you have code, it can extract the API from the code and say that, okay, this is action and this roughly maps to this particular action or API in your call. That is so trivial that, yeah, that is done automatically. That said, I'm actually looking for an other way where you can have that spec and do that to the copilot to say that okay, this is my design implemented. Now, the design itself is much more precise and validated compared to a free-form text. So since because of this, giving a precise instruction means that it can actually refer without any ambiguity, it can generate the code more correctly, but it would still be wrong. But since we also have the tests, it can go on a loop repeatedly until the test pass. So that way your implementation at the first pass itself, it will be correct. So that is the thing that I'm going after.

[00:18:53] Joe Colantonio Very cool. So you're testing kind of non-deterministic code nowadays with AI. How do you make your test deterministic? How do make your tests repeatable or stable even when the code is constantly changing, you don't know what it's going to change is because it knows the model and it says you made this change, I know the API, it affects these components based on the model I created ahead of time.

[00:19:19] Jayaprabhakar Kadarkarai Yeah, exactly. As long as the behavior did not change from your design or the model that you have specified, it can heal itself. If the interface doesn't change, obviously I don't need to change anything, because every sequence is going to be tested from the model. Whatever is valid in the model is going be tested eventually in your code. But if your behavior itself changes, then this would be a problem. Either your code might be is more correct but your model is not up to date or vice-versa. Now is it a regression or is it a new feature that you had which was not modeled? We cannot make a decision, we can say that there is a mismatch. Now it's up to the developer to decide whether to update the model or update fix the code.

[00:20:13] Joe Colantonio How many more bugs does this catch? Do you have any case studies away from your own experience? How many bugs this catches as opposed to not using something like a FizzBee?

[00:20:23] Jayaprabhakar Kadarkarai Actually, this is pretty new right now. I don't have any full on case study. The autonomous testing was launched only two weeks ago. It's pretty new. But for the design verification, it was independently used before as well. As part of the formal methods, as a formal methods tool, without even testing for the code, we have found many concurrency issues or what happens when basically two requests go in concurrently and there is a specific sequence of thread interleavings which can cause an issue. People have tried out, like people from Confluent and Shopify and a few other companies also have tried this out before. And that technique is not new. Even with other tools like TLA +, like Amazon is big proponent of formal methods for their cloud infrastructure. In MongoDB and again, Confluent, and pretty much every cloud infrastructure company have tried our formal methods to verify their design. It's just a mapping from the design to code, whether the code is correct are not used to be done ad-hoc. Very few people do it as a research project and never made it into a product because it used to be too complicated.

[00:21:45] Joe Colantonio Yeah, so one of the things that did catch my eyes as well is when you announced this, like you said, it's brand spanking new, you mentioned it's able to find concurrency issues. How's it able to do it? I know a lot of times people need to use like kind of like intrusive kind of tracing and maybe custom hooks in the code. How does FizzBee get around that, the finding of the currency issues?

[00:22:07] Jayaprabhakar Kadarkarai Okay, so this is again by comparing with the model. The thing is we need to run them, run multiple actions concurrently. That's the first one. And again, so traditionally in most testing, we don't do concurrency. We don't check for concurrency because it introduces flakiness. A naive example is that if you do a put and a clear happen concurrently, we don't know what do you even set as an expectation. Because of that many times people don't even write those tests. Here we know that based on the timestamps of when these two different threads happen, how they are overlapping. I can check whether it could have happened. One came first, which one came first and vice versa the other. If it is not overlapping, we know the one that completed first should have happened first. But if it is overlapping, I have to check for both the cases. It's called linearizability checking. But most real databases are not linearizable anyway. And still, with FizzBee, with some unique novel techniques, we can test for non-linearizable system also, like eventually consistent systems, using FizzBee. But the high-level idea is trying out all combinations based on the timestamps to see it if it matches the design. Again, effectively it is not checking whether your system is linearizable, but it is checking whether your system matches the designs. So having the design is crucial here.

[00:23:42] Joe Colantonio Very cool. All right. So if someone's listening to us, they're like, all right, this sounds pretty cool. How could someone get started with FizzBee? Is it open source? Is it paid? Like someone wants to try it out, but what do they need to do?

[00:23:55] Jayaprabhakar Kadarkarai FizzBee, the formal methods tooling is completely open source that they can try it out right away. Just brew, install, FizzBee. It has a binaries, works both in Mac and Linux. I haven't tried it out on Windows. I never have access to it. It should work. And the tutorials are pretty clean. Most people should get start, would be able to start on formal modeling within a couple of hours of training. I also have like the FizzBee custom instructions that you can put it in copilot. That would make it like super easy to build the model. I'm just coming up with an agent that could deep. In the long run, I want to accept a design document and create a model out of it. Right now, that part is not done. For the testing part, again, right now it's again free, but it is not open source. Again, brew install, FizzBee.MBT, and you can just install it and try it out locally. Right now, I don't have any paid service. Long run. Yes, I wanted to, instead of developers running it on their machine, I just wanted to be on a hosted service so that they can put it in their GitHub actions or whatever, and every time they merge, it should automatically trigger and test. That's a long term, but right now it's manual. The users have to install it locally and test it.

[00:25:27] Joe Colantonio Do you see this as a mandatory technique going forward as we go in more and more AI generated code? I would think this would be more and more mandatory, this type of approach. I assume you'd think the same thing because you created this, are there any other methods or things you see as needed as we go into 2026 with the AI type code?

[00:25:46] Jayaprabhakar Kadarkarai Yeah, like you noted, I do believe this would be way more crucial. And this all comes under the broad umbrella of autonomous testing. Again, different people say, when it comes to, again, autonomous testing, it means different things. Like sometimes it's about execution. Sometimes it's just about test generation, where it is, they're automating the mechanics or the test cases are still example-based. In my case, it's like, even that is automated. You never have to even review whether the test cases are complete or not. Those type of things I believe is going to be way more valuable. Think about it like let's say you have a website and you give these are the actions that can happen on the website. Tools should be able to say okay they will randomly click through various buttons and fill in with random data and say whether it matches your expectation. That is going to be more valuable than any other tool, which says that, okay, where even a free form text, if you say, okay, go click this button and then this button and see if this happens. That is not really autonomous in my opinion, because they are automating the mechanics, like how to click a button or what code to write to click that button. But I don't even want to give the sequence objects. I just need to say that these are all the actions that are possible and this can happen. That I believe is going to be more a robust and valuable solution in the long run. One other technique, again, I think I mentioned before, property-based tests are going to be crucial as well. And first testing and those kinds of things where you don't have to list down exact test cases to test, that would be, I definitely believe would be more valuable.

[00:27:34] Joe Colantonio So if someone's job is a software tester right now, just like a coder, are their jobs in jeopardy or they're just going to morph? They're going to morph where they need to do more property based, plus based testing. Is it going to be more high level? Just making sure the AI is doing what you know it's doing from your experience. Where does the human come into play here in this world?

[00:27:56] Jayaprabhakar Kadarkarai Very rarely automation have taken jobs, even though people have been saying. Every company that says that AI is going to take up your job, they are the ones hiring more engineers that other companies are not hiring, it's those. Okay, so I don't see it to be going away. It is just me going to make them more productive and doing things which humans are good at because finally we are going to sell to humans, okay? So they understand what is the expected behavior and they can do the model properly. This is how it should happen. Instead of doing the mechanic of individual things. And if you look at again, since we talked about automation like even 30, 40 years ago, I remember people talking about like how a basic language would means that anyone can write code. We don't need after many years. But then no, the software that we build became more complex that we need even more specialized skills. I believe that's going to happen. I don't know if I should give an example from manufacturing jobs are still there. It's like, it's huge. So automation rarely takes up job.

[00:29:08] Joe Colantonio Love it. All right, JP, before we go, there's a one piece of actionable advice you can give to someone to help them with their AI testing efforts. And then what's the best way to find or contact you?

[00:29:18] Jayaprabhakar Kadarkarai To contact me, LinkedIn is one easy option. We have a Discord channel. You can email me directly, jp@sb.io. And from advice perspective, right now I will be biased and saying anything. I believe testing is going to be way more valuable. And instead of just creating more and more example-based tests, which you don't even review, look for a better techniques like I believe property-based testing and model-based testing are going to be more valuable. Again, I don't want to say that anything bad about property- based testing because I believe model- based is an extension of property- based testing. So I always say that okay they both would be extremely valuable irrespective of what tool you use whether you use FizzBee or other model- based testing tools. I believe this is going to extremely useful.

[00:30:17] Also you can find links to all those awesomeness for FizzBee and everything else down below.

[00:30:22] Thanks again for your automation awesomeness. The links of everything we value we covered in this episode. Head in over to testguild.com/A565. And if the show has helped you in any way, why not rate it and review it in iTunes? Reviews really help in the rankings of the show and I read each and every one of them. So that's it for this episode of the Test Guild Automation Podcast. I'm Joe, my mission is to help you succeed with creating end-to-end, full-stack automation awesomeness. As always, test everything and keep the good. Cheers.

[00:30:57] Hey, thank you for tuning in. It's incredible to connect with close to 400,000 followers across all our platforms and over 40,000 email subscribers who are at the forefront of automation, testing, and DevOps. If you haven't yet, join our vibrant community at TestGuild.com where you become part of our elite circle driving innovation, software testing, and automation. And if you're a tool provider or have a service looking to empower our guild with solutions that elevate skills and tackle real world challenges, we're excited to collaborate. Visit TestGuild.info to explore how we can create transformative experiences together. Let's push the boundaries of what we can achieve.

[00:31:40] Oh, the Test Guild Automation Testing podcast. With lutes and lyres, the bards began their song. A tune of knowledge, a melody of code. Through the air it spread, like wildfire through the land. Guiding testers, showing them the secrets to behold.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

AI CY Prompt, Playwright Reliability, AWS Down and More TGNS172

Posted on 10/27/2025

About This Episode: Is Cypress about to change how you write automation forever? ...

A Halloween-themed promotional graphic for TestGuild Automation Testing's "Optimus Prime Halloween Special" with Paul Grossman, featuring festive decorations and two men, highlights the fun side of test automation during Halloween.

Test Automation Optimus Prime Halloween Special

Posted on 10/19/2025

About This Episode: In this Halloween special, Joe Colantonio and Paul Grossman discuss ...

Test-Guild-News-Show-Automation-DevOps

Testing Skyscrapers, AI Drift, Playwright Agents That Promise to Do It All TGNS171

Posted on 10/14/2025

About This Episode: Is the Testing Pyramid holding your team back? AI agents ...