About This Episode:
How does quality engineering fit into DevOps? In this episode Lagan Khare, a Manager of Quality Engineering with more than 13 years of experience, shares her insights into QE and integrated automated testing into DevOps and Agile. Discover how QE is different from QA and how to implement QE with DevOps, metrics, CI/CD and more. Listen up!
Exclusive Sponsor
The Test Guild Automation Podcast is sponsored by the fantastic folks at Sauce Labs. Try it for free today!
About Lagan Khare
15 years of experience in the Software Industry -currently managing Quality Engineering and pushing for Engineering Efficiencies across a wide portfolio of products – legacy and cutting edge teams through empathetic leadership, technical guidance and encouraging collaboration.
Full Transcript Lagan Khare
Joe [00:01:14] Hey, welcome to the Guild!
Lagan [00:01:17] Hey Joe! how are you?
Joe [00:01:17] Awesome, I guess probably before we get into it, is there anything I missed in your bio that you want the Guild to know more about?
Lagan [00:01:22] I think you covered it pretty good. I basically work to push our engineering efficiencies across a wide portfolio of products, both legacy and cutting-edge teams.
Joe [00:01:32] That's one of the things I love about your background. It seems like it's a real-world situation, but it's a huge enterprise, which I love. So I'm really excited to dive into what you did to actually drive the transformation you have of your company. The first transformation that I think people sometimes get hung up on is what is quality engineering. Back in the day, I used to call things quality assurance. So I guess to start off with what is quality engineering?
Lagan [00:01:54] Basically, quality engineering is a set of interrelated responsibilities into three different spheres, which are customer advocacy, defect mitigation, and quality ops. And to explain this a little bit further, so the QE or quality engineers are basically customers' advocates. We represent customers. We're in the shoes of the end-users. We're trying to test relevant things, making sure the requirements are correct and the product is working as per the expectations, the other sphere is defect mitigation. So we want to ensure that we are testing at every layer. Starting from there are unit tests, there are service tests, there are front-end testing, automation, and then performance engineering and all those good things. So that's the defect mitigation part. And now the most interesting one, and which I feel is newer to this domain is the quality ops so how quality engineering can be integrated with the ops here. So like it deals with deployment ownership, environment ownership. Quality engineering, it's just more than QA because quality assurance is making sure that the product is of good quality, finding defects. But quality engineering takes more than that. It talks about getting involved in the overall process of engineering, write some requirements gathering into the build process, into CI/CD, into releasing the software, to production monitoring, and in collaboration with different disciplines across the entire SDLC.
Joe [00:03:28] Nice. So I guess a lot of people might then ask, how is this different than DevOps or Agile? Is it like one and the same? Is it something different?
Lagan [00:03:35] So I think it's definitely different than Agile and DevOps. But I also feel like modern-day quality engineering, even though it is not the same thing, but we have to work in collaboration with these different disciplines because quality engineering on its own, I think can only go so far if you are not collaborating with our dev partners, our partners in delivery management, our partners in the DevOps world. So it is different, but I say we should work towards bridging that gap. So like some of the things that we've done to bridge that gap is introduced like squad-based testing where we expect the entire team to help us with testing to make sure that there is no like QE doesn't become the bottleneck, and there is a continuous flow of testing to all the sprints. And then at the same time, we're also working with the DevOps team, monitoring and synthetics and making sure that QEs are collaborating with them and helping out driving those networks in (unintelligible), because as a quality engineer, I think we know the product and what areas are most important and we're networking should be there, what kind of alert should be set there. So we're working with DevOps to sort of in that collaboration to provide that insight into what is important, what should be monitored and how to sort of keep those tests updated. And then similarly with developers, like we had a goal last year, we called it “find it, fix it” again in the terms of bridging the gap between QE and Dev, where we literally asked our QA engineers to find a defect. And if you know your product well, if you know the tech stack, submit a PR for that defect fix.
Joe [00:05:18] Yeah, I love the concept of “find it, fix it”. You did mention something about squad-based testing that just got me thinking. How does this work in your organization? Do you have sprint teams with testers embedded in the sprint teams? Is it still a separate team but then you all get together to do squad-based testing? Could you tell me a little bit more about the process, I guess, of getting QE involved in your teams?
Lagan [00:05:40] Yeah. So, yes, you're right. We have different squads in our organization and QEs are embedded in those squads. And since we all know, like the Dev-QE ratio is not always one is to one. And sometimes if we don't want testing to become like a bottleneck so like there's one QE on a squad and they're not available, we don't want the testing to stop. So again, QE would still be there to make sure that the quality checks are there, the gates are there, and we then train the developers to help us with testing so we can start with manual testing. Devs can also contribute to automation testing. So for squad-based testing, the concept is to have everybody involved in the squad contribute to the test because that would avoid QEs being the bottleneck. And then once the stories start getting stack up in the testing column, usually they decide like a work in progress limit. So if there are more than six or seven tickets in testing, then the developer, one of the developers, would pick up a testing ticket from in testing column and help out with testing rather than picking up another ticket for development. And the only rule is developer cannot test their own story so they can help out testing the stories and tickets developed by the other developers. Developers can also contribute to the automation and regression testing or release testing and it then depends on team to team, what the team dynamics is and how they've understood it and how what's comfortable for each team.
Joe [00:07:13] Awesome. So I think in your presentation you had like squad-based testing underneath bridging the gap between QE and Dev. You also had something called about I think it was “Have a Reliable Feedback Framework.” So I guess how do you have a reliable feedback framework? How did you put that in place?
Lagan [00:07:29] So for the feedback framework, we wanted to have feedback at every level. So like it could mean two things, right? Feedback on the product and then feedback also on the quality engineering efforts and what kind of testing they're putting into. So we address feedback on different levels. For the product, we want to make sure that the developers when they're testing, they know how to give feedback on the stories that they're testing. There are rules in place. There is a proper test case management system. There are proper channels where they could get the feedback back to the original developer or the quality engineers about a particular ticket. From the feedback framework regarding how the quality engineering is doing, we want to know what if the developers and the other team members they know about, they understand like where we are in terms of testing, like how transparent is your quality engineering? Do you know what goes in your smoke test? Do you know what regression comprises of? What do we do for production validation testing? So we want to make sure that there is a channel to sort of get all that feedback back to the quality engineering group.
Joe [00:08:37] Very cool. So you had another cool slide here on measuring how quality engineering is done. Now, you mentioned surveys, I believe, in the session. So is this a survey your company has done? Is it what other companies have done? What are those metrics that you found really helpful to actually measuring the QE transformation, how your teams are doing?
Lagan [00:08:54] Right. So we're actually measuring eight different metrics to sort of measure the engineering efficiencies. So the four are the standard DORA metrics, the DevOps research and assessment metrics. So that comprises of lead time, which is basically the time it takes in business days from code commit to live in production. The second one is deployment frequency. We measure these metrics monthly. So how often we deploy to production in a month. Then another one is the change failure rate. Like what percentage of our releases are successful versus what changes need to be rolled back. The fourth one is the mean time to recover. Those are the four DORA metrics. And then additionally, we track four more metrics, which is a rework production release, activity time, critical automation percentage, and automation runtime. So rework is basically selling your workflow if anything goes back backwards. So like if a story is being tested and you find a defect and it goes back to development, then and for any reason it could be a config issue, it could be a build issue, it could be a defect. But if there is rework, we counted as we do count it because we want to look into what why there is so much rework and these become like really interesting topics for retro discussions or team meetings where we find areas for improvement. Like, for example, for a team if the rework is really high say twenty-five, thirty, thirty-five percent, then are the acceptance criteria are not clear? Are the developers and testers talking to each other? Is there enough clarity on the story? Is there enough context around the story? Are we doing refinement properly? So these metrics are really good indicators of sort of figuring out what's the problem area and then trying to resolve it. Production release activity time is basically how long does it take for a team to release and release to production and do the prod validation. We want to keep this number low if we want to increase the deployment frequency, because if your deployments take hours to push forward out of production, definitely we don't want to do it again and again. And generally, there's an understanding that people don't want to spend so much time in releases, which is the opposite of where we want to be. So that's again, a really good metric to measure and track. Then critical automation is basically what percentage of your smoked test suite is automated. We want this to be one hundred percent at least because we're not tracking regression here because we want to be really careful about what we want to automate and how much we want to automate because it just then becomes a big maintenance activity if we automate everything without prioritizing. As much as I love automation, I want to be careful with the automated tests that we're not going to run again and again. We don't want to spend too much time. That's why we're tracking critical automation here. And then the automation runtime, because it speaks volumes about the teams like if their critical automation runs within minutes, within hours how long does it take? So if it takes two to three hours, then it means that there's scope for improvement where are we running the tests in parallel? Do we need to scale the infrastructure for testing? So these are some areas that give us insights into what's the state of the team and what the improvement areas that we can work on.
Joe [00:12:19] So you touched a little bit on test automation, but we haven't actually touched it in detail here. So just curious to know, what is your high-level test automation strategy? And then I thought maybe we dive into maybe certain areas.
Lagan [00:12:29] Sure. So for test automation, I believe that there has to be…so we don't want to sort of create this big backlog of automation where we are always automating after the sprint is over, after the story, is released. So I'm a big proponent of considering automation in the definition of done. Now, automation should be done in different layers. So there should be service-level automation. There should be automation for the UI front-end stories and tickets. But again, there should be a careful discussion between the QE teams and the stakeholders, like what is the P1 test case? What is the critical path scenario that at least should be automated? Which are our stable tests? Which tests are likely to change? But we should have some sort of agreement and buy-in from the entire team that this is something that the test team is going to invest time in. And we don't want to push automation to the last step or we do automation later or just keep on creating this backlog of automation, which in reality we never get to. So my strategy is to automate the high-priority critical test within the same sprint and include that piece in your definition of done. So like once your test story is tested, you report all the defects, the defects get fixed, it's accepted and you automated and then you push it out to production that you're always up to date with your automation backlog.
Joe [00:13:53] So are there any lessons you learn from doing test automation that you iterated on to make better like, how long does it take to run your smoke test? Did you find a bottleneck first? And how do you get the whole team not to lose confidence in the test, like if they were failing? Were they being flaky? Just tell me a little bit about the real-world implementation at that point.
Lagan [00:14:10] Right. So we wanted to like I said, we focus on a higher deployment frequency. So we wanted to get to a point for this particular team where we were doing monthly deployments. And then from monthly, we did automate a whole lot and we wanted to get to a point where we were doing weekly deployments. And then now we're doing daily deployments, of course. But so we were looking into this big test suite and we were having a whole lot of flaky tests. So we wanted this release to be a non-event. We wanted to be able to release on demand. So we tried breaking down that problem into different pieces where we looked at the test execution time. We looked at how many tests are failing. We looked at how the tests were organized and then how the test results were being reported. So basically the test execution, it was taking us about five to six hours to run the test. It was about three or four thousand test cases, and it was taking like five to six hours for regression testing. So we started collaborating with our DevOps partners and we scaled our infrastructure. I think initially we were running in EC2 instances and then we switched to Docker containers and we ran a whole bunch of parallel tests and we were able to get it down to under 30 minutes of parallel execution. At the same time, we were getting into a lot of false negatives. And for a three thousand or four thousand test suite even if you get like five or 10 percent of the test get fail, it's a lot for a person to sort of review and validate the test results. We basically looked into options like how do we get rid of these flaky test results? So we implemented a bunch of things that I remember we did IRetryAnalyzer where we would just look into the failed test. We tried to execute them again until they pass. And if they're really a failure, then they would always fail. And then we also worked with our team to identify what tests there are, which have open defects. And if the defects have not been fixed in forever, do we really need to continue running these tests? So we sort of clean that up a little bit, we prioritize the tests. We worked with the developers to sort of so if, say, a product comes in different languages, do we need to run everything in every language? What is the best combination we can get out of this whole big test suite where we're not repeating the test cases? And then the other big problem was how do we report the test automation results? Because we wanted everything to be sort of integrated and we built it into the CI/CD pipeline. And so we looked at many options. And since we are using Jira for all of our project management using Zephyr. So we looked into Zephyr API, I think it's called ZAPI. So basically what it does is it integrates the automation test results back into Jira. So before that, we were using this external dashboard which was created by our team to log and look into our automation results. But ZAPI seems like a better, more integrated option for us. So we resolved the test reporting piece from that. And then once we collect all of that together now we could run our test cases in parallel under 30 minutes. And we have our test results back into Jira. And then we create a Jenkins job and we use Slack a lot. So we integrated Slack notifications with the build failures, test run failures and if they pass or fail, and then that's how we solved that issue of getting to a faster test execution.
Lagan [00:17:48] I wish people could see the slide you have. You have an example of your CI/CD pipeline or your flowchart and it's really cool how you lay out pretty much what you just explained, but it's a really great slide. Hopefully, with your permission, I'll be able to put it in the show notes for people to check up. I thought that was really helpful as well. So how do you segment tests? You mentioned if they had defects, do you really need to run all these tests if there are known defects? If it's a language, do you really need to run all the languages? Do you do tagging? Like, how does that work?
Lagan [00:18:17] Like I said, we use Jira for test case management and Zephyr. So I think the logistics were pretty simple. We built different test suites and we sort of went through this manual process of revisiting and organizing our tests again. But I think the important part was to have that discussion between the stakeholders, like, “Is this important enough? Do we really need this test?” So we actually got stuck with certain tests which we couldn't run in an automated fashion. And then when we talked to the developers, to the product owners, and it came down to like, these tests are not really that important. We don't really want to run them with every small release. We could run them with on a cadence, like a weekly cadence or biweekly cadence. Another thing that I think we should do is look at the usage of your applications, like if you can get some data out of like which browsers are being used actually by the customers and what percentage of your customers are using what goes into those browsers. I think those are good insights to consider on how you want to run your test and which browsers do you want to run your tests. So we've looked at all of that and we sort of came to a conclusion like, “Okay, these tests are these are the bunch of tests we want to run on a weekly basis. These are the tests we need to absolutely run with every release.” So we didn't use tagging or labeling, but we did customize Jira a lot to sort of have these different test suites. And like I said, our automation results are going back into Jira. So we've sort of created these different categories of tests and organize them.
Joe [00:19:48] Nice. So I'm just curious how you handle this as well. You work for a large enterprise, a lot of teams. Over time your tests get more and more. Do you have any sort of pruning or deleting? Do you have any, like, guidelines for doing that or do you just keep the test going? How do you know, like over time, know from two years ago that you wrote a test is it still valid or not?
Lagan [00:20:05] I think it's very important to sort of continuously review the test cases and what you're testing, because definitely, I think you have to be proactive enough and you just have this review schedule, like on a quarterly basis or something, because over time tests do get invalid. And what happens is we know that these are known issues, these are known failures, but we don't find enough time to actually go in and clean that up. So I think it should be like as a good practice, every team should keep looking into what's still valid for a particular product.
Joe [00:20:42] And so I think you've been at this company for a while. So I'm just curious to know what it was like before this transformation and what it was like after? Do you have anything tangible that showed you that, “Okay, we made this big change. Everyone's involved now and the quality engineering-type aspect. So what did I get you?”
Lagan [00:20:56] So like I said Joe, pushing for the engineering efficiency, so basically it's like getting to market faster and delivering value to our customers in a faster way. We look very closely at the lead time and we want to reduce the lead time as much as possible and oftentimes, like more than like infrastructure changes, it's more like a mentality shift. You know, when you get into this mentality of shipping small bag size releases, your deliveries become very low risk, because if a lot of commits, a lot of changes are bundled into one big release, it is risky, the regression takes a really long time if it's not automated. So it's peace of mind for everybody, honestly, because if you're able to release low-risk changes more frequently then the customers are happy, the stakeholders are happy because they don't have to wait for a month or a week to get their features out in the market. We can just release and deploy as they are ready.
Joe [00:21:57] I'm just curious to know with your organization is well, how do deployments work? Do your QEs get involved with deployment? So how do the QEs get involved in deployment is my question.
Lagan [00:22:07] So we want QEs to be a part of the release process. We don't want quality engineers to be the people who just find defects and really to get into we really want QE to be wearing multiple hats and become really a full stack tester. So we work in collaboration with our DevOps partners, with developers. And we were pushing for these CI/CD pipelines where the deployments become simple enough so that everybody can do it. We even had a team last year which said everyone deploys and we literally got everybody on the squad to deploy even like the product owners, the delivery managers. Everybody deployed the release of production. But again, it has to be simple enough. It has to be one click release to production and everybody has to be involved in the releases. So, yes, we want quality engineering to be involved in the releases.
Joe [00:22:59] Awesome. It sounds really cool what you've done with your organization with this transformation. So I guess before we go, is there one last actionable piece of advice you can give to someone to help them with their QE efforts. And what's the best way to find or contact you?
Lagan [00:23:10] Sure, you can find me on LinkedIn. My name is Lagan Khare. And the one piece of advice is like collaboration, I think is the key for all the quality engineering people, the collaboration and relationship building, because if the different disciplines are working together, collaborating like development, the product owners, the DevOps, if we're all working together, we can definitely gain efficiencies and make this whole delivery process much faster and less risky. So that's one piece of advice for me. Collaborate with your counterparts.
Connect with Lagan Khare
-
- Company: www.elsevier.com/
- LinkedIn: lagan-khare
Rate and Review TestGuild
Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.