Performance KPIs with Keshav Vasudevan

13 April 2021 at 8:34 PM

By Test Guild

About this Episode

Performance testing is a non-functional testing technique performed to determine the system’s parameters in terms of responsiveness and stability under various workloads. In today’s times where digital web applications are being used more than ever, performance testing is a critical part of ensuring optimal customer experience. But what do you measure, especially when working with multiple tech stacks, teams of different skill sets, and managing various release deadlines? In this episode, discover some real-life examples to understand the different protocol and browser-based metrics to measure how well your web application’s performance is doing across the UI and API layer.

0:00 / 0:00

Join the Guild for (FREE)!

Email New Tab

TestGuild Performance Exclusive Sponsor

SmartBear SmartBear is dedicated to helping you release great software, faster, so they made LoadNinja. It load tests your web application with thousands of browsers for fast, accurate data. Try it today.

About Keshav Vasudevan

Currently, a Senior Product Manager at Aptible Keshav is the former Product and Marketing leader in technology, with a passion for building and growing SaaS-based web and mobile applications. Has led product management and marketing of two successful SaaS products for SmartBear Software – LoadNinja and SwaggerHub.

Full Transcript Keshav Vasudevan

Joe [00:00:16] Hey, it’s Joe, welcome to another episode of the Test Guild Performance and Site Reliability podcast. In a few days, I’m actually turning the big 50 so I’m taking the rest of April off. Because of that, I’m going to my back catalog of all my awesome perf Guild sessions we had from previous conferences and making those some podcasts episodes to share with you some performance awesomeness. We’re only having interviews with people because I am away for the month of April. So this session is on performance KPIs by Keshav, who was at that time the product manager at SmartBear. He’s moved on since then, but he shared a lot with the Guild on what are the critical KPIs you should measure in regards to your web application and how it appears in the eyes of your customers’ experience. So this talk is something over a bunch of real-world examples, understand different protocols of browser-based metrics to measure how well your applications’ performance is doing across the UI and API layer.

Keshav [00:02:00] Thanks for tuning into my talk on how to paint a cohesive picture of your web applications’ performance through measuring the right KPIs. I’m Keshav Vasudevan and I work as a product manager for all things performance testing at SmartBear Software. In my spare time, I love writing and check out my blog guru.school and you can also find me on Twitter. I love hearing from people about how they’re building their applications or delivering great quality for their customers. Feel free to reach out any time. And I want to preface this talk by mentioning that this is in no way an exhaustive list of all the metrics where there could be so many different metrics you could measure in order to understand your applications’ performance. Performance testing is an art. There’s a science and there’s a tactic to it but there’s also creativity about it in order to get to the right data and to the right understanding of your application performance.

[00:03:00] I want to start this talk by presenting a case for better performance. Performance testing is more important than ever before. We have so many more businesses going online, especially during this time of the pandemic. And customers given so many options that we have were spoiled for choice, which means are the switching costs are lower than ever before. There’s a seven percent potential drop in sales for every hundred-second delay in our performance. People leave. There’s about 40 percent customers abandon a website when it takes over three seconds to load. And there is a potential abandonment rate of 87 percent if an actual sales transaction takes two seconds longer. So that’s how impatient we are, which means as people delivering great software, we have to account for performance testing. We have to make sure we’re delivering the best quality. So really, when we are advocates of the end-users and their experience itself before we even get into the metrics, it’s very important to understand what our testing calls itself. Why are we even doing this? It’s one thing to say, “Hey, I want to just deliver the best experience." But, you know, as performance testing, you need to have some clear goals. And some common goals could just be, you know, determine the responsiveness and the reliability of the system. Maybe you want to make sure that your application is meeting the SLAs that customers have come to expect from other competitors or other players within the industry. Or maybe you just want to determine the system’s capacity, which is the maximum amount of load it can take before it doesn’t meet the SLAs itself. Or it could also be that maybe you want to handle growing workloads, right? Maybe your traffic team is showing you that you’re steadily increasing the number of users logging in or signing up for your service, and you want to ensure that your system can handle that. Or finally, you just want to diagnose some performance issues or identify bottlenecks so to start with identifying those goals and then figure out what are the metrics you need in order to measure how close or how far you are from those goals.

[00:05:11] So performance metrics are just that. They’re used to calculate the performance parameters and help you identify leading and lagging indicators to measure against your performance goals. Now leading and lagging indicators are like economic terms. But really leading indicators are just metrics that tell you the actual…that are more predictive of the system’s success or failure. Versus lagging indicators, they are not necessarily predictive, but they are the result of systems failing. For example, if, say, your DNS load time, your DNS time was really high. That could be a lagging indicator because it tells you that the DNS is high, but that is not necessarily the cause. The cause could be maybe there was poor network connectivity. Or maybe the DNS and the client who’s accessing this page are really far away, for example. So it could be that, you know, that the leading indicator, the cause of this is different than, you know, the actual result. By definition, leading indicators are harder to measure, which is why we use lagging indicators to get a sense of what’s actually happening in the system.

[00:06:31] Before you actually get started with the metrics as well, you know, you need to first understand how is your system performing when there’s no load, right? What is the ideal sys case when your system is meeting the SLA? So always prefer starting your tests with one virtual user or just you, just accessing your application and measuring and noting down what are the baseline metrics. So then when you actually do the load test in service of your goals, you can compare the results with the actual baseline itself to determine what’s breaking, if there’s any degradation, and to diagnose the actual issue itself.

[00:07:10] Now, let’s actually dive into the metrics. For us to identify the right metrics, we start from the end-user experience. And how is the end-user experience affected when your system is under load? And then working our way towards identifying what’s causing a potential degradation when the system is under load? So let’s say we are the end-users over here when we’re accessing all of these web applications across the browser. So let’s say you are clicking on a link and that’s taking you and that’s sending a request across the wire. There’s going to be a DNS server somewhere in between, and that is going to then identify the right IP and it’s going to send that request to the actual app server. From the app servers where the in the app server is where the actual processing of the request will take place. And finally, the code gets finished its execution and then is sending back a response across the wire and is sent back to the client. Once it’s sent back to the client, the DOM or the browser receives this HTML content and begins processing it, and constructs the actual DOM constructing the skeleton structure of the page. Finally, once the HTML is finished processing and it then goes on to render the actual page. And across all of this, there are different events that are being captured by the browser and are sent back for people to measure and for the browser to know when to go to the next step itself. So this is the typical transactional flow.

[00:08:46] Now, when we determine that something is broken, that the end-user experience has degraded, we can now start diving into some of these different metrics and see how they are when compared to the baseline. So let’s start with the initial metric when the initial request is sent itself. So some network metrics across this would be, for example, the DNS time. This is the time spent performing a DNS lookup which is obtaining the IP address of the website from the actual DNS. Now, if the value is high, it could indicate that there are problems with reaching the DNS over and retrieving its response. Or maybe there’s some poor network configuration. Or maybe you want to just contact your DNS provider, right? There’s a ton of them available right now. And whoever your DNS provider is may be able to assist as well to figure out what’s happening. The connect time is the time spent establishing a connection to the webserver after the DNS lookup. It’s the TCP handshake phase. Again if this value is high, which is, you know, the potential lagging indicator, it could indicate that there are some possible routing problems. Maybe there’s a bad configuration or low efficiency of the server bandwidth itself so that there could be those kinds of issues over here.

[00:10:06] You can then go on to the second phase, which is the actual application processing itself. So application metrics. This could be, for example, a very famous metric, which is the time to first bytes. This is the time spent waiting for the first bite of the response from the server. This includes, for example, processing the requests. You know, maybe they’re accessing the database. And then finally, you’re selecting and generating the response itself. Then there’s the response time, which is the time to actually send back that transaction to send since when the client sends the composed requests, until when the response is received by the client or gets downloaded by the client. If these values are high, it could indicate that there is a high server load with the database queries, maybe there some bloated web pages or maybe there are some memory leaks. It could also indicate some caching issues or maybe a lack of caching itself. Caching is useful to deliver content faster to customers because, you know, CDN providers help with this. So, you know, maybe or if your response times are really high, maybe it could also be a case where you want to talk to your CDN or see if there are some issues in your CDN. Or maybe it’s time to think of purchasing a CDN and or getting a CDN. So there could be some CDN issues as well over here. Application server metrics like throughput could also be useful. This is again, it’s a potential lagging indicator, you know, that helps you determine the simultaneous requests or the transactions per second. Your application can handle. The TPS, can correlate with response times if you are sending your requests sequentially or in consequence cycles. In such a case, longer response times equals longer-term transactions per second, meaning your system is not able to handle as many transactions. So longer response times equals lower TPS. It’s not necessarily this correlation. If the requests are not sequential or if they’re and if they’re all happening simultaneously or in parallel, then you won’t be able to establish this correlation as easily. This example on the right over here, for example, shows you how the TPS clearly breaks when the system’s capacity is reached. One thing to remember is that you know, a lower TPS doesn’t necessarily mean that you know, there’s some problem with the application server. It really depends on how you compare that with the baseline, which is why it’s an art because maybe there’s again, there’s some caching that’s going on, which means automatically the server is doing less work, which is a good thing. In this case, in the example over here, it’s clear it’s not necessarily the case because you’re seeing the TPS build up to a certain point and then drastically drop.

[00:13:15] Error rates are also useful and really give you a quick sense of understanding, you know, if there’s any for hundreds or five hundreds or unnecessarily three hundreds that are happening for some of these requests. I typically think that there’s never been a case where there’s not some percentage of errors that happen, especially under strenuous amounts of load. But again, you have to figure out as a company or as a team, what is the threshold of risk that you want to allow for in your response there as well? So it helps tell you how many failed requests are occurring at a particular point in time of your load test itself. There could be some health metrics as well that could be useful. Maybe you’ll want to at some point realize that given how much traffic that’s coming in, maybe you want to increase the physical memory available to process some of these requests. Maybe there’s some process or usage optimization that can happen or also some disk time as well. You know, there’s a lot of time that a disk is busy executing redirect requests. So there could be some health metrics as well that may be useful in order to understand what’s causing the performance degradation.

[00:14:25] Once it’s sent across the wire, then we can look at some of the DOM processing or page rendering events as well as metrics as well. The DOM load time is useful and again, this is something that the browser automatically collects and many load testing tools help identify these and capture them as well. This is the total time it’s taking to load and construct the DOM. The DOM is considered completed when there’s a specific event, the DOM content loaded event and that’s when it starts. This is one of the constructs, the DOM. But it’s important to remember, it doesn’t mean that the CSS and any sort of, you know, events or JavaScript events have finished. They may still be processing at this time. It’s only when the onload time is completed or the event time is completed, that’s when you know that the HTML is fully parsed, processed, and rendered and includes all sorts of scripts and stylesheets as well, that may be a part of this. If these values are high, it could indicate maybe there are some heavy client-side users. You know, maybe your CSS is kind of bloated. Maybe you have some big images or maybe you’re using some sort of image repository or other sorts of file asset repository that’s not able…or some dependencies on third-party services that are taking a while to process those requests. So there could be some useful identifiers to figuring out what the actual issue is itself. Now, of course, you know, if you want to go in full detail into some of the other client-side metrics, there’s obviously, you know, the W3 resources over here, which is the official world wide web consortium. And this is where you can get you can take a deep dive into some of these navigation timings themselves.

[00:16:14] Now we can look at some real-world examples over here. And I’ve taken a few examples of some of our customers who are using SmartBear’s load testing tools like (unintelligible). This is an example of one case, where, you know, we…the chart over here that you’re seeing over here tells you that for a specific step, which is a collection of actions on a page, or it could be the page itself, how long does it take for different browsers to go to and execute on those actions? So we’ve constructed we’ve done one hundred and fifty virtual user load tests for this example. Now, over here you can see that as there are more and more virtual users coming in at certain points of time, the time it’s taking for the browser to go through all those actions and remember, these are all real browsers coming in and going through those actions, we’re seeing some spikes. Like you can look at it at the thousand second time mark or the fifteen hundred second time. Now, when we look at the navigation timings over here, we’re able to identify some correlation. As soon as those spikes occur, we’re also seeing that the response times are really high over here. When we dug in a little more for this specific page and looked at the applications’ code, what we found out was that those functions were inefficient (unintelligible). But if you recall, you know, if the response time is high. It could mean that there are some poor database queries that could be happening, for example. So this was an example where such a thing really did happen for our customer.

[00:17:54] Another example is this case over here for a different page. And and and over here, we what we did was we looked at a specific step or specific page that contains these actions. And what we saw was, again, like when there are low virtual users, when there are about less than about or less about 50 to 70 virtual users, the system was behaving as expected. It was taking just about five to seven seconds to execute all the actions on that page. We’re just meeting the SLA. But as soon as we went a little higher, right when we reached up to that hundred and fifty view mark, the system started taking a lot more time to go from the start, from the first action or the first event all the way to the final action. So, again, what we’re doing over here is we’re looking at the end-user experience, which is what this chart shows us. And then seeing how that translates into identifying what’s the issue that’s causing this. And right over here, we noticed that as there are more virtual users coming in, there was a high DOM load time on the time the first byte for starting to peak or get higher and higher as the system was under load. When we dug in a little more using the network visualizer within the load testing tool. What we found out was, you know, when we when the system was behaving smoothly, when there’s just about, you know, less than 10 virtual users and it was meeting the SLA, you can see over here how the different server resources on the domme, which are constructed by the browser for the DOM, how they were behaving. And you’re seeing how, you know, the response times and how that waterfall is. However, when the system starts to have more virtual users, these metrics started to fluctuate and take a long time to process. There are many services and resources that took way longer to process under load, which helped us identify these services and try to fix them, and optimize the response times for some of them as well. So this is how you can work your way from the end-user experience, which is what the script durations chart for showing you and then work your way towards identifying what could be causing this issue. And ultimately, it’s all about using the right tool to gather these metrics. And really the right tool depends on what you’re trying to you do. There could be the protocol level load testing tool to gather these metrics we share, or it could be the browser-based load testing tools themselves. The browser-based load testing tools are great if you have a lot of client-side logic. So there’s like a lot of react or angular applications, for example, where there are heavy client-side asynchronous calls happening. And in such a case in the browser-based approach, you know, there’s very little programming because you’re all you’re doing is driving the browser in order to capture those actions. And as a consequence of that, you’re also generating really accurate load on your systems because it’s really the truest form of load available. And it helps you capture some complex user flows as well. It’s great for teams, especially teams who are more agile and who have a lot of developers or functional testers, for example, doing a lot of load testing or maybe there’s a very agile performance testing team as well, that’s keeping up with the team that’s delivering a lot of new applications and new updates to their applications.

[00:21:20] So some examples are like, you know, we have the open-source flood element. There’s SmartBear LoadNinja tools that helps you do this at scale without having to maintain the servers. There’s, of course, also TrueClient, which is from MicroFocus that allows you to do this as well. There’s a protocol baseload testing tools, and this is where, you know, this would be great for applications that have less client-side logic and are more traditional. And this is useful for testing teams that are more methodical and centralized, you know, because even though it would take a long a lot longer to create the tests, if the application that is being developed and update is not necessarily being updated every week, for example, then these tests are easier to maintain as well. And it’s great for really capturing request-response, traffic level metrics, some examples of these tools, obviously LoadRunner and Jmeter, the popular open-source testing provider as well. So, yes, this is the end of the talk. I hope you found this talk useful. This, again, as I mentioned, is in no way exhaustive of the different metrics you could use. It’s really performance testing as an art. And, you know, everyone in this community is an artist. And I would love to hear from you if you have any feedback or additional questions or just, you know, have a different way of doing performance testing or understanding or metrics, please feel free to reach out to me.

Connect with Keshav Vasudevan

Company: Aptible
LinkedIn: keshavv

Rate and Review TestGuild Performance Podcast

Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.