Performance Engineering for Beginners with Scott Moore

1 December 2020, 10:59 AM

By Test Guild

Scott Moore Performance Engineering Feature

About this Episode

Want to learn more about effective performance engineering? In this episode, Scott will break down each phase of a successful performance engineering lifecycle. Discover how to start moving from just a performance tester to thinking like an engineer and getting involved earlier in the lifecycle. Scott will also discuss what the digital experience means from a performance tester's perspective. Listen up!

TestGuild Performance Exclusive Sponsor

SmartBear is dedicated to helping you release great software, faster, so they made two great tools. Automate your UI performance testing with LoadNinja and ensure your API performance with LoadUI Pro. Try them both today.

About Scott Moore

Scott Moore

Scott is a Perfvangelist, Host of the Performance Tour (perftour.us) with over 27 years of IT experience with various platforms and technologies. Scott has worked with some of the largest applications and infrastructures in the world. Scott founded LoadTester Incorporated in 2004 and co-founded Northway Solutions Group in 2010. In 2015, Scott Moore Consulting LLC was set up to provide consulting services around Performance Testing, Performance Engineering, and Performance Monitoring. He is an active thought leader in the performance engineering space.

Connect with Scott Moore

Full Transcript Scott Moore

Joe [00:03:46] All right, Scott, let's start off with what is performance engineering? I'm hearing a lot of performance testers thinking they're going to lose their jobs. Is performance testing part of performance engineering? What are the different pieces of performance engineering and how does it affect someone who is currently in a performance tester role?

Scott [00:04:01] So if you're hearing more and more about how performance engineering is going to take over performance testing, don't worry. It's not. Performance testing is just one discipline of performance engineering. And as you can see from these bubbles here, there's a lot to this. And probably this is not going to be known by one single person and organization. It's probably going to be spread out among many roles, many people. And this could take an entire career to learn how to do this stuff. So don't down yourself just because you might be, quote-unquote, just a performance tester. There is a lot to that, especially if you're going to do it right. Now that would fall into the bubble labeled quality assurance. But you'll notice that at the beginning, at the top noon position, we talk about business and contracts. How many companies do you know that are able to get a guaranteed service level agreement for response time from any third-party vendor that's building software for them? I don't know of any, mainly because the vendor could never guarantee a response time because the customer's implementation is probably unique. There's probably some customization, but will we ever get there? I don't know. This is something that has a unique skill set associated with contracts and service level agreements, and that's a negotiation thing. That's probably before any code is even written. As we move into the developer, as we move and we're going clockwise here, as we move into the developer side of things, what are things that the developers should know as they build performant code and what are things that we as performance engineers can check to make sure they actually are doing that? We're going to talk about some of that next. So as we go around this circle here, just note that performance covers the entire life cycle of a project and a product, I would say from the beginning to death of a project. And as it rolls out into legacy, it should always be part of everything that you're thinking about. The days of testing performance into a product are over. It's too risky and it causes too much technical debt to go back and do it later.

Joe [00:06:07] So I just love that quote that Scott just gave, that the days of testing performance into an application ain't over. Being an older, an older tester I remember those days and it was a nightmare. And I really think as we shift left, no, those days really are over as we get more agile and more and more companies make the digital transformation to include all types of testing activities earlier and earlier in the software development lifecycle. So the next section, Scott went over the importance of performance baselines and some reporting he uses to get some insight into an application before even diving into a quote-unquote performance load stress test.

Scott [00:06:46] I used the website here called WebPage Test to create this graphic, but we don't have to actually use this and there are others. But it was a great illustration of looking at a single web page. And again, we're talking about web-based web applications mainly. But the output from this report, if you look across the top, it shows some pretty important timings. How long did it take the entire web page to load? How long did it take before it was actually usable? How long did it take for things to paint? These are good metrics that we can use just from opening up a single web page on a single browser. If these timings are beyond what the service level objective would be for your app, it's only going to get worse under load. At worse, it will fall over before it ever reaches the intended number of virtual users. But here, do we really need to run a load test if this timing was ten, fifteen, twenty seconds instead of three seconds? If we needed to dive in deeper, we can look at the waterfall view that you see there on the left-hand side. This is all the elements of a web page and how long it takes each one of them to come back across the network and render as well as how long it actually takes before that actual element was called for. That resource sometimes is coming from third-party locations and that can actually hold up your web page. So the developers might need to know how to use techniques to avoid that, to make it look like the page is coming back a lot sooner and rendering sooner even though those third-party requests may have to wait until the very end before they're retrieved. On the right-hand side, you see a screenshot and this is my actual web page, scottmoore.consulting. And at the bottom, you have these pie charts. It's something that I always like to illustrate when I showed this graphic. Think of your web page like a pie. If you have too much, too many ingredients and you try to put too much into the pie it's going to run over and become more of a hot mess than a real pie. Something that's good to eat, right? That's the same way we should think about this web page, the full web pages. There's only so much you can put in there. Now, Google has something called the RAIL system. And what it boils down to is any fully rendered page should take less than five seconds, typically. It may be less for certain applications that may be more mission-critical. And for each individual service or call, it should take less than one hundred milliseconds or there should be a good reason why it does take longer than that. Now you can do the math and figure out how many services can make up a web page that takes five seconds before that's breached or how many elements that would be. The key is to put together a performance budget of how many requests and things you can put into a web page or an application before it becomes a hot mess. And you can do this by answering the question, how does it look for a single user?

Joe [00:09:54] So another awesome tip by Scott that I love is the performance budget. Every application should have it for one user before you start ramping up your performance effort. So that's a great tip. So you might be asking, are there any other tools you can use to help you gather these statistics? And Scott then goes over some tools that he recommends or he has it listed under dev tools.

Scott [00:10:17] And you don't have to use WebPage Test to do this. You can actually use your browser and any browser should have a developer tools feature. This is Google Chrome. And typically Google's a little bit ahead of the curve on the type of information they display in dev tools. But that's beyond the point. Whatever browser you have will work. And this is just a quick dev tools rendering of the performance tab where I refresh this page and recorded it and it's showing that same similar type of waterfall here in the center. And then they have some other graphics at the bottom that you can actually start learning how long it took before these calls were made, how many milliseconds it took, and what order they were called. Notice that there's this big space in the middle between about twenty-three hundred, twenty-four hundred milliseconds to thirty-five hundred thirty-six hundred milliseconds where some external calls were made way later. And it makes it look like this web page took almost four seconds to render. Those were actually third party request and they caused this timing to actually be a lot longer than it was. But because my web page takes advantage of holding off for those third-party requests until the other local stuff is rendered, it actually doesn't feel this long when you look at the page. So let's actually do that right now. Let's take a look at my own website, scottmoore.consulting. I mean, Google Chrome. I'm going to go into more tools, developer tools, and you'll notice that we have multiple tabs here, elements, console, sources. There are three tabs I want to share with you. And these are things that you can do yourself or work with your development staff and make sure they're doing these things as soon as they have code that's renderable in a web application. The first thing you can do is just refresh this page while the network tab is recording and you'll see a waterfall view being created for you, and this will show you how long it's taking each one of these requests, the protocol used, the status code, the amount of time and the size and notice at the bottom, we have these totals here – how many request, how much data was transferred across the network and how long it's taking that web page to take. The second one, the second tab I want to show you is the performance tab. Let's reload the page again with no caching and let the performance tab do its job. It's going to profile that web page and then it's going to give you another view similar to the waterfall, as well as additional types of graphics that you can research and learn how to use. This is kind of out of the scope of this, but you can see that I get the initial elements, then I load that background again. And if I've got a big background image, that's going to take a lot longer to load this screen. But you can see certain elements that have to be reloaded if there's especially if there's new content there. So the third one I want to show you is the audit. And this one is great because this Lighthouse report, as Google calls it, can give you all kinds of information based on these categories. For now, we want to uncheck everything and we want to use desktop because that's what I'm using. But you could also choose mobile if you wanted to get sort of a worst-case under a dirty network condition. Mobile would be a great thing to do, especially if that's where your traffic is coming from. So let's go ahead and generate this Lighthouse report. This is going to give me a score as well as some things that I'm doing wrong or right. Sometimes it can take up to 15, 20 seconds. The score I get is 27 out of 100. And that's not very good, to be honest with you but I know the reason why this is. So if you do get scores from things like this or we're going to look at GTmetrix next, just remember, there may be good reasons why you developers are doing what they're doing. As long as they know why they're doing it, the score may not matter, but some of the statistics would, and that this is how long it's taking for these metrics to occur. And these are the W3C metrics that come right out of the browser. And this is the same type of metrics you would see from WebPage Test and GTmetrix. So it's pulling out of the APIs that exist inside the browser itself. So you can see that there are items that it recommends that I pay attention to – deferring off-screen images, eliminating render-blocking resources. And these are things that I can have a conversation with my development crew about and address these. There's another thing that I could do. I could take this particular website and put it into GTmetrix. Let's do that now or WebPage Test. GTmetrix will also give you scores from why slow and others. It will also give you Waterfall views. And this is actually coming from Vancouver, Canada, using the Chrome desktop browser. And then it generates a report. I notice I'm getting pretty low scores, I would say. However, my time 2.3 seconds is not that bad and the total page size is pretty good. So if I look down, it's really low hanging fruit that's causing these low scores. There are some images out there, some PNG images that are compressed and aren't scaled down for the size that is on their page. In other words, they're much larger than what's being displayed on the page. And that's more information that needs to be pulled down. So just by correcting those sizes, I could save ninety-three percent off the resolution or a reduction in size. And this would probably bring me up to an A if I just solve that and it would probably bring me into a B plus or in A score. While this is very subjective, the page details are not. So this is what I really like to pay attention to. How long is it taking for that page to render under those conditions – Vancouver, Canada, Chrome, desktop? And how large is the size and how many requests? This is part of creating that budget. If you say we're only going to have a web page that's less than two megs in size total and we're only going to have less than 80 requests, once you breach those, you know that you need to do something about it, to adjust it. And then all of these other things that are suggested that you do just really become that low hanging fruit. You might see things and hear about taking advantage of the content delivery network, but you might not be using the content delivery network because you're using an internal proxy for that. Like Apache, make some things that you can use instead of using a CDN and it accomplishes the same thing. So again, there may be reasons, good reasons why your score isn't a hundred percent. That's not the goal. The goal is this. Can you reach a performance level that meets the service level objective of that page? So these are things that a performance engineer would think about and do. And if the developers aren't doing this, the performance engineer should be doing it or at least training them on how to do it. As soon as this web page is available in a development environment, I have access to these browser tools and if it's publicly available and it's already in production, I can see what this is on GTmetrix and WebPage Test. And I can also look at my competitors as well and see how they rank in this. This is very helpful information and this gets you started thinking like an engineer. I haven't even thought about running this under load at this point. But I do know that if this timing were over 10, 12 seconds and my service level objective was five seconds, we really should stop right here. This if this is learned by the developers that it should be under five seconds or whatever number you've done and they stick this budget, it'll never make it to the performance engineers until it meets this, which saves all of that time and effort by the testers doing their piece just to verify this very same thing. But if we get to this and the developers are happy pushing forward in the test, then the performance testers can just validate this and say, Okay, now we can push this under load and whatever that amount would be, let's say we want a thousand users to be able to use this website concurrently doing these business processes. That's when that verification becomes really valuable. But if you're not going to start looking earlier in the cycle here, then you're sort of wasting a little bit of time here. It's not going to help you get out there any faster. In fact, it's going to make you have to go back when the testers find this back to the developers and fix the problem later. So you're actually creating technical debt at that stage.

Joe [00:19:24] Alright, hopefully, you are able to follow along. I know a lot of this is based on visuals that Scott was showing on his computer at that time of his presentation. So a great way to actually see this once again is to take advantage of that discounted rate by heading over to perfguild.com and getting one hundred dollars off your ticket by entering in the code 100 guild coin and you'll be able to see Scott's and all the other recorded sessions for the 2020 PERF Guild event. So next Scott went over performance testing and his views on performance testing and how that relates to performance engineering.

Scott [00:20:02] Now, this is the actual process that I use when I'm doing performance testing for a client still today and I've used this for several years. It's iterative. So it works for Agile and Waterfall. It really just depends on the amount of scope that you're looking at as you're doing these actual things, and it does have deliverables associated with it. I'm going to show you a few of my templates. Now, these are just to get you started. These are not written in stone. It's just to give you some ideas. And there's no reason to try to ask for copies of this because I want you to create your own templates. It's what's your best practice for you. So let's look at this. In general, we have four main stages. We have the planning phase, which is gathering all the information that I need before I start doing any kind of scripting or testing. I want to know everything about that stack. Is this a Java-based application dot net? Is it using React or some other kind of JavaScript library framework? What kind of network is it on? Is it cloud-native? Is it on AWS Google cloud? Is it a mixture and a hybrid between on-premise and off and on the cloud? Is it a SAS based product? How much visibility do you have into the back end of the infrastructure? Can you actually monitor the back-end servers in a performance testing tool or do you have to use third parties? I mean, all of these questions arise. How about the data? Is it the standard SQL Server Oracle database or is it a data store like ElasticSearch or Big Data, Apache Spark? Something like that. We need to know what we're looking at first and then we need to ask those non-technical questions like who are the users? What are they doing and how often are they doing it? And try to pinpoint the exact business processes that the end-users are doing that have the highest risk, that load the backend the most and would have the biggest performance risk financially or in other aspects. Now, I can't tell you the number of times where I've been to a client and I begin to look at how they're doing performance testing today and their business processes don't even match what the end-users are actually doing. And just by calling a customer of theirs and sitting with them for maybe an hour or two, taking print screen, seeing what they do and trying to match that to the business processes, I found many times they never matched. And this was the reason why the company was struggling, trying to figure out why are we doing testing but we're not seeing why these bottlenecks are actually existing in production. Where's the disconnect? That's sort of where it begins. And all of this begins in the planning phase. When you get into the building phase, that's the actual automation piece. This is where you're scripting and you're using tools like LoadRunner, Jmeter, NeoLoad and you're building out these automated processes and your data driving them. You're correlating data back to account for all this dynamic data that's being passed back and forth. Many times performance testers get kind of stuck in this, their own little bubble, their own little world, and how wonderful the geeky coding can be. Maybe they're using LoadRunner and using C#. Maybe they're not putting a lot of code in there. And they're using something like NeoLoad to record these scripts. They're still having to configure all of this and make sure that they run. The key here is making sure that when you playback each business process, it does match what the client's doing. The volumes that you've chosen, the data that you are using is enough data for your test. All these things have to do with creating a real-life scenario as close to the real world as possible. And that takes a lot of questioning, a lot of investigation. But once you get that working, you will get amazing results that you just have to put the effort into the front end. The execution phase that you see there is the actual testing. Now, I have a specific methodology that I use where I start with a baseline of a single user, a few users, about twenty percent of what I think I'm going to need, and then ramp up to one hundred percent. My test are usually about one hour, not including ramp up and ramp down. And if I can ramp up to the full amount, let's say that a thousand users and I reach that amount, I will hold that for at least an hour while I'm monitoring the infrastructure resources and then I'll ramp down either fast if I don't care about looking at memory release or if I'm looking for like a soap test for slow memory. I'll ramp down very slowly. In either case. I'm probably going to chop off the ramp up and ramp down and see what is this application behave like under these peak conditions and if I can meet the peak hour, I know I can reach the other twenty-three hours that are in the day. So that's how I do my test. The final phase here is the confirmation phase and this is where you're actually looking at the analysis and the results and you're providing that back to the business. Now, I realize you sometimes have two different audiences. So you might prepare technical results. Maybe they come out of the performance testing tool and you have to sort of translate those for the technical people involved and say, Okay, here's the transaction response time, and here's how we're doing in terms of memory, CPU, network storage. And here are some other metrics that we're also gathering from the code itself possibly. Let's say we were using detailed diagnostic tools like Dynatrace or diagnostics from HP MicroFocus. That type of information would all be correlated together. And your job as the performance tester would be to make the truth rise to the top by taking out what you don't need to see in the analysis, leaving only what needs to be seen as this is the unadulterated story that we've gotten from this test. And that final report may be a different report, a summary, maybe a short word document for the non-technical people, maybe the upper management. And this may be a very quick thumbs up, thumbs down type of thing where they can easily make a business decision. Should we move forward? Are there showstoppers? And I'm going to show you some of the templates that I've used for the planning phase, as well as the final test reports. And I've got two different types here to show you.

Joe [00:26:35] Alright. Really great stuff there. People always ask me what should be in a performance report. Scott then shows his screen and goes over some reports and what he thinks sections that he uses all the time for his performance engagements. It didn't translate well to audio, so I left it out and now I'm going to just go right into his section on continuous performance.

Scott [00:26:53] So if you're doing continuous testing, continuous integration, continuous production implementations, you're going to need to be in a pipeline and think about performance at the same time. So when you're using Bamboo and Jenkins to push through a set of unit tests and maybe there's a set of functional tests as well, there needs to be a performance test as well. Now, in this case, I'm using NeoLoad as an example instead of LoadRunner. I'm also using Elastic Stack as a monitoring solution. So I'm just trying to switch it around so I'm not favoring any particular tool. But in the CI pipeline, you may be using Bamboo or Jenkins. You could use a number of tools, Jmeter, Gatling, NeoLoad and you want to run a series of performance test every time that you run the other types of test. You probably want to throw some Security Penetration Testing in there as well. Also, use that same tool in the test environment to get more of an integrated view instead of just unit things like individual services, for example, would be on that CI dev pipeline, whereas in the testing and staging areas you'd want to get fully rendered pages that might be made up of multiple services. And in production, you also want to be monitoring those things and probably push monitoring into the test environment and the development environment as well, so that all these things that can be monitored are monitored, not over monitored, but monitored. And in this case, what I had put together was a situation where we were running unit-level performance testing and Dev, we were running an integrated test in the testing staging environment, and in production, we were running a combination of a NeoLoad script, one of the same ones that we used in test with a Selenium user to grab the fully rendered JavaScript-heavy front end. And then we were using Google Stackdriver to monitor the cloud machines because it was completely cloud-based. All of this information could be put into the Elastic Stack. So Elasticsearch has multiple components now, including an APM component. It also has ways to through its logging, it can pull in Stackdriver information as well. So by being able to correlate all of this data together and create these additional graphs, you can monitor performance through the entire pipeline. It's possible it can be done. It is a lot of work and many people are trying to solve this problem as we speak. But this is an example of how it can be done.

Joe [00:29:33] And Scott let's just wrap it up with the term I've been hearing more and more about, and that is digital experience.

Scott [00:29:40] According to Gartner and many lead analysts out there, digital experience is made up of what I would call a three-layer cake. At the top of that cake, we have what's called the end-user experience, meaning this is what this user would see if they opened up the browser, that application and used it and has a stopwatch in their hand. They don't care about all the technology that goes behind that. What kind of network you're on. If it's mobile or desktop, they just care how long is it taking me to use this application. There are two ways to gather that information. On the left, we have synthetic timings. This is exactly how GTmetrix works. For example, we put our website in there. It ran a synthetic script that pulled that page down and analyzed it, but we initiated that. Think of that as running every 15 minutes in a 24 hour day and looking for changes and variations and also pulling it from different locations around the web. On the right, we have real user information. So RUM real user monitoring is another way to grab data as users are actually using the browser and it's pulling those same W3C metrics we talked about earlier. And it's combining that information with those synthetic timings to try to get a complete picture of – What's actually happening? How close can I get to the last mile of where that user actually lives? Now, that's probably the most important metric. But if you don't have the additional information behind that in these lower-level pieces of the cake, these lower levels, it's going to be difficult to figure out exactly what a problem might be, what the root cause might be. So we have this second layer that Gartner calls application discovery, tracing, and diagnostics. And that's a big mouthful. So let's just break it down. It's all the technology that runs under the covers from the .NET Core, PHP, the front end JavaScript libraries, the database, and anything that might be running on Docker container, Kubernetes, and the operating systems that might be running on. Everything down to the bare metal hardware. There are ways to monitor a lot of this stuff. And when you have a particular transaction timing, let's say a search is much higher than others, it would be very easy if you have this data available to drill down into the database layer, look at a specific metric and probably figure out in just six seconds or minutes what the problem is going to be, especially if it's the DBA or a specialist looking at that problem. The last layer is a fairly new layer that people are beginning to talk about called AI ops. And in short, what that means is let the machines figure out what the machines are trying to say. All of that second layer when you combine all of that together, it's a lot of data. And what we have now are experts out there using their experience to try to drill down and find a problem manually. And it can be done. And depending on how good the person is, it can be done rather quickly. But instead of going that route, we can have automated ways and algorithms to look at all of this data and try to correlate it together and put together good possibilities of, hey, you're seeing this begin to breach this amount of this value per se. And then this is a problem. This is out of bounds. This is an anomaly. Let's isolate that and figure out what the problem might be based upon machine learning or what we know about this application. And so we're starting to see vendors like Akamas and others, the Davis AI that's coming out from Dynarace. We're starting to see some of this starting to surface. And if it matures the way that it should, then taking all this data and making sense of it, if you can trust it, would shorten the time for triage situations and it would isolate performance problems, hopefully, find them and help you fix them, even automatically fix them before customers are even aware there's a problem. It can also do things like figure out the best configuration for your AWS machine that you should choose, like choose this number of CPU's and this amount of memory combination to get the best performance for your value. So you're not overspending on your cloud bill, making sure you're not over-allocating resources to Docker containers so that you are right-sizing everything that you're doing. And you're not just giving the cloud vendors money that you shouldn't. So that's what the digital experience really all entails. it's being able to get all of this information, make sense of it. But the real important metric is that top layer of what is the end-user seeing and an end-user monitoring is typically what that's called. Very, very important.

Joe [00:34:50] Alright Scott before we go, how about summarizing your session? And maybe tell us the best way to find or contact you.

Scott [00:34:56] So in summary, it's very easy to become a performance engineer. You just have to begin thinking in terms of the entire lifecycle, how can I improve performance anywhere and everywhere? How can it be improved? And can I improve the performance without ever running a test? But in performance testing, you can streamline it by using templates and you can create a repeatable process and you can use that slide that I shared with you as a beginning point to start your own process that is repeatable. And finally, remember that end-user experience is the king. If you start with the end-users experience in mind and work your way backwards into the technical stuff that we all know and love, you're going to win every time because you always start with what the end-user seeing, not in a particular tier and you don't have those blinders on. Well, you know, I'm a network guy. My network is working fine. I'm a DBA. My database is working fine. As a performance engineer, you do look at all of that. But if all of those are fine, but the end-user experience is still poor, i.e. the response time is still high, then that's what you start with and you work your way backwards. So thank you so much for watching this presentation. I really appreciate it. I also want to mention that I am currently in process of a performance tour. This is going to be going on for quite some time where I go to different cities, I talk to people and listen to their performance issues that they're having. We try to figure out, you know, have the conversation, try to figure out if we can solve these problems. You can find out more about this tour. It's actually an online show on YouTube and you can reach me through perftour.us. I'm on Twitter @loadtester and that's the LinkedIn URL to reach me. So I'd love to connect with you all and continue these conversations. So check that out. Let me know what you think. Give me feedback and I hope to hear from many of you. And I wish you the best and I hope you got something out of it.

Rate and Review TestGuild Performance Podcast

Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.

Michael Martinez TestGuild DevOps Toolchain

A Day In The Life With Dev Op with Michael Martinez

Posted on 07/24/2024

About this DevOps Toolchain Episode: Join us as we uncover DevOps's secrets, exploring ...

A person is speaking into a microphone on the "TestGuild News Show" with topics including weekly DevOps, automation, performance, and security testing. "Breaking News" is highlighted at the bottom.

AI for Test Coverage, Why Playwright is Slow, Crowdstrike and more! TGNS129

Posted on 07/22/2024

About This Episode: Do you know how much of real production usage your ...

Mark Creamer TestGuild Automation Feature

AI’s Role in Test Automation and Collaboration with Mark Creamer

Posted on 07/21/2024

About This Episode: In this episode, host Joe Colantonio sits down with Mark ...