About this Episode
In this episode, James will boil down for you what is performance engineering today, and some of the core attributes that we look at in terms of performance as a performance engineer.
TestGuild Performance Exclusive Sponsor
SmartBear is dedicated to helping you release great software, faster, so they made two great tools. Automate your UI performance testing with LoadNinja and ensure your API performance with LoadUI Pro. Try them both today.
About James Pulley
As one of the original and lasting LoadRunner experts, James Pulley's focus for the past decade has been assisting clients in finding answers to complex questions related to load, performance, response time and scalability. James began his career working for Microsoft in the support arm of Microsoft's Operating Systems and Database solutions as an electronic support pioneer for Microsoft on Compuserve for Microsoft Windows. After Microsoft James worked for Banyan systems, a pioneer in the integrated directory services model for PC networks. From Banyan, James moved to Mercury Interactive for a two-year stint in Mercury's sales arm as a pre-sales systems engineer. In the last 18 years, James has been working independently as an executive, consultant, advisor, and evangelist for various companies big and small.
James is also one of the hosts on the performance testing podcast PerfBytes.
Connect with James Pulley
Full Transcript James Pulley
Intro[00:00:01]Welcome to the Test Guild Performance and Site Reliability podcast, where we all get together to learn more about performance testing with your host Joe Colantonio.
Joe Colantonio[00:00:16]Hey, it's Joe, and welcome to another episode of the Test Guild Performance and Site Reliability podcast. If you haven't heard, I actually turned 50 this month, so I try to take the whole month of April off. Because of that, I didn't have a lot of interviews lined up. So what I'm doing is I've gone to my back catalog of my Perf Guild and have taken the best of the best of sessions, chopping them up and making them more of an interview-style session. So this is a quasi-interview with James Pulley all about absolute performance. And let me have James explain what the session is really about first.
James Pulley[00:00:48] Welcome to the PERF Guild presentation on absolute performance, triple distilled for your engineering enjoyment. What are we going to do is try to kind of boil down what is performance engineering today and some of the core attributes that we look at in terms of our performance as performance engineers.
Joe Colantonio[00:01:11] Thanks, James! I'm really excited to jump into it before we do though if you don't know, I'm also always looking for speakers and topics to cover in the performance and site reliability area. So if you have any ideas, any suggestions for me, I had to recommend you go to TestGuild.com, click on the about and contact menu option and send me a quick email and your thoughts on what you'd like to hear in the future, or if you want to be a guest on an upcoming PERF Guild podcast, we'd love to hear from you. And also let me know if you're interested in having a Perf Guild session for 2021. I did it every other year so far, so I did one in 2018, one in 2020. I'm debating if I should do one in 2021 or wait till next year. So let me know your thoughts. We'd be interested in that as well. And so the two ways to do that once again is just go to TestGuild.com, click the About Contact menu option, or go to AskGuild.com and it should redirect you directly to that contact form.
Joe Colantonio[00:02:08] This episode was brought to you by SmartBear. Listen, load testing is tough. Investing in the right tools to automate tests, identify bottlenecks, and resolve issues quickly could save your organization time and money. SmartBear offers a suite of performance tools like LoadNinja, which is a SaaS UI load testing tool, and LoadUI Pro, an API load testing tool to help teams get full visibility into UI and API performance so you can release and recover faster than ever. Give it a shot. It's free and easy to try, head on over to SmartBear.com/solutions/performancetesting to learn more.
Joe Colantonio[00:02:53] Hey, James, welcome to the Guild. Before we get into it, can you just tell us a little bit more about yourself?
James Pulley[00:02:58] My name is James Pulley. I'm a co-founder of PerfBytes with Mark Tomlinson. In my current role, I am a practice manager for TEK Systems Global Services Performance Engineering Group. In this group, we cover performance engineering practices all the way from the birth to the death of an application, and that goes all the way from requirements through development. Traditional quality assurance practices, platform engineering, capacity planning, and simulation of modeling, so it's got quite a wide breadth associated with it. And I'm going to kind of talk to the things that we look at across that entire lifecycle, believe it or not. A little bit more about myself as I moderate a number of forums related to software performance engineering practices and testing practices. You may have run across me in one or multiple those on Facebook, Yahoo, Google Groups, LinkedIn, SQA forums, or maybe a few others and there as well. And of course, I like barbecue. I think that kind of goes along with being a performance tester and performance engineer. There's something in our DNA that we gravitate towards barbecue. And for those of you who do not know my background, I had six years of formal art training. So I bring kind of an odd eye to performance tests and performance engineering because of that background.
James Pulley[00:04:33] So let's go ahead and get started. And I want to begin with something we're all too familiar with. Time, and if we look, this is a pretty long measurement of time in our twenty-three minutes, forty-five seconds, and six hundred and seventy-eight milliseconds, this is where a lot of performance testers stop in our industry today, and that's a problem. They simply measure time. They get a record of time, they get a report of time, they print out a nice graph of time, and then they hand it to somebody who looks at the graph and says, “Why? Why is it this long?” And this is where you get the break between a performance tester and a performance engineer, it's on this question of why. So let's look at what really drives response time and then how that has suitability to how we ask questions and things of that nature. The things that drive response time should be well known to a lot of people who are attending this conference, but it's worthwhile covering them. We begin with the CPU. It is a resource on the box or multiple boxes, and we are going to pair that with memory because this is also a component of our system architecture, disk comes into play as well. And then our slowest resource which is a network. Now, Scott Moore likes to call these the four horsemen. Why does he do that? Because it is how your application uses these four resources together which govern what that response time will be.
James Pulley[00:06:30] Under high loads, your application should have as light a touch on these resources as possible because other users coming in will also be contending for the use of these resources. So. If we hold on, if we grab a really small amount and hold onto them for a short period of time, then we're going to have a higher scale ability to our system. So as Scott calls them the four horsemen, I actually refer to these four collectively as the finite resource pool, and no matter where you go in IT, there is a limit to one or more of these resources. Even if you go to the cloud, you turn on auto-scaling and eventually you reach the limit of a host where you can go no larger on a particular host. And if there's something that is a truism or an axiom is that poor-performing code will expand to fill the available resource pool. So if you have a well-tuned app, you might be able to run a website on an Intel Atom processor or Raspberry Pi, support a high user population. But if we don't, it's possible that the largest resource pool inside of Amazon will not even support a small number of users. So in addition to the finite resource pool, there are very specific questions we should ask of these resources and we'll begin by taking a look at what happens when we allocate the resources. There is an optimum point for the allocation of a resource. I like to call it the point where a user gets interesting. And when I say by interesting, if you're in an e-commerce context, this is the person that is beginning to convert. If you are interested in another business context, it is when you are beginning to go down the path of a particular business process and you get closer and closer to the end, we should preference you more and more in terms of resources. We shouldn't allocate everything right upfront. And by God, we should not give everyone who comes in maximum resources, because if we do and we don't have at least 50 percent plus one of our users who are working to take advantage of the resource, we're going to have issues.
James Pulley[00:09:12] Many years ago, Mark Tomlinson and I were sitting in a room together at a company we affectionately refer to as Big Shoe, and he turned over towards me and tapped me on the shoulder and said, “You know what? They give everybody a shopping cart.” And I just looked at him. I said, “That's too early.” And as it turned out, that was exactly the case. It was something that impacted their performance. It impacted their conversion rates. It impacted them on their most revenue interesting events, spot sales. But it was a historical artifact that was there in the system that just kept being in place because a marketing manager said. “Whenever somebody comes onto the site, we have to know how many items are in their shopping cart.” That means by default, everybody gets a shopping cart. Everybody grabs what's in their car from cold storage. And we put a number up there. And that was too early allocation because on no e-commerce site, does 50 percent plus one ever begin to convert?
James Pulley[00:10:24] The next question we need to ask is related to how long we hold onto a resource before we collect it. Now, a whole industry has started up surrounding this issue, which is issues related to monitoring your garbage collection. So I prefer to think of myself as a systems programmer, which puts me in the realm of C. So when I free something up, it just goes away immediately. I don't have to worry about the garbage collector coming around and, you know, how's it been Mark and all and, you know, “Is it ready to go yet? Let's get it out of here.” But this is an actual serious problem. Why? Because once again, you might have a marketing manager coming into play and saying, “Hey, we need to hold on to that session, concessions are expensive. Hold on to that session for 30 minutes.” Well, that's probably longer than your average page-to-page transition time or even your maximum page-to-page transition time on a website, and as a result, you're locking up resources that can be used for other people coming through the system. Getting back to that issue of high scalability, when you have a highly scalable system and you have a high number of users, you need to hold on to your resources as short a window as possible because the resource you free, it's going to go to the next person who needs it.
James Pulley[00:11:51] We all probably remember healthcare.gov. This was a real problem on healthcare.gov, how long they locked up resources. People came onto the site, they were asked to fill out an extensive amount of information in order to get a quote when most people were really just looking for information and that held that session open for that user for forty-five minutes an hour, an hour, and a half until they were done completing their form and submitting everything back. This had the effect of locking up resources so other users could not get on the site. And eventually, the site simply ran out of resources and fell over. And this was a cloud-based site. You would think they would have auto-scaling turned on things of that nature, but this is a real problem and you can find this on your own web server today when you take a look at how your web server cut sessions or how your database connection pool cut sessions. If you hold on to them too long, you're simply going to run out of connection, pool connections to go to the database. You're simply going to run out of connection handles to get to the webserver. So be a little aggressive. Analyze how long you're taking on the system, and in my case, my favorite statistic is that web page-to-page transition time, how long it takes to go from page one to page one plus one or two or page three. And that's usually somewhere in the two to the three-minute range for most people. If you happen to look in your server configuration and see that your session is being held for 30 minutes, it's been held for 18 hundred seconds. Well, perhaps it's time to trim that down a little bit because you're going to have issues.
James Pulley[00:13:46] Next up. We want to ask questions about how often you hit a resource, we refer to this as your frequency. We've all been in situations or nearly all of us where we have a subquery that is out of control, a subquery that is populating a view or it's a part of a business process to a database. And the subquery has to be run ten thousand times. So each individual query may not be that expensive, but when you add up the cost of ten thousand times, it absolutely does. I've actually been in a situation where I've observed this issue inside of a Java virtual machine as well, where we have a whole bunch of classes and subclasses and permutations of classes going on. And so we get thousands or tens of thousands or hundreds of thousands of calls till we get to the absolute primitive object and come all the way back up through the stack. So it's very important that you look at your frequency of calls as well. And then to bring us around to the very end, we have to ask the question of quantity. “Are we over subscribing now?” This is a very common issue involving network. Network is the most expensive of your finite resource pool because we have to leave the bus and we have to go across what to the computer is an exceedingly slow connection to either send or receive data. And websites being what they are today and digital cameras being what they are today, it is very easy to have eight megabyte or greater images of complexity on websites. Now they make no mistake about it, these pictures are beautiful. But there is a problem with having an eight-megabyte image on your front page. It does slow it down. This issue of quantity also drives the content delivery network industry, how often we use your network, that's frequency. How much data we're passing, that's quantity. We want to reduce the amount of data being sent and we want to put it in a location closer to the user network, and ideally, we'd like to reduce the frequency. So we got quantity and frequency by adjusting the cache model so that's how performance gets improved as we take a look at each one of these elements in a finite resource pool, CPU, disk, memory and, network. And we look at a way to shave a little bit at a time from a lot of requests for each of those resources in order to improve performance. So if I can pull microseconds away from the CPU, if I can pull milliseconds away from the network, if I can make the disk more efficient, if I can reduce my memory footprint, all of that has benefits in terms of my performance.
James Pulley[00:17:00] Okay, so now let's look at this with kind of a jaded perspective, knowing where we are in the industry today for performance engineering tools. If I want a fully-featured tool, I need to be able to ask all of these questions. And there are a lot of tools in the market which only support a subset of these questions. And for a performance engineer, that's really a problem, because as I move and look at tools, I do need to have a balance on my tools. I need to be able to collect timing information because that's my end-user experience in this case but I also need to understand why or what is driving that. And that's my monitoring collection. But interestingly enough, it's really on the analysis where we're adding the majority of our value as performance engineers. I'll think of a traditional performance test cycle. We have this balance of capabilities and we have this ever-increasing value, I may spend a month building a test in order to generate my timing records, in order to collect my monitoring resources and things of that nature. And at the very end, I have a manager demanding to know the output in under an hour after the test has finished. That's a real problem because it is in that one hour where we have our limited window to deliver maximal value and this is where most organizations are struggling today. Hence, you spend all of this time building the test, executing the test, in some cases, you're executing a test without the benefit of monitors, without the benefit of the analysis, you're just collecting time and records. So you have a limited opportunity to deliver value. I want to suggest to you today, go back to your organizations and find the earliest possible opportunity that you have to collect without, to collect timing records. Once you have that, find the earliest possible way to collect monitoring data to match up to that. That might actually be in the unit testing phase where someone is looking at an individual query to a database where someone is looking at an individual web services call long before in development and this could be long before you actually get to any sort of multi-user performance test.
James Pulley[00:19:47] There's a little dark secret that I should expose in this case, which is about 80 percent of performance issues that we find in multi-user performance tests can be found with one user or one unit test. There is evidence at that small of a level, that atomic level, that there is a performance issue in the form of time too long where budgets have been exceeded. And then you could take a look at how resources are being used to find the root cause and send it back and get it fixed much earlier. For those organizations which are struggling to find value in their performance testing practice, consider a performance engineering perspective of how do we collect this information early? And that information could be collected in terms of ROM, real user monitor integration, it could be collected from logs, it could be collected with other instrumentation inside of, say, a Java virtual machine. But ultimately you want to find the earliest possible location. And time to collect that data.
James Pulley[00:21:07] On a typical basis. Here's how most performance engineering organizations develop. We begin with time. Hopefully, we progressed from this point. We collect time then we need to understand why certain things are going on. Monitoring is the key to this. And then finally we use it to get to the point where we can deliver value or analysis. This is not a turn on the lights one day and you are delivering value the next, typically for an organization mechanically, it actually takes quite a while to get proficient just at collecting time. If you're doing it in a performance testing context, it could be three months to a year for different individuals to mature in their testing use, just their testing tool use. If you elect not to train or not have a mentor for your group, then that time is actually probably going to be extended. You may think that you're saving money, but you're actually impacting the delivery of value. Once you get to monitoring, we now have two components of that balanced, balance resource for that balance toolset that we need to have. And finally, when you get to the analysis, this is really where you begin to deliver value. Once again, reinforcing the prior slide of the earlier you can provide some analysis even if it's on a single user basis. I would even argue if you can go back to the requirements and identify that there are no performance requirements, that is, you have no way to advise the developers what they should expect, how they should set budgets, no way to advise the software as a service's vendor, what the expectations are on your part for them. If those are absent, you're almost always going to have an issue and struggle in the performance testing and performance engineering phases of your project, so find that early on. The earlier, the better, and if you can find it and fix it in the requirements phase, that will have miracles downstream for how it will shift your organization.
James Pulley[00:23:40] Now, as part of my last slide. I want to talk about the tools that are in use in the industry today. And how those tools kind of fit together in a performance engineering context. So I want to start with what you might call a fully-featured tool. A fully-featured tool has the ability to either pull or collect or generate a timing record. A transaction, as we normally term it in performance testing and has the ability to collect monitoring data directly from the tool, doesn't require a third-party tool. It's within the tool itself. And then finally, you have the analysis that you are able to conduct within the tool itself. So think about that. Three legs on a stool, you have a timing record collection, you have monitoring and analysis all within one tool, you have no other dependencies and you can move forward.
James Pulley[00:24:47] Next up, we have in the market, a large number of partially capable tools for performance testing and performance engineering, and when I say partially capable, that means at least one of these three capabilities is absent. I have the ability to load, load, and report, but maybe not monitor or analyze. I have the ability to load and report, but not monitor or analyze things of that nature. So if I'm looking at a tool. If I'm only interested in response times and I've never done anything else and I'm a performance tester, I will look at another tool that only generates response time and say, “Hey, that's what I do, I generate response times, I collect timing records. That's perfect for me.” The problem is when you come to use it in a performance engineering context, it doesn't have all the data points that you need. In order to deliver value, you're going to have to use another tool or collection of tools in order to add that. Then we have a class of tools in the market called load throwers. These are tools that can actually walk through a business process for a given interface, say, web, but they have no monitoring capability to speak of. They really have very limited reporting capabilities and analysis capabilities and visualization capabilities are almost non-existent. So we have tools that, “Hey, I can point at your website and I can turn them on and generate five hundred or five thousand or five hundred thousand users. And I collect a bunch of records for how long that takes for all those sessions. And then I can maybe print off a report and hand it to somebody.” And then the common question comes back, why? You will find that these tools are typically paired with other tools. They do the monitoring. A third tool, that will do visualization. You may have a fifth tool that does analysis or a fourth or fifth tool that does analysis and reporting. So this becomes like a Lego project where I have a tool to do X or brick to do X and other bricks to do Y and other brick to do so. You're not really getting the full capability within one tool. You have many tools to support and many tools to maintain. And then finally at the bottom, we have this class of tools called a denial of services tool, and my favorite tool in this category because it has such a cool name, it is Bees with Machine Guns. I mean you can just visually, you can just see that whole swarm of bees with machine guns and they're all coming to get your server. I mean, the person who named that is really just genius. But what separates the denial of services tools from the load throwers is that the denial of services tools is really on a somewhat uncontrolled basis. They don't really collect timing records. They don't really walk through a business process completely. They have very limited reporting. They assume that you're going to run all sorts of monitors on the back end. And basically, they're just there to turn a fire hose or a flamethrower on your website, and that's about it. Now you can get some data out of a denial services tool on the monitoring on the back end, maybe how your cache model is working, how it loads up your CDM provider, things of that nature. But a word of warning, CDM providers don't like these denial services tools either. They see them as an attack and sometimes they will turn off your load generator sites as a result. So just keep that in mind. Heed definitions as we move forward for you to evaluate what you're using in your environment. We have fully-featured tools which have that balance of capabilities. We can either generate a timing record or collect a timing record. We have monitoring capabilities within the tool. And then finally we have analysis and reporting capabilities. As we move further down, we have partially capable tools. They don't have the balance of those three capabilities and you get fewer and fewer capabilities as you go further and further down this list. As a general rule of thumb, the fully-featured tools are mostly on the cost front, the commercial open-source tool front, and the denial of services tools, and the other ones that are partially capable. You tend more towards the GNU or open source front. So if you've ever wondered why there is a break between those tools and some people say, “Oh, tool X is perfect, but I only generate timing records.” So if I had a fully-featured tool and I only generate timing records, I see no value to me and what I do with those other capabilities which are there, but I don't use. Likewise, if you're on the side where you have a tool that only generates X and you need to answer questions, now you're back to that Lego construction project of bringing in other tools to fill those gaps.
James Pulley[00:30:20] So lastly, kind of a look at the marketplace of tools out there. There are a lot of tools represented here, some are on the open-source front, some are on the commercial off-the-shelf software front, but they all have some capability for either generating a timing record, collecting some monitoring data, or analyzing. Now, not every tool has them. For instance, let's say I was using Gatling in a development context. Might be perfect for exercising my web services call, but it might have limited capability to report the type of timing that I need, the type of analysis, the type of monitoring, I can use other tools to fill in that gap. That gets to that point I made earlier. As a performance engineer versus a performance tester, we want to look as early in the lifecycle as possible to begin collecting those items, collecting that timing record, collecting resource measurements for CPU, disk memory, and network, asking those questions of when we allocate a resource when we collect it, how often we hit a particular resource and how large of that resource we're using. Because if we can find that early and if we can find it substantially early, we can save the company quite a bit of money. We can also improve our value on a performance engineering context to the organization because they no longer will see you pigeonholed in a particular location of “Oh, James really only answers multiuser performance questions”. So don't even ask a question until you get to multi-user performance. Well, that's not true. I can help you answer questions related to single-user performance. In fact, I might have a RUM agent installed inside of functional testing and development so I can get data as early as possible and identify and point out, “Hey, we have some nonperforming code early. Let's go take a look at it on a single-user basis.” I might be collecting logs and looking at them using Splunk or the Elk stack, in which case I could potentially see a collection of time in record data as early as the first lines of a Web services call are built. I can see how the times are shaping. I can see if they're within budget and I can begin advising quite early on an active basis versus a passive basis for data comes to me. I can say, “Hey, Mark, I see that service you're writing. It's out of budget. Should we take a look at it in greater depth?” Mark may come back to me and say, “Oh, no, no, no, I have a whole bunch of debug stuff in there right now to make sure that I'm getting all of the information back that I need. Once I pull that out, then we'll take a look at it”. So have that dialog early, have that dialog often use as many tools as you want to get them. But also keep in mind, when you come to performance testing, it's probably best to have a full-featured tool just because it reduces the complexity of your test environment to do this with now. I'm going to get a lot of arguments from that. I'm fully aware and I will tell you as a services provider, if you come to me with an open-source tool and say you must deliver with an open-source tool, I'm probably going to have to charge you more because I want to deliver value and I have to go add other tools to the one that you've told me in order to deliver on that value proposition of analysis, collection of monitoring data, timing records and bringing them all together and one location in order to deliver that value. Thank you very much for your time today. If you have any follow-up questions from this presentation, please feel free to reach out to me. firstname.lastname@example.org or email@example.com.
Joe Colantonio[00:34:48] Thank you, James, for your performance testing awesomeness in all you do and everyone at PerfBytes for all you do for the performance engineering community. And as I mentioned, this was taken from a two thousand nineteen PerfGuild online conference. I'll have a link to the show notes. So if you actually wanted to watch the whole video, you could still actually get the recordings for this event. And to do that, all you need to do is head on over to TestGuild.com/64 and while there, make sure to click on the Try Them Both Today link under the exclusive spots and sections to learn all about SmartBear's two awesome performance test tools, LoadNinja and LoadUI Pro. So that's it for this episode of the Test Guild Performance and Site Reliability Podcast. I'm Joe. My mission is to help you succeed in creating full stack automation awesomeness, which includes performance testing and site reliability. As always, test everything and keep the good. Cheers.
Outro[00:35:43] Thanks for listening to the Test Guild Performance and Site Reliability podcast. Head on over to TestGuild.com for full show notes, amazing blog articles, and online testing conferences. Don't forget to subscribe to the Guild to continue your testing journey.
Rate and Review TestGuild Performance Podcast
Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.