Live Kubernetes Optimization in Production with Stefano Doni

Published on:
Stefano Doni TestGuild PerformanceFeature

About this Episode:

How cool would it be to optimize live systems running in production environments? In this episode, Stefano Doni, CTO at Akamas, will share some ways to meet the challenge of optimizing Kubernetes in production. Discover how AI technology can help you with automated context detections, safe recommendations against customer-defined SLOs, and other capabilities to achieve cost efficiency and application performance goals without compromising SLOs.

TestGuild Performance Exclusive Sponsor

SmartBear is committed to helping you release high-quality software, faster, so they created LoadNinja to make performance testing effortless. Save time by eliminating correlations with automated, browser-based performance testing that ensures your application performs reliably when it’s needed most. Try it free today.

About Stefano Doni

Stefano Doni

Stefano is obsessed with performance optimization and leads the Akamas vision for Autonomous Performance Optimization powered by AI. With more than 15 years of experience in the performance industry, he has worked on projects for major national and international enterprises. He has presented several talks at various conferences, including SREcon, CMG, Performance Summit, and PAC. In 2015, he won the CMG Best Paper award for contributing to capacity planning and performance optimization of Java applications. In 2017, he shipped one of the market's first Kubernetes capacity optimization solutions.

Connect with Stefano Doni

Rate and Review TestGuild Performance Podcast

Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.

00:00:05] Get ready to discover some of the most actionable performance engineering testing and Site Reliability advice with some of the world's smartest engineers. Hey, I'm Joe Colantonio host of the Test Guild and SRE Podcast and my goal is to help you succeed in creating application performance awesomeness.

[00:00:26] Joe Colantonio Hey, it's Joe, and welcome to another episode of the Test Guild Performance and Site Reliability podcast. Today, we'll be talking with Stefano all about optimizing configurations of K8 and some really cool technology that they just announced with their company. Really excited about this. But before we get into it, if you don't know, Stefano is obsessed with performance optimization and is the CTO at Akamas with more than 15 years of experience in the performance industry, he's worked on multiple projects both national and international enterprises. He's presented a ton of different events like SREcon, I think CMG, Performance Summit, the PAC, and I think he's going to be speaking at KubeCon coming up in a few weeks, which will sure we'll talk about. In 2015, he won this CMG best paper for contributing to capacity planning and performance optimization of Java applications. And in 2017 he shipped one of the first in the market Kubernetes capacity optimization solutions. I'm really excited to have him on the show. If you have anything to do SRE, anything to do with Kubernetes, you don't want to miss it. Check it out.

[00:01:26] This episode is brought to you by the awesome folks at SmartBear. Listen, we know load testing is tough but necessary. So investing in the right tools to automate tests, and identify bottlenecks to resolve issues fast, saves your organization both time and money. And that's why SmartBear created a LoadNinja, a sas load testing tool to help teams get full visible performance so you can release quality software faster than ever, make testing effortless and give it a shot. It's free and easy to try. Head it over to loadninja.com and learn more.

[00:02:00] Joe Colantonio Hey, Stefano, Welcome back to the Guild.

[00:02:03] Stefano Doni Hi, Joe. Thanks for having me today. I'm really excited.

[00:02:05] Joe Colantonio Awesome. Great to have you. Like I said, it sounds like you're doing a lot of great work at Akamas and I'm excited to find out more about that. Before we do it, is there anything I missed in your bio that you want the guild to know more about?

[00:02:16] Stefano Doni No, It was great. So thanks a lot for summarizing that.

[00:02:20] Joe Colantonio Sweet. So last time we spoke, I think it was last year, episode 81, like it's almost been a year, I think it was in November. I've been seeing the shift from performance to SRE and I don't know if that's something you've been seeing as well. I think in the preshow you mentioned something like maybe Akamas is maybe helping roll out more features to help out SRE. Is that a direction you've been seeing more and more of?

[00:02:41] Stefano Doni Yeah, exactly. So actually what we have been working on lately is the whole new capabilities on the product, which is the 3.0 version of Akamas, which brings some major innovations around exactly around those topics. So for those who are familiar with Akamas, up to now, basically the way we work it was to actually use A.I to help many performance engineers to automate the tooling of the configuration of their application and infrastructure, things like JVM configuration, database configs or Kubernetes resource settings, etc., mainly in pre-prod environments. So leveraging load test and tooling, automating the tooling of the configuration there. So maybe focus on performance engineer. So before the application is released to production or when a new business event might arise, etc.. So that's actually we have put performance engineers do before the 3.0. And now with 3.0 we are actually very excited because we bring major innovation into the platform, which is the ability to help optimize applications directly in productions. So that's kind of a new shift for the product. And that is actually, as you were mentioning, more geared toward SRE. So people that actually are interested in taking care of reliability and cost efficiency, performance of application, running in production.

[00:04:00] Joe Colantonio Nice. Is this a shift or is it just added benefits? So you still service performance engineers, but you have more and more features now to help SRE as well.

[00:04:08] Stefano Doni Yeah, good question, Joe. Actually, Akamas 3.0 becomes a platform. So it's a unique product. It's a unique platform that adds you, basically address both scenarios. So that's the performance optimization, performance engineering kind of development in pre-production, staging environments still makes a lot of sense if you think about the DevOps strength. So being able to, for example, make sure configurations are great to sustain high-level of load that we are expecting. So something that we still don't see in production today. Or for example, addressing even cost scenarios, set goals engineering scenarios like for example, I would like to identify the best configuration for my application that maximizes reliability under sustained failed scenarios. So there are still a whole lot of interesting use cases for which you might want to use Akamas in pre-prod environments. We are actually adding a kind of new part of the product that lets you do the next test. So once you have launched your application in production, typically you needs to actually make sure it's optimized based on the new traffic, on the based on the action or the way to use the platform. That might be different with respect to what you are designing in your test scenarios or even keeping track of the changes that happened to the application to the infrastructure in production. So it's an addition and our Akamas 3.0. becomes holistic action platform that can have these both scenarios.

[00:05:33] Joe Colantonio So this might be completely off the wall and incorrect, but there's a shift because it's pre-production not as relevant as it once was. I know cloud native and using all services and production, you really can never get the real performance, a real optimization they could in pre-production. Is that one of the reasons for this or I'm just off base there?

[00:05:51] Stefano Doni Yeah, that's a great topic. I see actually even think about what we see on the market, we certainly see a shift toward actually moving, directing production, also doing lots of much more experimentation in production as part of actually Devops structure off kind of shipping as more increments of code even to the production and being able to allow quickly being able to test what will be the best product even for the user. So it's a kind of mindset shift that we are seeing and that facilitates, of course, doing that kind of optimization guides in production also it's great for even for our use cases, not actually for other company, we see that they are actually preventing the investments on performance testing. So I guess, it depends a lot on the catch on the business, actual business needs. We see it kind of true trace. Some people's organizations and not investing anymore on performance engineering. And you need to see the complexity, to see the skills that are required, the investments, and then moving towards a more kind of a giant way to test things in production. Other kind of companies. I was a financial company. We see that they're actually investing more in making sure the abilities that before shifting to production or to avoid incidents once they launch.

[00:07:06] Joe Colantonio So what are the benefits of the testing in production basically?

[00:07:10] Stefano Doni Actually, the thing is that when actually SREs look at application production, they typically have different kinds of use cases, different kinds of business needs. So one thing that we see are is being able to optimize the cost. So that's for sure one of the biggest drivers. So we see that in cognitive space, much of the responsibility for optimizing the costs relies on application things because at the end of the day, the cost of Kubernetes cluster is dependent mainly on the way the application is configured. So how big that those containers? Those spots are in that applications, so they start working to do that. So that's something about being able to help teams and reduce costs. So why not, in fact, in performance which is the other side of the kind where teams are struggling, not having that performance or reliability? So that's kind of another benefit because again, due to the kind of tricky way that Kubernetes managers resources, we find that many, many teams are struggling with that. So they risk of actually having a production incidence which actually can give do a lot of scale things like reputational damage, etc. due to the way that complexity of the environments of the stock, and that is getting lots and lots of issues. So we see those are the benefits of the driver. And the other thing is that of course, that's a kind of skills shortage in the industry. So it's not easy to find people that are able to do this kind of work and it takes a lot of time even for them. So the idea is that, of course, let's provide a smart system that can help those guys in making a better decision and quite drastically less time.

[00:08:46] Joe Colantonio So I think last time we spoke, maybe with someone else. A lot of the issues we saw were due to configurations of Kubernetes. So how does this work now in live? Is it all automated? And like you mentioned, there's a skill shortage. So are you trying to do all the heavy lifting? How does that work?

[00:09:01] Stefano Doni Yeah, that's great. So the way we work in live optimization. So when are you optimizing production is basically we're trying to address the whole problem. So actually Akamas first of all, observe our application is going to work in production. While do we do that, we observe basically for such a user but we also serve the traffic the patterns of response time. So we have several application metrics. And Akamas visibility is able to recognize the parts of workload. So he's able to understand if there are pieces, values, etc., in order to actually understand how the application is getting. And Akamas based on the A.I models that we build actually self-providing recommendations. So if we make that, if we think about, for example, what are we working in pre-prod environment. So the whole optimization cycle was completely automated in pre-prod environment. Basically, no people had to stare at the screen and see the changes because actually the whole process would be automated. When we moved to production, things requirement has changed a little bit, meaning that people have more control. So once Akamas identify the next configuration to try is as a simply automatically change it the application as they work in production. But now Akamas, recommend a change to the user. So your user will be able to actually review the changes for example, I want to change the size, the security needs, and memory needs. For this Kubernetes part, I want to change the JVM parameters like changing the garbage collection each size. So people at SREs want to have control and want to understand what would be pushed to production. So that's a kind of human in the loop approach that it is actually accepted by the market because they want to have this level of control.

[00:10:38] Stefano Doni Next step is actually to roll out the changes to production and that's complete our optimization cycle and that these are getting done to typically integrating with devops Spyglass. So typically people again don't want any tool to change the application as they work in production kind of randomly without control. So what we do is typically integrate with DevOps processes, the customer aware. This means typically, for example, pushing a configuration to a repository, so not changing the actual cluster, but pushing a new config on the repository where the DevOps team, for example, has the chance to again review and manage the request into the repository and then typically abide by Jenkins for example, we actually roll out the change to production. So that's how it works.

[00:11:24] Joe Colantonio Yeah. I'm not sure if you're familiar with visual validation. A lot of times it does a baseline and then it sees a difference and then it asks the user, Do you want to make this change based on this difference? Is this the same kind of approach? Does it just prompt you like, Hey, we know this, blah, blah, blah? Do you agree? If so, do you accept these changes?

[00:11:40] Stefano Doni Yeah, exactly. So we provide parameters. So our recommendation is, at the end of the days, a change to the configuration. So we let the user see what is the current configuration and then what is the next of that is actually suggested by Akamas. So you can actually see what are the changes. So you can kind of familiarize we wanted we have a change, but in a way users also have the ability to revise. So let's say that I don't know, you'll want to skip some particular change the values of the parameter so they are allowed to do so. And thus in our UI we also present and I think that is important since terms of explaining the details of what the A.I is providing user present analysis so the users can actually iterate in our product, actually see how the application is performing. And might is going down. So perhaps a change is a kind of direct action because the application is going down. So we want to increase the cpu limits that we assign to debug, for example, in response to performance .... that. So that's kind of explainability capability that we providing to the platform, which I think is key for people to accept those changes.

[00:12:43] Joe Colantonio Nice as the holiday season rolls on Black Friday. Will this help you scale up if you're having issues? And if so, does this say, hey, we can scale up to blah, blah, blah, but there's going to be this cost associated with it? Does it give you that type of data?

[00:12:55] Stefano Doni Yes. So what we do typically is to right-size the container. So for example, selected a proper cpu and memory needs to your containers. We can also act on auto Scaling parameters. So that's kind of another use case where typically people already have written auto scalers from Kubernetes, like HorizontalPodAutoscaler. So we like the ones we've had actually optimizing the HorizontalPodAutoscaler for actual for example, to make sure that you want to go to somewhere like let's find out the best configuration that keeps cost at a minimum, but it provides a required level of performance site reliably that I'm expecting.

[00:13:34] Joe Colantonio Because you are focusing on live production. I think you mentioned SRE, but you also mentioned I think, chaos engineering, are there any features that help you with chaos engineering? Can you do like little segments to see what happens with an experiment? And if it works, give some insight, hey, that works, maybe you should push that to configuration going forward or anything like that.

[00:13:51] Stefano Doni Yeah. Yeah. Somewhat so on the Chaos Engineering we work, we find people today at least prefer to start, it's kind of a journey. Typically, people prefer to start in pre-prod environments of course we know that the holding game would be doing things in production, but the way we see is that people are starting to actually adopt Chaos Engineering in pro-prod. So that's kind of one scenario that we see. So leveraging loadtest because at the end of the day you need to reproduce things in pre-prod environments. While working on production. One practice that we see a lot is using canary releases. So for example, when Akamas recommend a change, the question is when do we apply that change? So sometimes we apply the change to just a small portion of the parts that are ..... for example, ..... website and then you can also observe how this subset of the parts we perform and that's not actually comparing against the main control group in a way. So that also kind of another way to actually try out different settings that is in production. Of course, you would need to be much more safe because actually pre-prod environment you are allowed to actually do anything you want, change any part, etc. in the prod environment you need to be more careful. And that brings to a whole lot of safety capability that we have to include into our platform to make sure that actually the recommendation is good for production and they are not causing any kind of adverse.

[00:15:23] Joe Colantonio Very nice. Because it is using A.I machine learning. If you are using the same application over time, does it does your technology get better with better insights or does it get more? Does it get smarter over time? It gets a large dataset.

[00:15:35] Stefano Doni Yeah, that's a good question. So I guess that's super great because actually, that's the core of A.I. So we see, by the way, a lot of marketing around lots of solutions use A.I to optimize applications. So one distinctive fact of we realizes is that it's a learning piece. So as you said, it's exactly that. So the fact that we are seeing different workflows, we can see what information work best with different work tool? And Akamas actually learn. So we didn't identify that from this particular microservice, for example, I don't know. This specific garbage collection for the JVM works best, so they may be contrary to another microservice, which prefers another setting of the JVM or the containers. So it's this learning is key because actually Akamas can, A.I can adapt the specific microservice and make sure that we can extract cost efficiency for each application by learning auto-tooling to see the application. So that's is very different from, I don't know, other approaches that just use thresholds like. If you are above 80% just on goes CPUs, that is a much simpler approach where more learning is actually there

[00:16:44] Joe Colantonio So how much of autopilot can you put this on over time? Say like you've been running a for year, it's been working and you've been accepted things and you're like, Look, I get the insights, but I don't have time for this sometimes. Just check this off. Just I accept it. Do it automatically. Is that an option?

[00:16:59] Stefano Doni Yeah, that's a great question, Joe. Actually, we have two modes in the kind of production optimization, so manual approval and fully automated. So manual approval is basically the process that I've just described, so basically human, we get recommendations to do actually if you want, then apply to production according to the us processes. Actually, we envision as the customer gets thrust into the recombination. And knows really have the ability to put the kind of fully autonomous mode where actually Akamas we be able to actually skip this manual validation process and just for example, either apply the changes directly to your Kubernetes cluster or actually again, leveraging your processes, changing the configuration of the microservices again inline with the infrastructure in a goal or anything, a split approach and then to read to your pipeline completely automatically. So we also vision kind of ..... That's kind of the only the future state where customer can reach those kinds of maturity. We also envision a ... path towards that so that some parameters might be changed, like even without restarts of the application. So they might be kind of automatically configured by Akamas without any prior approval. So you have those capabilities. So you have also the ability to define ranges and increments. So for example, you can say, okay, why not change the parameter within those ranges because I'm save that A.I want to I don't know identify big values for very small values for my cpus that might be risky. So you get those kinds of safety policies that you can define within the platform. And I think that's one of the main features, main differentiator that we have because people want to have those kinds of controls and to trust a solution like that.

[00:18:48] Joe Colantonio Absolutely so. I know in performance since I've been involved all these years, it's always been like either throw more hardware at it or optimize your application to be more performant. And I guess as more people go cloud-native, they start using Kubernetes. This isn't a magic option. Like if your application is not optimized to be performed, it's not like, is this going to fix all that or like or do you have any lessons learned on like people using Kubernetes and Java optimization challenges you've seemed their clients or customers?

[00:19:17] Stefano Doni Yeah, actually, in fact, we have been working on, especially with many customers around Kubernetes lately, and especially Java is still one of the most popular languages for modern microservices, a leveraged framework like Spring Boot, etc. So we have quite a bit of interesting lessons learned there. And I like your pillars as we call them. So the fact that we typically have dealt with those kinds of problems by throwing hardware or by fixing the code actually it doesn't submit those kinds of layer, which is the configurations. So those configurations we are finding that especially not simply the configuration at the container level but at runtime then. So at the end of the day, what runs within a container or a path when I just want is actually different time, like the JVM what go right. And they are ultimately actually responsible for driving. For example Sos usage. So the amount of memory, Kubernetes application or the amount of CPU they would be using, of course, that's dependent on the actual application, the actual business logic and the tool, but actually doesn't take a path which is actually the responsibility of outage JVM, for example. Manage memory. Outage JVM manages CPU so the garbage collector is working. So what we have found out is that for example one of the biggest issues is that people are suffering reliability problems in incorrectness applications. So the dreaded out-of-memory case. So Kubernetes is very different when it comes to managing memory in respect to I don't know operating system like we have worked in the past. So there's no swapping, there's no slowdown. So Kubernetes is suddenly kills your pod, your container, as the moment that you reach the level of memory limits that you are configuring. So a lot of teams are struggling with that. And the issue there is that, again, the way that the runtime works is key. So for example, what we have find out is that there are many organizations that configure the JVM with the heap size very close to the memory limit. So it probably is that the generic, for example, allocates extra memory beside the heap size and that is very easily triggering out-of-memory errors and all of a sudden bring these services down in production. So that's the problem is that is actually it's a hard problem because there are no well-defined configuration options for the JVM for example to actually keep the whole footprint within the limit. So we've got those lot of trying down of that. So the key to the solution is that you need to configure properly also not just the CPU and the memory setting and the container level, but also if you know what is the configuration of your runtime, especially the JVM. But we are seeing similar and interesting in a way developmental also on the goal on runtime which is kind of following a similar path. So it's not just the JVM, it said, but it's pretty much the same is happening to garbage collective language which includes also .Net example or Golang or even JS. So that's kind of the common thing that we are seeing.

[00:22:17] Joe Colantonio For those issues. And with live production that does it help you because actually adds costs. I guess if you don't have things size rates to help you with pod resizing and node right sizing also?

[00:22:27] Stefano Doni Yeah, we won't get any basically what we call application where we mean working at the different layers of the stack. So we work at the pod level. So basically being able to right-size the containers, the pods. The containers that runs in the process of CPU memory press the limits. But also we have actually identified the best settings for the so the runtime is around within, so like the JVM, etc.. So that's one part. The other part is that the definition that we have is that when doing those kinds of optimization, we find people are struggling because actually due to the way to again, Kubernetes is right side memory management or CPU management works, you can have lots of side effects if you don't properly size the configuration. So for example, other problems typically people run into is what causes CPU struggling? So you will get slowed down by Kubernetes. This was managing mechanism if you are not properly putting your CPU limits in place. The lesson that we learned is that we need to identify that the only way to actually be safe on both side we got new reliability and also ensure performance. Why reduce the cost? Is it being able to actually drive the utilization based on the actual application response time? So that is key because actually there are lots of solutions a CPU can be infrastructure like I'm using 10% of my CPU usage, let's bring down the CPU because and use a video. But then you run the risk of introducing these kinds of advanced throttling, etc., which they can change if you find that people our time management with that. So we talk to this problem is actually we actually see their workload, we got our metrics from observability tools like application response time and the throughput, the patterns, etc. to give already to recommend configuration, that kind of discourse. But they minimize the risk of introducing slowdown or limitations.

[00:24:23] Joe Colantonio That's a good point. And observability has been a big trend the past few years. And it sounds like something you're highly integrated with as well.

[00:24:29] Stefano Doni Yeah, exactly. Because again, many solution just simply and we tried to solve the problem and the infrastructural area. So you simply need a way monitoring or observability tools just looking at the infrastructure. But of course, we need lots of metrics from the whole stack in order to be able to drive smarter, smarter recommendations. And again, our approach is not having any engine but rely on the integration. So we have native integration, for example, with Prometheus, which is I guess the most common stocking and obviously stock that we find for Kubernetes settings but also dynatrace we are very much ... Because it's one of the most common tools, commercial tools that we found on the market.

[00:25:11] Joe Colantonio Absolutely. So it sounds cool, but I'm sure people would love to see those see a demo scene in action. You did mention I think in the pre-show some about KubeCon so what is KubeCon if people want to see this in action. Maybe give us a little more information. I believe you going to be there.

[00:25:26] Stefano Doni Yeah, exactly. So that's great points, Joe. We will be having our booth there. So KubeCon is basically the leading CNCF cloud native computing foundational conference around the Kubernetes ecosystem today. So it's I think it's one of the biggest offerings that we have today in IT. So initially in North America would be actually in two weeks time in Detroit. So I'm very excited to be there. So we do have a booth there with our team. So just hanging out, come to the conference if you can do that and we would be having a live demo. So if you want to see actually the live product and actual, I guess that would be awesome to chat with the attendees also.

[00:26:10] Joe Colantonio And I believe Scott Moore is going to be there. So I think everyone that listens to this Scott Moore so he's always a hoot. He might be in his office. You definitely check that out.

[00:26:17] Stefano Doni Yeah. Scott also coming with us at the Kubecon and we'll havedemos and talking to customers. So I guess that's we are going to have a lot of fun.

[00:26:30] Joe Colantonio Awesome. Okay, Stefano, before we go is there one piece of actual advice you can give to someone to help them with their performance SRE Kubernetes testing efforts. And what's the best way to find you contact you and learn more about Akamas?

[00:26:44] Stefano Doni All right, we have a website Akamas.io. We have published just several responses that most of our blog posts around exactly those topics. So we are the white papers that describe better and more detail how we work in production. So how do we optimize and this is microservices with an application-aware approach so people can download it and that it is. You are free to all getting in touch with practitioners. So people are having those kinds of issues and that's you can also find me on Twitter and on LinkedIn.

[00:27:16] Thanks again for your performance testing awesomeness. For the links of everything we value we covered in this episode. Head on over to testguild.com/p98 and while you're there make sure to click on the tri them both today link under the exclusive sponsor's section to learn all about smart bear's too awesome performance test tools solutions Load Ninja and Load UI Pro. And if the show has helped you in any way, why not rate and review it in iTunes? Reviews really do matter in the rankings of the show, and I read each and every one of them. So that's it for this episode of the Test Guild Performance and Site Reliability Podcast. I'm Joe. My mission is to help you succeed in creating end-to-end full-stack performance testing awesomeness. As always, test everything and keep the good. Cheers.

[00:28:02] Hey, thanks again for listening. If you're not already part of our awesome community of 27,000 of the smartest testers, DevOps, and automation professionals in the world, we'd love to have you join the FAM at Testguild.com and if you're in the DevOps automation software testing space or you're a test tool provider and want to offer real-world value that can improve the skills or solve a problem for the Guild community. I love to hear from you head on over to testguild.info And let's make it happen.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
Stefano Doni TestGuild PerformanceFeature