About this DevOps Toolchain Episode:
In this episode, we explore the critical topic of cloud visibility for QA and DevOps with special guest Greg Arnette, co-founder and CTO of CloudTruth. Discover the impact of cloud visibility on cost, security, and reliability, shedding light on how it can shape the release process and influence the success of development teams. Greg presents insights into DevOps teams' challenges and highlights the importance of a robust cloud visibility strategy. He also introduces CloudTruth as a solution to address the pain points associated with managing configuration and secret sprawl. Listen in to understand how cloud visibility affects various layers of cloud infrastructure and its significance in driving cost-effective, reliable cloud operations. Greg also shares the seven key principles that can improve release velocity and reduce downtime, offering valuable advice for teams looking to enhance their DevOps efforts.
About Greg Arnette
Greg is the co-founder & CPO of CloudTruth. Prior, he was the founder & CTO of three cloud / SaaS companies in the email and data protection market. Greg has been creating cloud SaaS solutions for over a decade and is the author of the patent “Method and system for arbitraging computer resources in a cloud computing environment.”
Connect with Greg Arnette
- Company: www.cloudtruth
- Blog: www.cloudtruth.com/blog
- LinkedIn: www.gregarnette
- YouTube: www.CloudTruthVideos
- Git: www.cloudtruth
Rate and Review TestGuild DevOps Toolchain Podcast
Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.
[00:00:01] Get ready to discover some of the most actionable DevOps techniques and tooling, including performance and reliability for some of the world's smartest engineers. Hey, I'm Joe Colantonio, host of the DevOps Toolchain Podcast and my goal is to help you create DevOps toolchain awesomeness.
[00:00:19] Joe Colantonio Let me ask you who manages the configurations for all your various test environments, and other environments. When your teams don't have a good strategy, the resulting misconfiguration really can slow down the release process. And I've seen this cause a lot of things like downtime and a lot of teams getting frustrated. So what do you do? That's what this episode is all about. Hey, I'm Joe, and today, we'll be talking with Greg all about cloud visibility for QA and DevOps and a whole bunch more. If you don't know, Greg is the co-founder and CTO of CloudTruth and prior, he was the founder and CTO of three cloud SaaS companies in the email and data protection market. Greg has created cloud solutions for over a decade and he is the author of the Method and Systems for Arbitrage and Computer Resources in a Cloud Computer Environment, so really knows the stuff. You don't want to miss this. Check it out.
[00:01:07] Joe Colantonio Hey, Greg, welcome to the Guild.
[00:01:10] Greg Arnette Thanks, Joe. Great to be here. Hey, everyone.
[00:01:13] Joe Colantonio Awesome to have you. I guess before we get into it, we've mentioned in the preshow something about cloud visibility, why it's important. Maybe you can give a little insight into why you think cloud visibility is so important nowadays.
[00:01:25] Greg Arnette Yeah, this is an important topic that all teams are facing right now. And it impacts on several really defining layers in terms of this overall cloud visibility into the number of assets that you've provisioned from your cloud provider, which then in turn becomes your cost basis for using the cloud, your surface area for security, and reliability. And behind all these assets are configuration parameters, variables, and secrets to make sure they're all working properly across development, staging, and production environments. So visibility into all these moving parts is a requirement. It's a challenge. It's where the DevOps tooling market is headed. And it's this combination of internally created tools by DevOps and platform engineers, as well as what third-party vendors are bringing in a form of innovation into the space.
[00:02:15] Joe Colantonio Nice. Can you talk a little bit more about the cost component visibility, why it's important? I've seen some articles recently where they say companies have moved to the cloud and then because of the cost, they move to in-house again. Is that to not maybe having visibility into the cost?
[00:02:29] Greg Arnette It's a combination. So there was roughly around between 2007 and 2011, 2012, there was a huge initial rush into the cloud, and cloud users tried to replicate what they were doing in their own on-prem data centers, like a cookie cutter, what an architecture look like in their, say, VMware system or Dell based or Equinox based system, and replicate that into like an AWS or Azure environment. That's the worst-case scenario. You're going to have spiraling cloud costs because you build out systems differently if they're cloud-native versus on-prem. Starting around 2012 and to where we are now, it's all about cloud-native, born on the cloud. Nothing is being designed that was going from, say initially from the get-go from an on-premise system to a cloud. It's getting designed cloud native. So we need to be thinking differently about cost-aware architectures in the cloud so that you're getting the benefit of why you went to the cloud in the first place, which is you don't want to pay for resources that you're not using. You don't want a bunch of pre-provision compute standing to aisle, ready to take on a job if there's no job to be happen, if it's happening at midnight, and you also want to be able to burst up really quickly, should you experience a very positive, a certain sudden surge of usage, a lot more customers signing up or mobile users taking more pictures, or whatever kind of app or system you're developing. The cloud brings together on-demand customer architectures, but if you don't pay attention to the signals, you're going to end up paying more money for your cloud bill than if you're just stuck with an on-prem solution. The cloud allows you to be very flexible, and experimental, but you have to really be aware of what every line of code you're writing almost can be tied to. How much that piece of code it's going to cost you to run it in the cloud these days? So that kind of cost-aware engineering is something new.
[00:04:16] Joe Colantonio Absolutely. Who's responsible for this? It seems like a lot of times is it the DevOps team. Is it the QA? Is it the developers? Is that one of the things that make it so maybe people aren't paying attention because no one knows who's responsible for the cloud piece of the cloud visibility piece?
[00:04:32] Greg Arnette Yeah, it's. Definitely a shared responsibility. Increasingly, there's been this shift in the industry towards a new discipline called FinOps. So everything today is on ops. It's DevOps, it's GetOps, it's DevSecOps. FinOps is the latest kind of incarnation of how teams are deciding to manage their expense portfolio in the cloud, and it's a new discipline. It's carving together. Kind of like a slice of DevOps and a slice of like the CFOs group. More and more companies, as they get larger, are putting cloud cost analysis into the hands of the CFO or a business operations leader who really cares about tying in cloud expense with what it costs you to run a customer, your tenant in your system, and also how it affects the bottom line. This new hybrid approach is taking effect right now.
[00:05:22] Joe Colantonio Absolutely. I guess another thing I see people sometimes get hung up on is they don't know where to go to get all the information. They have all these platforms and all these dashboards. How do you address that for visibility to make sure maybe your CFO is seeing what he needs to see and be alerted when he needs to be alerted?
[00:05:39] Greg Arnette Yeah. So working backward from, say, the CFO needs to be the recipient of a synthesized analysis that can be a sort of business-friendly, not technical jargon understanding of how cloud costs are being managed. There is a typically complicated process behind the scenes. That's an aggregator of aggregators. You might have a bunch of different dashboards that are helping the Ops team run the system more efficiently, and then you have dashboards that help the revenue side of the organization understand how customers are growing and marrying that together, as was the challenge. So you start to see tools like the data dogs and the aggregators of aggregators being able to synthesize different streams of data into a format that can be consumed by both non-technical and technical people alike.
[00:06:24] Joe Colantonio Awesome. So besides CFOs, I think you mentioned once again, it's important for QA folks to be aware or have visibility into the cloud. I'm just curious to know your reasoning for that. Why would you think that'd be important for someone that maybe as a QA or a testing role?
[00:06:38] Greg Arnette Yeah, that kind of takes the conversation from thinking about looking at costs to configuration parameters, variables, and secrets, which are at the heart of doing everything well in the cloud. Every asset you have in the cloud, every piece of software that you're writing, every third-party service that you're using has a configuration behind it. And that configuration has to be the configuration for the development environment, the staging and test environment, as well as the production environments. The application developers are giving tools that are typically called platform engineering tools that make their lives easier. So an application developer who's writing some new features or creating a net new app has to specify the configurations for those that new feature has to be able to communicate it successfully to the test QA stage as part of the SDLC process, and then to the Ops team that's going to operationalize that code. So it's that middle step that we see as kind of being underserved right now. Testing QA engineers are kind of borrowing from application developer tools and borrowing from DevOps tools to kind of get the information flow that they need to do their job well. And so QA and test engineers are managing a hybrid effect. They're taking software that was developed in a dev environment by an application developer. They're trying to simulate, as best as possible the production environment that that software then will be living in because they want to test it as close as they can to a production real-world scenario to get an accurate result. But yet they can't actually simulate a full production environment because that would be too expensive. Production environments have a lot of provisioned resources for redundancy and reliability in the security posture, and it's not cost-effective to keep a mirror image of production running all the time for test. So QA and test engineers need to be very adept to that called spinning up environments that look like production, doing their testing and turning them off as quickly as possible to save money, and then spinning them up again and do their test. So it's a delicate balance between an affordable environment that works for everybody, as well as adequate test coverage for how that piece of software will work in a production environment because you don't want to get surprised when your test coverage in staging didn't find the bug that occurred in production because production was configured slightly different than staging, and that bug happened to exercise that difference.
[00:08:51] Joe Colantonio That's interesting because it's a container and you have config files and probably Yaml files that spin these things up. If it's slightly different a parameter, like you said, if it's in staging, if it's not exactly the same, you may get different behavior. Especially I could see that like with performance being an issue.
[00:09:06] Greg Arnette For sure. Performance. That's where we see this a lot. Right-sizing the number of nodes in a cluster, for example, to handle the expected traffic load, right-sizing the number of cores in the memory in the database to handle the number of simultaneous connections that production sees. Maybe the staging environment doesn't see unless you can do an artificial load assessment of burstable traffic activity. That's where QA and test groups need their own set of tooling that works alongside of the existing DevOps tool stack, and often could be the same tool, but just used by a different set of use cases, but have an appropriate, accurate way of doing it.
[00:09:45] Joe Colantonio Awesome. So you also mentioned security. So secrets obviously, if the secret gets into GitHub or something that's a problem. So are there other security issues someone needs to be aware of that they may not have visibility into right now with their cloud environments?
[00:09:58] Greg Arnette Yeah. Certainly. Leak secrets remediation is important. And whether it's an application developer incorrectly put a secret in a git repo, and maybe it's even an open repo that forces a fire drill of rotating that secret because you just don't know who saw it, who got a copy of it. Also, leaking into log files is really a no-no these days. And that's an area where both the application and QA and test need to be vigilant not to log a secret or redacted if it gets done or have a mask that puts all asterisks in its place. Let's be more thoughtful. There's also a looming change in the industry that's being spearheaded by Google, which is going to change default lifespan of TLS and SSL certs. Now it's roughly, currently it's roughly a little over a year, and they're moving to 30-day or 90-day lifespans, which then will force all the consumers of search, which is everybody, to have an easy way of changing those certs that affect the stability and security posture of a whole stack reliably, more quickly than accustomed to in the past.
[00:11:05] Joe Colantonio That's interesting, especially if you have automated testing and expect the browser being a certain state. And you have to update hundreds and thousands of virtual machines. This is interesting. Yes.
[00:11:16] Greg Arnette Right. So it's going to put more pressure on teams, especially in the QA and test realm, to have the facility to duplicate that, because you have to make sure the code works when your certs are being rotated every 30 days or 90 days versus once a year, once a year, it's kind of more ceremonious but if it's happening quarterly, that's a lot of moving parts that have to be orchestrated carefully. It's in the same way as if you're going to change your database password. You have to make sure that the old password gets cycled off the new password comes online. All the clients are going to talk to that database, know the new password. There's a bunch of patterns like bluegreen deployment patterns and so forth which talk about this kind of concept as an abstract. But in terms of actually how you orchestrate the delicate balance of changing the server and changing the client in the right order. You could have a period of time where clients can't talk to the database because they don't know the current password anymore. They weren't updated that it's new versus the old one they have in their current config settings.
[00:12:14] Joe Colantonio All right. I think we covered a lot of pain points teams are probably facing. Obviously, you created a solution cloud through. Maybe you could talk a little bit about what is cloud truth and how does it help with some of these pain points we talked about.
[00:12:25] Greg Arnette Sure. Yeah. Thanks for asking about that. The genesis of cloud one of these examples of a solution that was designed to solve the pain point that my other co-founder had in our real prior jobs of being SAS and CTO leaders, we kept seeing that our DevOps teams were building these internal tools to manage what we call today config sprawl and secret sprawl, and its way of kind of putting the horse back in the barn, so to speak, around the fact that many teams start off with a very simple approach to managing their config and secrets. They typically stuff a man, Json, Yaml files, and git repos and use a secret store either provided by the cloud vendor or say something like Hashi vault from open source. And as their technology stacks scale, their business thrives. They start to incur very kind of incrementally, this idea of config sprawl and secret sprawl to the point where it gets it becomes a tipping point and is now seen as an impedance to a fast release cycles. It's too much paper cuts and the DevOps release process. And that's why we're here to find out. So CloudTruth is a config ops platform that's designed to work alongside of git ops and Kubernetes and infrastructure as code, so it doesn't replace any tools that you're already using right now, except maybe a homegrown version of something that we do versus some net new functionality that plugs in alongside your continuous deployment process and then becomes a single aggregated record of configuration and secret truth and makes it easy to inject config everywhere it's needed. So if you're going to use Terraform to provision your cloud, we become the system that manages all your Terraform variables across multiple environments and projects. If we're using Kubernetes, we're the system that manages all your config maps and secrets and helm charts from a central authority, giving you role-based access control, change management, the ability to compare settings across multiple environments. It brings together a bunch of really a wish list items that DevOps teams have been craving for the past five years that we've decided to launch as a separate product in its own.
[00:14:29] Joe Colantonio I've tried to visualize this. Help me out. So you have all these systems. Does it sit on top of it then and then when people say, I'm a QA tester, log into cloud truth, and then cloud truth should then have all the different areas where I could set configuration once and it would go out to all the systems and update them?
[00:14:45] Greg Arnette Yeah. So Cloud Truth is a robust platform. It's available as SAS or self-hosted and provides an API is CLI or web app. And then many configuration clients that can be used like a Terraform provider. A Kubernetes operator GitHub actions plug-in. Wherever you need to consume config, cloud truth can be that system that injects it. And that's the idea of the perfect config for every deploy. So that you're not focused on chasing down configuration faults as you're trying to get new releases out the door. There's a web application that becomes your visual configuration designer. If you're familiar with tools like Figma, which let you rapidly develop new user interfaces. We're like a figma for config. You can design your config in a graphical way using our toolset. So you have a hierarchy. You have your dev environment. You have your dev environment, your staging environment, and your production environment. You can manage that hierarchy. You can set default values. And then those values can be inherited further down in the tree. You can have overrides. You can manage ephemeral dev environments that are constantly being launched and torn down and launched and torn down. All the configuration settings behind all that become managed by our system. And then in terms of application config and microservices config, where the system will manage all the configuration for the software that your teams are writing, or your third-party services, or your platform services, or your cloud pieces. All that has config has a requirement behind the scenes. So we become that central management layer. And that's what we call it like a configuration command center with a very robust dashboard and visual layer. And then a CLI that allows you to take the concept of these platforms and inject it into a kind of any kind of workflow that you have, either through a command line interface or an IDE plug-in, or one of the clients that we provide or SDK that we provide.
[00:16:33] Joe Colantonio All right. So I would think well, I know it's probably overlooked a lot from teams. So it almost sounds like this would help with communication as well between the different teams within when they're creating software to say, hey, here's the config we have, is this correct? And they can go to one place to go. No, that's not correct. This is going to cause an issue. So it's almost shifting left before you get to an issue in production, am I understand that correctly as well as a benefit?
[00:16:55] Greg Arnette Yes. You're doing a spot on this. It's a big source of release friction right now. Is this handoff from application developers to QA testers to operations on what are the right config settings for every module in the system or every component in your system? And they're going to be different across all these different environments. And then if you layer in a third dimension, such as tenants and customer configs, where you might have a certain customer in your production environment, that's entitled to a certain set of functionalities, that's another layer of abstraction that we help handle as well. It really makes the communications channel or you don't need a communications channel anymore. This suddenly becomes the communications channel. So it all gets correctly configured from every phase in the deployment cycle.
[00:17:43] Joe Colantonio So how much effort is done by the tester or the developer part? Does it bubble up insights or is it all just the team deciding what's right and what's wrong with the configs? Is there any type of, I don't know, any type of analysis? It does on its own to bubble up. Hey, this config file looks a little funky based on our best practices or anything like that.
[00:18:02] Greg Arnette Yeah, you can implement your best practices as a set of rules in the system. So we call those guardrails. So every parameter value or variable can have a guardrail such as what type should it be an integer a string or a boolean. And then further, are there any constraints on what the value can be? So for example, you might have a parameter that's going to set the number of nodes in a cluster. And there's a cluster that's going to be for production, a cluster for staging, and a cluster for development. And development, it might be a single node cluster or two nodes, because you just need something minimal to program against. In staging you want that to look a lot like production, but without incurring the expense. We can manage, say that's a very discrete parameter variable. We can manage how easy it is to flow between the environments, and we can give guardrails. Developer environments can't be more than two nodes. Staging can be between 2 and 10 nodes. Production can be between 2 and 20 nodes for example. So if you can't color outside the lines, you can implement your business logic and your technical logic as a set of rules and guardrails inside the platform. And then we'll soon be supporting OPA, Open Policy Agent rules as well, which bring in a wealth of kind of external learnings that we can apply. It's become an expert system. You can set appropriate default values, ranges, and very sophisticated rules on how you want to manage all this at scale.
[00:19:19] Joe Colantonio Nice, how easy is it to integrate or implement in the current system? Like you said, as long as it's not like a homegrown system, is it like an API that people would just plug into?
[00:19:29] Greg Arnette There are numbers of integration points, and that was designed from the get go to realize that Cloud Truth would will be installed in existing live environments. It's very rare. You get like a total greenfield experience these days to start up something new. We know we're coming into the market, that there is existing workflows that are typically built around Bash or Python or Go code and scripts that are linking together different processes. We can interface at the right point using at the command line layer. Up if you're changing together a bunch of scripts in terms of the ability around the integration of is easy as possible. Typically we see customers adopting this project by project. To kind of do it wholesale would be call it a boil the ocean exercise. It's just not practical these days. We sit alongside your tools that you might be using for CI/CD, for infrastructure as code, for container orchestration, and for like the dev test stack kinds of tools. So we don't replace anything. So we provide automatic importers. We can scan your git repos. We can import all the config. We can integrate with services like AWS Secret Manager and Param Store and Azure Key Vault, HashiCorp vault. We can reference all that and make it immediately useful. So yeah, project by project, you can retrofit the config strategy in a way that makes sense for how your team is designed to work.
[00:20:54] Joe Colantonio Nice. I was following up as an engineer. It's always hard to convince management you need a certain tool. So I think I was reading an article that you point me to that talks about how you can actually measure this by, you can measure certain key virtual performance metrics like lead time and meet time to restore. How does that help? And is that something you can say, hey, look, if we use this, this will help us with our meet time to restore by 10% whatever.
[00:21:19] Greg Arnette That's a great topic to kind of click into. We're seeing that as teams are being asked to be accountable to delivering software, new features, for their respective customers, and more so than they have in the past. And there's just a mandate right now to get more stuff done and to be more as efficient as possible, given the economic climate that we're in. So this new kind of door metrics are taking hold. They're pretty simple to understand. Like what your deploy frequency? What your change failure rate? What your mean time to recovery? If you peel away the layers behind all these important kinds of goals that teams are striving to improve on, what you find is that there's a common horizontal layer. That's how are you managing all the config and secrets for all your projects. And if you have an inadequate strategy that's going to affect your core door metrics. We have testimonials from our customers that tell us that they have gone from release cycles that would take months to get new features out in front of their customers down to weeks. And they attribute that to the adoption of our philosophy on how to manage config and secrets at scale, which we've codified as a kind of a manifesto called the seven Factor Config principles, which is an idea of there are seven important factors that teams can adopt. They can use either cloud truth, and build their own system. But if you adopt these seven core factors, you are going to guaranteed to see an increase in your release velocity and a decrease in downtime, and a decrease in security issues.
[00:22:47] Joe Colantonio All right. I do not mean to put you on the spot, I don't have them written down. What are the seven factors really quick?
[00:22:51] Greg Arnette Let's see if I can remember them. I might have to look at a quick cheat sheet here, but they start off with the idea you want to keep your config dry. And that's the first principle. Keep your config dry, decouple, and externalize your config from source. Because we see that's one of the bad practices that teams are trying to work away from is, nothing hard coded these days. And you want to abstract your config like the idea of like, what's the log level from this service versus how do I fetch the value for the log level? The other, the fourth principle is centralized that you want to centralize as much of this as possible. And centralized doesn't mean that you have to put all your eggs in one basket, but you want centralized access. You want a single API call to fetch the config for a particular component, and that config data for that component could be spread across a couple of different interfaces. But you don't need to care about that as a consumer, you just want to be able to make one simple API call or CLI call. The fifth principle is keeping your config obvious and well-understood. So right now, many teams are suffering from the fact that your config and secret strategy are kind of as a black box to almost everyone on the team, except for just a handful of people in the ops group, which came up with this kind of initial structure. And they've been kind of just kind of kicking the can down the road to evolve it more methodically. The sixth principle is keeping your config secure. So that's a no brainer for secrets. Your secrets need to be encrypted. The crown jewels of like access to your database and your file systems and all the important interfaces are managing sensitive data. So having a very thoughtful strategy on how to keep your secrets secure. But even your nonsecret config data should be kept secure. And just keeping it kind of in a plain Json or Yaml file could be a way that a hacker could figure out you're doing something and give them the attack vector that they need to understand. So even nonsecret data should be kept secure. And then finally the seventh principle is versioning. You need to be keeping track of the changes of these values more so than just keep relying upon a change track in your git repos. So a dedicated system that starts to look like a configuration management database or cmdb, which is a term from kind of the old school CIT where cmdb were popular for a while. There really hasn't been a modern kind of cloud-based reinvention of Cmdb yet. And we're starting to see whether what we're working on might be going down that path in the future.
[00:25:20] Joe Colantonio Greg could just tell us a little bit more about the free plan. What does it include? What can people get with that? Is that a good gateway to get them know okay, this is something we definitely need. Let us buy the more advanced paid option.
[00:25:31] Greg Arnette Yeah. The free plan gives you access to all the important capabilities that we provide in the platform. How we separate it from the paid plans is if you want to add a lot more team members or a lot more projects or a lot more environments. So it's designed to be the equivalent of us. We don't have an open-source version, so our free forever plan is the way that hobbyists can get started. We also have special pricing for nonprofit and early-stage startups that can get access to the platform without being in a budget-friendly way. So you can get a taste of everything in the free edition. It's when you want to scale or you want more of an enterprise support system. Or if you want to run the software in your own environment, we have price plans for those.
[00:26:13] Joe Colantonio Okay, Greg, before we go, is there one piece of actual advice you can give to someone to help them with their DevOps cloud efforts? And what's the best way to find contact you or learn more about cloud truth?
[00:26:23] Greg Arnette The one piece of advice I would share is that from what we've seen after talking to thousands of DevOps leaders over the past few years as we've launched this project, is that there's a notion out there that people think their config and secret strategy is relatively simple. It's just flat files and repos, and you marry in a secret store and that's all you have to think about. But the reality is that simple notion mask is a facade for something that's really complex, that touches multiple team members, and it's seen as this kind of tax on release velocity. I would encourage everyone to kind of take a step back, really think about how you're managing your secrets and your config at scale, and take a hard look at all these different kinds of scripts that no one really wants to talk about anymore. In your DevOps tool stack that becomes these weak links in terms of a reliable production deployment system that just operates smoothly. And to get more information about Cloud Truth, go to Cloudtruth.com. And there's a bunch of videos and documentation. There's a free forever edition and a couple of different price plans based on the kind of type of size and organization you are.
[00:27:31] And for links of everything of value, we covered in this DevOps toolchain show. Head on over to TestGuild.com/p136 and while you are there, make sure to click on the SmartBear link and learn all about Smartbear's, awesome solutions to give you the visibility you need to do the great software that's SmartBear.com. That's it for this episode of the DevOps Toolchain show, I'm Joe. My mission is to help you succeed in creating end-to-end full-stack DevOps toolchain awesomeness. As always, test everything and keep the good. Cheers
[00:27:54] Hey, thanks again for listening. If you're not already part of our awesome community of 27,000 of the smartest testers, DevOps, and automation professionals in the world, we'd love to have you join the FAM at Testguild.com and if you're in the DevOps automation software testing space or you're a test tool provider and want to offer real-world value that can improve the skills or solve a problem for the Guild community. I love to hear from you head on over to testguild.info And let's make it happen.
Sign up to receive email updates
Enter your name and email address below and I'll send you periodic updates about the podcast.