Mastering OpenTelemetry and Observability with Steve Flanders

11 September 2024, 09:07 PM

By Test Guild

Steve-FlandersTestGuild_DevOps-Toolchain

About this DevOps Toolchain Episode:

Today, we've got an exciting episode lined up for you. Steve Flanders, the senior director of engineering at Splunk and one of the founding members of OpenTelemetry, is joining us.

In this episode, we'll explore the power of AI in observability, the intricacies of instrumentation and tracing signals, and the essential role of context and correlation in problem isolation and root cause analysis. Steve will also share insights from his upcoming book, Mastering Open Telemetry and Observability, which aims to make this complex topic accessible to everyone from beginners to experts.

Listen in to discover valuable knowledge, from discussing real-world applications like the Astronomy Shop demo app to addressing concerns around personal identification information (PII) and the balancing act of instrumentation.

Whether you're a DevOps professional, a developer, or simply interested in the future of observability, you won't want to miss this conversation.

About Steve Flanders

Steve Flanders

Steve Flanders is a Senior Director of Engineering at Splunk, which was acquired by Cisco, where he is responsible for the Splunk Observability Cloud platform team and Splunk’s OpenTelemetry contributions. He is part of the founding team of the OpenCensus and OpenTelemetry projects and has over a decade of experience in the monitoring and observability space. Steve regularly speaks at conferences, including KubeCon, holds an MBA from MIT, and has a book coming out this fall called Master OpenTelemetry and Observability: Enhancing Application and Infrastructure Performance and Avoiding Outages.

Connect with Steve Flanders

Rate and Review TestGuild DevOps Toolchain Podcast

Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.

Transcript

Download New Tab

[00:00:01] Get ready to discover some of the most actionable DevOps techniques and tooling, including performance and reliability for some of the world's smartest engineers. Hey, I'm Joe Colantonio, host of the DevOps Toolchain Podcast and my goal is to help you create DevOps toolchain awesomeness.

[00:00:19] Hey, it's Joe, and welcome to another episode of The Test Guild DevOps Toolchain. Today we'll be talking with Steve Flanders all about his upcoming new book and maybe new at the time you hear this or it could be old. It all depends. So it's mastering OpenTelemetry and observability, enhancing application and infrastructure performance, and avoiding outages. A really important topic. If you don't know, Steve is a senior director of engineering at Splunk, where he is responsible for the Splunk Observability Cloud Platform team and Splunk's open telemetry contributions. He is part of the founding team of the Open Census and Open Telemetry Projects, and has over a decade of experience in the monitoring and observability space, so he really knows this stuff. You don't wanna miss this episode. Check it out.

[00:01:05] Hey, if your app is slow, it could be worse than an error. It could be frustrating. And one thing I've learned over my 25 years in industry is that frustrated users don't last long. But since slow performance isn't sudden, it's hard for standard error monitoring tools to catch. That's why I think you should check out BugSnag, an all in one observability solution that has a way to automatically watch for these issues real user monitoring. It checks and reports real user performance data in real time so you can quickly identify lags. Plus, you can get the context of where the lags are and how to fix them. Don't rely on frustrated user feedback. Find out for yourself. Go to bugsnag.com and try for free. No credit card required. Check it out. Let me know what you think.

[00:01:56] Joe Colantonio Hey, Steve. Welcome to the Guild.

[00:01:58] Steve Flanders Hi. Thanks for having me.

[00:02:00] Joe Colantonio Awesome to have you. I guess before we get into it, I always like to ask authors of books. Why? Why are you writing a book? It takes so much time. Obviously, you're very busy. You're at a high level position at Splunk. So what's the deal?

[00:02:11] Steve Flanders Yeah, I've kind of always wanted to do it. It's been on my bucket list for a long time, and I feel like, at least for me personally, like you have to really know what topic and you have to really be willing to put in the time to actually write the book. And I really felt like I had a fair amount of experience in both observability and with the OpenTelemetry project, and I felt like I had something to kind of give back. And personally, I learned a lot from others kind of coming up in the industry, reading blogs from others, going to conference talks, reading books. And so I'm just trying to hopefully give something back that I hope others will also get value at.

[00:02:43] Joe Colantonio I guess the first thing is show audience. I know OpenTelemetry has been a big thing recently. I've had a lot of people on my guests talking about it, what your angle on it? Is this book for everyone? How does it flow as a beginner to expert or expert only? How's that work?

[00:03:00] Steve Flanders Yeah, yeah. I try to take an approach where I hope that it's approachable to the widest audience possible. It doesn't matter whether you have a lot of familiarity with either observability or OpenTelemetry, or if you've been using it for years already and you already know it quite well. I'm hoping that there's something for everyone kind of in there. I approached it by trying to compare it to things that are maybe relatable to the real world. For example, like telemetry data, metrics, traces, and logs, I compare to like symptoms that you might have like heart rate, blood pressure, height and weight that a doctor might look at, for example. Trying to relate it to concepts, you might better understand if you don't understand the concepts. But at the same time, I'll drill down into like specifics of like how do you configure the OpenTelemetry collector? What does the Yaml file look like? How can you use this for advanced use cases like tail based sampling or filtering or aggregation rules? There's really a mix of content in there trying to cover the topic as broadly as possible.

[00:03:57] Joe Colantonio Nice. How did you become part of the founding team of OpenTelemetry?

[00:04:02] Steve Flanders Yeah, so I was part of a stealth startup called Omnition that was acquired by Splunk. And at Omnition, we were building a distributed tracing back end, which is now the Splunk APM product. And at that time, we really believed that there was kind of a shift going on in the industry where you were going to see basically open source instrumentation, data collection kind of components come to the forefront because we heard after like tons of customer interviews that there was a real pain point around proprietary instrumentation and data collection and the inability to really move between vendors or kind of get the value or support that was necessary, and even the increasing costs from a vendor perspective. Very early on, we actually reached out to Google and found out that Google and Microsoft are actually partnering to build something called Open Census. And initially they were focused just on the instrumentation aspects. Basically, what you put in your application to generate the traces, metrics and logs. And while that was interesting to us at our startup, we were actually more interested in the data collection aspects. How can you deploy something like a host base agent and actually receive this data, either push or pull based, process that data and then export it to one or more different back ends. And so we actually partnered with them. They had that is kind of their phase two initiative. We said, hey, what if we help build the collector aspect of this? That was known as the Open Census service. And then as people probably know, Open Census and open tracing merged to form OpenTelemetry. And the OpenTelemetry collector is actually based on the Open Census service. My initial kind of introduction to it was on the data collection aspects. Now, I have a team that's kind of contributing to all aspects OTEL whether it be the specification, the instrumentation or the data collection aspects.

[00:05:46] Joe Colantonio Are you surprised by the adoption of the standard because it seems like a lot of companies are embracing it, which I'm not surprised by it. Maybe I am a little. Maybe I'm cynical, but are you surprised by the adoption of it?

[00:05:56] Steve Flanders I don't think that I am, and that's because it's an actual real pain point for both sides of the coin here. End users have a pain point, whether that's a customer or an individual person, in that they only want to really instrument once, but they want to be able to continuously get value out of it. They can't afford to re instrument to change like vendors that they're using or change, like open source projects they're trying to leverage. They really need a single stand. There's one pain point. But on the flip side, vendors, they actually kind of need the same thing. So prior to a standard like OpenTelemetry, most vendors had to actually hire a team of engineers that would build instrumentation data collection just for their proprietary backends. And so that's actually really expensive. It takes a lot of work because there's hundreds or thousands of different environments or integrations that you need to collect data from. And so standardizing this and having a single way to actually collect it and unifying those efforts mean that now any vendor, whether it's open source or commercial can really take advantage of this data and then enrich it as they see fit. Given that like both sides kind of benefit at the end of the day, I'm actually not that surprised that it's kind of taken off. And I hope that it continues to kind of grow in adoption kind of long term here.

[00:07:07] Joe Colantonio All right. So back to the book. Like I said, I sometimes go off the tracks here. The book I think starts off with the fundamentals of observability. Are there any core principles of observability that you think are really important for people to know about, to help with system reliability?

[00:07:22] Steve Flanders Yeah, I mean, observability is a pretty broad topic and it really means different things to different people. It's like some people think of observability and they only think of application performance monitoring or APM. And while that is a part of observability, it may or may not be the only part like APM uses traces, just the traces are very powerful because they have context and correlation. But in addition to traces, you have at least metrics and logs. The three pillars of observability as are often referred to. But even the three pillars of observability are not the complete picture of observability. What we're seeing is that you really care about like what is the user experience? So like how is the performance from a browser perspective or a mobile device? How do you collect that data through like real user monitoring or synthetic data as an example? Profiling on the application also matters. Observability is just a broad topic overall, and I try to cover the foundations of like what is OTEL think of observability. What does the CNCF, which OTEL is a project within the CNCF think of observability. What does the concept actually mean? What is this origins, where to come from? What is kind of my definition of observability? So trying to add like a personal aspect to it. So yeah, I try to cover kind of broadly what the topic is and what you hope to achieve from it. And then I use it as a launching point for the other chapters in the book to really provide specifics on how hopefully you can achieve it in your environments.

[00:08:44] Joe Colantonio Nice. So that's a good launching point. You go from the opening chapter, then I think you launch into your intro to OpenTelemetry. So folks that listen, I assume they know but just in case, what is OpenTelemetry and why? Why is it kind of like a I hate the word game changer, but why is it really the observability area I guess?

[00:09:01] Steve Flanders Yeah, yeah. An easy way to think about OpenTelemetry is it is an open source standard for telemetry data. And telemetry data means things like traces, metrics and logs, but it extends to profiling, real user monitoring, synthetics and the like, or at least it will hopefully in the future here. But in addition to providing basically an open standard, it actually has a reference implementation. So it provides language based instrumentation, things that you install within your application to generate what OTEL call signals these traces, metrics and logs. That is language specific. So you're going to see all popular languages supported Java, .Net, Go, Python, PHP, Ruby, you name it. Right. Like there's a lot of broad coverage there. And then it offers a data collection component known as the OTEL collector. And this is used to receive process and kind of export that telemetry data. And the collector is kind of powerful because it can actually do things to the data. It can enrich it, it can redact information, it can do additional processing like aggregation or filtering up it, depending on what your use cases are. But it kind of gives you control of that data. So you can do with it what you want, whether you want to send it to multiple different backends, whether you want to reduce the amount of data that's being sent to a back end, all that's like fully within your control at the end of the day. But OTEL is a big project, right? Like covering ten plus programing languages, offering collector components, having a specification for all of this, and having a specification that's actually signals aware like it treats, it has a full definition for traces, a full for metrics or full for logs. And then it has kind of cross-cutting components as well, like the notion of resources, which you can kind of think of as where does this application live in your environment? What host is it running in? What cloud provider? Or if you're running on premises like what container orchestration engineer you're using? And it actually enriches all of this data. It provides context and correlation across all the different signals. That's really the power. It goes beyond just looking at an individual signal or an individual like product line like APM. It looks broadly across observability.

[00:11:02] Joe Colantonio All right so this may be a really newbie question. When I think of this I think of production data. But are you able to utilize this while you're developing a software to kind of profile. It is like shift left almost to get this data before it goes to live users?

[00:11:15] Steve Flanders Yeah, yeah. This is actually a great question, right. Like observability is common for production environments because you're trying to solve availability and performance problems in really complex environments. Today, you have a lot of microservices and distributed systems. You have a lot of dependencies on third parties and calls to external APIs or databases or what have you. So production, I think is kind of a no brainer. But then like if you think beyond that, you have actual software development lifecycle type things, you want to make sure that you're not like introducing regressions or performance issues. You want to understand the health and behavior of your system before it even makes it to production. And yes, OpenTelemetry or observability in general can help with us. Ideally, you should be looking at it through the entire life cycle, not just when you get it to the production front door, and you can actually enrich this with additional information. For example, there's this notion of events. You can think of them as like specific types of log records. So an event might be like I trigger a CI/CD pipeline. Well, I want to be able to denote that with the data that I'm collecting. I can figure out, for example, of something change between build one and build two. And so that's an important like piece of metadata that you're going to want to tag on there. But yes, you should be doing this through every cycle, whether it's development, QA, if you have like a staging environment or performance testing environments, all of these steps are great places to introduce the telemetry data and to start kind of observing your systems and understanding how they behave.

[00:12:36] Joe Colantonio Nice. Obviously a hot topic in the industry regardless of what people are doing is AI. Now it OpenTelemetry I love it because I have all this data, but now it's kind of hard because you have all the data you have to analyze. So do you see AI being actually unlike other use cases? This is a good use case for it because it was kind of made for kind of going through all this data and kind of parsing it and bubbling up insights and things like that?

[00:13:00] Steve Flanders Yeah. And then I touch on this all actually the last chapter. So kind of the future of observability because AI is still kind of an emerging thing and emerging technology and trend, I see a couple of short term benefits and then many long term. And the short term, I'm sure if you've heard of generative AI, where you can basically ask AI system a question and it will come back with an answer. This is actually really powerful because query languages for backends, whether it's open source like Prometheus or some commercially proprietary back end that you want to get data out of from an observability perspective, it's really hard to like learn those core languages and be able to construct the query in the right way. You either need to have a lot of expertise, read a lot of documentation, a lot of trial and error. But generative AI is making this way easier. Instead of me needing to understand like a prompt QL query, I can just ask it a question like hey, why is my checkout service slow and it will go construct the queries for me, go generate that data, come back and give me an answer saying, hey, I noticed that you have higher latency here. You just deployed a new build. Maybe it's related to that. At least for a generative AI perspective, there's a lot of benefits. But the future in my mind is like, let's take the OTEL collector as an example. It will eventually have all the data in your environment. If it becomes like a single aggregator. And then you can do some very powerful things for an AI like you can do, like adaptive filtering or sampling. You could do like redaction of information because it identifies like PII, personally identifying information and actually cleans that up for you. You can do things like send some of this to a data lake, like maybe in your own premises, and then send, like the rest of your data to a data warehouse from Observability platform. And you can dynamically change like where the data is going. I think the real power of AI, from the observability perspective will be what I call edge computing, which is something like an OTEL collector running in your environments between like the edge of your environment and where you're trying to send that data. I think there's a lot of opportunity there, and you're actually starting to see a lot of vendors and startups kind of come up in this space. I'm very interested to see where that goes in the next few years.

[00:15:03] Joe Colantonio Sounds wild. So a little bit more about the OTEL collector then. Once again, it could be off based. I know Google is in a lot of these ad platforms. Rolling back what you can know about people coming to your website. Do the OTEL collect to have like parameters in place. Do you have to worry about that? Is this completely different or not necessarily collecting real user personal identification information?

[00:15:24] Steve Flanders Yeah. So if you think about application observability so like back end type systems that is typically traces metrics and logs. It may have some PII in there but it's less likely usually like a developer would have to explicitly add something in there that would contain it. One example might be you have a server that calls a backend database. Well, if you add the query parameters that you're issuing to that back end or back end database, or you get the results of that query that could contain some PII. But like for back end systems, you're not going to get a lot of that from a telemetry perspective. But you mentioned things like real user monitoring, which is more like browser or mobile devices. That is much more likely to contain PII because I'll have geolocation data. I may know, like what cookie parameters you have or like what headers you're passing, which could have your email address or name or other pieces of information that you entered into a form. So there is definitely some concern, at least from a real user monitoring perspective. Today, if you're generating real user monitoring, you're probably sending it directly to an observability backend. You're probably not sending it to the OTEL collector. You could. I know of a few use cases where this actually happens today, but in general, because customers or end users are all over the world, you really want to send it essentially to the observability platform. And there they may have features where you can like toggle what PII is kept versus not what's redacted and what have you. If you have an air gapped environment or like an intranet, then you may actually use the OTEL collector for something like Rum data. And there are options, right? They're called processors in the collector. And you can process that data however you see fit, where you could actually redact or remove some of the information. You can hash that data. So if you want to have like a unique identifier that doesn't actually have the raw string that might be leaking sensitive information, you can do that. You can aggregate this information and maybe generate metrics off of the PII and only send the aggregated data. So you're not leaking PII. There are definitely options depending on your use case. But I would say in the case of Rum, more of these are going to be observability platform specific and not necessarily OTEL specific.

[00:17:30] Joe Colantonio Gotcha. And there's a whole chapter on OpenTelemetry collector. And I guess that brings me to OpenTelemetry instrumentation, which is a chapter. Curious to know is there anything you see come and challenges with people that try to implement OpenTelemetry with the instrumentation piece?

[00:17:45] Steve Flanders Yeah, I mean, instrumentation in general, I think is actually very complex. If we take the tracing signal, that's probably the most complex signal today because in addition to generating, there's called spans inside of a trace. In addition to generating a span, you have to pass what's called context. It's actually called context propagation. The idea here is like, if I talk to you and you talk to someone else, I need to know that you talk to someone else. We do that by passing an identifier. I would tell you, Hi, I'm Id1 and you would tell the next person you're talking to. Hi, ID1 talked to me and now I'm talking to you. And we can actually stitch that information together. In software, that's typically done through headers. You pass a header like a trace ID with a unique value, and that that'll get passed through every single request or transaction in your environment. Well, doing that in instrumentation is actually very difficult. Now, OTEL is trying to make this easier. For example, it has built in context propagation for some of the libraries that are automatically instrumented. And it offers something called automatic instrumentation, which runtime it actually injects itself into your code and allows you to do things like pass contacts easily. Now, automatic instrumentation is nice in that I don't have to go manually update the code in my application in order to get the telemetry data, but it only works for things that the automatic instrumentation is aware of. This is like known libraries or frameworks, typically open source or popular ones, and definitely not everything available today. If you have your own custom libraries or frameworks, automatic instrumentation may or may not work for you. And then you need to learn how to actually add instrumentation properly within your environments to make sure things like context propagation work. This is definitely challenging. But the good news is OTEL provides a lot of kind of choice along the way. You can start with automatic instrumentation, you can add manual instrumentation later, and even using automatic and manual at the same time is fully allowed. So there's a lot of flexibility in the model of the OTEL framework.

[00:19:46] Joe Colantonio Nice. I think you also covered the power of context and correlation. My background this 20 years ago, I served as a performance engineer would run a performance test. We have all these monitors. We see a spike in a response time. We're like, Oh gosh, I don't know what relates to what what's causing what. All these different metrics coming on. How does it change nowadays? Like what context of correlation is that solved where, it's the SQL statement is running X amount of time out of this whole step that I ran.

[00:20:15] Steve Flanders Yeah. So again, context and correlation I think is a difficult challenge. Tracing or distributed tracing tries to make that easier by passing context. But metrics and logs by default they don't have that context. I actually refer to metrics in logs more like symptoms in your environments. They tell you that something happens, but not necessarily why? Because their individual records are like time series of things that occurred. But OTEL actually does allow passing of contacts with metrics and logs. That's one of the cool benefits that it provides. So across all signal types you have some there. The other cool thing that it does is it treats the called resources I mentioned it earlier. It treats like the infrastructure or where you are or type aspects as natively important to OTEL. So everything can contain resource information. That's valuable because it helps with at least problem isolation if not root cause. By problem isolation I mean, let's say I have a thousand Java servers all over the world that are that run this checkout service, and all of a sudden I'm having a problem with checkout services. I don't know whether all services are impacted, if just a single region is maybe just one pod or one host is having problems, but with resource information, which is basically just metadata tagged on to these signals, traces, metrics, and logs, I can start to answer those questions. I can be like, oh, hey, only this particular subset of my Java instances, maybe in one region is having problems. And I can further break that down and say, oh, in this one region, I just deployed a new release. And so I'm now seeing differences between my old release and my new release. That's some of the context and correlation that's available there. A lot of this if you use like automatic instrumentation is provided out of the box, which is kind of nice, but you can of course enrich it with your own conventions or your own metadata as well. There's a lot of cool ways to kind of get context and correlation through OTEL, some of it automatic, some of it manual, but it is trying to make it easier and trying to extend it beyond just a signal in the observability space.

[00:22:14] Joe Colantonio Yeah, this is crazy. Back in the days work for insurance company, we had to install an .... on machines, ship it to insurance company, you have to plug it in and then ping a site. And then if it's down like this is incredible what you could do nowadays. That's awesome. So speaking of that, like sometimes people may be hearing this, they want to get their hands dirty and they may not have an application. They don't want to mess with their application. I think in the pre-show you mentioned something about like a demo app. Is there like a demo app someone could follow along with these topics to really learn the concepts as a reading?

[00:22:43] Steve Flanders Yeah. So OTEL provides one. It's called the OpenTelemetry demo, also known as the Astronomy Shop. OpenTelemetry logo is a telescope, right? So the demo application is like a shop where you can buy things for like telescopes or binoculars or things to that extent, it's actually you can think of it like an e-commerce application. So it has a bunch of services in the back end where you can like authenticate, view the front end, you can check out, there's payment currency, there's a cart service, and all of them are unique microservices written in different languages. So it's a polyglot architecture and they've all been instrumented with OpenTelemetry. They all send data to an OpenTelemetry collector. And that OpenTelemetry collector actually forwards that data to open source observability platforms like Prometheus and Jaeger. So it's basically batteries included version that you can just get up and get started with to actually experience the power of OpenTelemetry. Today, that demo environment only supports stable components in OpenTelemetry. So OpenTelemetry is an evolving project and it's GA or generally available equivalent is called stable in Open Telemetry. But things that are not stable is called in development used to be called experimental. Now it's called in development in OTEL, in-development ones are actually not part of the demo environment today. They do wait for things to become stable before actually introducing it to the demo environment. But you can very easily deploy this either locally on your system. It supports Docker, it supports Kubernetes, and you can play with it. You can turn things on and off. You can kind of customize it as you see fit, and they have a few extensions and things built into it too. It comes with a load generator tool, so it will actually start generating telemetry data for you. It comes with Grafana. It has built in dashboards. You can actually see some of the telemetry data and what you can actually do with it. It really is kind of easy to, inject even failures into it. It has feature flags. You can actually say, hey, I want to cause 10% errors on the cart service, and you will start seeing that in like Prometheus or Yeager and be able to kind of troubleshoot your environment. Really, really cool overall.

[00:24:45] Joe Colantonio Nice. So like any technology, I could see people pushing the boundaries or kind of abusing it. So I think you have a chapter on the anti-patterns and pitfalls, any examples of maybe things people should avoid or maybe not do, even though it may be tempting to do?

[00:24:58] Steve Flanders There's actually lots of things. So with great power comes great responsibility, right? Like let's talk about instrumentation for a second. You could literally instrument everything in your application, every function, method, whatever it is throughout your entire code base. You could add tons of metadata for like everything single like unique thing about it's like, what is the customer ID, what's the user ID, what's the transaction ID, what host is it running on Pod? All of that. That could be called over instrumentation. So eventually you generate so much telemetry data that it's just overwhelming. A, you have to process all of that data. So there's a cost associated with that. B, you have to like store it and be able to analyze it and like dashboards and alerts. That gets very expensive over a while, and C, like not all that data is valuable, right? Like your end state is probably trying to be able to quickly solve availability and performance problems. You don't need to collect every piece of telemetry data in the entire world to do that. So one of the anti-patterns that's kind of talked about is is over instrumenting. Of course, the opposite is also true, which is under instrumenting. If you don't have enough of that telemetry data and you can't actually achieve observability because you can't ask any question out that data, you don't have the data to actually ask those questions and live adding more instrumentation like during an active incident. That's not easy. Something like automatic instrumentation would require a restart. Manually instrumenting typically requires a restart. You may not be able to reproduce the issue after you restart, so you have to really strike the balance between under instrumenting and over instrumenting. And OpenTelemetry can help here. Like automatic instrumentation is great, especially for known frameworks and libraries. It actually provides a good balance and manual instrumentation. You can kind of control and you can learn over time whether you need more or less. But that's a really classic example and I list thousands of different scenarios beyond instrumentation as well in that chapter?

[00:26:56] Joe Colantonio Absolutely. So people definitely need to buy the book. We'll have a link for it in the comments down below once it's available. All right. So we're getting near the very end. So you talked about scalability and how to scale observability. I guess once you have these anti-patterns, hopefully you're following them because I'm sure you'll find out if you're doing things wrong once you get to the scaling part. So any tips for scaling or is it? It's like any project I assume is a constant refining that needs to go on throughout what you're learning from production and how you change throughout what you're learning from what you've done?

[00:27:26] Steve Flanders Oh yeah, it's definitely a constant process because your environment will change. Maybe you have spikes in traffic which generate additional telemetry data. How do you handle that? And these are typical like distributed systems type problems. If I have a stateless service for example the OTEL collector is stateless by default. So unless you configure something that stateful then you could use something like an auto scaler where you're like, hey, I'm going to measure the CPU or memory over my OTEL cluster if it's running in gateway mode, and I will have it dynamically add more or reduce the number of instances. That is possible. But if you configure the OTEL collector in a stateful way, for example, tail-based sampling requires state management, then you have to be very careful because when you add or remove instances, you could actually cause disconnected traces because the ability to route the data to the right instance is at least temporarily broken, while the cluster kind of re sinks to its new size. So yeah, you have to be careful here. You have to understand like what is available in terms of tools, what you need to be monitoring and then what impact said changes have in your environments. Like anything like a lot of the recommendations I have is make sure you test this before you go to production because you're gonna have a really bad time otherwise. It's already too late at that point, but I talk about it from an application perspective. I talk about it from an OTEL perspective, and I even talk about it from a observability platform perspective, because all of them have different dimensions of scale and things that you care about. And it's not just how do you scale up or scale out. It's also like the cost associated with that and whether you're getting value out of performing those operations, sometimes the answer to scalability problems is don't send as much data. Going back to over instrumenting, if I reduce the amount of data, if I filter it earlier or if I aggregate it earlier, I don't have to scale the back end platform as much to support those use cases. And so those types of aspects are also discussed in the chapter.

[00:29:19] Joe Colantonio All right. So we already touched a little bit on the future of observability, which how you in the book. Curious to know though I get showing demos all the time. I've been seeing people automation engineers and testers creating solutions, the like an SDK on top of OpenTelemetry, and then they're able to get from production to create test because they know what uses are doing. They're doing crazy things where APIs and things like that. So are you surprised by maybe how OpenTelemetry is being utilized? You see a future for that, people working on top of it to do other things that you may not have originally anticipated?

[00:29:51] Steve Flanders One of the things I love about OpenTelemetry is just how extensible it is. So one of its guiding principles is to be open source and vendor agnostic, that that second part is actually very important. Being vendor agnostic means it needs to be able to support a variety of different open standards, and it needs to be extensible to the future as new things are being added, be that testing, be that profiling, be that things in CI/CD pipelines, whatever the use case is from an observability perspective, it needs to be able to grow and evolve into it. And that was thought about from day one in the OTEL specification. That's really, really powerful. And I love seeing different use cases for OTEL. There's actually a few cool projects out right now that are doing LLM, which is going back to AI for a second, LLM observability, I love that, and it uses OTEL behind the scenes to actually like collect this data. OTEL actually has semantic conventions, which is just a fancy way to say standardized names for metadata. They have it for LLM, just like they have it for HTTP or databases like these additional use cases are just so cool to me, and I love just that the community in general, it's very friendly and inviting. And so you'll see people come in and come out depending on what their use cases are. And we can kind of all collaborate together. I'm not surprised to see this long story short, and I hope to see more of it going forward, because that really is the power of open source in my mind, that shows a really healthy community overall.

[00:31:13] Joe Colantonio Absolutely, it must be a cool feeling to see how your work is grown and people are utilizing it for sure. All right, Steve. I got an early kind of outline of the book, but I know you've been working on it. Can you give me, like, an official date? I know it's kind of late when you're writing a book, you never know. But when do you plan on launching the book? How could people get their hands on it? I know you probably travel around the world at conference and people actually meet you? Get the book signed once it's live. All that stuff about the book once it goes live, when is it going live and how can people get their hands on it?

[00:31:42] Steve Flanders Yeah, so it's actually available for preorder right now. It's not released yet, but you can preorder it on your typical sites like Barnes and Noble or Amazon or something to that extent. I'm in the home stretch of kind of incorporating feedback and going through the last versions of it, so that's very exciting. Looking forward to being done with that process. And then right now, the book is is scheduled to be released sometime in early October to early November. The idea is that it will be published before KubeCon North America, which is in Salt Lake City this year. I actually just found out that I had a session accepted there, so I will definitely be there and I will have copies of the book. I would love to kind of meet people there also going to be at the conference. I believe I'll be doing probably a book signing as well. So that's probably the best way to get your hands on a potential free copy of it. And then of course, it'll be available on all your typical sites, be it a physical copy or like online virtual copies, probably starting in, the October November timeframe.

[00:32:37] Joe Colantonio Okay, Steve, before we go, is there one piece of actionable advice you can give to someone to help them with their DevOps OpenTelemetry efforts, and what's the best way to find or contact you?

[00:32:46] Steve Flanders Yeah, so my general guidance is if you're not looking at OpenTelemetry today, you need to be. It doesn't matter what observability products you're using, where you are in your observability journey. I can almost guarantee you will get value out of OpenTelemetry, even if it's just that it provides data portability and the ability to be vendor agnostic. I really encourage you to kind of take a look at that. In addition, the observability space is rapidly evolving. OpenTelemetry is part of the CNCF. The CNCF has a lot of great resources on observability, including different platforms that are available, trends that they're seeing, surveys that they've done from end user perspective. Definitely took a look at anything observability related in the CNCF. And then regards in terms of like contacting me, I'm available on a variety of different sources. I have my blog @SFlanders.net. LinkedIn is always a great place to kind of find me. My blog has information about the book as well, and then you'll find me in the OTEL community too. There's a CNCF slack instance for OTEL. I'm active there, or just in the and the OTEL GitHub repositories themselves.

[00:33:45] Remember, latency is the silent killer of your app. Don't rely on frustrated user feedback. You can know exactly what's happening and how to fix it with bugs snagged from Smartbear. See it for yourself. Go to BugSnag.com and try for free. No credit card required. Check it out. Let me know what you think.

[00:34:07] LAnd for links of everything of value we covered in this DevOps Toolchain Show. Head on over to Testguild.com/p161. And while you're there make sure to click on the Smart Bear link and learn all about Smart Bear's awesome solutions to give you the visibility you need to deliver great software that's Smartbear.com. That's it for this episode of the DevOps Toolchain Show. I'm Joe. My mission is to help you succeed in creating end-to-end full-stack DevOps Toolchain Awesomeness. As always, test everything and keep the good. Cheers.

[00:34:40] Hey, thank you for tuning in. It's incredible to connect with close to 400,000 followers across all our platforms and over 40,000 email subscribers who are at the forefront of automation, testing, and DevOps. If you haven't yet, join our vibrant community at TestGuild.com where you become part of our elite circle driving innovation, software testing, and automation. And if you're a tool provider or have a service looking to empower our guild with solutions that elevate skills and tackle real world challenges, we're excited to collaborate. Visit TestGuild.info to explore how we can create transformative experiences together. Let's push the boundaries of what we can achieve.

[00:35:24] Oh, the Test Guild Automation Testing podcast. Oh, the Test Guild Automation Testing podcast. With lutes and lyres, the bards began their song. A tune of knowledge, a melody of code. Through the air it spread, like wildfire through the land. Guiding testers, showing them the secrets to behold.

Scroll back to top

Top Automation Guild Survey Insights for 2026 with Joe Colantonio

Posted on 11/24/2025

About This Episode: About This Episode Automation Guild turns 10 this year, and ...

Test Automation Crisis,AI Takes Over Performance Testing and more TGNS175

Posted on 11/17/2025

About This Episode: Is your test automation making things worse? Could AI agents ...

Testing AI Vibe Coding: Stop Vulnerabilities Early with Sarit Tager

Posted on 11/16/2025

About This Episode: AI is accelerating software delivery, but it’s also introducing new ...