About this DevOps Toolchain Episode:
In this episode, host Joe Colantonio is joined by special guest Jeremy Burton, the CEO of Observe Inc. They delve into the world of observability, discussing its importance in troubleshooting unknown problems and the challenges in correlating data points to identify application issues.
Jeremy explains how Observe's goal is to streamline troubleshooting by consolidating all data in one place, making problem resolution easier. The conversation also tackles the potential impact of generative AI in automating troubleshooting steps and eliminating the need for human-written runbooks. From shifting from traditional anomaly detection to observability to the challenges of implementing observability tools, the episode focuses on the cultural shift and the importance of instrumentation in code for effective troubleshooting.
Join us as they explore the world of observability, its applications in troubleshooting, and the shift towards a customer-focused approach to identifying and resolving issues.
About Jeremy Burton
After a long career in large Enterprise Software companies such as Oracle, VERITAS and EMC, Jeremy is now CEO of a fast growing startup – Observe Inc, a SaaS Observability offering built on top of Snowflake. Observe promises to lower the Mean Time To Resolve incidents by two thirds and, at one third of the cost. Jeremy also sits on the board of Snowflake and Mclaren F1.
Connect with Jeremy Burton
- Company: www.observeinc
- Blog: www.observeinc.com/blog
- Twitter: www.Observe_Inc
- LinkedIn: www.observe-inc
- YouTube: www.@ObserveInc
- Git: www.observeinc
Rate and Review TestGuild DevOps Toolchain Podcast
Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.
[00:00:01] Get ready to discover the most actionable end-to-end automation advice from some of the smartest testers on the planet. Hey, I'm Joe Colantonio, host of the Test Guild Automation Podcast, and my goal is to help you succeed with creating automation awesomeness.
[00:00:19] Joe Colantonio Hey, we're all entering an era where simply monitoring a system isn't enough, and it's all about gaining deep insight and foresight. And this is the world of observability. And our guest today, Jeremy Burton from Observe, Inc. and he is here to turn everything that you thought you knew about data on its head. Prepare to have your perspective change. Hey, I'm Joe, and welcome to another episode of the Test Guild DevOps Toolchain. And today, we'll be talking with Jeremy all about what I'm calling Stop Monitoring and Start Observing. So after a long career in large enterprise software companies such as I think Oracle, Veritas, and EMC. Jeremy is now the CEO of a fast-growing startup. I think they just got a lot of funding recently called Observe, Inc. which is a SAS observability offering built on top of Snowflake. Observe promises to lower the meantime to resolve incidents by 2/3, and 1/3 of the cost is what I'm being told. And Jeremy also sits on the board of Snowflake and McLaren F1. Really excited to have him on the show. You don't want to miss it, if you really want to learn more about observability.
[00:01:21] Joe Colantonio Hey, Jeremy, welcome to The Guild.
[00:01:26] Jeremy Burton Thank you. Big intro. Hopefully, I can live up to that big intro.
[00:01:33] Joe Colantonio I'm really excited. I know observability has being in hot, hot, hot lately. I'm just curious to know why. I know you're an entrepreneur and you must have looked at all these areas of Okay, what is maybe lacking and where can I create a need that something that's going to help people. Why observability?
[00:01:47] Jeremy Burton Yeah, I think it really starts with as an industry, building applications differently and you can probably track this back to maybe, I don't know, seven, eight, nine years ago when Kubernetes became commonplace as a sort of fabric for building applications on top of container-based development. And really the benefit there was that developers could go fast, they could write code at their own pace and then deploy it into production every day. And the problem that that creates for the poor folks that have to troubleshoot when things go wrong is that you've got a lot of change going into production every day. You don't have time. There was a time, believe it or not, when we used to test a lot of things before we pushed them into production. And the line now is you test in production. Why is that? Well, because you can't reproduce a 100,000 or a million users in the test. And so we essentially test in production. And so you see problems that you've never seen before. And the way you troubleshoot those problems is you have to be able to investigate. And that's really what observability is all about. It's investigating these unknown problems, things that you've never seen before.
[00:02:57] Joe Colantonio No, I love that. I think a lot of people, when they're older, like me, they take back in the day when they had a raised floor and they actually had everything under their control. It was all deterministic. So you could test before you went to production and you pretty much you were good. But nowadays you really can't, I guess because you have multi-cloud, you have different services that could be out of your control that are up and running. So is that why we see this even more now? Because going cloud-native and all these moves that we've been doing lately.
[00:03:22] Jeremy Burton Developers sort of rule the day, right? They want it to go fast. They want instant gratification. I write my code, I want to push it into production. And that was say, the world of apps on the phone. People want regular updates. They expect that I'm going to get new capabilities every day. And so the pace of delivery of new features has increased dramatically. And so that forces you then to push more often to production. And so when you see these problems, obviously it's a race to figure out, okay, what went wrong and how do I fix it? Because if you don't figure that out, you're probably going to have a negative impact on the customer experience. And from big banks to small mom and pop shop, they do business online. And that ability to give people a good experience is a differentiator to the business or to the brand.
[00:04:10] Joe Colantonio How do you know what's going on? And if you really can't if it's so undetermined, almost what's happening, anything could happen. I know that's why a lot of people do chaos engineering before because they're trying to test for this unknown activity. They have all this data, all these points, like how do you know what to even focusing on?
[00:04:27] Jeremy Burton Yeah. I mean, look, this is the mantra of the company. We believe everything is a matter of data. If we collect all the exhaust fumes that these digital applications emit, the infrastructure and the application, and we collect them together into one big database, and then we try and make sense of it all. We put another way we help SRE teams and DevOps seems to make sense of those sort of digital exhaust fumes and track down what the problem was. So that was the genesis of the company was, Hey, couldn't we collect all of this data? Not just the logs, not just the spans and the tracing data, not just the time series metrics, but can we collect it all and could we make sense of it, and could we help folks troubleshoot these unknown problems?
[00:05:08] Joe Colantonio So obviously, you have a platform called Observe. What makes it unique in this space. I've seen all the tools that are some of which are secret sauce. Does it make it easier to troubleshoot all these different points and give you insights on like, what is it that makes it so?
[00:05:23] Jeremy Burton Yeah, there's a couple of things. But number one, we started later than most, so we're a relatively new company and obviously as a new company, you can take advantage of new technology in a new architecture. Number one, we put all the data in one database and that sounds like the most obvious thing in the world to do. But believe it or not, historically, that is not been the way because you take a company that's done very, very well over the years like a Splunk, they really built a database to help people analyze logs and they made it sort of bespoke for that purpose. And then a company like Datadog came along and they built they have a time series database. Why? Because they help people analyze metrics, CPU utilization, and all that good stuff. And so we were like, okay, enough of this like a different database with different types of data. Why don't we just put the data all in one database and we can take advantage of new technology Snowflake? because it can handle time series data, it can handle unstructured logs, it can handle semi-structured JSON blobs. So putting the data in one place of the foundational thing and then the real magic is that we don't just allow folks to do like a breadcrumb search. I've got a container ID, let me go search thousands of terabytes of logs. We actually curate that data and we structure it. And that's what makes it easier to navigate and therefore troubleshoot. And so the magic in the product is taking this sort of messy, unstructured data and adding structure to it and adding relationships to it, which allows people to follow the path of the problem and identify the root cause.
[00:06:57] Joe Colantonio And so it seems like it takes logs, it takes metrics, but it's all in one location. So it's able to make correlations better because you have everything in one place that's able.
[00:07:06] Jeremy Burton That's right. The analogy that I always use to help people sort of grok it is if I put a blindfold on you and I dropped you in an unknown location and took the blindfold off, like, what would you do? If you're in San Francisco, you'd be like, oh, Golden Gate Bridge, Transamerica building, Golden Gate Park. And based on your position relative to those places, you'd be like, okay, I think therefore I'm in the Marina District. And so troubleshooting an unknown problem is exactly the same. You pick a data point, maybe not the Golden Gate Bridge, but it might be a metric. Okay, show me the logs for that metric. Show me the traces of the spans for that metric. Now, if I correlate around those points, I should be able to identify where in my application the problem is. And so we take that exact principle to heart and Observe. In most companies today, by the way, that correlation is done by like a subject matter expert. There's always one or one hand, there's one or two people in the company who know everything. And typically today, I do not. What happens is, folks, there's an incident, they get everyone on the line and they say, okay, take a screenshot of what you see, Bob, and put it in the Slack channel. Look, Marry, you next. What do you see? And this goes on for the first hour of the troubleshooting call, and then one of these subject matter experts comes along and eyeballs the screenshots in the Slack channel and does the correlation in their brain. And our viewpoint is like that should be done in software. And the reason it's not done in software is because the data is not in one place.
[00:08:39] Joe Colantonio Love it. Kind of similar, I started my career as a performance engineer, in order to do performance engineer you had a database guy, a mainframe person, you had all these people look at their little piece and then you come together and go, okay, let's correlate. It sounds like this is all in one location, one dashboard. It's almost like a team flight deck, almost, where you could see everything at a high level as well?
[00:08:57] Jeremy Burton Yeah. I mean, you want what ___ is. People want to be able to see the top level view. One of our biggest customers is a big bank, right? And they use Observe to help troubleshoot their problems with their mobile application. And they want to see the customer journey from log in right the way through everything that they've done. And then when something goes wrong, they want to be able to drill down, Hey, show me that service. Okay? Now show me the metrics for that service. Show me the log for it. Show me the latency for it. Okay. Yeah, It's not there. Let's pack it up. Let's go here. Today, that's like 100 people on a call each given a discrete view of what their area is seeing. And so really what we're trying to do is eliminate that chaos at the beginning of the troubleshooting call where you don't have all the data and you're not able to correlate it. If we can get rid of a lot of that, then actually finding the problem and then fixing it once you've identified the problem tends to be trivial in most instances. We can eliminate a lot of that messing around at the beginning of the call. Then I think we can help people make progress.
[00:09:58] Joe Colantonio I guess, finding an issue, though, like all this data, so much data like how do you and you have to do it in real time really fast. Does it bubble up insights or do you use something like generative AI to query it, to say, hey, tell me in plain English, what are you looking for? And I'll bubble up insights for you. How does that work?
[00:10:17] Jeremy Burton Yes, there are a couple of things that are in the data. First of all, the freshness of data. It takes about 15 seconds from when the data is created to it being in Observe. And then you can actually, query it, number one. Number two, then on your question around generative, I think we're probably not going to get to the Star Trek like, Hey, computer, what's the problem any time soon? But I'm pretty bullish on the impact to Generative AI because think about it this way like our founder, Jacob. Anything that goes wrong with our system, if you ask Jacob, he knows what to do. And so the thought with generative AI, yes, we can do the sort of the GPT help and the co-pilot and we've got all of those things in the product. But what we believe we can get to is if we could train an LLM on your prior incident history. LLMs are statistical engines, right? They don't inherently have an ability to reason, but they can statistically tell you what the next right thing to do is based on prior history. And so what we think we can do is for a particular class of incident, we could dynamically create a run book for that incident. So run books for years have been written and maintained by humans. It's sort of best practices for troubleshooting. But why can't you automatically generate those from prior incident history, number one? And number two is, if you can do that, why wait for the human to take the steps? Why not go and execute some of these steps? And so maybe when a human enters the troubleshooting flow, all of the data that they would need for the investigation is in front of them. And in most companies, that's probably half the time it takes to resolve the incident is getting the data together. So that I think the premise is that you could half the time to troubleshoot by just eliminating a lot of the putting around before you get your hands dirty and start investigating, particularly with an unknown problem. Like if you've never seen the problem before, there's probably not a precise answer to get to the root cause. But there will be a general set of steps that you would follow in order to triage that and sort of move the ball further down the field. And I feel like we can start with the ball on the 50-yard line instead of the 10 or 20-yard line.
[00:12:26] Joe Colantonio Yeah, I love it. I mean, there's a lot of promise in A.I., but it seems overhyped. But I think with machine learning, with all this history and logs, with a large language models specific to your incidents, I think is, is made for machine learning almost.
[00:12:39] Jeremy Burton It's massively overhyped. But I think like I mean, in tech, I mean, I've been in it a long time. A little bit of fashion. We love a good trend and a bubble. And I think this one is absolutely a bubble. Absolutely 90% of the investment dollars that have been spent right now on companies, those companies are going to go out of business. But there will be some big winners. And I think for those of us that have worked with LLMs today and know the reality of training and LLM and the hallucinations, it isn't perfect. It's kind of a little bit ugly. But I think the promise in terms of like getting people ramped and productivity, I think it's going to change the nature of the way we interact with machines, which I think we've talked about in tech for probably decades.
[00:13:21] Joe Colantonio Right. I think that the feasibility of triaging lists, giving you like a quadrant to look at rather than everywhere is a big win for sure.
[00:13:30] Jeremy Burton Yeah, that's the win because as you mentioned earlier, the data volumes are vast. How do you eliminate most of the data and then dive in? And that's I think the LLM is going to help a lot with that.
[00:13:40] Joe Colantonio So what's the big benefit of this? Is it able to detect issues before your customers start complaining, or is it just like an incident happened and then, it's almost preventative or is it just trying to fix something that's just just plain old broken?
[00:13:56] Jeremy Burton Yeah, nice, it's a great question. I think certainly things like anomaly detection have been around for a while. So when you see something anomalous, you can alert and that may well be a sign of something more profound happening. I think with observability, it's an opportunity to sort of rethink how we're alerting and how we're finding problems. There are literally thousands of alerts set on every conceivable piece of infrastructure. And our viewpoint in the world of observability is much more like, okay, but what's important? Did it affect a customer? If it didn't affect a customer, maybe we don't care so much. And so can you actually set an alert or determine even whether a problem is affecting a customer? And when I mentioned earlier about taking this machine data and deriving relationships one of the pieces of magic in Observe is that you can see your customers and you can see your metrics by customer. And if there's an alert, you can see which customers are affected. And so maybe I don't necessarily need to pay attention to a thousand noisy alerts in my infrastructure, but I do if it is impacting a customer. And I think that is, I think observability. The opportunity, if you like, is to rethink how we're looking at our applications and systems and very much take a customer perspective on the problem. There'll always be ways that you can show outliers or show anomalies, and that might give you a heads up of a problem. But I think at the same time we can probably lose a lot of the noise in the alerts by actually alerting on the things we care about versus the underlying infrastructure that we might not.
[00:15:34] Joe Colantonio I love that. I know every time I speak to an SRE, there is dealing with noise and even with security, what's really an issue and be able to prioritize it is huge for a team. Because you have all this information happen really quickly. You need to build to triage quickly and pay attention to the things that really matter. So I love that approach.
[00:15:53] Jeremy Burton Yet, which is to set discrete alerts with that context. And this is again, like on observability, what's it about? It's about having more context. Why do you need context? Well, if it's a problem you've never seen before back to my analogy, if it's a location you've never been to before, you need context to establish where you are. And so I think the great observability tools are going to give SRE and DevOps teams that context that they never had. Should I care about these alerts? Did it affect anyone? Okay, did let me investigate. Okay, maybe I can ignore it.
[00:16:22] Joe Colantonio Absolutely. So isn't this a tool everyone should have? So other challenges to implementing an observability tool or is it just is that new? Is it education? Is it a hard sell?
[00:16:33] Jeremy Burton Yeah, it's relatively new. Alerts, it's becoming more widespread and certainly we bump into new directors and VP of observability quite often now. And so it's a little bit of a mindset shift in the organization because I do believe in the idea of engineering for observability has been the case for many years that engineers write code and other people clear up the mess and we are moving much more to a world where engineers have to take some responsibility for investigating the problems that are found in that code. And the best way they can do that is they can better instrument that code. And I do buy this idea that if you're going to go down the path of observability, it's not just renaming the monitoring team or the DevOps team, the observability team, IT Ops became DevOps. They do anything different now they just change the name. What does DevOps means development and operations working together. But many DevOps teams still feel like they're always on the hook to fix engineering problems. It's like, No, no, the Dev and DevOps should mean that they're also in the boat rowing in the same direction. I think there is definitely some work to do on building a culture of observability within a company. Then once you've got the team bought into that, you should have more instrumentation in your code. So with more instrumentation, it should be easier to find out where the problem lies. And so that to me is we always talk about technology, but people are involved here as well. And there is a cultural aspect to this to get people to understand the world from maybe not their historical perspective. And certainly, engineers doing their share of work on call helps with that. I mean, if you're on the front lines of dealing with problems and you realize that your code hasn't got the level of instrumentation to answer the questions you need to answer, that's the best way to have an engineer run back to the desk and do it.
[00:18:25] Joe Colantonio With then getting a system like this, then actually highlight what they need to focus in on for better instrumentation almost?
[00:18:31] Jeremy Burton Yeah. There are a couple of approaches to observability. There's an approach which is, hey, if you've done logging. Or if you've done custom metrics, you're doing it all wrong. Re-instrument your code with an open telemetry or something like that, and then everything will be perfect. I think for newer companies that's a viable approach. But I think for most larger companies, they've got years of code that is not instrumented with open telemetry. And so we took the approach early on with Observe that we don't care whether it's structured open telemetry events or whether it's unstructured logs. Give it all. We believe that we can make even if you've just got your code instrumented with logs, we can make more use of those logs because we add structure to the logs after the fact. And then look, if there's blind spots, there's no magic here. If if there are blind spots in the code, we can be precise about where the blind spots are. Don't re-instrument everything, but add this to this set of logs and then you'll be able to answer that question. We tend to pitch observability. We also tend to go after logs first because it tends to be a quick win. We can show the value of the product. And also I'd say particularly at this point in time, these systems are getting quite expensive. We can show folks how they can not only troubleshoot faster, but they can save a bucket too.
[00:19:52] Joe Colantonio I love that approach. A lot of times someone hear something, it's cutting edge and they may be on a very small team, start-up and they can implement all these. I'm used to work for like health care and insurance on legacy code. So being able to actually consume all these different formats I guess would be a big win for sure.
[00:20:11] Jeremy Burton I mean, my view, by the way, I think in that I mean, the environments that you're familiar with that is going to be logs with a bit of metrics for the foreseeable for the rest of this decade, that is going to be the predominant amount of telemetry that exists. I think open telemetry is a good thing, but it's a new thing and new things tend to be adopted by new companies much, much more quickly than older, more traditional companies. But it is not the case that you can only have observability if you re-instrument your code with like open telemetry or something new. That should not be the case.
[00:20:44] Joe Colantonio Absolutely. So you did mention you are a newer company. I guess a little background with the company, how long you've been around and do you have any customer success stories. Like I mentioned, I think you had a claim about able to cut incident times for like 1/3 or something.
[00:20:59] Jeremy Burton Yes, we've been around just over five years, took us about four years to build the product and then the last year to really get it to where we wanted it to be. So some things don't change in software and I've been doing this a long time and writing code still takes time and getting it right still takes time. I still have never seen a release that shipped early, so some things don't change. But yeah, we've got about 60 customers right now. Most of them are sort of 200 to 2000 employees in size. So you'd say sort of mid-sized companies. We're working with a couple of larger enterprises now. Our biggest customers are big bank. They do about 120 terabytes of data a day, and we're working with them on how they troubleshoot their mobile banking application. And we're quite bullish on what we can do to improve their troubleshooting time. And that is probably the biggest use case for Observe, observability, troubleshooting, reducing time to resolve incidents. We also work with some customers because in smaller companies the DevOps team tends to be the security team. If you've got a team of two or three folks, they get a hold of Observe and they realize that you can ingest logs. Well, you can ingest access logs, you can ingest security logs. We have a language in the product called OPAL, and you can go manipulate that data directly. So I'd say probably 20% of our users use the product not just for observability, but also sort of security and even compliance side use cases because they're essentially running queries against data coming from the identity and access management systems and from the firewalls and things like that.
[00:22:35] Joe Colantonio Actually, they've run down security, so there is overlap. So it's almost like you can kill two birds with one stone if you obviously you need security, better security. So implementing observability and especially Observe sounds like it would also help with security also.
[00:22:47] Jeremy Burton Yeah, I mean, maybe this is counter to a lot of folks. We've got a very broad definition of observability. We don't believe it just pertains to distributed applications. We believe that it should also pertain to the infrastructure. I mean, that's what the applications run on after all, we also think that, look, you can take telemetry information from a monolith as well and be able to troubleshoot that more quickly. You can take data from firewalls and make sense of those. And the term observability actually was coined back in the 60s by a guy who was studying control systems theory, a guy called Rudolf Coleman. And it was really just about, understanding or asserting what is going on inside a system from the external outputs. So it's quite a simple concept. It's like, Hey, can I figure out what's going on inside from the outputs? And security, if I've got access to outputs, which would be the logs can determine what's going on inside the network from that, sure you can. So why wouldn't observability apply to security as well? I mean, I don't see why it wouldn't.
[00:23:57] Joe Colantonio How are you able to handle the data volume that we're seeing nowadays? Is it because a Snowflake? Is Snowflake a database? It's the first time I've heard of Snowflake. Like I said, I'm a newbie here.
[00:24:05] Jeremy Burton Yeah, no, that's a great question. And I think one of the fundamental differences that Snowflake has and even though it was the biggest IPO in history, most people still don't really understand why it was such great technology. And I mean, I mentioned earlier that it can handle different data types. That's interesting. But that's not the main event. The thing about Snowflake is that to run queries, it doesn't build an index. And I was an oracle for almost 10 years and that sort of bends your mind that like hurts you. Because hey, if I want to find something in a book, what do I do? I'll look it up in the index. So how can it possibly be faster to not have an index? And the trick is that what Snowflake does is two things. It organizes the data ingests and what's called partitions or micro partitions, and then it organizes the data very efficiently in storage. And so it doesn't use an index, it scans data. But because the data is organized efficiently, it's able to eliminate a lot of the data when it's executing, the query does what's called pruning. And so the punchline is, is that it's faster in many cases than an index-based database, but it doesn't build an index. It scans. The benefit then the reason why that is so cool is we just dump all the data we want Snowflake to query into S3, right into Amazon. That's right. So cheapest chips to ingest it. We don't even charge for ingestion and then Snowflake queries it from there. Every other system when you ingested it starts building an index which gets expensive. And the more data you have, the bigger the index is, the more they have to charge. And so this really changes the economics of observability because it's fundamentally a different query engine and a different database technology.
[00:25:55] Joe Colantonio Okay, Jeremy, before we go, is there one piece of actual advice you can give to someone to help them with their DevOps observability efforts? And what's the best way to find contact you or learn more about Observe?
[00:26:06] Jeremy Burton Yeah, I think the number one topic we get inbound right now is cost. I think folk's budgets are under pressure and rather than just looking at can I get a discount or can my vendor do this more cost-effectively and sort of pull the thread and say, ask why, Why is this more expensive? And usually what that comes down to is that the data volumes that people are seeing today is maybe an order of magnitude or two more than their current tooling was designed for. And so, yeah, I'd say asking the question why and if someone can give me a discount, does that solve the problem or am I back here next year? Pull the thread on that, because what we're seeing is the architecture of incumbent tooling was designed for a different era and a different volume of data, and the economics are very, very different. So yeah, I'd say number one, if that's your pain point, pull the thread and ask why. Technically, why? Because the price list usually reflects the architecture of the product.
[00:27:12] Joe Colantonio Awesome. And the best way to learn about Observe?
[00:27:14] Jeremy Burton Best way to learn, Observeinc.com. If you go up there, obviously free trial for those that are interested and blogs and resources should give you everything from the very beginning to Observe the founding story right the way through to current day.
[00:27:26] And for links of everything of value, we covered in this DevOps toolchain show. Head on over to TestGuild.com/p134 and while you are there, make sure to click on the SmartBear link and learn all about Smartbear's, awesome solutions to give you the visibility you need to do the great software that's SmartBear.com. That's it for this episode of the DevOps Toolchain show, I'm Joe. My mission is to help you succeed in creating end-to-end full-stack DevOps toolchain awesomeness. As always, test everything and keep the good. Cheers
[00:27:49] Hey, thanks again for listening. If you're not already part of our awesome community of 27,000 of the smartest testers, DevOps, and automation professionals in the world, we'd love to have you join the FAM at Testguild.com and if you're in the DevOps automation software testing space or you're a test tool provider and want to offer real-world value that can improve the skills or solve a problem for the Guild community. I love to hear from you head on over to testguild.info And let's make it happen.
Sign up to receive email updates
Enter your name and email address below and I'll send you periodic updates about the podcast.