About this DevOps Toolchain Episode:
Welcome back to another episode of the DevOps Toolchain podcast! In Today's episode, Ivan Merrill, the head of solutions engineering at Fiberplane in Amsterdam, joins us. Ivan has 15 years of experience helping financial enterprises understand monitoring and observability, making him the perfect person to discuss streamlining the DevOps experience. Listen in as Ivan shares his insights on the fascinating topic of reliable and complex systems, drawing inspiration from industries like aviation and health. We'll dive into the importance of Service Level Objectives (SLOs) and how they can help developers and operators determine what is considered good or bad regarding system reliability. Ivan also introduces Fiberplane's collaborative notebooks for SREs and their innovative AutoMetrics tool. So grab your headphones and get ready to streamline your DevOps experience!
TestGuild DevOps Toolchain Exclusive Sponsor
Get real-time data on real-user experiences – really.
Latency is the silent killer of apps. It’s frustrating for the user and under the radar for you. It’s easily overlooked by standard error monitoring. But now BugSnag, one of the best production visibility solutions in the industry, has its own performance monitoring feature: Real User Monitoring . It detects and reports real-user performance data – in real-time –so you can rapidly identify lags. Plus gives you the context to fix them. Try out Bugsnag for free today. No credit card is required.
About Ivan Merrill
Ivan Merrill is Head of Solutions Engineering at Fiberplane, based in Amsterdam. Before joining Fiberplane Ivan spent 15 years in large financial enterprises helping teams understand the power of monitoring and observability, whilst leading large-scale deployments of monitoring tools on rather important banking systems.
Connect with Ivan Merrill
- Company: www.fiberplane.com
- Blog: www.fiberplane.com/blog
- LinkedIn: www.ivan-merrill
- Twitter: www.AlmightyGiraff
Rate and Review TestGuild DevOps Toolchain Podcast
Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.
[00:00:01] Get ready to discover some of the most actionable DevOps techniques and tooling, including performance and reliability for some of the world's smartest engineers. Hey, I'm Joe Colantonio, host of the DevOps Toolchain Podcast and my goal is to help you create DevOps toolchain awesomeness.
[00:00:19] Hey, it's Joe, and welcome to another episode of the Test Guild DevOps Toolchain. And today, we'll be talking with Ivan Merrill all about Streamlining The DevOps Experience. If you don't know, Ivan is the head of Solutions Engineering at Fiberplane based in Amsterdam and before joining Fiberplane, I've been actually spent about 15 years at large financial enterprises helping their teams really understand the power of monitoring and observability. He really knows his stuff. Also, while leading large-scale development of monitoring tools and other important banking systems, he has a bunch of experience I think you're really going to get a lot of value from. You really want to stick around. If you just want to learn how to streamline your DevOps focus with SLOs and all these types of cool capabilities that Fiberplane has. You don't want to miss this episode. Check it out.
[00:01:07] Hey, if your app is slow, it could be worse than an hour. It could be frustrating. And in my experience, frustrated users don't last long. But since slow performance is sudden, it's hard for standard error monitoring tools to catch. That's why BugSnag, one of the best production visibility solutions in the industry has a way to automatically watch these issues. Real user monitoring. It detects and reports real user performance data in real time, so you can quickly identify lags. Plus, get the context of where the lags are and how to fix them. Don't rely on frustrated user feedback. Find out for yourself. Go to BugSnag.com and click on the free trial button. No credit card required. Support the show and check them out.
[00:01:54] Joe Colantonio Hey, Ivan, Welcome to The Guild.
[00:01:58] Ivan Merrill Thank you for having me.
[00:01:59] Joe Colantonio Great to have you. I always ask everyone, is there anything in your bio that I missed that you want the Guild to know more about?
[00:02:05] Ivan Merrill I don't think so. I think that kind of pretty much sums it up, really. Kind of. Yeah. Far to past about monitoring observability. That's kind of my job. I kind of fell into. I've been doing it for a long time now.
[00:02:18] Joe Colantonio Love it. What drew you to Fiberplane then? What made you want to get to Fiberplane and really build something?
[00:02:25] Ivan Merrill Honestly, the first thing is that it's a really cool product. I really like the idea. Fiberplanes kind of core product is collaborative notebooks for SREs. And having been in large incidents and seeing that kind of every team has their own kind of data source and every team is trying to go away and check their own things, could have been private and then kind of actually that ability to collaborate and everything was quite hard and you invariably ended up kind of getting lots of links to data and screenshots and things posted in your chat tool of choice. And that was quite hard to kind of follow things through and understand what was going on. And I think just the idea of having the whole incident kind of investigated in the open is a great way to build trust amongst teams, which is so important.
[00:03:17] Joe Colantonio Absolutely. So you have over 15 years of experience. I think SRE is newer than that. What is the evolution been like? I know when we talk about observability, I used to be involved with performance testing. It was more like application performance monitoring, but that seems to have morphed. Any thoughts on SRE and this new kind of way of monitoring?
[00:03:36] Ivan Merrill Yeah. I think one of the things that always kind of interests me with a lot of this stuff is that a lot of the things that we're talking about in the SRE world, a lot of principles have actually been practiced by a lot of really large scale organizations, particularly in the banking world, where reliability clearly, a failure comes with potentially a heavy fine and front page news. So things like chaos engineering that we used to call that was part of our kind of go live process where someone would try and break your thing and see how it failed and you needed to kind of work out resiliency. We used to do regular site switches between kind of one day. All of these things kind of reliability practices that we've been doing for many years. But I think SRE really kind of codifies a lot of this stuff and makes it more accessible to people who aren't working in these kinds of huge organizations that don't have the resources to spend on this thing. But in terms of the evolution, I mean, in the marching of observability world when I first started, it was all infrastructure marching. It was a kind of CPU control and it was what we used to back in the day BMC Toolin, it was very much checking an amount of things like run a check, is this CPU above a threshold? No, everything's good. And you do that kind of over and over again forever. And it was very, very hard to equate anything kind of user experience-wise. And then obviously we kind of got into APM world and that was great. And I used to go around and say to everyone, your users don't care what your CPU metric is like. They care with no experience was interesting to listen to that, real user monitoring thing at the beginning of this thing because that was one of the things that we were interested in. We were trying to roll out. And that's been really great. And I think that focus is absolutely right. And now, just the evolution for me into observability is really big because it's no longer become this check-based thing. Application performance management is great and you do generate a lot of data and you've got different aspects to it. But the idea of observability, the understanding of the state of your system, and being able to be more integrative, is the word integrative? Of your system is really powerful. And I think it just also moves on the way that people think about these things. It's not these tools and these practices are not just something to do in an incident, but actually driving a lot of other really important engineering processes and stuff.
[00:06:17] Joe Colantonio Love it. Almost seems like a mindset shift. Like you said, we used to look at CPU usage, disk usage, and memory, but now it's more users. Do you think that's why one of the reasons for that is we are moving towards like cloud-native type applications where before we had systems under our control where we know, okay, we know this can handle this, but now who knows like what we're scaling up, what containers are scaling up, what services we're using that are outside our control, and we're like, I can't even test this if I wanted to.
[00:06:45] Ivan Merrill Yeah, absolutely. I mean, I think we've mentioned the word DevOps, which I think means by law we have to talk about the kind of cattle versus pets analogy. I mean. Quite a bit it's exactly that right. I mean, I think there is this, the whole kind of the architecture of what we're running about, CPU memory monitoring and stuff like that. It didn't matter so much back in the day when there was a sysadmin who understood that pet. When I do talk I literally have a picture of my cat, Stephen, and it's like, I know what time he wakes up, what time he likes to go to bed, what time he likes his meals, like if I give him a little tickle under the chin, I know his behavior. And that is very well balanced there then the mental model that we can have of these systems when they were much smaller is more complete. And I've been fascinated looking at some of the things around kind of reliable systems. And Dr. Richard Cook kind of how complex systems fail and all of this stuff is really, really interesting. And we are moving into this space of having complex systems where no one has a complete mental model, and that requires a fundamental shifting in how we approach things and what we do to actually be able to detect if they're not working as expected, where and why and everything else like that.
[00:08:02] Joe Colantonio Absolutely. Is that another reason why we see a rise in like service-level agreements are objectives? We've had them before, but it seems like they're really transforming the way we may have done alerting traditionally. Maybe I know you've written a few blog posts on this as well. Can you dive into a little bit about SLOs, how we can understand them better, how they may be the preferred choice for one of businesses, or something like that?
[00:08:26] Ivan Merrill Yeah, service-level objectives, SLO. I've seen in the SRE originally and everything like that. I like to think of them as a bar of quality that a team chooses to hold itself up to. I'm building a system. What is the bar of quality that I hold myself up to? That's always kind of how I personally have chosen to explain it. And I think that's an important thing to think about because service-level agreements are things that you put into contracts. You sign them with your customers, they are often legal. When I was running a monitoring team, we had vendors that we used. They gave me an SLA. SLOs are what we choose to hold ourselves up to, and that is a very important distinction. If you done it right, they are based on what is important to your customer. And so what you're trying to say is give me some SLIs, service level indicators of the real important interactions that my customer has with my service, my application, and what is the bar of quality that we hold ourselves up to for those interactions. And there are a number of different things that you can use here, kind of the number of errors that you said to your customers, the number of the latency, in fact, the two, the most obvious ones. It's about not being perfect because we all talk about kind of there is no service, it's 100% available, but it might be for a short period of time. But it's impossible to keep it up. You can't have 100% perfectness, but you can have. We often talk about the kind of 4/9 or 5/9 or however many 9s you want. We want maybe 99.99% of our customers to get a successful request. And that is the bar of quality that we wish to hold ourselves up to. And what it's doing is it's kind of giving us a way to say every single interaction that a customer has, was this good or bad? Did they get a 200 okay or did they get a kind of 500 or above or whatever is the way that you choose to think about it? And that's again, really important. And one of the things that I found most powerful about kind of going around to engineering teams and implementing them is it's forcing people to have this level of thought that perhaps hasn't happened before. They might have thought we've got a nice HTTP error code, that's great, but which ones do we care about? Come on, what? If we're getting a customer typing in that password wrong. I've seen people have that pop-up as an error, but that isn't an error. That HTTP code that you've provided that. I mean, it's appropriate for that. But we don't want to judge that as an actual error to us. We need to if you kind of exclude that from our calculations and what are the things that we can definitively say are good and what does that leave that are essentially definitively bad? And that is a really powerful concept that you can apply through and start measuring and do all kinds of crazy things once you do it. So yeah, it's a hugely powerful concept and literally books have been written on it, but in a simple level, it's a bar of quality that you choose to hold yourself up to.
[00:11:43] Joe Colantonio I love that term bar of quality. How do we know we have our SLOs aligned with the customer's really needs? Sometimes we have thoughts that this is what our customer wants and need, and sometimes we get so focused on that that we don't even change based on what we're observing in production. How do you make sure that they are aligned with your customers?
[00:12:02] Ivan Merrill Yeah. With difficulty. I mean, it's not easy. I think the first thing to recognize in this space is that SLOs aren't permanent. They evolve, and your service evolves. I mentioned briefly Dr. Richard Cook. It was talked about humans being the adaptive part of the system. We are constantly introducing change to our system right where new features, and releases, changing our architecture, maybe. So with that, it's right that our service level objectives change. If we add something that makes it more reliable and we consistently say that we are more reliable, then we can maybe increase our service level objective if we think we can regularly meet that. But ultimately, it's about being not perfect, but good enough. And that will depend on service to service, and that will be some interactions with customers. I mean, ultimately what we're trying to do is to make our service reliable enough. There is this great idea that I really kind of firmly believe in that reliability is any product's number one feature. And it says the idea of it was not reliable, then users won't trust it. And if they don't trust it, then they're going to look for alternatives. And if your system has no users, then what's the point in existing? We need to make sure that it's reliable enough. Now, what is reliable enough? That is a very difficult question, and I don't think really I could provide any single answer because it will be different for different systems. I mean, I've actually known in the banking world that some services provide a response too quickly. If you're kind of thinking about a decision around, are we going to accept this person as a customer? Yeah. And someone receives, no. I don't think I believe they compute ahead. It said, no. I don't believe it. So in actual fact, in this particular case, having a really quick SLO. Its really low latency is entirely unimportant because your customer doesn't-in an actual fact, you want to almost add in a delay in this time. This is entirely dependent around customer interaction. And I think part of the thing from the Engineering World is engineering people that are pushing out that the SLOs and things are talk to other people in your business, talk to your product owners, talk to your marketing people, talk to your customer support people. This should be I think that it's a bar of quality that you are providing, and it's what your team can provide. But it needs to be done in conjunction with other people. Because it's no point me saying that, my bar of quality is 4/9, but actually our marketing team is expecting 5/9 and are talking about we're being the most reliable, whatever. There's not a right answer, but it needs to be in conjunction with everyone else. And normally I think, it's pretty quick to find out if your users are unhappy.
[00:15:02] Joe Colantonio No, it's a great point. Especially like you said that example, you give the customer what they're expecting and even though it is technically slow, it may not be slow because like you said, that's great, great. I love that example. How do we get this information or fed into our SRE teams and our DevOps teams? Is it all hinged on observability in how it's baked into an infrastructure in applications and how we're able to get this data to be able to come up with better and better SLOs?
[00:15:28] Ivan Merrill You're asking an observability person whether it's all based on observability. Yes, it absolutely is. Always observability is the answer to everything. I mean, let's face it. I mean, I'm not sure it's entirely true, but it was not that, was it the moment that what you don't measure, you can't improve? There is that quite famous quote, isn't there? And whether that's true or not is up for debate. But it's certainly very hard to measure or to improve something that you don't measure. I think there has to be a grounding in it. Now, that's not to say that you need a full-on, absolutely perfect monitoring system. To me, fundamentally, for the SLO for the actual kind of the mass of the SLO. You need a way of deciding for everything that you're measuring. So normally that's user requests. It's something good or bad, because ultimately what you're trying to do and one of the most powerful aspects of service of objectives is providing a common language to the entire business of reliability and whether we were reliable enough. And in the DevOps world, that means actually aligning devs and ops on was this good or bad. So kind of a simplistic level, if you have the ability to take whatever is your unit of measurement and say this one was good and this one was bad, then you have enough. But hopefully, that will then evolve into maybe more kind of granular SLOs or kind of better quality SLOs or anything else like that. I mean, it's certainly something you can start off with quite simply and then evolve. As you get more experience.
[00:17:03] Joe Colantonio Great. The last time I spoke at the folks at Fiberplane, I think it was like November. And I noticed recently you had a new Observability look-like feature called Autometrics. We talked about SLOs. Why observability is so important. Is Autometrics one of the solutions that someone can use? I believe it's open source as well to help them with observability.
[00:17:22] Ivan Merrill Absolutely. And that's kind of part of the reason that we created it, invented it, and put it out there for the world to use. When speaking to lots of different companies about kind of the situation for them and their organization around monitoring observability, how well their applications were instrumented, we kept seeing similar kinds of patterns of issues. Actually, companies were struggling to get good instrumentation in the first place because maybe developers were unsure about how to do it or didn't know what to instrument. Even things like naming your metrics are difficult. And then particularly, once you got those metrics, how do you go about querying? There are so many different tools, each of them as an arrangement in a specific language, like everything else like that. So what Autometrics is a function-based metric for a few different languages. What you do is you decorate your functions within your code base, within your IDE, and for that, you automatically get the number of implications of that function, the number of errors from those implications, and the latency of the implications. And because it's quite a kind of curated way of doing this because the metrics that are created are given a very specific name. You can actually then follow it through if you know the metrics, know the number of errors and the number of implications, and you can start to work out a percentage on the success rate. And if you know the latency and you're given a latency target, then again you can start to work out, are we meeting that target? Which then means that SLOs can be very quickly generated based on those metrics and alerts based off those, SLOs. It's very kind of a great way to get started with observability, a really great way to kind of get some pretty good observability very simply. And I think one of the there's a couple of really, really big benefits to me. One of the things that I really like is that it takes away observability being this kind of problem to overcome this thing that you need to do, this thing that requires a lot of effort because actually, it's really simple to go and decorate your functions with the Autometrics decorator and then, to not have to think about anything else but from that, get good quality metrics that particularly metrics that make sense. I briefly touched on mental models at the beginning of this, but for someone who is writing the code, that is building the system, their mental model is based up on how they interact with it. And that's their code. That is the function that they write. Providing metrics that are in that kind of paradigm, that make sense to them, that align with their mental model means there isn't really a translation to be done between someone writing code and the metrics are output, and that's incredibly powerful and helping people kind of quickly, easily consume those metrics and actually do useful things with them.
[00:20:28] Joe Colantonio Now this just popped to my head. Is this also a way for SREs to maybe open up communication better with the developers? Some developers of one are decorating their functions and almost opens up a dialog between SREs and Developers. It's almost like a backhanded way to get them to talk to one another. Like I said, just popped in my head.
[00:20:45] Ivan Merrill Absolutely. I think it's certainly in my experience, it can be quite hard to go out to some of the engineering teams and say you need to instrument your code. It's not actually a very fun task. No one really wants to go through and say, today we're going to be kind of adding metrics to our code base. That's not something that people really. Being able to kind of as you when you touch your code base, maybe you're retyping some code, maybe writing a new feature, but just being able to add in this decorator nice and simply get the metrics means that it's really easy, but it enables, as I said, firstly, you're getting some really great metrics that make sense to both the SREs and the developers. I think that it helps the SREs to understand the code base better because the metrics that they're getting consuming are based on the functions of the code. Again, it's just this idea of helping people speak the same language which is really powerful.
[00:21:39] Joe Colantonio Absolutely. How does this all tie in then, I think one of the main objectives of Fiberplane is to help people kind of streamline their focus, especially prioritize service level objectives and SLO capabilities like that. I guess, obviously with all the metrics and all this stuff baked in, it's going to help you prioritize it, get more information, be able to prioritize. But is there anything else that folks will be able to do at Fiberplane to help them with the service level objectives or anything around this area?
[00:22:07] Ivan Merrill Yeah, it's a good question. I mean, SLOs are the basis of such so many kinds of important things. If done right, they're not just a way to say we're reliable enough or we're not reliable enough. And we briefly touched on the fact that you can base your alerting on the back of that. You've got a lot of fatigue is a very real problem in the SRE world and everything else. But in terms of streamlining kind of that, the overall process and developers and DevOps things and everything else that it's also about, well, firstly there is a prioritization, right? You're able to kind of make a better understanding of do I need to prioritize reliability or do I not. And with the Fiberplane notebooks, you've got a continual idea of as a notebook, you've got an incident you're going through, you can stand, hopefully, you've got your application instrumented with Autometrics and everything, so you can see things like your error rate and therefore your burn rate, which is another concept maybe for another day, kind of going into the complexities of service level objectives. But you're starting to see kind of which incidents are costing you a lot of time, a lot of money and you've got I think of the Fiberplane notebooks as kind of incident and artifacts. Kind of looking at as you're investigating an incident, you are recording that investigation as you go through, which I think is really cool. And as I said, one of the reasons I kind of wanted to get involved with this company. So if you're doing all that stuff, then you can start to really make a more data-driven decision around do we need to invest in maybe a release process. Because the last five incidents that we've had that have taken so much time have been around releases and so is it right? We put in maybe more of automated testing early on as we go through our continuous integration status, all that kind of stuff. SLOs start offering this thing about kind of measuring reliability, but once you start to kind of bring them in and use them in a more thoroughly throughout the organization, as you get more mature with them, as you start using them as a basis for do I have an incident? Investigating those incidents, understanding what is the cause of this incident, what is causing us to reach our SLO target, then? They've become such a wider, more powerful thing that can really bring a huge amount of benefit. Yeah, many things I mean, I could talk about this for a long, long, long time, but definitely there are lots of opportunities.
[00:24:40] Joe Colantonio Absolutely. So it sounds like a lot of data, just like performance testing of a lot of this data, and sometimes it's hard to tell, okay, what does this mean or which one should I prioritize? So any tips on how to turn maybe some of these complex SLOs, the metrics into more clear, pretty much actionable insights? Is any sort of visualization baked in or anything like that to help the person figure these out really quickly or triage quickly?
[00:25:04] Ivan Merrill Yeah. So certainly if we come back to things like Autometrics, there is an open source. That open source project is what's called Explorer, which is a really nice simple UI to say these are your service level objectives, these are the ones that on tracked, these are the ones of target. And again, don't forget that each one of your service level objectives should be done correctly, based on customer experience. And it's it comes down to a percentage. It comes down to a very kind of simple percentage. If you've got two SLOs and one of them is at 95% and one of them is at 97%, it is a pretty kind of clear thing. This is part of the common language they provide. One is better than the other. Yeah, there are things like that. And actually, there were even some profound dashboards and provided as well. Once you kind of get to the point of you collecting the data, there are a number of different ways that you can really visualize it and just become I think everything is easier with that with a picture. Getting a clear view is really important. The Explorer, profound dashboards. There you go.
[00:26:15] Joe Colantonio Awesome. Okay, Ivan, before we go, is there one piece of actionable advice you can give to someone to help them with their DevOps SRE efforts? And what's the best way to find contact you or learn more about Fiberplane?
[00:26:27] Ivan Merrill Okay, so the first one is probably the easier one. So a couple of things. We've talked a lot about Autometrics that's autometics.dev Get involved. This is an open source. We'd really like to see people on there, come join in our discord and then you can go to Fiberplane.com to sign up for the notebooks. In terms of one bit of advice for SREs and things, I think one I've given before is this is quite a new practice. And I think that we all have to understand that we are continually learning this. And I think, don't be afraid of failure and understand that that's part of the learning process. I think that service level objectives really do kind of epitomize that where you might start off one service level objective, but it will continually evolve and you should continually evolve. And that's kind of something that we must all accept. I mean, I always like to think that reliability in our tech world is pretty new compared to kind of aviation or kind of some of the other health and everything else like that. And a lot of the people that we're taking inspiration from are from that world, and they have an awful lot more experience with it, a lot to learn from. We are constantly evolving it. Don't be scared of it. Give it a go, and understand it, it probably won't be perfect the first time, but that's okay.
[00:27:53] And for links of everything of value, we covered in this DevOps toolchain show. Head on over to TestGuild.com/p132 and while you are there, make sure to click on the SmartBear link and learn all about Smartbear's, awesome solutions to give you the visibility you need to do the great software that's SmartBear.com. That's it for this episode of the DevOps Toolchain show, I'm Joe. My mission is to help you succeed in creating end-to-end full-stack DevOps toolchain awesomeness. As always, test everything and keep the good. Cheers
[00:28:28] Hey, thanks again for listening. If you're not already part of our awesome community of 27,000 of the smartest testers, DevOps, and automation professionals in the world, we'd love to have you join the FAM at Testguild.com and if you're in the DevOps automation software testing space or you're a test tool provider and want to offer real-world value that can improve the skills or solve a problem for the Guild community. I love to hear from you head on over to testguild.info And let's make it happen.
Sign up to receive email updates
Enter your name and email address below and I'll send you periodic updates about the podcast.