Collaborative Notebooks for Debugging Your Infrastructure with Micha Hernandez Van Leuffen

Published on:
Micha Hernandez-VanLeuffen TestGuild Performance Feature

About this Episode:

Want an easy way to query, visualize, and understand metrics and logs in your infrastructure? In this episode, Micha Hernandez van Leuffen is the founder of Fiberplane shares a new way to enhance your SRE collaboration. Discover all about collaborative notebooks for resolving incidents, coordinating work related to downtime, how to build up a structured knowledge base, and more.

Try Fiberplane yourself:  studio.fiberplane.com
Learn more about Fiberplanes features: docs.fiberplane.com

TestGuild Performance Exclusive Sponsor

SmartBear is committed to helping you release high-quality software, faster, so they created LoadNinja to make performance testing effortless. Save time by eliminating correlations with automated, browser-based performance testing that ensures your application performs reliably when it’s needed most. Try it free today.

About Micha Hernandez van Leuffen

Micha Hernandez van Leuffen

Micha Hernandez van Leuffen is the founder of Fiberplane. He has previously exited a company focused on CI/CD to Oracle. Micha has dedicated his career to improving the workflows of developers. And now with Fiberplane, he's building a collaborative notebook platform for DevOps and SRE.

Connect with Micha Hernandez van Leuffen

 

Rate and Review TestGuild Performance Podcast

Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.

[00:00:05] Get ready to discover some of the most actionable performance engineering testing and Site Reliability advice with some of the world's smartest engineers. Hey, I'm Joe Colantonio host of the Test Guild and SRE Podcast and my goal is to help you succeed in creating application performance awesomeness.

[00:00:26] Joe Colantonio Hey, it's Joe, and welcome to another episode of the Test Guild Performance and Site Reliability podcast. And today, we'll be talking with Micha, all about collaborative notebooks for debugging your infrastructure. If you don't know, Micha is the founder of Fiberplane. He previously exited a company focusing on CI/CD to Oracle. Micha also has dedicated his career to improving the workflows of developers. And now with Fiberplane, he's building a collaborative notebook platform for DevOps and SREs. So we're going to dive deep into this. And you don't want to miss this episode, check it out.

[00:00:59] This episode is brought to you by the awesome folks at SmartBear. Listen, we know load testing is tough but necessary. So investing in the right tools to automate tests, identify bottlenecks, to resolve issues fast saves your organization both time and money. And that's why SmartBear created a load ninja, a SaaS load testing tool to help teams get full visibility into performance. So you can release quality software faster than ever. Make testing effortless and give it a shot. It's free and easy to try. Head on over to loadninja.com and learn more.

[00:01:31] Joe Colantonio Hey, Micha. Welcome to the Guild.

[00:01:37] Micha Hernandez van Leuffen Thanks for having me, Joe.

[00:01:39] Joe Colantonio Awesome. Before we get into it, is there anything I missed in your bio that you want the guild to know more about?

[00:01:43] Micha Hernandez van Leuffen Yeah. I also run a fund called NP-Hard Ventures, just focused on investing in pre-seed and seed dev tools. Infrastructures build a block for tomorrow-type companies. So same wheelhouse. But then on the investing side.

[00:01:58] Joe Colantonio So that brings to my first questions. And even relate to what I was going to ask you, I'm always curious to know how founders they sell something Oracle sounds like you invest in other companies. Why start up another company?

[00:02:08] Micha Hernandez van Leuffen That's a good question. Yeah. So I guess sort of the origin of Fiberplane came from, I would say, sort of our frustration and that we experienced in the previous company which was worker and but also inside of Oracle and that's specifically around sort of debugging our infrastructures. We can dive into that more. But I'd say sort of, yeah, there's sort of this thing crunching way in my head and some of our struggles that we had was like just, I wanted to pick it up and start a new company around it.

[00:02:35] Joe Colantonio So the company was, I guess originally you sold to oracle was CI/CD based this sounds like more of an SRE solution. So I guess the first thing seems like a lot of companies are going cloud-native. How has that changed? Maybe the landscape of DevOps engineers that maybe were solely focused on CI/CD and now maybe SRE. Things that they're struggling with now that you think maybe they weren't struggling with before.

[00:02:58] Micha Hernandez van Leuffen Yeah. So I'd say, Kubernetes and containers and even serverless. I think they've become more prevalent. When we started work, ah, the previous company, we were super early in that sort of container journey. The original version of a worker was actually running on LXC, not even Kubernetes. There was no orchestrator. Once a Docker came on the scene and Kubernetes, we completely moved it up. And that was around, I think, like 2014 when we made that move, initially on CoreOS fleet. So we're using containers, but there was no Kubernetes yet. And then later on in 2015, 2016, going full-on Kubernetes. I think the worker story is sort of indicative of what other companies are experiencing now is like, well, great. We now got full on containers, we're full on Kubernetes, and now we've got a bunch of new problems. Right. And then specifically around the observability and SRE space.

[00:03:53] Joe Colantonio So it does also make it more complex for debugging applications and services as we go to Cloud Native because maybe they're not all under your control?

[00:04:00] Micha Hernandez van Leuffen Yeah, exactly. So that was actually the reason. Well, one of the reasons for starting Fiberplane was how hard it is now to debug your infrastructure. One of the challenges that we had to work around inside of Oracle is like, great we're running the CI/CD system. It's composed out of a bunch of microservices, running in a multitude of containers, abstracted on top of AWS, all these layers of abstraction, multiple moving parts, like how do you reason about your infrastructure and all these services running inside of it? Right. And that for us was quite a challenge because we were doing CI/CD. So we were running users arbitrary code inside of this distributed system and that makes for some great downtime occasionally or incidents or stuff that we need to debug. And so that was part of our frustration. Like we need better tools to sort of to reason about this stuff. And that was sort of one of the reasons to start a new company.

[00:04:56] Joe Colantonio Nice. So also, I think maybe before Cloud Native, maybe you didn't find issues as often as you would in production. And therefore, people maybe doing debugging or testing production is often is that another reason why you see a need for a solution for people to be able to do production-based, but development for production-based, debugging, and things like that.

[00:05:15] Micha Hernandez van Leuffen I think to some extent I think also in CI/CD is sort of the move towards Agile, whatever that means, has sort of accelerated the need for that as well. We're building these things very fast instead of going from sort of one data to two data version. Now we're doing these incremental updates. Developer velocity has increased. You are testing and production, right sort of. Obviously, you should do that, but we're moving quite fast. And that also as a result of that is sort of his need for better tools.

[00:05:47] Joe Colantonio So, I don't know where I saw this. Something to do with dashboarding. A lot of companies focus on a dashboard like one place where everything can be done. I thought you had a blog post maybe on something on the single pane or the myth of everything in one place. Can we talk a little bit more about that?

[00:06:03] Micha Hernandez van Leuffen Yeah, I think that's sort of one part of our thesis and sort of DNA of the company. And this is not to bash on the dashboarding companies out there, but dashboards are great for when you know in advance what you're going to measure. Right. And which effectively means what can go wrong by sort of this cockpit view of your infrastructure. You've set it up in advance. You've reasoned, you've thought every scenario through you reason about everything. And you put this thing, put this into a dashboard. And then, nowadays maybe less so. But originally people put put it on TV inside the office that nobody looks at. And I think nowadays and that's part of our thesis that, well, that's great for the known knowns, but not so great for the unknowns. Right. And the unknowns are more prevalent now because of all of the reasons that I just mentioned. It's sort of this move towards microservices. We're running things inside of containers. It's this sort of orchestrator system like Kubernetes that's scheduling all of these things. We've got users acting on our services, right? You don't know everything in advance. So what we're thinking is that, well, you need a more explorative form factor to reason about your infrastructure and your services and sort of obviously inspired by Jupyter Notebooks and the data science space. Let's use this notebook where we sort of can gather all these different metrics, your logs, your traces, and possibly also other types of data, your deploys, your builds, and then put that in this explore to form factor and then reason about it.

[00:07:39] Joe Colantonio And so I assume that helps with collaboration. Is that one of the big benefits this helps SRE collaboration?

[00:07:45] Micha Hernandez van Leuffen Yeah. Yeah. So that kind of brings us to I guess the other origin point of the company is that, okay, great. There's a better form factor for looking and debugging your infrastructure. That's one thing. But the other thing that I thought was striking is how we've benefited from all this collaborative software out there. Right. Notion for productivity. Google Docs, of course, we got Figma in the design space, Framer for building websites in a collaborative manner. But it's like weird how actually developer tools and then specifically DevOps tools have not really benefited from this move toward collaborative software. I think honestly, like the things that we got right, we got the pull request, which is quite old, that by now I think since GitHub came around and we have pair programming, collaborative programming, right. And Visual Studio Code, Live shared some of these collaborative IDEs. But other than that, we haven't really benefited as a sort of industry DevOps or developer industry from this whole move towards collaborative software. So that's the other piece that we are bringing to the table is that explored of form factor, but and also in a collaborative way where you can mention people as you would and in Slack you can sort of start discussions or you can assign people. And you have this sort of system of record and shared visibility into your infrastructure and then how people are working on infrastructure or reading on that infrastructure.

[00:09:08] Joe Colantonio And I assume the reason this is also critical is, like you said with the dashboard, it's like no knowns, SREs, a lot of unknowns, what's going on in production. So maybe you need this real time type of collaboration. Is that why this is so important?

[00:09:21] Micha Hernandez van Leuffen Yeah. And I think the other piece of it is the status quo of you and I perhaps looking at an issue inside of our infrastructure is, you start a zoom and you share maybe your data docs screen, which is another dashboard, and then you and I talk over it. Right. That's the status quo right now. I think also COVID accelerated that thinking quite a bit like, okay, we need collaborative software right front and center to do this in a more collaborative and shared visibility and transparent way.

[00:09:50] Joe Colantonio So I guess to really be able to dive into this more, I guess we just talk a little bit more about Fiberplane, what is Fiberplane? And then maybe we can see how that helps with collaboration or tools like Fiberplane helps with this process.

[00:10:02] Micha Hernandez van Leuffen Yeah. So imagine very much a Google Docs like environment, right? You've got this writing, collaborative editing experience. That's sort of one thing, right? And I can like Slack as I mentioned, or Notion you can @ mentioned people, tag people, you've got your basic writing capabilities. Then where it gets interesting is we have the notion of what we call providers, which are effectively plugins that allow you to fetch data from your right now your observability stack. Right. So we support, for instance, Elasticsearch where you can fetch your logs, we support Prometheus, we can fetch your metrics. We've got a century provider as well and a few others up our sleeves. So this allows you to sort of suck in all these metrics, logs, traces inside the notebook form factor, present this type of data, be it in a chart that is like real-time is not a screenshot or an iframe that is actual Fiberplane rendered chart of your metrics or be allow you to get some logs in there, display them, filter down, also sort of highlight them and sort of say, hey, these log lines seem interesting. That could be the culprit of the thing that I'm investigating. You're able to sort of save that and build up that system of record. So that sort of in a nutshell, what Fiberplane is. Then, on top of that, we have templates. So very much like the Notion or any other type of sort of productivity type software where you can sort of codify that knowledge as a template or as a run book. Right? When you're running a sort of investigation, go through these and these and these steps. But also, before you dove into anything, at least look at these charts or these logs by default and you sort of able to codify that inside the template and render these charts or logs beforehand. So a great example here would be, say a pager duty alert goes up. That's something that we integrate with. Automatically a notebook gets created based off a template where you've maybe codified the service name that is down, which could be like the billing API. And then we're able to render some charts related to that billing API, give you some logs related to that billing API, and then getting the right people inside the notebook and then start debugging that thing.

[00:12:16] Joe Colantonio So it sounds like it saves a lot of time for triage because it's only bubbling up the charts that specific to that type of incident. And then does it automatically alert people that maybe a tagged for that type of incident so you know, who would even get on board to help collaborate with?

[00:12:32] Micha Hernandez van Leuffen Yeah. So right now indeed within the product, within Fiberplane itself, people get notified that they should do something or you can sort of codify the at mentions. It's part of the template that always jokes should be in this vein and help out.

[00:12:48] Joe Colantonio Very nice. So does it go do any sort of automatic triaging for you as well? Because it has all that data. Does it bubble up inside to say it sounds like a lot, throw in the correct charts or places, but does it do anything that says, Hey, maybe you need to look at this? This has been the way it was for the past month, but now we see a spike there. So that's where you should start looking at or collaborating on?

[00:13:08] Micha Hernandez van Leuffen Yeah, so I wouldn't say we're doing anything in ML thing yet where we sort of make suggestions based on certain data or spikes that we see. What we do support is another good example of like the provider model, the integrations is where you're capable of actually pushing events. We have an eventing API push events to Filberplane and a good example here would be deploys. So say you are using GitHub actions for your CI/CD and you're doing deploys. Each time you do a deploy for your service, you're able to sort of actually call an action, a Fiberplane action that we built. And it sort of pumps in that deploy data, maybe with some labels attached to it as well, which is another sort of powerful concept within Fiberplane. Sort of label that event is to billing API, this version, and what you're now capable of doing in sort of overlaying the events with the metrics, right? So now you see maybe you see some kind of spike of five hundreds. And then before the spike you see a big fat red line, which is your deploy. Now, hey, maybe these things are correlated, right? Like I just need to deploy of this service. The 500 error rate has gone up. Maybe these are related. So at least give you be able to correlate these different data types, right. That's an important concept inside Fiberplane.

[00:14:33] Joe Colantonio Cool. So this whole concept of templates, I assume it's good for incident responses. Are there any other uses for templates that you see been popular?

[00:14:40] Micha Hernandez van Leuffen Well, I think the incident response is one thing I would say sort of run books like how we sort of need to do certain perform certain actions to our infrastructure, which might not always be an incident like maybe some kind of upgrade or roll out. It also goes to codify best practices processes within your organization as a template. Right. And that's not just useful for this incident response use case, but also maybe for onboarding, right? New people join the company when new people join the team. This is sort of the process that we go through when we deploy a new service. So you're also sort of capable of codifying that knowledge and building up that system as a record.

[00:15:18] Joe Colantonio Very nice. And I think we also talked about how maybe a lot of tools around observability and monitoring kind of in siloed environment. So it sounds like this also helps unlocked for the whole team to get these insights.

[00:15:31] Micha Hernandez van Leuffen Yeah, exactly. I mean, that's the other piece, right where for maybe for your logs, you go to Kibana and for your metrics you go to Grafana our Prometheus. This sort of brings in all these sort of different data sources into one form factor and one space to work and again make it actionable with true collaboration.

[00:15:52] Joe Colantonio And this is different than a dashboard because it's real-time specific to what's happening in your environment at that time.

[00:15:57] Micha Hernandez van Leuffen Yeah, I think dashboards are still great for information gathering, right? I think if you look at sort of the categories of observability, you've got. The alerting piece, the notification piece, right. Pager duty as an example, you've got Datadog in a dashboard space or Grafana. It's great for information gathering looking at that again, that cockpit view of the entire system. But when it comes to action, you need to do something having actionable intelligence related to that specific service across different types of data. And again, it could be metrics, logs, traces, but also, as I pointed out. Right, we support sort of this inventing mechanism allows you to sort of have these all these sort of these disparate types of information into one place and make an actual.

[00:16:39] Joe Colantonio How long is Fiberplane been around for?

[00:16:40] Micha Hernandez van Leuffen For yeah, it's a good question. So I think we started a company around two years ago, a little under two years with not a single line of code written. So we've been building for quite some time. In all honesty, there's a lot to build as we're building both, to be honest, like this sort of rich text editing environment, right. As you would come to expect from a Notion or a Google Docs, but at the same time doing this infrastructure piece. So we're kind of building two startups at the same time, I would say. Turns out, building a rich text editing experience is not trivial. So we've invested a lot of time and effort building our own editor, as I said, sort of provided a plug and model, being able to gather the different data sources, these different systems. Yeah, there's a lot to build.

[00:17:25] Joe Colantonio So there's a lot to build. So there's a reason why you didn't have a line of code is because you want to learn as you go along rather than have a preconceived notion of what people want. Like did you have a preconceived notion of what you were going to build? And then after two years, like a wait to watching the customers use this, we realized that's actually here is where, where they really need help with.

[00:17:42] Micha Hernandez van Leuffen Yeah, exactly. So I think we have this sort of thesis around this form factor. And when we spoke about Fiberplane you can almost like visualize what that means even if you to say like collaborative notebooks for DevOps and SREs, you can kind of fathom what that could look like. Right. But then indeed we did a lot of customer discovery and talked to a lot of companies and how they're dealing with their infrastructure, how they've set it up, how they're dealing with sort of incident response and debugging their services, what types of tools they're using, all of that. And then we sort of adjusted the roadmap accordingly and sort of this is where we are now.

[00:18:20] Joe Colantonio Cool. So we're almost in the end of October. So I always ask people this regardless of what the interview is, 2023 is there anything you see on the horizon that maybe site reliability engineers need to be aware of or maybe get on board before ticks off?

[00:18:36] Micha Hernandez van Leuffen Yeah, I mean, obviously Fiberplane will be coming out in a couple of weeks. I'm very excited about that. So we'll be doing sort of our launch to people can sign up, take it for a spin. And then particularly we're super interested in getting feedback around what providers like, what plug-ins are people interested to see. Again, right now we've got Elasticsearch, we got Prometheus, we got Loki, we've got Sentry. Perhaps cloud watch has been a much requested in talking to sort of our first users. So very interested to hear from the folks out there what they want to see in terms of new integrations to build.

[00:19:12] Joe Colantonio So what's the model going to be? I don't have like an open-source version. Is it all a paid solution? Is it both?

[00:19:19] Micha Hernandez van Leuffen Yeah. So initially it will be freemium. You can just take it for a spin and then I expect us to sort of introduce the business model in the new year. We've got some ideas on what that should look like, but obviously also very keen to get feedback from the users and sort of talk about pricing and what seems reasonable and all that. And then on the open source front, we will be open-sourcing our provider stack sort of plugin model, sort of the SDK provider development kit to build new providers, open-sourcing our template, templating stack as well alongside our CLI, which we haven't really talked about yet. So we also have a companion command line interface which quite interesting, like apart from the usual stuff where it allows you to interact with the Fiberplane manage service, you can create notebooks and initiate our published templates and initiate triggers and create notebooks from the triggers, all of that. But it also has two interesting features called FP run is one thing so and this is for a scenario where usually when people are debugging their infrastructure, there might be AWS and dashboards and all of that, but usually, they also operate from the terminal, right? Like they're behind your notebooks and typing things using all these different command line tools to interact with the infrastructure. QCTL is a good example, but the downside of that is like from a team perspective, I have no idea what Joe just typed to sort of help get the system back to normal. Right. So we've got this thing called FP run, which allows you to type the commands to a notebook, right? So you're able to sort of type QCTL, get pods or get logs, and able to sort of type that into the notebook. But on top of that, we kind of can parse the data. So we instead of just like rendering plain text, we can understand that, hey, this is actually a log and we're able to sort of visualize it as a log. We're able to sort of filter it and as you would sort of in a logging product. So that's interesting. And then the other thing that we have is ..... So imagine the same scenario, but it's more for long-running sessions, right? You're doing a bunch of stuff from within the terminal. You're able to sort of record that entire session, all the commands that you're typing for prosperity. So your teammates can later see, hey, this is sort of the path that Joe went through.

[00:21:38] Joe Colantonio Oh, that's cool. It's almost like a knowledge base. So if incident happens again, you're like, wow, all right, this is what fixed it before. Can you replay it? And it will automatically type the commands again for you and all that.

[00:21:47] Micha Hernandez van Leuffen Yeah, exactly right. And sort of codify that process and indeed built up that system of record. And of course, maybe you can save that as a template. And now in the next time something happens, you know what to do.

[00:21:59] Joe Colantonio Love it. So is anything else we missed for a functionality? I mean, the command line sounds like a write up analogy for developers. They're always using the command line. This seems like a normal extension.

[00:22:10] Micha Hernandez van Leuffen Yeah, exactly. No, I think we covered most of it. The providers. Again, we're starting off with obviously observability tools. So support for Elastic Prometheus. And as I said, Cloud Watch is coming up. But it's going to be interesting, doing looking into nonobservability providers, right. So GitHub has a great example. Obviously looking into Slack as well, able to sort of interact with Fiberplane from within Slack.

[00:22:35] Joe Colantonio So as I mentioned, the preshow, I am not a Site Reliability Engineer. So if a Site Reliability engineer is listening to this. What is like the grabbing thing you find make people gravitate towards Fiberplane or would want to try it out when it's officially released? Is it like one like game-changing functionality you think site reliability engineer? And when they hear it, they're like, Oh, that makes sense. I need to get my hands on this.

[00:22:57] Micha Hernandez van Leuffen Just easy to get started with it, right? Like it's a very friendly if you look at the product like it's a very friendly environment which actually geared toward perhaps stressful situations. If something is down or you need to sort of tagline that we use a lot. So when everything is on fire, it's actually a very pleasant experience to use because it looks very much like a sort of consumer enterprise product has a very great usability, good developer experience.

[00:23:25] Joe Colantonio So it seems like if the file log comes off rather like looking all over, okay, where do I look? Where do I start? This is like this will help get them in line with the folks on?

[00:23:35] Micha Hernandez van Leuffen And as a team allows you to remain calm and collected.

[00:23:42] Joe Colantonio Nice. Love it. Okay Micha, before we go. Is there one piece of actionable advice you can give to someone to help them with their SRE testing efforts? And what's the best way to find or contact you?

[00:23:50] Micha Hernandez van Leuffen You can contact me via Twitter. I'm @Mis on Twitter and Fiberplane is @Fiberplane on Twitter. So feel free to reach out to a message and in terms of advice or just sign up for the product, I would say we're happy sort of to do an onboarding call, get the feedback, curious to sort of see you go through it and looking forward to help us inform our roadmap.

[00:24:19] Thanks again for your performance testing awesomeness. For the links of everything we value we covered in this episode. Head on over to testguild.com/p102 and while you're there make sure to click on the tri them both today link under the exclusive sponsor's section to learn all about smart bear's too awesome performance test tools solutions Load Ninja and Load UI Pro. And if the show has helped you in any way, why not rate and review it in iTunes? Reviews really do matter in the rankings of the show, and I read each and every one of them. So that's it for this episode of the Test Guild Performance and Site Reliability Podcast. I'm Joe. My mission is to help you succeed in creating end-to-end full-stack performance testing awesomeness. As always, test everything and keep the good. Cheers.

[00:25:05] Hey, thanks again for listening. If you're not already part of our awesome community of 27,000 of the smartest testers, DevOps, and automation professionals in the world, we'd love to have you join the FAM at Testguild.com and if you're in the DevOps automation software testing space or you're a test tool provider and want to offer real-world value that can improve the skills or solve a problem for the Guild community. I love to hear from you head on over to testguild.info And let's make it happen.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
Micha Hernandez-VanLeuffen TestGuild Performance Feature