Observability Testing Using OpenTelemetry with Swapnil Kotwal

28 July 2024, 08:16 AM

By Test Guild

Swapnil Kotwal TestGuild Automation Feature

About This Episode:

In today's episode, we explore OpenTelemetry, a powerful tool that is transforming how you observe and diagnose your systems in production.

Checkout Test Observability yourself now: https://testguild.me/browserstack

Have you ever faced unexplained system failures or performance issues and wondered how to identify and resolve them quickly?

Our guest, Swapnil Kotwal, Test Lead Engineer at SailPoint Technologies, brings his 12 years of expertise to shed light on how OpenTelemetry unifies fragmented data silos into a seamless observability stack. Swapnil guides us through ensuring unparalleled clarity and flexibility in our infrastructure, from the foundations of observability to practical implementations in distributed systems.

This episode is valuable, especially for test automation teams dealing with flaky tests and fragmented tools. So, let's embrace the observability revolution and future-proof our strategies together. Stay tuned!

Exclusive Sponsor

Test automation teams grapple with several daily challenges that hinder their efficiency and effectiveness. They often miss genuine test failures due to flaky tests and are forced to perform defect triage based on guesswork. The lack of a reliable measure for automation quality exacerbates the issue, while the time-consuming process of debugging test failures drains valuable resources. As a result, they frequently resort to repetitive manual test runs for verification, undermining the very purpose of automation.

Making it even more complex is the fragmented automation testing stack, where one has to hop onto multiple tools for test health tracking, test reporting, and test failure analysis. Triaging and debugging test failures after every run takes hours.

That’s why BrowserStack built Test Observability.

Test Observability provides visibility into every test execution in your CI pipelines, utilizes AI to remediate failed tests, and enables proactive improvement of test suite reliability. It helps with a bunch of things like:

Smart test reporting & build insights in real-time
Track all automated tests in one place – UI, API, or unit tests
Detect ‘flaky, always-failing tests & new failures’
Debug failed tests faster with AI
and more!

Check it out for yourself now. Go to https://testguild.me/browserstack and see for yourself.

About Swapnil Kotwal

Swapnil Kotwal

For twelve years, I've navigated the ever-evolving tech landscape, building robust test frameworks for startups and security-critical infrastructure at Sailpoint. From open-source projects like TestNG, Appium and Gatling to performance engineering at the helm, my journey has been one of constant learning and innovation. Today, I'm laser-focused on architecting vendor-agnostic observability stacks, ensuring performance insights fuel data-driven decision-making. Join me as I share my insights on building the future of reliable and scalable systems.

Connect with Swapnil Kotwal

- Company: SailPoint Technologies
- Blog: www.swapnilvkotwal
- LinkedIn: www.swapnilvkotwal
- Twitter: www.swapnilvkotwal

Rate and Review TestGuild

Thanks again for listening to the show. If it has helped you in any way, shape, or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.

Transcript

Download New Tab

[00:00:00] In a land of testers, far and wide they journeyed. Seeking answers, seeking skills, seeking a better way. Through the hills they wandered, through treacherous terrain. But then they heard a tale, a podcast they had to obey. Oh, the Test Guild Automation Testing podcast. Guiding testers with automation awesomeness. From ancient realms to modern days, they lead the way. Oh, the Test Guild Automation Testing podcast. With lutes and lyres, the bards began their song. A tune of knowledge, a melody of code. Through the air it spread, like wildfire through the land. Guiding testers, showing them the secrets to behold. Oh, the Test Guild Automation Testing podcast. Guiding testers with automation awesomeness. From ancient realms to modern days, they lead the way. Oh, the Test Guild Automation Testing podcast. Oh, the Test Guild Automation Testing podcast. With lutes and lyres, the bards began their song. A tune of knowledge, a melody of code. Through the air it spread, like wildfire through the land. Guiding testers, showing them the secrets to behold.

[00:00:34] Joe Colantonio Have you ever encountered unexplained system failures or performance issues in production? And ask yourself, how do I diagnose and test problems quickly and accurately to avoid downtime? That's what this episode is all about. Today, I want to share with you an Automation Guild session from our last online conference about the power of Open Telemetry. This session shows how Open Telemetry unifies fragmented data silos into a seamless observatory stack. Future proofing your infrastructure for unparalleled clarity and flexibility. This is an area I think more testers need to be involved in and aware of. So definitely check it out all the way to the end. And if you do me a favor, just leave a comment or review and let me know what you think about these types of podcast episodes that are straight forward into. I'd like to know your thoughts and really appreciate it.

[00:01:23] Attention test automation teams! Are you tired of flaky tests, endless debugging, and fragmented tools? Introducing Browser Stack's test observability. Your all in one solution for smarter testing. It's cool because you could track all your tests in one place, detect flaky tests instantly, and debug failures faster with their advanced AI powered insights. And improve your automation stability with real time analytics and custom alerts. No more guesswork, just results. With over 25 years in test automation, I've seen it all. And that's why I trust Browser Stack to streamline your workflow and help boost your productivity. Ready to transform your testing? Support the show and visit testguild.me/browserstack and see the difference for yourself.

[00:02:13] Swapnil Kotwal Hey Guild, I'm Swapnil Kotwal, test lead engineer at SailPoint. For 12 years I have navigated ever evolving test landscape building robust test framework for startups and security critical infrastructure at SailPoint. From open source project like TestNG, Appium, Gatling to performance engineering at the HEML, my journey has been one of the constant learning and innovation. Today, I'm laser focused on architecting vendor agnostic observability stack, ensuring the performance insight, fueled data driven decision making. The observability landscape is undergoing the seismic shift and Open Telemetry is the driving force. Imagine the world where fragmented data silos disappears, replaced by a unified stream of metrics, logs, and traces, seamlessly revolving the inner working of your system. And that's the power of open data. This is vendor neutral framework that shatters the chain of proprietary tools, granting you the freedom to choose the best fit model. It's just not about the data collection, it's about future proofing your observability strategy. Open Telemetry empowers both developers and operations team with simplified instrumentation and streamline analysis, fostering the collaboration and cluster clear insights. Don't get left behind with the observability revolution. Embrace the Open Telemetry and unlock the future of understanding of your system. Unparalleled clarity and flexibility. So today's topic is distributed system observability using Open Telemetry. In today's world, SaaS world basically, systems are distributed, hundreds of microservices are delivering your product. And you need that clarity. You need that correlation between your traces, logs, and metrics. That's where the Open Telemetry came into the picture. I talk about myself a bit. That's what I usually look like. And you can follow me on the different platforms as well. Let's talk about the Open Telemetry or in telemetry in general. So if you look at any spacecraft that is traveling to the space, there are some telemetry. It is sitting and looking at your spacecraft. And in similar fashion, you need to have certain things deployed on your spacecraft so that you can monitor or observe the state of those spacecraft. In similar fashion, the cloud native applications needs the telemetry, and based on that telemetry you would able to monitor that. And that's how it looks like. What is the observability? Is it any new thing. No, it's not a very new thing. We have been monitoring our systems using them. Many I would say vendors like Datadog, Prometheus, Grafana, and so many. But there were a never a unified way of instrumenting your source code and exporting those or emitting those traces, logs, and metrics. And that's the problem that Open Telemetry is going to solve. It's just not solving them how you should instrument your source code with how you emit that data, how you collect that data and send it to your representation, that is how the whole point of it is come under the Open Telemetry. In the SaaS world, the continuous CI/CD or continuous testing is not a very new thing. But this is a new, I would say, space that is required that we need to constantly monitor our system. We need to see how our system is doing it. Before our customer can feel the heat, we should know the status of the system so that we can mitigate before anything goes wrong. That's where the continuous observability came into the picture and Open Telemetry will help impact this.

[00:06:08] Swapnil Kotwal This is the big picture of it. When you make your system observable, then you will able to monitor it. So you need to put some payload on your SaaS system so that it should emit the status of your service or your application so that you would be able to emit those traces. This is the bigger umbrella of your monitoring. These are the basic pillars of observability. In terms of Open Telemetry we call it as a signals. But these are not the only three signals that it has. These three signals are the one which we usually see as a representation of our data. But there are two more which is a baggage and context which basically helps to propagate the context from one a stack to another stack or to propagate the context, the baggage is used. Let's not go to the technical part of it. So this is the foundation pillar that Open Telemetry has. You need to know the logs to what are the problems that you can list it out traces the request scopes traces which can show you the whole journey of your request that starting from the first service to the databases and coming back to the again, that is being served. And throughout this journey, if there is anything that is going wrong why that is happening, you should also need some metrics to see if there is some CPU utilization or CPU utilization is growing aggressively or anything that is not something that we can control. Metrics would help in that case as well. This is how they are the pillar when your system becomes the observability.

[00:07:54] Swapnil Kotwal I will talk about the basic differences of it. And I also have one of the interesting findings over here. Observability means you need to make your system observable. And then it becomes a monitorable. All right. Anything that is we can monitor it. In observability you will see what is my system is doing? In monitoring is my system is working or not? The most beautiful part of it is when there is some event or something happens, in case of Open Telemetry, it emits the traces or logs or metrics rather than the other platforms we have to constantly keep on pulling on a certain intervals that whether I have something to get it at the representation to it. In monitoring, mostly traces are not there. The logs and events metrics over there. But there are again some of the challenges like it won't scale that much that we want it to scale. When your system are growing you need that proper observability. That is how it differs. In case of monitoring, if there is something wrong, that is you got an observability to use that data to tell what is the wrong, how it happened, why it happened. You can debug the things without going or touching the platform. You will have a data which will tell you the whole story that what exactly is wrong with your system? Let me see how this Open Telemetry came into the picture. The brief history is like Uber was doing its research for many years. And it come up with the open source project called Open Tracing. It was mostly in the specifications how your client side platform should do the instrumentation for your source code. And other side Google and other companies were also working on Open Census project, they were bit ahead, then Open Tracing where they were trying to define the specification interfaces as well as the implementation as well. And data collection exporters also they come up with it. Around 2019 they came together, they joined the process, and they form one common open source project called Open Telemetry. Open Telemetry is the second largest CNC project at this moment after Kubernetes. And that is how the Active community did have. Okay, so let's see why we need the observability. In the mean time to reproduce the issue or to act on that issue. Whatever testing burden that you have, you want to reduce that? The full control of it that you can achieve using the Open Telemetry. These are the some topics I just listed it out. You can go collect that data. You can have one system that unifies yours or specs. Say for example some of your microservices are running on Azure. Some of them are running on GCP, some of them are on AWS. And maybe they are written in different programing language. If you are not having one common solution, if they are not compatible with the vendors and platforms and the programing languages, you won't able to streamline the whole system. You won't able to see that big picture. And that's where Open Telemetry came into the picture that they almost all programing languages support the standards and protocols are the same vendor neutral, platform agnostic, and even programing language libraries are also available, which is very systematic across the platforms. There are one of the representation tools that Open Telemetry have Jaeger which provides you how the traces look like. And how you can look at it to solve it. I will talk about it on an actual example. Say for example, this is how e-commerce, any website look like, right? When you have one of the front-end that is a web browser, which basically you are trying to purchase something which you goes to checkout service then, the shipping service then it goes to the payment and it adds to the cart. This is how it sends the confirmation email. And you need to see how this journey of that particular request happened. Is there anyone all you need to find? That is how if that service you want to see the traces came into the picture. If you look at this waterfall tree, it actually had some time. And the data type what kind of request it was, which kind of microservice it has and things like that. And how much time was spent by that particular request on that particular method or that microservice as a whole? This is the representation of. This is all thing we call that the trace and the time taken by the trace or the request to complete. And the bifurcation of it is underneath. Like whatever the microservice it was, it has got some functions and things like that. You would have to take care of it doing the programming or instrumenting it so that we'll see. In similar fashion, say for example, you also want to see the metrics of it. In case of you see there are some lag happening at checkout service. You need a metrics to see okay if there is CPU reaching to 100%. And then only you would got to know, I need to look into it. I need to increase the CPU because maybe the heap size like this. In similar fashion, if there is any new code changes happen and there are lot many 500 requests being served from the front end. Those things will tell you the metrics. In case of logs say for example if there is anything happen okay, what is exactly happened? The logs tell the story, there are something wrong with that checkout service because the website is running out of memory and it is not able to serve the incoming requests. You would see those logs, but you need to have these all three pillars unified. You need to know their correlation. You don't have to do it manually. And that's the basically, we are going to solve over here. It's not a microscope. It's the clarity under that microscope. What exactly happening with it. That is the word the data being emitted from the microservices how data collection happens. There are standard agents and cloud integration that Open Telemetry have. It also have two types of instrumentation. You can do manual instrumentation and instrumentation where you can actually do modification traces log metrics. Or you can actually use their libraries to do auto instrumentation as well. There are ways you can use a low code or no code to do your instrumentation and emitting those spaces, there is no cardinality limits. That's the beauty of it. It will reduce the burden of any company which is looking for this observability. This is one of the few parts where there were different libraries, implementation, data structure, and format were there for each of these pillar tracing, metrics, and logs. There were nothing one unified framework or platform was there which is being replaced by Open Telemetry. And that's the beauty of it. And that is what we call cloud native telemetry. I already mentioned this is the second most active project in CNCF today. Don't need to worry about it. It is going really well in development and things are going on. What are the things you would be able to solve using it? As you know the bifurcation of your request journey from the very upper layer to the all the microservices and again back. So you would able to identify where the time has the most of the time spent by this particular request. And you would be able to have the logs and traces. They are correlated with each other. You would able to see, okay, what is the problem, how to solve it and how to make decisions on it side. So that's the beauty that data will help you. At this moment, since there were platform dependent, vendor dependent solutions are available which makes it cumbersome to solve or integrate multiple platforms, programing languages, and services. Those are the gaps being solved by Open Telemetry. As I mentioned, we would be able to solve those problems how we will go for it. You need to decide the what are the goals, how you will be able to solve it? For that, you need to decide the goals. And you need to work as a whole team to solve the problems that is there. Open your project how that metrics is and other projects could serve. Let's talk about the architecture of it. As I mentioned there are specification the API, SDK, and some of the helpful data is available. The collectors are the one which actually collects the data and send it to the representational tools. There are client libraries which is vendor agnostic for your instrumentation and things like that. Those are all languages and platform are supported at this moment. But you would able to seamlessly emit all these pillars, traces, metrics and logs to the end using any platform. This is the basically your application. We basically delivered one of the agent, I would say a jar in case you are using the Java. And which basically emit the traces which are collected by the Otel Collector. The Otel Collector is a kind of your sidecar, or maybe the agent that is running along with your application which basically sends that data to the upper layer. And if you have multiple such a kind of a programming language are used, multiple microservices are here, and those microservices basically being used to deliver this e-commerce website. And this is the website which is having some of the telemetry or binoscopes selling which I can go and add into cart. I can look into the cart review and I can place the order in a similar way. This is the website which is being served by multiple eservices and microservices. And this is the Jaegar where we can see what is the traces that is happening. Let me expand it for maybe three hours and see what this traces, if I get any, I will see if there is any traces. This is how trace look like okay. The time taken the request originated at this zero microseconds to it actually took this much time. How much time it has taken for this request being serve? That is the representation of it. It represent how much time it actually took and the bifurcation of it. If I go over here, I will see how much time it taken. It took 1.16 milliseconds. The next one took 100 microseconds and likewise. If at all you would able to see where the time has gone, you would able to debug it. There are ways basically show the architecture of it. At this moment it is not really cool. But if you go to here. So it should show me something like this, the correlation of your microservices, how it looks like. That is what it should show. It also show the physical memory how much time it has spent. These are the some of the Grafana dashboard that we build using the Open Telemetry data that is being in this state to it. These are the microservices that is being used. And what is the main rate over the range today. What that particular how many requests per second being served by particular service. So that we can know which service is extremely busy in our application, which needs to be autoscale or needs to be further enhance that you got to know the clarity over here, the latency that is also represented. If it all there are any errors are happening, the error rates and a lot many metrics we can try.

[00:20:14] Swapnil Kotwal There are some metrics which actually showcase what exactly are wrong with your system. When the peak was happened? How the data was handled in that case? These are the metrics and the representation of it.

[00:20:28] Swapnil Kotwal The first thing that we saw over here is the traces. The second thing we saw the metrics. If you look at the Grafana, that Grafana data are being served by the Prometheus or Jaeger or OpenSearch. Here it is Prometheus, that is basically showing the uptime of your system, how constantly it was running. Or is there any hiccups. Here basically showing that if I select the data type Jaeger, the bifurcation of your traces, how the journey of your request was, and here it basically showcase your logs with the whatever the problem statements you have. This is how you can use this all seamless integration which have the span id, trace id, using which you can unify. You can search between the logs. You can look into the traces and metrics and heap time. And you would able to make sense what exactly happened with your system? I will quickly demo you how simple it is. If you will look at this example, for example, log4j here. We are using Open Telemetry and Log4J as a provider. And what we are doing is we just return a one meter which basically represent what happens with this particular request error message with a span or not. And that basically written at the bottom that if you want to showcase the span, it will show span as well with the error messages are in for debugging messages. This is where we instantiate our Open Telemetry instance. If you look at there are three APIs like Trace Provider where you can build your own trace with some strategy and you can include it, there is logger provider which basically you can set up what kind of log you want to do, how you want to process them, whether in a sequence or in batch. And you can set it there. In similar fashion, there is set metrics provider also. You can say to the same instance of your Open Telemetry SDK, and this platform will start emitting your all three pillars of your system so that you can see it in horizontal manner. Here is the one of interesting tool. For example, if you want to troubleshoot some issue, you can simply go and check it likewise, the dropdown is shown here. In that fashion, you would able to identify how much time it has taken, 49 seconds were spent. But where that time has gone. And how my request was. You need to be very careful while instrumenting that. How many traces are meaningful traces you are making. Say, for example, here I am emitting a lot many traces which doesn't make any sense. In that case you can see I have one. If you look at this trace, it originated here. And there are certain methods it show this method is started somewhere here. But it is not part of sub method that is here or here. The error has originated here that what it means that I think username and password wasn't correct. But this is how you would able to pinpoint the problem. And that's the beauty of the traceability. That is how you would able to solve the issue of troubleshooting production. With that I am concluding, and hopefully this session will help you out to do the testing, troubleshooting, and monitoring of your system health on the production. So let's do it. These are the some of the references that I would say the one should go and start with. Thank you so much. If you have any queries, anything, any problem or any feedback, please reach out to me on my email address that given here on my social media handle. Thank you so much.

[00:24:23] Joe Colantonio Thank you Swapnil for Open Telemetry awesomeness. For links of everything we value we covered in this episode, head on over to testguild.com/a507 and make sure to check out this episode's awesome sponsor browser stack and how they can help you with testing observability. That's it for this episode of the Test Guild Automation Podcast. I'm Joe, my mission is to help you succeed in creating end-to-end full stack automation awesomeness. As always, tests everything and keep the good. Cheers!

[00:24:56] Thanks for listening to the Test Guild Automation Podcast. Head on over to Testguild.com for full show notes, amazing blog articles, and online testing conferences. Don't forget to subscribe to The Guild to continue your testing journey.

Scroll back to top

Mateo Rojas Carulla TestGuild DevOps Toolchain

AI and the New Era of Cybersecurity Threats with Mateo Rojas-Carulla

Posted on 12/11/2024

About this DevOps Toolchain Episode: Today, we're exploring a topic that's becoming more ...

Discover Future Trends in Automation at Automation Guild

Posted on 12/08/2024

About This Episode: I'm your host, Joe Colantonio, and I am thrilled to ...

From Code to Leadership with Evan Niedojadlo

Posted on 12/04/2024

About this DevOps Toolchain Episode: Today's episode delves into the journey of transitioning ...