Do you want to build more resilient software systems? Do you like to break things on purpose? If so, this episode is for you. We’ll be Test Talking about Chaos Engineering with Tammy Butow, a principal site reliability engineer at Gremlin. Tammy will explain how to test the ways in which your system responds to stress so you can identify and fix failures before they impact your customers—saving you and your company the embarrassment of software downtime, bad publicity and lost revenue. Listen up!
About Tammy Bütow
Tammy Butow is a principal SRE at Gremlin, where she works on chaos engineering—the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Previously, Tammy led SRE teams at Dropbox responsible for the databases and storage systems used by over 500 million customers and was an IMOC (incident manager on call), where she was responsible for managing and resolving high-severity incidents across the company. She has also worked in infrastructure engineering, security engineering, and product engineering. Tammy is the co-founder of Girl Geek Academy, a global movement to teach one million women technical skills by 2025.
Tammy is an Australian and enjoys riding bikes, skateboarding, snowboarding, and surfing. She also loves mosh pits, crowd surfing, metal, and hardcore punk.
Quotes & Insights from this Test Talk with Tammy Bütow
- Back in 2010, the Netflix team created something called Chaos Monkey. Chaos Monkey became really popular because it was right when Netflix was moving to the cloud. And they did their massive migration. What they wanted to do is be able to make sure that whenever engineers were building something on the AWS they didn't just think that all of those machines will be there all the time.
- It's really exciting sine Gremlin is the first company to actually build something like that that has a web UI that has an API that allows you to programmatically do all of your Chaos engineering experiments including disaster recovery testing which we call in region fall over in the chaos engineering world.
- We did actually recently build a new product call Alfi which is application level fault injection. And that's really a new concept. It's much more advanced. So usually what people do is they start with infrastructure levels so we collect that Ilfi (infrastructure level fault injection) where you are injecting failure on the actual servers or the containers or network related failure. But there is the idea to have actually injecting failure into your application.
- The idea here is you can do very precise fault injection so you can actually match against any attribute you're already using you can really do precise scoping of attacks to things like for example for custom IDs, location, device types after you integrate Alfi into the application. And that means that you can create really small glossed radius chaos engineering experiments.
- One of the big things that we've been focusing on over the last year is really helping people learn about how to get started doing chaos engineering because technically it can be quite complicated and a little bit scary before you start doing. But once you've been doing it for a long time you'll become very good at it and you will no longer be scared to run chaos engineering attacks.
- This is one of my big cases advice and it is going to sound a little scary but I like to say that it's important to focus on critical systems not low hanging fruit. And that's my personal advice and the reason is because if you focus on a critical system you inject failure for a critical system then you're able to show people why it's important and you can actually show them the positive impact that you can have because your more likely to be able to identify important things that you need to fix.
Connect with Tammy Bütow
- Twitter: @tammybutow
- LinkedIn: tammybutow
- Company: Gremlin
Rate and Review TestTalks
Thanks again for listening to the show. If it has helped you in any way, shape or form, please share it using the social media buttons you see on the page. Additionally, reviews for the podcast on iTunes are extremely helpful and greatly appreciated! They do matter in the rankings of the show and I read each and every one of them.
Powered By SauceLabs
Test Talks is sponsored by the fantastic folks at Sauce Labs. Try it for free today!