Break Things on Purpose

A podcast about site reliability engineering (SRE); Chaos Engineering; and the people, processes, and tools used to build resilient systems. Sponsored by Gremlin. Find us on Twitter at @BTOPpod.

en-usGremlin49 Episodes

Episodes (49)

Maxim Fateev and Samar Abbas

In this episode, we cover:

00:00:00 - Introduction

00:04:25 - Cadence to Temporal

00:09:15 - Breaking down the Technology

00:15:35 - Three Tips for Using Temporal

00:19:21 - Outro

Links:

Temporal: https://temporal.io

Transcript

Jason: And just so I’m sure to pronounce your names right, it’s Maxim Fateev?

Maxim: Yeah, that’s good enough. That’s my real name but it’s—[laugh].

Jason: [laugh]. Okay.

Maxim: It’s, in American, Maxim Fateev.

Jason: And Samar Abbas.

Samar: That sounds good.

Jason: Welcome to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. In our build episodes, we chat with the engineers and developers who create tools that help us build and operate modern applications. In this episode, Maxim Fateev and Samar Abbas join us to chat about the problems with orchestrating microservices and the software they’ve created to help solve those problems.

Jason: Hey, everyone, welcome to Break Things on Purpose, the podcast about reliability, chaos engineering, and all things SRE and DevOps. With me today, I’ve got Maxim Fateev and Samar Abbas, the cofounders of a company called Temporal. Maxim, why don’t you tell us a little bit more about yourself?

Maxim: Hi, thanks for inviting me to the podcast. I have been around for quite a while. In 2002, I joined Amazon and Amazon was pretty small company back then, I think 800 developers; compared to its current size it was really small. I quickly moved to the platform team, among other things. And it wasn’t AWS back then, it was just the software platform, actually was called [Splat 00:01:36].

I worked in a team which owned the old publish-subscribe technologies of Amazon, among other things. As a part of the team, I oversaw implementation of service and architecture, Amazon [unintelligible 00:01:47] roll out services at large scale, and they built services for [unintelligible 00:01:51] Amazon so all this asynchronous communication, there was something my team was involved in. And I know this time that this is not the best way to build large-scale service-oriented architectures, or relying on asynchronous messaging, just because it’s pretty hard to do without central orchestration. And as part of that, our team conceived and then later built a Simple Workflow Service. And I was tech leader for the public release of the AWS Simple Workflow Service.

Later, I also worked in Google and Microsoft. Later I joined Uber. Samar will tell his part of the story but we together built Cadence, which was, kind of, the open-source version of the same—based on the same ideas of the Simple Workflow. And now we driving Temporal open-source project and the company forward.

Jason: And Samar, tell us a little bit about yourself and how you met Maxim.

Samar: Thanks for inviting us. Super excited to be here. In 2010, I was basically wanted to make a switch from traditional software development like it used to happen back at Microsoft, to I want to try out the cloud side of things. So, I ended up joining Simple Workflow team at AWS; that’s where I met Maxim for the first time. Back then, Maxim already had built a lot of messaging systems, and then saw this pattern where messaging turned out—[unintelligible 00:03:08] believe that messaging was the wrong abstraction to build certain class of applications out there.

And that is what started Simple Workflow. And then being part of that journey, I was, like, super excited. Since then, in one shape or another, I’ve been continuing that journey, which we started back then in the Simple Workflow team while working with Maxim. So, later in 2012, after shipping Simple Workflow, I basically ended up coming back to Azure side of things. I wrote this open-source library by the name of Durable Task Framework, which looks like later Azure Functions team ended up adopting it to build what they are calling as Azure Durable Functions.

And then in 2015, Uber opened up office here in Seattle; I ended up joining their engineering team in the Seattle office, and out of coincidence, both me and Max ended up joining the company right about the same time. Among other things we worked on together, like, around 2017, we started the Cadence project together, which was you can think of a very similar idea as like Simple Workflow, but kind of applying it to the problem we were seeing back at Uber. And one thing led to another and then now we are here basically, continuing that journey in the form of Temporal.

Jason: So, you started with Cadence, which was an internal tool or internal framework, and decided to strike out on your own and build Temporal. Tell me about the transition of that. What caused you to, number one, strike out on your own, and number two, what’s special about Temporal?

Maxim: We built the Cadence from the beginning as an open-source project. And also it never was, like, Uber management came to us and says, “Let’s build this technology to run applications reliably,” or workflow technology or something like that. It was absolutely a bottoms-up creation of that. And we are super grateful to Uber that those type of projects were even possible. But we practically started on our own, we build it first version of that, and we got resources later.

And [unintelligible 00:05:09] just absolutely grows bottoms-up adoption within Uber. It grew from zero to, like, over a hundred use cases within three years that this project was hosted by our team at Uber. But also, it was an open-source project from the beginning, we didn’t get much traction first, kind of, year or two, but then after that, we started to see awesome companies like HashiCorp, Box, Coinbase, Checkr, adopt us. And there are a lot of others, it’s just that not all of them are talking about that publicly. And when we saw this external adoption, we started to realize that thing within Uber, we couldn’t really focus on external events, like, because we believe this technology is very widely applicable, we needed, kind of, have separate entity, like a company, to actually drive the technology forward for the whole world.

Like most obvious thing, you cannot have a hosted version [unintelligible 00:06:00] at Uber, right? We would never create a cloud offering, and everyone wants it. So, that is, kind of like, one thing led to another, Samar said, and we ended up leaving Uber and starting our own company. And that was the main reasoning is that we wanted to actually make this technology successful for everybody in the whole world, not just within Uber. Also the, kind of, non-technical but also technical reasons, one of the benefits of doing that was that we had actually accumulated quite pretty large technical debt when running, like, Cadence, just because we were in it for four years without single backwards-incompatible change because since [unintelligible 00:06:37] production, we still were on the same cluster with the same initial users, and we never had downtime, at least, lik, without relat—infrequent outages.

So, we had to do everything in backwards-compatible manner. At Temporal, we could go and rethink that a little bit, and we sp...

Break Things on Purpose

en-usOctober 05, 2021

John Martinez

In this episode, we cover:

00:00:00 - Introduction
00:03:15 - FinOps Foundation and Multicloud
00:07:00 - Costs
00:10:40 - John’s History in Reliability Engineering
00:16:30 - The Actual Cost of an Outages, Security, Etc.
00:21:30 - What John Measures
00:28:00 - What John is Up To/Latinx in Tech

Links:

Palo Alto Networks: https://www.paloaltonetworks.com/
FinOps Foundation: https://www.finops.org
Techqueria.org: https://techqueria.org
LinkedIn: https://www.linkedin.com/in/johnmartinez/

Transcript

John: I would say a tip for better monitoring, uh, would be to, uh turn it on. [laugh]. [unintelligible 00:00:07] sounds, right?

Jason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode we chat with John Martinez, Director of Cloud R&D at Palo Alto Networks. John’s had a long career in tech, and we discuss his new focus on FinOps and how it has been influenced by his past work in security and chaos engineering.

Jason: So, John, welcome to the show. Tell us a little bit about yourself. Who are you? Where do you work? What do you do?

John: Yeah. So, John Martinez. I am a director over at Palo Alto Networks. I have been in the cloud security space for the better of, I would say, seven, eight years or so. And currently, am in transition in my role at Palo Alto Networks.

So, I’m heading headstrong into the FinOps world. So, turning back into the ops world to a certain degree and looking at what can we do, two things: better manage our cloud spend and gain a lot more optimization out of our usage in the cloud. So, very excited about new role.

Jason: That’s an interesting new role. I’d imagine that at Palo Alto Networks, you’ve got quite a bit of infrastructure and that’s probably a massive bill.

John: It can be. It can be. Yeah, [laugh] absolutely. We definitely have large amount of scale, in multi-cloud, too, so that’s the added bonus to it all. FinOps is kind of a new thing for me, so I’m pretty happy to, as I dig back into the operations world, very happy to discover that the FinOps Foundation exists and it kind of—there’s a lot of prescribed ways of both looking at FinOps, at optimization—specifically in the cloud,
obviously—and as well as there’s a whole framework that I can go adopt.

So, it’s not like I’m inventing the wheel, although having been in the cloud for a long time, and I haven’t talked about that part of it but a lot of times, it feels like—in my early days anyway—felt like I was inventing new wheels all the time. As being an engineer, the part that I am very excited about is looking at the optimization opportunities of it. Of course, the goal, from a finance perspective, is to either reduce our spend where we can, but also to take a look at where we’re investing in the cloud, and if it takes more of a shift as opposed to a straight-up just cut the bill kind of thing, it’s really all about making sure that we’re investing in the right places and optimizing in the right places when it comes down to it.

Jason: I think one of the interesting perspectives of adopting multi-cloud is that idea of FinOps: let’s save money. And the idea, if I wanted to run a serverless function, I could take a look at AWS Lambda, I could take a look at Azure Functions to say, “Which one’s going to be cheaper for this particular use case,” and then go with that.

John: I really liked how the FinOps Foundation has laid out the approach to the lifecycle of FinOps. So, they basically go from the crawl, walk, run approach which, in a lot of our world, is kind of like that. It’s very much about setting yourself up for success. Don’t expect to be cutting your bill by hundreds of thousands of dollars at the beginning. It’s really all about discovering not just how much we’re spending, but where we’re spending it.

I would categorize the pitting the cloud providers against each other to be more on the run side of things, and that eventually helps, especially in the enterprise space; it helps enterprises to approach the cloud providers with more of a data-driven negotiation, I would say [laugh] to
your enterprise spend.

Jason: I think that’s an excellent point about the idea of that is very much a run. And I don’t know any companies within my sphere and folks that I know in the engineering space that are doing that because of that price competition. I think everybody gets into the idea of multi-cloud because of this idea of reliability, and—

John: Mm-hm.

Jason: One of my clouds may fail. Like, what if Amazon goes down? I’d still need to survive that.

John: That’s the promise, right? At least that’s the promise that I’ve been operating under for the 11 years or so that I’ve been in the cloud now. And obviously, in the old days, there wasn’t a GCP or an Azure—I think they were in their infancy—there was AWS… and then there was AWS, right? And so I think eventually though you’re right, you’re absolutely right. Can I increase my availability and my reliability by adopting multiple clouds?

As I talk to people, as I see how we’re adopting the multiple clouds, I think realistically though what it comes down to is you adopted cloud, or teams adopt a cloud specifically for, I wouldn’t say some of the foundational services, but mostly about those higher-level niche services that we like. For example, if you know large-scale data warehousing, a lot of people are adopting BigQuery and GCP because of that. If you like general purpose compute and you love the Lambdas, you’re adopting AWS and so on, and so forth. And that’s what I see more than anything is, I really like a cloud’s particular higher level service and we go and we adopt it, we love it, and then we build our infrastructure around it. From a practical perspective, that’s what I see.

I’m still hopeful, though, that there is a future somewhere there where we can commoditize even the cloud providers, maybe [laugh]. And really go from Cloud A to Cloud B to Cloud C, and just adopt it based on pricing I get that’s cheaper, or more performant, or whatever other dimensions that are important to me. But maybe, maybe. We’ll remain hopeful. [laugh].

Jason: Yeah, we’re still very much in that spot where everybody, despite even the basics of if I want to a virtual machine, those are still so different between all the clouds. And I mean even last week, I was working on some Terraform and the idea of building it modularly, and in my head thinking, “Well, at some point, we might want to use one of the other clouds so let’s build this module,” and thinking, “Realistically, that’s probably not going to happen.”

John: [laugh]. Right. I would say that there’s the other hidden cost about this and it’s the operational costs. I don’t think we spend a whole lot of time talking about operational costs, necessarily, but what is it going to cost to retrain my DevOps team to move from AWS to GCP, as an example? W...

Break Things on Purpose

en-usSeptember 21, 2021

Omar Marrero

In this episode, we cover:

What Kessel Run is Doing: 00:01:27
Failure Never has a Single Point: 00:05:50
Lessons Learned: 00:10:50
Working the DOD:00:13:40
Automation and Tools: 00:18:02

Links:

Kessel Run: https://kesselrun.af.mil
Kessel Run LinkedIn: https://www.linkedin.com/company/kesselrun/

Transcript

Omar: But I’ll answer as much as I can. And we’ll go from there.

Jason: Yeah. Awesome. No spilling state secrets or highly classified info.

Omar: Yes.

Jason: Welcome to Break Things on Purpose, a podcast about chaos engineering and building reliable systems.

Jason: Welcome back to Break Things on Purpose. Today with us we have guest Omar Marrero. Omar, welcome to the show.

Omar: Thank you. Thank you, man. Yeah, happy to be here.

Jason: Yeah. So, you’ve been doing a ton of interesting work, and you’ve got a long history. For our listeners, why don’t you tell us a little bit more about yourself? Who are you? What do you do?

Omar: I’ve been in the military, I guess, public service for a while. So, I was military before, left that and now I’ve joined as a government employee. I love what I do. I love serving the country and supporting the warfighters, making sure they have the tools. And throughout my career, it’s been basically building tools for them, everything they need to make their stuff happen.

And that’s what drives me. That’s my passion. If you’ve got the tool to do your mission, I’m in and I’ll make that happen. That’s kind of what I’ve done for the whole of my career, and chaos has always been involved there in some fashion. Yeah, it’s been a pretty cool run.

Jason: So, you’re currently doing this at a company called Kessel Run. Tell us a little bit more about Kessel Run.

Omar: So, we deliver combat capability that can sense or respond to conflict in any domain, anywhere, any time. Or deliver award-winning software that our warfighters love. So, Kessel Run’s kind of… you might think of it as a software factory within the DOD. So, the whole creation of Kessel Run is to deliver quickly, fast. If you follow the news, you know DOD follows waterfall a little bit.

So, the whole creation of Kessel Run was to change that model. And that’s what we do. We deliver continuously non-stop. Our users give us feedback and within hours, they got it. So, that’s the nature behind Kessel Run. It’s like a hybrid acquisition model within the government.

Jason: So, I’m curious then, I mean, you obviously aren’t responsible for the company naming, but I’m sure many of our listeners being Star Wars fans are like, “Oh, that sounds familiar.” Omar: Yep, yep.

Jason: If you haven’t checked out Kessel Run’s website, you should go do that; they have a really cool logo. I’m guessing that relates to just the story of Kessel Run being like, doing it really fast and having that velocity, and so bringing that to the DOD, is that the connection?

Omar: Actually, it goes into the smuggling DevSecOps into the DOD, so the 12 parsecs. So, that’s where it comes from. So, we are smuggling that DevSecOps into the DOD; we’re changing that model. So, that’s where it comes from.

Jason: I love that idea of we’re going to take this thing and smuggle it in, and that rebellious nature. I think that dovetails nicely into the work that you’ve been doing with chaos engineering. And I’m curious, how did you get into chaos engineering? Where did you get your start?

Omar: I’ve been breaking things forever. So, part of that they deliver tools that our warfighters can use, that’s been my jam. So, I’ve been doing, you can say, chaos forever. I used to walk around, unplug power cables, network cables, turn down [WAN 00:03:24]. Yeah, that was it.

Because we used to build these tools and they’re like, “Oh, I wonder if this happens.” “All right, let’s test it out. Why not?” Pull the cable and everybody would scream and say, “What are you doing?” It was like, “We figured it out.”

But yeah, I’ve been following chaos engineering for a while, ever since Netflix started doing it and Chaos Monkey came out and whatnot, so that’s been something that’s always been on my mind. It’s like, “Ah, this would be cool to bring into the DOD.” And Kessel Run just made that happen. Kessel Run, the way we build tools, our distributed system was like, “Yep, this is the prime time to bring chaos into the DOD.” And Kessel Run just adopted it.

I tossed the idea, I was like, “Hey, we should bring chaos into Kessel Run.” And we slowly started ramping up, and we build a team for it; team is called Bowcaster. So, we follow the breaking stuff. And that’s it. So, we’ve matured, and we’ve deployed and, of course, we’ve learned on how to deploy chaos in our different environments. And I mean, yeah, it’s been a cool run.

Jason: Yeah, I’m curious. You mentioned starting off simply, and that’s always what we recommend to people to do. Tell us a little bit more about that. What were some of the tests that you ran then, and then maybe how have they matured, and what have you moved into?

Omar: So, our first couple of tests were very simple. Hey, we’re going to test a database failover, and it was really manual at that point. We would literally go in and turn off Database A and see what happened. So, it was very basic, very manual work. We used to record them so we can show them off like, “Hey, check this out. This is what we did.”

So, from there, we matured. We got a little bit more complex. We eventually got to the point where we were actually corrupting databases in production and seeing what happens. You should have seen everybody’s faces when we proposed that. So, from there, we’re running basically, we call it ‘Chaos Plus’ in Kessel Run.

So, we’ve taken chaos engineering, the concept of chaos engineering, right, breaking things on purpose, but we’ve added performance engineering on top of it, and we’ve added cybersecurity testing on top of it. So, we can run a degraded system, and at the same time say, “All right, so we’re going to ramp up and see what a million users does to our app while it’s fully degraded.” And then we would bring in our cyber team and say, “All right, our system is degraded. See if you can find a vulnerability in it.” So, we’ve kind of evolved.

And I call it, put chaos on a little bit of steroids here. But we call it Chaos Plus; that’s our thing. We’ve recently added fuzzing while we’re doing chaos. So, now we got performance chaos, our cyber team, and we’re fuzzing the systems. So, I’m just going to keep going until somebody screams at me and says, “Omar, that’s too much.” But that’s essentially a little bit of our ride in Kessel Run.

Jason: That’s amazing. I love that idea of we’re going to do this test, and then we’re going to see what else can happen. One of the things that I’ve been chatting with a bunch of folks recently about is this idea, we always talk about, especially in the resilience engineering space, that failure never has a single point. It’s not a singular root cause; it’s always contributing factors. And the problem is, when you’re doing chaos eng...

Break Things on Purpose

en-usSeptember 07, 2021

Carmen Saenz

In this episode, we cover:

Intro and an Anecdote: 00:00:27
Early Days of Chaos Engineering: 00:04:13
Moving to the Cloud and Important Lessons: 00:07:22
Always Learning and Teaching: 00:11:15
Figuring Out Chaos: 00:16:30
Advice: 00:20:24

Links:

Apex: https://www.apexclearing.com
LinkedIn: https://www.linkedin.com/in/mdcsaenz

Transcript

Jason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode, Ana Medina is joined by Carmen Saenz, a senior DevOps engineer at Apex Clearing Corporation. Carmen shares her thoughts on what cloud-native engineers can learn from our on-prem past, how she learned to do DevOps work, and what reliable IT systems look like in higher education.

Ana: Hey, everyone. We have a new podcast today, we have an amazing guest; we have Carmen Saenz joining us. Carmen, do you want to tell us a little bit about yourself, a quick intro?

Carmen: Sure. I am Carmen Saenz. I live in Chicago, Illinois, born and raised on the south side. I am currently a senior DevOps engineer at Apex and I have been in high-frequency trading for 11 out of 12 years.

Ana: DevOps engineers, those are definitely the type of work that we love diving in on, making sure that we’re keeping those systems up-to-date. But that really brings me into one of the questions we love asking about. We know that in technology, we sometimes are fighting fires, making sure our engineers can deploy quickly and keep collaboration around. What is one incident that you’ve encountered that has marked your career? What exactly happened that led up to it, and how is it that your team went ahead and discovered the issue?

Carmen: One of the incidents that happened to us was, it was around—close to the beginning of the teens [over 00:01:23] 2008, 2009, and I was working at a high-frequency trading firm in which we had an XML configuration that needed to be deployed to all the machines that are on-prem at the time—this was before cloud—that needed to connect to the exchanges where we can trade. And one of the things that we had to do is that we had to add specific configurations in order for us to keep track of our trade position. One of the things that happened was, certain machines get a certain configuration, other machines get another configuration. That configuration wasn’t added for some machines, and so when it was deployed, we realized that they were able to connect to the exchange and they were starting to trade right away. Luckily, someone noticed from external system that we weren’t getting the positions updates.

So, then we had to bring down all these on-prem machines by sending out a bash script to hit all these specific machines to kill the connection to the exchange. Luckily, it was just the beginning of the day and it wasn’t so crazy, so we were able to kill them within that minute timeframe before it went crazy. We realized that one of the big issues that we had was, one, we didn’t have a configuration management system in order to check to make sure that the configurations we needed were there. The second thing that we were missing is a second pair of eyes. We need someone to actually look at the configuration, PR it, and then push it.

And once it’s pushed, then we should have had a third person as we were going through the deployment system to make sure that this was the new change that needed to be in place. So, we didn’t have the measures in place in order for us to actually make sure that these configurations were correct. And it was chaos because you can lose money because you’re down when the trading was starting in the day. And it was just a simple mistake of not knowing these machines needed a specific configuration. So, it was kind of intense, those five minutes. [laugh].

Ana: [laugh]. So, amazing that y’all were able to catch it so quickly because the first thing that comes to mind, as you said, before the cloud—on-prem—and it’s like, do we start needing to making ‘BC’, like, ‘Before Cloud’ times when we talk about incidents? Because I think we do. When we look at the world that we live in now in a more cloud-native space, you tell someone about this incident, they’re going to look at us and say, “What do you mean? I have containers that manage all my config management. Everything’s going to roll out.”

Or, “I have observability that’s going to make us be resilient to this so that we detect it earlier.” So, with something like chaos engineering, if something like this was to happen in an on-prem type of data center, is there something that chaos engineering could have done to help prepare y’all or to avoid a situation like this?

Carmen: Yeah. One of the things that I believe—the chaos engineering, for what it’s worth, I didn’t actually know what chaos engineering was till 2012, and the specific thing that you mentioned is actually what they were testing. We had a test system, so we had all these on-prem machines and different co-locations in the country. And we would take some of our test systems—not the production because that was money-based but our test systems that were on simulated exchanges—and what would we do to test to make sure our code was up-to-date is we actually had a Chaos Monkey to break the configuration.

We actually had a Chaos Monkey and it would just pick a random function to run that day. It would be either send a bad config to a machine or bring down a machine by disconnecting its connection, doing a networking change in the middle to see how we would react. [unintelligible 00:05:01] with any machine in our simulation. And then we had to see how it was going to react with the changes that was happening, we had to deduce, we had to figure out how to roll it back. And those are the things that we didn’t have at the time. In 2012—this was another company I was working for in high-frequency trading—and they implemented chaos engineering in that simulation, specifically for them, we would catch these problems before we hit production. So yeah, that’s definitely was needed.

Ana: That’s super awesome that a failure encountered four years prior to your next company, you ended up realizing, wait, if this company actually follows what they do have of let’s roll out a bad deploy; how does our system actually engage with it? That’s such an amazing learning experience. Is there anything more recent that you’ve done in chaos engineering you’d want to share about?

Carmen: Actually, since I’ve just started at this company a couple of months ago, I haven’t—thankfully—run into anything, so a lot of my stories are more like war stories from the PC days. So.

Ana: Do you usually work now, mostly on-prem systems or do you find yourself in hybrid environments or cloud type of environments?

Carmen: Recently, in the last three to four years I spent in cloud-only. I rarely have to encounter on-prem nowadays. But coming from an on-prem world to a cloud world, it was completely different. And I feel with the tools that we have now we have a lot of built-in checks and balances in which even with us trying to manually delete a node in our cluster, we can see our systems auto-heal because cloud engineering tries to attempt to take care of that for us, or with, you know, infrastructur...

Break Things on Purpose

en-usAugust 26, 2021

Zack Butcher

Welcome back to another edition of “Build Things on Purpose.” This time Jason is joined by Zack Butcher, a founding engineer at Tetrate. They also break down Istio’s ins and outs and the lessons learned there, the role of open source projects and their reception, and more. Tune in to this episode and others for all things chaos engineering!

In this episode, we cover: