In this episode, we chat with Omar Marrero, Chaos and Performance Engineering Lead at Kessel Run, a company at the forefront of delivering âcombat capability that can sense and respond to any conflict in any domain, anytime, anywhere.â To say that Omar and Kessel Run are at the forefront is an understatement. Join the conversation as Omar and Jason discuss bringing chaos into the DOD, bringing the best software possible to the warfighters, convincing the DOD to get on board with chaos engineering, and more! Tune in for the rest.
Show Notes
In this episode, we cover:
- What Kessel Run is Doing: 00:01:27
- Failure Never has a Single Point: 00:05:50
- Lessons Learned: 00:10:50
- Working the DOD: 00:13:40
- Automation and Tools: 00:18:02
Links:
- Kessel Run: https://kesselrun.af.mil
- Kessel Run LinkedIn: https://www.linkedin.com/company/kesselrun/
Transcript
Omar: But Iâll answer as much as I can. And weâll go from there.
Jason: Yeah. Awesome. No spilling state secrets or highly classified info.
Omar: Yes.
Jason: Welcome to Break Things on Purpose, a podcast about chaos engineering and building reliable systems.
Jason: Welcome back to Break Things on Purpose. Today with us we have guest Omar Marrero. Omar, welcome to the show.
Omar: Thank you. Thank you, man. Yeah, happy to be here.
Jason: Yeah. So, youâve been doing a ton of interesting work, and youâve got a long history. For our listeners, why donât you tell us a little bit more about yourself? Who are you? What do you do?
Omar: Iâve been in the military, I guess, public service for a while. So, I was military before, left that and now Iâve joined as a government employee. I love what I do. I love serving the country and supporting the warfighters, making sure they have the tools. And throughout my career, itâs been basically building tools for them, everything they need to make their stuff happen.
And thatâs what drives me. Thatâs my passion. If youâve got the tool to do your mission, Iâm in and Iâll make that happen. Thatâs kind of what Iâve done for the whole of my career, and chaos has always been involved there in some fashion. Yeah, itâs been a pretty cool run.
Jason: So, youâre currently doing this at a company called Kessel Run. Tell us a little bit more about Kessel Run.
Omar: So, we deliver combat capability that can sense or respond to conflict in any domain, anywhere, any time. Or deliver award-winning software that our warfighters love. So, Kessel Runâs kind of⊠you might think of it as a software factory within the DOD. So, the whole creation of Kessel Run is to deliver quickly, fast. If you follow the news, you know DOD follows waterfall a little bit.
So, the whole creation of Kessel Run was to change that model. And thatâs what we do. We deliver continuously non-stop. Our users give us feedback and within hours, they got it. So, thatâs the nature behind Kessel Run. Itâs like a hybrid acquisition model within the government.
Jason: So, Iâm curious then, I mean, you obviously arenât responsible for the company naming, but Iâm sure many of our listeners being Star Wars fans are like, âOh, that sounds familiar.â Omar: Yep, yep.
Jason: If you havenât checked out Kessel Runâs website, you should go do that; they have a really cool logo. Iâm guessing that relates to just the story of Kessel Run being like, doing it really fast and having that velocity, and so bringing that to the DOD, is that the connection?
Omar: Actually, it goes into the smuggling DevSecOps into the DOD, so the 12 parsecs. So, thatâs where it comes from. So, we are smuggling that DevSecOps into the DOD; weâre changing that model. So, thatâs where it comes from.
Jason: I love that idea of weâre going to take this thing and smuggle it in, and that rebellious nature. I think that dovetails nicely into the work that youâve been doing with chaos engineering. And Iâm curious, how did you get into chaos engineering? Where did you get your start?
Omar: Iâve been breaking things forever. So, part of that they deliver tools that our warfighters can use, thatâs been my jam. So, Iâve been doing, you can say, chaos forever. I used to walk around, unplug power cables, network cables, turn down [WAN 00:03:24]. Yeah, that was it.
Because we used to build these tools and theyâre like, âOh, I wonder if this happens.â âAll right, letâs test it out. Why not?â Pull the cable and everybody would scream and say, âWhat are you doing?â It was like, âWe figured it out.â
But yeah, Iâve been following chaos engineering for a while, ever since Netflix started doing it and Chaos Monkey came out and whatnot, so thatâs been something thatâs always been on my mind. Itâs like, âAh, this would be cool to bring into the DOD.â And Kessel Run just made that happen. Kessel Run, the way we build tools, our distributed system was like, âYep, this is the prime time to bring chaos into the DOD.â And Kessel Run just adopted it.
I tossed the idea, I was like, âHey, we should bring chaos into Kessel Run.â And we slowly started ramping up, and we build a team for it; team is called Bowcaster. So, we follow the breaking stuff. And thatâs it. So, weâve matured, and weâve deployed and, of course, weâve learned on how to deploy chaos in our different environments. And I mean, yeah, itâs been a cool run.
Jason: Yeah, Iâm curious. You mentioned starting off simply, and thatâs always what we recommend to people to do. Tell us a little bit more about that. What were some of the tests that you ran then, and then maybe how have they matured, and what have you moved into?
Omar: So, our first couple of tests were very simple. Hey, weâre going to test a database failover, and it was really manual at that point. We would literally go in and turn off Database A and see what happened. So, it was very basic, very manual work. We used to record them so we can show them off like, âHey, check this out. This is what we did.â
So, from there, we matured. We got a little bit more complex. We eventually got to the point where we were actually corrupting databases in production and seeing what happens. You should have seen everybodyâs faces when we proposed that. So, from there, weâre running basically, we call it âChaos Plusâ in Kessel Run.
So, weâve taken chaos engineering, the concept of chaos engineering, right, breaking things on purpose, but weâve added performance engineering on top of it, and weâve added cybersecurity testing on top of it. So, we can run a degraded system, and at the same time say, âAll right, so weâre going to ramp up and see what a million users does to our app while itâs fully degraded.â And then we would bring in our cyber team and say, âAll right, our system is degraded. See if you can find a vulnerability in it.â So, weâve kind of evolved.
And I call it, put chaos on a little bit of steroids here. But we call it Chaos Plus; thatâs our thing. Weâve recently added fuzzing while weâre doing chaos. So, now we got performance chaos, our cyber team, and weâre fuzzing the systems. So, Iâm just going to keep going until somebody screams at me and says, âOmar, thatâs too much.â But thatâs essentially a little bit of our ride in Kessel Run.
Jason: Thatâs amazing. I love that idea of weâre going to do this test, and then weâre going to see what else can happen. One of the things that Iâve been chatting with a bunch of folks recently about is this idea, we always talk about, especially in the resilience engineering space, that failure never has a single point. Itâs not a singular root cause; itâs always contributing factors. And the problem is, when youâre doing chaos engineering, youâre usually testing one thing.
And then itâs like, âOh, I did the failover on that database and that worked.â Iâve been suggesting that people now start to do, âWell, if this is in a degraded state, what are the contributing factorsâif thatâs still working, what are the contributing factors that can lead to a major catastrophe?â Thatâs one of the nice things that actually performing these failures allows you to do rather than just imagining them and trying to work up some sort of response process to your imagination.
Omar: Thatâs our thing. So, from our perspective, thatâs what I charge the team to do is like, âHey, we need to make sure these things are working.â Comes back to my passion, right? Were delivering tools to the warfighters; the warfighter needs to have tools that work. And thatâs what Kessel Run does; thatâs what Kessel Run exists for.
We deliver that award-winning software that our airmen love. So, following that trend, thatâs where chaos comes in place. So, weâre building fancy tools, and we got an awesome platform that supports it and all that stuff. Weâre just there to make sure, âHey yeah, this is engineered correctly. Itâs responsive to fault or any kind of failure.â And we justâI mean, weâre literally blasting it with anything we can imagine to make sure it could support that.
Jason: Iâm curious if you could dive into some details about one of your recent chaos engineering experiments. Was anything unusual or unexpected? And what did you learn from it?
Omar: So, I think one of the cool ones, which is the latest one, was that database corruption. There was a lot of questions on, âHey, we have some tools in place we built. The engineering is in place to make sure that if the database goes down, nothing is impacting our system and whatnot. What would happen if the database gets corrupted?â For some odd reason. I donât know, thatâs probably going to happen once in a million, I donât know.
But itâs like, âHey, letâs figure it out.â So, my team came up with an experiment; we went and we started corrupting databases in staging. Itâs like, âAll right yeah, that was cool.â Oh, and then we went to the leader, she was like, âHey, we want to do this in production and call an outage and see how the teams responds.â And at the same time, weâre going to throw a whole bunch of curves.
Weâre going to disappear key people, weâre going to make sure you donât have access to certain things. It was not just database corruption; weâre going to throw curveballs at you like thereâs no tomorrow here. So, we did, and it was actually a pretty good experience. So, we figured out, hey, yeah, the database corruption just happens, whatnot and the team like our SRE team actually figured out. It took them a little bit because it was a lot of curveballs, but we learned, all right, if this does happen and we have all these issues happening at once, itâs probably a non-realisticâIâd call itâfire drill, but itâs something we got to prepare for just in case.
Weâve learned from it and we actually practiced it again. So, from the initial time it took us to go through the curveballs, we did another one, threw different curveballs at them, and that was like a no-brainer. Theyâre like, âYep. We got this. Donât worry about it. We ran this through once, so we know.â
Which is why we do these things. You want to practice and then, if thereâs an outage, shorten the time, make sure itâs not impacting. What was really cool to see is, like, it didnât matter how many databases we corrupted and how many curveballs we threw at the system, there was never an impact to the end-user, which is the goal. We practice chaos to make sure that itâs always working. So, we validated that our system can tolerate all these curveballs and all these things we were doing at it. And itâs something that weâve never tried before, so it was pretty cool.
Jason: I love that you mentioned what you threw at people was maybe not realistic, itâs not something that would happen in the real world, but I think it brings up that idea of when youâre training for things, if you train harder, if youâre an athlete and you train harder than you wouldnât normally in a game, and youâre constantly stressing yourself when it comes to that real-world situation, it just seems easy.
Omar: Yeah. And thatâs what the SRE teamâbecause we do the normal, âHey, we did the test,â and then we go, [itâs like 00:10:31], âThis is what we saw.â And then we actually asked for feedback from the teamâs. Itâs like, âAny way we could have done this test better?â The normal process.
And theyâre like, âWe loved this. Weâve learned so much that helps us either automate more scripts or streamline our process.â So, from our standpoint, weâll keep throwing curveballs. And I think they did that, aside from, hey, this is a very realistic scenario, and then we go to theâthis is probably a little bit over the edge, but we still want to do it. We do both. Itâs good.
Plus, it doesnât keep a same [unintelligible 00:11:04]. Weâre used to it. All of a sudden youâre throwing all these curveballs at the team, they can nitpick from all these lessons learned and put better processes in place, make it faster, better engineering. The teamâs awesome. All the team that supports Kessel Run, our SRE team, our platform team, everybodyâs super smart, super amazing, and Iâm just there to test their ability to respond. Which is why I like my job.
Jason: You mentioned lessons learned, and Iâm curious, as somebody whoâs been doing chaos engineering for quite a long time, actually, what are some of the top lessons that you would give, or the top advice you would give to our listeners as they start to do chaos engineering?
Omar: I would say. So, you start simple, and thatâs key. You start simple. If you really mention chaos to somebody whoâs not familiar, the first thing theyâre going to do is theyâre going to Google âchaos engineering,â and what theyâre going to find out is Netflix and Chaos Monkey. Thatâs an awesome tool, but do your research, figure out what other people are doing, and get involved in the chaos community world; thereâs a lot of people doing some cool stuff.
Start with a small test so you can see and get the data from there, and scale up. As you learn and as you go, you scale up. And it helpsâchaos scares, sometimesâor not, sometimes. For the most partâyour senior leadership because youâre telling them, âHey, Iâm going to come in and break stuff.â So, doing small-scale tests allows you to prove and provide, hey, this is why itâs beneficial.
The actual event is not chaos. We call it chaos engineering, but the actual event is very controlled. We know what weâre doing, weâre watching, we have somebody in place to say stop in case things are going haywire. So, you have to explain that while youâre doing. And just do it; itâs just like testing, you have to test your applications, and the more testing you do, the better.
The closer you shift left the better, too but you have to test. You got to make sure your apps are working. So, chaos engineering is just another flavor to that. The word chaos usually scares people. So, you just got to slowly do it and show them the value of doing chaos.
Hey, youâre doing chaos, this is what it brings. Hey, we just proved your database can failover. Thatâs a good thing. And if it didnât fail over, itâs like, how can we make it happen? So, thatâs a small-scale test that provides that feedback and data you need to say, this is why we have to adopt chaos engineering.
And as you going, getâdoâgo crazy, right? As your leadership allows you to do stuff like, yeah, letâs just do it. And work with your teams. Work with the SRE teams, work with the app teams and get feedback. What do you need?
What is your biggest problem? Thatâs one thing I ask my team to do. So, every month, they go to the team and say, âAll right, so whatâs your biggest hurdle? What right now is yourâwhy donât you sleep?â And we go, âOkay, can we replicate âthe why donât you sleepâ so we can let you sleep?â
So, thatâs an approach thatâs worked for us. And a whole bunch of our tests are based on that. Itâs like, okay, âWhat keeps you up at night?â Weâll test it so you can sleep. And then next month, give me the next thing that keeps you up at night. And we go in and we test it.
Jason: And like that iterative approach of letâs work on, whatâs your biggest pain point? What keeps you up at night? And then letâs solve that. And then whatâs the next thing? And keep working down that chain until, hopefully, nothing keeps you up at night.
Omar: Yeah, thatâll be good. We all sleep and itâs like, âOh, this thingâs on cruise control. Letâs go.â Jason: You mentioned convincing management or the upper levels of management in allowing you to do this. Whatâs that process like at Kessel Run? And then, whatâs that process look like as Kessel Run convinces the broader Department of Defense to adopt this?
Omar: Oh, thatâs a fun one. Yeah, so we when we first brought it up, we got the, âWhat are you trying to do?â Lookâbecause it was like, âHey, we want to do chaos engineering.â It was like, âOkay, yeah, weâve heard a little bit about this. What does that mean?â
Itâs like, âIâm just going to break stuff.â Which probably wasnât the smart approach at the moment, but thatâs what I said. And theyâre like, âNo, wait. What do you mean?â And Iâm like, âYeah, and eventually I want to do it in production.â
So, I just went all out. That was my presentation. You know, Iâve learned from that. Itâs like, okay, baby steps, Omar. But initially, it was like, âI want to do chaos and I want to get to production.â They were like, âYeah, sounds good, but I need a plan.â I was like, âOkay. Iâll come up with a plan. And weâll figure it out.â
And so thatâs how we slowly started. And I stood up the team, Bowcaster, and from there we kind of, all right, how do we show the value of chaos engineering? How do we learn chaos and all that stuff? So, it was easy to get them to adopt it. It was the actual execution of tests that was a concern.
Because there was a lot of unknowns. We didnât know what weâre going to break. We donât know how itâs going to react. And how do we actually do this? And we slowly just kind of did those little tests. It was like, all right, weâre going to do this, weâre going to do that. And thatâs how we got it.
And now that weâre moving to the rest of the DOD, thatâs a really cool adventure because our framework, what Bowcaster has built in Kessel Run, is what they want to move to the rest of the DOD. So, the Chaos Plus model is whatâs interesting. The fact that we are moving to the rest of the DOD is very cool because itâs something I believe should be in the rest of the DOD. And weâre happy to experiment. From the Kessel Run perspective, thatâs what weâre here for.
Weâll experiment and weâll let you know what fails what doesnât fail because weâre an experimental lab. And, yeah. But the senior leadership in DOD in charge with all the software development and stuff like that, theyâre all over it. They just want toâhey, how do we make it happen? What do you need?
Youâll see thereâs a different mind change now that chaos engineering is more familiar around the DOD and the tech space. âHey, yeah. This thing called chaos engineering.â Itâs not just, yeah, Netflix does chaos engineering. Itâs like, yeah, everybodyâs doing chaos engineering.
So, you see the little mind shift from, initially, when I bought it in. It was like, âHey, I want to break stuff in production.â And everybodyâs like, âWhoa, hold up there. Thereâs [no 00:17:05] baby steps here, Omar.â Now, itâs like, âHey, letâs go and do it.â Is like, âYeah, letâs do it. How do we execute? But letâs do it.â So, itâs a very cool thing to see.
Jason: Iâm wondering if maybe that readiness to adopt things like this since youâve spent time in the militaryâI havenât, but from what I understand, it sounds like the military has ideas of really, really doing testing. And in some cases, not production testing. We donât start wars just to train the military, but there is the idea of things like live-fire testing. Do existing practices within the military influence the perception of chaos engineering, and to help people actually understand it better, maybe more so than with standard civilians and corporate enterprise?
Omar: Yes. Testing is very important in our systems. So, itâs a different mindset, I would say. So, because in corporate world, itâs all about the money, making the system work and make sure itâs not going down because you lose profit. Or if youâreâthatâs the mindset on that one.
For us, we are in charge of defending the nation, so our system has to be proven and ready to rock within seconds. So, we do a lot of tests, and chaos engineering is just one extra layer to those tests. And now that we are moving to this massive DevSecOps transformation, chaos engineering is key. Thereâs no way we can do this without having chaos engineering involved. So, thatâs what our senior leadership is pushing.
Hey, yeah, this is another flavor of testing. Itâs important because weâre building distributed complex systems across the cloud and whatnot, to support the DOD mission. So, chaos engineering is there. Same thing with the live-fire testing. We got to do live-fire testing to make sure that the ammunition is working, and the guns are working, and everythingâs working right. This is just a different flavor of live fire testing, just on software, and applications, and infrastructure, and the whole deal.
Jason: You mentioned running game days and throwing curveballs, and that sounds like more of a manual game day where youâve got people running the attacks and people responding. Youâve mentioned Kessel Run and really that velocity, and getting faster at things, and automating. Have you started automating the chaos engineering process as well?
Omar: So, we have and weâre following the same approach as when we started. So, the baby steps approach. So, we are going to slowly work with the SRE team to automate some of these tests. And thatâs ongoing. My teamâs working on it right now, so weâre getting there.
Itâs part of our slowly learning and kind of process. The manual, like, game days wonât stop. Those will keep going because of the curveballs we want to keep throwing at the teams, but the automations is coming. The idea is to get the chaos engineering closer to the dev cycle as we can, so shift left as much as possible. And thatâs our next goal.
So, weâre working on that. And I think a lot of it comes down to where do we do it. So, we work in different environments. Itâs not just what we call the internet right now. We have different environments, so how do we automate across all environments?
And part of it is how are we architect that so it works. So, if we make it work on one environment, how does it work on all the environments? So, thatâs usually where our timelines are. So, trying to make sure that our architecture supports all environments versus having to spend a lot of resources, you know, all right, weâre going to engineer one environment, we've got to engineer another environment, we've got to engineer another environment. We want to make sure just toâout of the box, here we go. But that is part of our goal, and we are starting baby steps, so the database failover test is probably the one we will automate first.
Jason: As youâve done chaos engineering, youâre doing the game days manually; what was the process like in terms of tools and adoption? I think a lot of people start off and they hear of Chaos Monkey and so they immediately jump over and, âCool, let me grab Chaos Monkey and see if I can use that.â For any listeners that have tried that youâve probably have quickly recognized that that tool, not so great for public consumption, was very much designed for Netflix. So, Iâm curious if you could tell me more about your tools adoption, what have you used? What are you using now? What does that evolution look like?
Omar: Yeah, so we actuallyâthe first thing I told my team was you are going to research tools. [laugh]. I know Chaos Monkey is out there, but Iâm like, thereâs definitely more tools that we should look at. Iâm sure thereâs been a whole bunch of tools created, depending on our platform. And thatâs what they did.
So, they went and they researched a whole bunch of tools. And they came back and they presented the tools they wanted to use, or kind of just integrate into our architecture. When the team started, right, so when we started that chaos team, the Bowcaster team was supposed to focus just on chaos engineering, but the more I kept thinking about it, it was like we need to focus on chaos and some other stuff. So, thatâs where the performance engineering and the fuzzing came in plays, and bringing the cyber team into the game. So, from a tool perspective, when you look at us, Bowcaster the team is also the tool.
So, they have a tool, Bowcaster is the tool that we deploy across KR to do chaos engineering. Now, within that tool or that framework, thereâs the tools behind it. And thereâs a combination of open-source tools and other tools that we do there, but those just provide the engine for us to perform all of our tests on what we call Chaos Plus. So, Bowcaster is our tool. Yeah, itâs the team and the tool is kind of weird, right?
But the team and the tool, so when you go into KR and you say, âHey, I want to chaos engineering.â Itâs like, âAll right. Go do chaos engineering with the Bowcaster tool that the Bowcaster team built.â But the architecture behind that, thereâs a lot of tools. And it was thatâthat was the task I gave the team.
Itâs like, âI need you to research tools. I know, Chaos Monkey is out there. I know Simian Army, I know all these tools that originally come out when you Google.â Itâs like, if Netflix created it, thatâs the first thing that comes up. But there has to be more, especially in the Kubernetes world. Thereâs a whole bunch of tools. So, thatâs what they did, and we took a combination of those tools and we built Bowcaster. And thatâs what we got.
Jason: Thatâs an excellent point, though, about not just a chaos engineering tool. And I think a lot of times when people think of chaos engineering because itâs chaos engineering it sounds like this well-defined practice of, this is it. If you have chaos engineering, you must have chaos engineers, and so it seems siloed when in actuality, itâs just one of many practices that SREs and DevOps and all engineers should practice. So, this idea of, weâre going to build a tool that has not just the chaos engineering, but all of these other things that you need, and providing that as a service is, I think, a fantastic idea.
Omar: Thatâs always been the charter Iâve given the teams. Yes, we want to do chaos engineering; chaos engineering is awesome. We all dig it, we preach it, weâre huge advocates of it, but what else can we provide? I mean, weâre already degrading the system, so what else can we test? [unintelligible 00:24:25] break the system and blast it with a million users and see what happens. And itâs like, âAll right, systems degraded; weâre blasting it. Letâs see if we can hack it.â
And maybe while thatâs degraded and getting blasted, maybe we figure out thereâs a vulnerability or something. So, thatâs always been the concept. Itâs like putting chaos engineering a little bit on steroids, we call it. And thatâs what Bowcaster does. Bowcasterâs job is to build these things and support it.
And Iâm sure weâll come up with other crazy stuff as we get feedback from team, like, âHey, it would be cool if you can do this.â And weâll just build it into our framework and it will just be another service that Bowcaster provides aside from performance and chaos engineering.
Jason: Omar, thanks for coming on the show. Fantastic information. Itâs inspiring to see the journey of where youâve come from and where youâre headed, especially with the Bowcaster team at Kessel Run. Before we go, though, I wanted to ask, do you have anything that you want to plug or promote, job openings, upcoming speaking? Where can people find you on the internet to learn more about the stuff youâve been doing?
Omar: So, Kessel Run, very active, so you can find us at LinkedIn: Kessel Run, or just go to our site, kesselrun.af.mil and youâll find a whole bunch of information there, careers, so if youâre interested come work, weâre cool people. I promise we do cool stuff.
And if you come work for Bowcaster, weâll hire you and you can break stuff with us, which is why weâcanât get better than that, right? Yeah, come check us out, kesselrun.af.mil. Lots of information there, careers, you can follow us and yeah.
Jason: Awesome. Thanks again for coming on the show.
Omar: Thanks.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.