- 20 min read

Podcast: Break Things on Purpose | Leonardo Murillo, Principal Partner Solutions Architect at Weaveworks

Sit down with Ana and Jason for this week's show with Leonardo (Leo) Murillo, principal partner solutions architect at Weaveworks, and former DJ, who joins us from Costa Rica. Leo shares his take on GitOps, offers a lot of excellent resources to check out, and shares his thoughts on automating reliability. He also defines how to account for the “DJ variable” and “party parameters” alongside some fun anecdotes on DevOps. Leo is an ardent member of the Costa Rican cloud community, which he goes into some details on. Tune in for another reliable episode!

Show Notes

In this episode, we cover:

  • 00:00:00 - Introduction
  • 00:03:30 - An Engineering Anecdote
  • 00:08:10 - Lessons Learned from Putting Out Fires
  • 00:11:00 - Building “Guardrails”
  • 00:18:10 - Pushing the Chaos Envelope
  • 00:23:35 - OpenGitOps Project
  • 00:30:37 - Where to Find Leo/Costa Rica CNCF

Links:

Transcript

Jason: Welcome to the Break Things on Purpose podcast, a show about our often self-inflicted failures and what we learn from them. In this episode, Leonardo Murillo, a principal partner solutions architect at Weaveworks, joins us to talk about GitOps, Automating reliability, and Pura Vida.

Ana: I like letting our guests kind of say, like, “Who are you? What do you do? What got you into the world of DevOps, and cloud, and all this fun stuff that we all get to do?”

Leo: Well, I guess I’ll do a little intro of myself. I’m Leonardo Murillo; everybody calls me Leo, which is fine because I realize that not everybody chooses to call me Leo, depending on where they’re from. Like, Ticos and Latinos, they’re like, “Oh, Leo,” like they already know me; I’m Leo already. But people in Europe and in other places, they’re, kind of like, more formal out there. Leonardo everybody calls me Leo.

I’m based off Costa Rica, and my current professional role is principal solutions architect—principal partner solutions architect at Weaveworks. How I got started in DevOps. A lot of people have gotten started in DevOps, which is not realizing that they just got started in DevOps, you know what I’m saying? Like, they did DevOps before it was a buzzword and it was, kind of like, cool. That was back—so I worked probably, like, three roles back, so I was CTO for a Colorado-based company before Weaveworks, and before that, I worked with a San Francisco-based startup called High Fidelity.

And High Fidelity did virtual reality. So, it was actually founded by Philip Rosedale, the founder of Linden Lab, the builders of Second Life. And the whole idea was, let’s build—with the advent of the Oculus Rift and all this cool tech—build the new metaverse concept. We’re using the cloud because, I mean, when we’re talking about this distributed system, like a distributed system where you’re trying to, with very low latency, transmit positional audio, and a bunch of different degrees of freedom of your avatars and whatnot; that’s very massive scale, lots of traffic. So, the cloud was, kind of like, fit for purpose.

And so we started using the cloud, and I started using Jenkins, as a—and figure it out, like, Jenkins is a cron sort of thing; [unintelligible 00:02:48] oh, you can actually do a scheduled thing here. So, started using it almost to run just scheduled jobs. And then I realized its power, and all of a sudden, I started hearing this whole DevOps word, and I’m like, “What this? That’s kind of like what we’re doing, right?” Like, we’re doing DevOps. And that’s how it all got started, back in San Francisco.

Ana: That actually segues to one of the first questions that we love asking all of our guests. We know that working in DevOps and engineering, sometimes it’s a lot of firefighting, sometimes we get to teach a lot of other engineers how to have better processes. But we know that those horror stories exist. So, what is one of those horrible incidents that you’ve encountered in your career? What happened?

Leo: This is before the cloud and this is way before DevOps was even something. I used to be a DJ in my 20s. I used to mix drum and bass and jungle with vinyl. I never did the digital move. I used DJ, and I was director for a colocation facility here in Costa Rica, one of the first few colocation facilities that existed in the [unintelligible 00:04:00].

I partied a lot, like every night, [laugh][unintelligible 00:04:05] party night and DJ night. One night, they had 24/7 support because we were collocations [unintelligible 00:04:12], so I had people doing support all the time. I was mixing in some bar someplace one night, and I don’t want to go into absolute detail of my state of consciousness, but it wasn’t, kind of like
 accurate in its execution. So, I got a call, and they’re like, “We’re having some problem here with our network.” This is, like, back in Cisco PIX times for firewalls and you know, like
 back then.

I wasn’t fully there, so I [laugh], just drove back to the office in the middle of night and had this assistant, Miguel was his name, and he looks at me and he’s like, “Are you okay? Are you really capable of solving this problem at [laugh] this very point in time?” And I’m like, “Yeah. Sure, sure. I can do this.”

We had a rack full of networking hardware and there was, like, a big incident; we actually—one of the primary connections that we had was completely offline. And I went in and I started working on a device, and I spent about half an hour, like, “Well, this device is fine. There’s nothing wrong with the device.” I had been working for half an hour on the wrong device. They’re like, “Come on. You really got to focus.”

And long story short, I eventually got to the right device and I was able to fix the problem, but that was like a bad incident, which wasn’t bad in the context of technicality, right? It was a relatively quick fix that I figured it out. It was just at the wrong time. [laugh]. You know what I’m saying?

It wasn’t the best thing to occur that particular night. So, when you’re talking about firefighting, there’s a huge burden in terms of the on-call person, and I think that’s something that we had experienced, and that I think we should give out a lot of shout-outs and provide a lot of support for those that are on call. Because this is the exact price they pay for that responsibility. So, just as a side note that comes to mind. Here’s a lot of, like, shout-outs to all the people on-call that are listening to this right now, and I’m sorry you cannot go party. [laugh].

So yeah, that’s telling one story of one incident way back. You want to hear another one because there’s a—this is back in High Fidelity times. I was—I don’t remember exactly what it was building, but it had to do with emailing users, basically, I had to do something, I can’t recall actually what it was. They was supposed to email all the users that were using the platform. For whatever reason—I really can’t recall why—I did not mock data on my development environment.

What I did was just use—I didn’t mock the data, I actually used just to a copy of the production [unintelligible 00:07:02] the users. I basically just emailed everybody, like, multiple times. And that was very embarrassing. And another embarrassing scenario was, one day, I was working on a firewall that was local to my office, and I got the terminals mixed up, and I shut down not my local office firewall, but the one that was at the colocation facility. And that was another embarrassing moment. So yeah, those are three, kind of, self-caused fires that required fighting afterwards.

Ana: The mock data one definitely resonates, especially when you’re starting out in engineering career where you’re just like, “Hey, I need to get this working. I’m trying to connect to pull this data from a production service,” or, “I’m trying to publish a new email, I want to see how it all goes out. Yeah, why not grab a copy of what actually usually is being used by my company and, like, press buttons here? Oh, wait, no, that actually is hitting a live endpoint? I did not know that.”

Which brings me to that main question; what do you end up learning when you go through these fires? After you went through this incident that you emailed all of your customers, what is something that you learn that you got to take back.

Leo: I learned how you have to pay attention. It’s hard to learn without having gone through this experiences because you start picking up on cues that you didn’t pick up in the past. You start seeing things that you didn’t pay attention to before, particularly because you didn’t know. And I’m pretty sure, even if somebody would have told me, “Don’t do this,” or, “Don’t do that. Be careful,” you still make those mistakes.

There is certain things that you only achieve through experience. And I think that’s one of the most important things that I realized. And I’ve actually see the analogy of that with my children. There’s certain things that I, no matter how well I articulate, they will not learn until they go through those experiences of themselves. But I think that’s one of the things that I’d argue, you ha—you will go through this, and it’s—it’s not okay, but it’s okay.

Everybody makes mistakes. You’ll also identify whether—like, how supporting your team is and how supportive your—the organization you’re working with is when you see the reaction to those errors. Hopefully, it wasn’t something too bad, and ideally there’s going to be guiderails that prevent that really, really bad scenario, but it’s okay to make mistakes. You learn to focus through those mistakes and you really should be paying attention; you should never take anything for granted. There is no safety net. Period.

So, you should never assume that there is, or that you’re not going to make a mistake. So, be very careful. Another thing that I learned, how I can I work in my development environment. How different patterns that I apply in my development environment, how I now I’m very careful to never have, kind of like, production [x 00:10:11] readily available within my development environment. And also to build those guiderails.

I think part of what you learn is all the things that could go wrong, might go wrong, so take time to build those guiderails. I think that’s important. Like anything else that comes with seniority, when you have a task to accomplish, the task itself is merely a margin, only a percentage of what you really should consider to reach that objective. And a lot of the times, that means building protection around what you’re asked, or thinking beyond that scope. And then leverage the team, you know? If you have people around you that know more, which is kind of great about community and collaboration. Like, being—don’t—you’re not alone.

Ana: I love that you mentioned guardrails and guardrails being a way that you’re able to prevent some of these things. Do you think something like chaos engineering could help you find those guardrails when you don’t know that you don’t have a guardrail?

Leo: I think it definitely. The more complex your job, the more complex your architecture, the more complex of the solution you’re building—and we’ve gotten in an increase in complexity over time. We went from monoliths to microservices to fully distributed architectures of services. We went from synchronous to asynchronous to event-driven to—like, there’s this increase in complexity that is basically there for a reason because of an increase in scale as well. And the number of possible failure conditions that could arise from this hugely diverse and complex set of variables means that we’ve gotten to a point that likely always was the way, but now it’s reached, again, and because of targets aligned with this complexity, new levels of scale, that there is currently more unknown unknowns than we’ve ever had.

The conditions that you can run into because of different problem states of each individual component in your distributed architecture, brings up an orders-of-magnitude increase in the possible issues that you might run into, basically a point where you really have to understand that you have no idea what could fail, and the exercise of identifying what can fail. Or what are the margins of stability of your solution because that’s, kind of like, the whole point, the boundaries? There’s going to be a set of conditions, there’s going to be a combination of conditions that will trigger your—kind of, will tip your solution beyond that edge. And finding those edges of stability can no longer be something that just happens by accident; it has to be premeditated, it has to be planned for. This is basically chaos engineering.

Hypothesizing, given a set of conditions, what is the expected outcome? And through the execution of this hypothesis of increasing or varying scope and complexity, starting to identify that perimeter of stability of their solution. So, I guess to answer your question, yes. I mean, chaos engineering allows you to ide—if you think about that perimeter of stability as the guardrails around your solution within which have to remain for your solution to be stable, for instance, there goes—[unintelligible 00:13:48] chaos engineering. I was actually talking to somebody the other day, so I’m the organizer for the Costa Rica Cloud-Native Community, the chapter for [unintelligible 00:14:00], and I have this fellow from [unintelligible 00:14:04] who, he works doing chaos engineering.

And he was talking to me about this concept that I had not thought about and considered, how chaos engineering can also be, kind of like, applied at a social level. What happens if a person xyz is not available? What happens if a person other has access to a system that they shouldn’t have? All these types of scenarios can be used to discover where more guiderails should be applied.

Jason: You know, you start to learn where the on-call person that’s completely sober, maybe, is unavailable for some reason, and Leo comes and [crosstalk 00:14:45]—

Leo: Right. [laugh]. Exactly. Exactly. That’s what you have to incorporate in your experiment, kind of like, the DJ variable and the party parameter.

Jason: It’s a good thing to underscore as well, right? Back to your idea of we can tell our children all sorts of things and they’re not going to learn the lesson until they experience it. And similarly with, as you explore your systems and how they can fail, we can imagine and architecture systems to maybe be resilient or robust enough to withstand certain failures, but we don’t actually learn those lessons or actually know if they’re going to work until we really do that, until we really stress them and try to explore those boundaries.

Leo: Wouldn’t it be fantastic if we could do that with our lives? You know, like, I want to bungee jump or I want to skydive, and there’s a percentage of probability that I’m going to hit the ground and die, and I can just introduce a hypothesis in my life, jump, and then just revert to my previous state if it went wrong. It would be fantastic. I would try many, many things. [laugh].

But you can’t. And it’s kind of like the same thing with my kids. I would love to be able to say, “You know what? Execute the following process, get the experience, and then revert to before it happened.” You cannot do that in real life, but that’s, kind of like, the scenario that’s brought up by chaos engineering, you don’t have to wait for that production incident to learn; you can actually, “Emulate” quote-unquote, those occurrences.

You can emulate it, you can experience without the damage, though, if you do it well because I think that’s also part of, kind of like, there’s a lot to learn about chaos engineering and there’s a lot of progress in terms of how the practice of chaos engineering is evolving, and I think there’s likely still a percentage of the population or of the industry that still doesn’t quite see chaos engineering beyond just introducing chaos, period. They know chaos engineering from calling the Chaos Monkeys kill instances at random, and fix things and, you know, not in the more scientific context that it’s evolved into. But yeah, I think the ability to have a controlled experience where you can actually live through failure states, and incidents, and issues, and stuff that you really don’t want to happen in real life, but you can actually simulate those, accelerates learning in a way that only experience provides. Which is the beauty of it because you’re actually living through it, and I don’t think anything can teach us as effectively as living through [unintelligible 00:17:43], through suffering.

Ana: I do also very much love that point where it’s true, chaos engineering does expedite your learning. Not only are you just building and releasing and waiting for failure to happen, you’re actually injecting that failure and you get to just be like, “Oh, wait, if this failure was to occur, I know that I’m resilient to it.” But I also love pushing that envelope forward, that it really allows folks to battle-test solutions together of, “I think this architecture diagram is going to be more resilient because I’m running it on three regions, and they’re all in just certain zones. But if I was to deploy to a different provider, that only gives me one region, but they say they have a higher uptime, I would love to battle, test that together and really see, I’m throwing both scenarios at you: you’re losing your access to the database. What’s going to happen? Go, fight.” [laugh].

Leo: You know, one thing that I’ve been mentioning to people, this is my hypothesis as to the future of chaos engineering as a component of solutions architecture. My hypothesis is that just as nowadays, if you look at any application, any service, for that application or service to be production-ready, you have a certain percentage of unit test coverage and you have a certain percentage of end-to-end coverage of testing and whatnot, and you cannot ignore and say I’m going to give you a production-ready application or production-ready system without solid testing coverage. My hypothesis is that [unintelligible 00:19:21]. And as a side note, we are now living in a world of infrastructure as code, and manifested infrastructure, and declarative infrastructure, and all sorts of cool new ways to deploy and deliver that infrastructure and workloads on top of it. My theory is that just as unit testing coverage is a requirement for any production-ready solution or application nowadays, a certain percentage of, “Chaos coverage,” quote-unquote.

In other words, what percentage of the surface of your infrastructure had been exercised by chaos experiments, is going to also become a requirement for any production-ready architecture. That’s is where my mind is at. I think you’ll start seeing that happen in CI/CD pipelines, you’re going to start seeing labels of 90% chaos coverage on Terraform repos. That’s kind of the future. That I hope because I think it’s going to help tremendously with reliability, and allow people to party without concern for being called back to the office in the middle of the night. It’s just going to have a positive impact overall.

Ana: I definitely love where that vision is going because that’s definitely very much of what I’ve seen in the industry and the community. And with a lot of the open-source projects that we see out there, like, I got to sit in on a project called Keptn, which gets a chance to bring in a little bit more of those SRE-driven operations and try to close that loop, and auto-remediate, and all these other nice things of DevOps and cloud, but a big portion of what we’re doing with Keptn is that you also get a chance to inject chaos and validate against service-level objectives, so you get to just really bring to the front, “Oh, we’re looking at this metric for business-level and service-level objectives that allow for us to know that we’re actually up and running and our customers are able to use us because they are the right indicators that matter to our business.” But you get to do that within CI/CD so that you throw chaos at it, you check that SLO, that gets rolled out to production, or to your next stage and then you throw more chaos at it, and it continues being completely repetitive.

Leo: That’s really awesome. And I think, for example, SLOs, I think that’s very valuable as well. And prioritize what you want to improve based on the output of your experiments against that error budget, for example. There’s limited time, there’s limited engineering capacity, there’s limited everything, so this is also something that you—the output, the results, the insights that you get from executing experiments throughout your delivery lifecycle as you promote, as you progress your solution through its multiple stages, also help you identify what should be prioritized because of the impact that it may have in your area budgets. Because I mean, sometimes you just need to burn budget, you know what I’m saying?

So, you can actually, clearly and quantifiably understand where to focus engineering efforts towards site reliability as you introduce changes. So yeah, I think it’s—and no wonder it's such a booming concept. Everybody’s talking about it. I saw Gremlin just released this new certification thing. What is it, certified chaos engineer?

Jason: Gremlin-certified chaos engineering practitioner.

Leo: Ah, pretty cool.

Jason: Yeah.

Leo: I got to get me one of those. [laugh].

Jason: Yeah, you should—we’ll put the link in the [show notes 00:23:19], for everybody that wants to go and take that. One of the things that you’ve mentioned a bunch is as we talk about automation, and automating and getting chaos engineering coverage in the same way that test coverage happens, one of the things that you’re involved in—and I think why you’ve got so much knowledge around automation—is you’ve been involved in the OpenGitOps Project, right?

Leo: Mm-hm. Correct.

Jason: Can you tell us more about that? And what does that look like now? Because I know GitOps has become this, sort of, buzzword, and I think a lot of people are starting to look into that and maybe wondering what that is.

Leo: I’m co-chair of the GitOps Working Group by the CNCF, which is the working group that effectively shepherds the OpenGitOps Project. The whole idea behind the OpenGitOps Project is to come to a consensus definition of what GitOps is. And this is along the lines of—like, we were talking about DevOps, right?

Like DevOps is—everybody is doing DevOps and everybody does something different. So, there is some commonality but there is not necessarily a community-agreed-upon single perspective as to what DevOps is. So, the idea behind the OpenGitOps Project and the GitOps Working Group is to basically rally the community and rally the industry towards a common opinion as to what GitOps is, eventually work towards ways to conformance and certification—so it’s like you guys are doing with chaos engineering—and in an open-source community fashion. GitOps is basically a operating model for cloud-native infrastructure and applications. So, idea is that you can use the same patterns and you can use the same model to deploy and operate the underlying infrastructure as well as the workloads that are running on top of it.

It’s defined by four principles that might resonate as known in common for some with some caveats. So, the first principle is that your desired state, how you want your infrastructure and your workloads to look like is declarative. No, it’s—you’re not—there’s a fundamental difference between the declarative and imperative. Imperative is you’re giving instructions to reach a certain state. The current industry is just
 defining the characteristics of that state, not the process by which you reached it.

The current state should be immutable and should be versioned, and this is very much aligned with the whole idea of containers, which are immutable and are versioned, and the whole idea of the Gits, that if used
 [unintelligible 00:26:05] if used following best practices is also immutable and versioned. So, your declared state should be versioned and immutable.

it should be continuously reconciled through agents. In other words, it eliminates the human component; you are no longer executing manual jobs and you’re no longer running imperative pipelines for the deployment component of your operation. You are allowing your [letting 00:26:41] agents do that for you, continuously and programmatically.

And the fourth principle is, this is the only way by which you interact with the system. In other words it completely eliminates the human component from the operating model. So, for example, when I think about GitOps as a deployment mechanism, and for example, progressive delivery within the context of GitOps, I see a lot of
 what’s the word I’m looking for? Like, symbiosis.

Jason: Yeah. Symbiosis?

Leo: Yeah. Between chaos engineering, and this model of deployment. Because I think chaos engineering is also eliminating a human component; you’re no longer letting humans exercise your system to find problems, you are executing those by agents, you are doing so with a declarative model, where you’re declaring the attributes of the experiment and the expected outcome of that experiment, and you’re defining the criteria by which you’re going to abort that experiment. So, if you incorporate that model of automated, continuous validation of your solution through premeditated chaos, in a process of continuous reconciliation of your desired state, through automated deployment agents, then you have a really, really solid, reliable mechanism for the operation of cloud-native solutions.

Ana: I was like, I think a lot what we’ve seen, I mean, especially as I sit in more CNCF stuff, is really trying to get a lot of our systems to be able to know what to do next before we need to interfere, so we don’t have to wake up. So, between chaos engineering, between GitOps, between Keptn, [unintelligible 00:28:32] how is it that you can make the load of SRE and the DevOps engineer be more about making sure that things get better versus, something just broke and I need to go fix it, or I need to go talk to an engineer to go do a best practice because now those things are built into the system as a guardrail, or there’s better mental models and things that are more accurate to real conditions that can happen to a system?

Leo: Actually, I sidetracked. I never ended up talking more about the OpenGitOps Project and the GitOps Working Group. So, it’s a community effort by the CNCF. So, it’s open for contribution by everybody. You’re all in the CNCF Slack, there is an OpenGitOps Slack channel there.

And if you go to github.com/open-gitops, you’ll be able to find ways to contribute. We are always looking to get more involvement from the community. This is also an evolving paradigm, which I think also resonates with chaos engineering.

And a lot of its evolution is being driven by the use cases that are being discovered by the end-users of these technologies and the different patterns. Community involvement is very important. Industry involvement is very important. It would be fantastic and we’re an open community, and I’d love to get to know more about what you’re all doing with GitOps and what it means for you and how these principles apply to the challenges that your teams are running into, and the use cases that and problems spaces that you’re having to deal with.

Jason: I think that’s a fantastic thing for our listeners to get involved in, especially as a new project that’s really looking for the insight and the contribution from new members as it gets founded. As we wrap up, Leo, do you have any other projects that you want to share? How can people find you on the internet? Anything else that you want to plug?

Leo: I love to meet people on these subjects that I’m very passionate about. So yes, you can find me on Twitter. I guess, it’s easier to just type it, it’s @murillodigital, but you’ll find that in the show notes, I imagine. As well as my LinkedIn.

I have to admit, I’m more of a LinkedIn person. I don’t, I hope that doesn’t age me or made me uncool, but I never figured out how to really work with Twitter. I’m more of a LinkedIn person, so you can find me there. I’m an organizer in the community in Costa Rica CNCF, and I run.

So, for those that are Spanish speakers, I’m very much for promoting the involvement and openness of the cloud-native ecosystem to the Hispanic and Latin community. Because I think language is a barrier and I think we’re coming from countries where a lot of us have struggled to basically get our head above water from lesser resources and difficult access to technology and information. But that doesn’t mean that there isn’t a huge amount of talent in the region. There is. And so, I run a—there’s a recent initiative by the CNCF called cloud-native TV, which is we’re ten shows that are streaming on Twitch.

You go to cloudnative.tv, you’ll see them. I run a show called Cloud Native LatinX, which is in Spanish. I invite people to talk about cloud-native technologies that are more cloud-native communities in the region.

And my objective is twofold: I want to demonstrate to all Hispanics and all Latin people that they can do it, that we’re all the same, doesn’t matter if you don’t speak the language. There is a whole bunch of people, and I am one of them that speak the language that are there, and we’re there to help you learn, and support and help you push through into this community. Basically, anybody that’s listening to come out and say these are actionable steps that I can take to move my career forward. So, it’s every other Tuesday on cloudnative.tv, Cloud Native LatinX, if you want to hear and see more of me talking in Spanish. It’s on cloudnative.tv. And the OpenGitOps Project, join in; it’s open to the community. And that’s me.

Ana: Yes I love that shout-out to getting more folks, especially Hispanics and Latinx, be more involved in cloud and CNCF projects itself. Representation matters and folks like me and Leo come in from countries like Costa Rica, Nicaragua, we get to speak English and Spanish, we want to create more content in Spanish and let you know that you can learn chaos engineering in English and you can learn about chaos engineering in Spanish, Ingeniería de Caos. So, come on and join us. Well, thank you Leo. Muchisimas gracias por estar en el show de hoy, y gracias por estar llamando hoy desde Costa Rica, y para todos los que estån oyendo hoy que también hablen español...pura vida y que se encuentren bien. Nos vemos en el próximo episodio.

Leo: Muchas gracias, Ana, and thanks everybody, y pura vida para todo el mundo y ÂĄhagamos caos!

Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.

October 7, 2021 - 4 min read

Getting started with Disk attacks

Persistent storage is one of the more difficult aspects of managing distributed systems. When we attach a storage device to a host—whether it’s flash storage, network attached storage (NAS), or old fashioned spinning disks—we generally don