In this episode Jason is joined by John Martinez, Director of Cloud R&D at Palo Alto Networks, to talk about the FinOps Foundation and the vast range of optimization opportunities to reduce spend in the cloud. John comes in with some extremely useful insights into how FinOps is laid out and their use of a âcrawl, walk, run approach.â John and Jason discuss multi cloud and go into the specifics on the costs associated with multi cloud as well the security changes that will come with. Curious what the future of multi cloud might look like? Tune in to this episode for those details and more!
Show Notes
In this episode, we cover:
- 00:00:00 - Introduction
- 00:03:15 - FinOps Foundation and Multicloud
- 00:07:00 - Costs
- 00:10:40 - Johnâs History in Reliability Engineering
- 00:16:30 - The Actual Cost of an Outages, Security, Etc.
- 00:21:30 - What John Measures
- 00:28:00 - What John is Up To/Latinx in Tech
Links:
- Palo Alto Networks: https://www.paloaltonetworks.com/
- FinOps Foundation: https://www.finops.org
- Techqueria.org: https://techqueria.org
- LinkedIn: https://www.linkedin.com/in/johnmartinez/
Transcript
John: I would say a tip for better monitoring, uh, would be to, uh turn it on. [laugh]. [unintelligible 00:00:07] sounds, right?
Jason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode we chat with John Martinez, Director of Cloud R&D at Palo Alto Networks. Johnâs had a long career in tech, and we discuss his new focus on FinOps and how it has been influenced by his past work in security and chaos engineering.
Jason: So, John, welcome to the show. Tell us a little bit about yourself. Who are you? Where do you work? What do you do?
John: Yeah. So, John Martinez. I am a director over at Palo Alto Networks. I have been in the cloud security space for the better of, I would say, seven, eight years or so. And currently, am in transition in my role at Palo Alto Networks.
So, Iâm heading headstrong into the FinOps world. So, turning back into the ops world to a certain degree and looking at what can we do, two things: better manage our cloud spend and gain a lot more optimization out of our usage in the cloud. So, very excited about new role.
Jason: Thatâs an interesting new role. Iâd imagine that at Palo Alto Networks, youâve got quite a bit of infrastructure and thatâs probably a massive bill.
John: It can be. It can be. Yeah, [laugh] absolutely. We definitely have large amount of scale, in multi-cloud, too, so thatâs the added bonus to it all. FinOps is kind of a new thing for me, so Iâm pretty happy to, as I dig back into the operations world, very happy to discover that the FinOps Foundation exists and it kind ofâthereâs a lot of prescribed ways of both looking at FinOps, at optimizationâspecifically in the cloud, obviouslyâand as well as thereâs a whole framework that I can go adopt.
So, itâs not like Iâm inventing the wheel, although having been in the cloud for a long time, and I havenât talked about that part of it but a lot of times, it feels likeâin my early days anywayâfelt like I was inventing new wheels all the time. As being an engineer, the part that I am very excited about is looking at the optimization opportunities of it. Of course, the goal, from a finance perspective, is to either reduce our spend where we can, but also to take a look at where weâre investing in the cloud, and if it takes more of a shift as opposed to a straight-up just cut the bill kind of thing, itâs really all about making sure that weâre investing in the right places and optimizing in the right places when it comes down to it.
Jason: I think one of the interesting perspectives of adopting multi-cloud is that idea of FinOps: letâs save money. And the idea, if I wanted to run a serverless function, I could take a look at AWS Lambda, I could take a look at Azure Functions to say, âWhich oneâs going to be cheaper for this particular use case,â and then go with that.
John: I really liked how the FinOps Foundation has laid out the approach to the lifecycle of FinOps. So, they basically go from the crawl, walk, run approach which, in a lot of our world, is kind of like that. Itâs very much about setting yourself up for success. Donât expect to be cutting your bill by hundreds of thousands of dollars at the beginning. Itâs really all about discovering not just how much weâre spending, but where weâre spending it.
I would categorize the pitting the cloud providers against each other to be more on the run side of things, and that eventually helps, especially in the enterprise space; it helps enterprises to approach the cloud providers with more of a data-driven negotiation, I would say [laugh] to your enterprise spend.
Jason: I think thatâs an excellent point about the idea of that is very much a run. And I donât know any companies within my sphere and folks that I know in the engineering space that are doing that because of that price competition. I think everybody gets into the idea of multi-cloud because of this idea of reliability, andâ
John: Mm-hm.
Jason: One of my clouds may fail. Like, what if Amazon goes down? Iâd still need to survive that.
John: Thatâs the promise, right? At least thatâs the promise that Iâve been operating under for the 11 years or so that Iâve been in the cloud now. And obviously, in the old days, there wasnât a GCP or an AzureâI think they were in their infancyâthere was AWS⊠and then there was AWS, right? And so I think eventually though youâre right, youâre absolutely right. Can I increase my availability and my reliability by adopting multiple clouds?
As I talk to people, as I see how weâre adopting the multiple clouds, I think realistically though what it comes down to is you adopted cloud, or teams adopt a cloud specifically for, I wouldnât say some of the foundational services, but mostly about those higher-level niche services that we like. For example, if you know large-scale data warehousing, a lot of people are adopting BigQuery and GCP because of that. If you like general purpose compute and you love the Lambdas, youâre adopting AWS and so on, and so forth. And thatâs what I see more than anything is, I really like a cloudâs particular higher level service and we go and we adopt it, we love it, and then we build our infrastructure around it. From a practical perspective, thatâs what I see.
Iâm still hopeful, though, that there is a future somewhere there where we can commoditize even the cloud providers, maybe [laugh]. And really go from Cloud A to Cloud B to Cloud C, and just adopt it based on pricing I get thatâs cheaper, or more performant, or whatever other dimensions that are important to me. But maybe, maybe. Weâll remain hopeful. [laugh].
Jason: Yeah, weâre still very much in that spot where everybody, despite even the basics of if I want to a virtual machine, those are still so different between all the clouds. And I mean even last week, I was working on some Terraform and the idea of building it modularly, and in my head thinking, âWell, at some point, we might want to use one of the other clouds so letâs build this module,â and thinking, âRealistically, thatâs probably not going to happen.â
John: [laugh]. Right. I would say that thereâs the other hidden cost about this and itâs the operational costs. I donât think we spend a whole lot of time talking about operational costs, necessarily, but what is it going to cost to retrain my DevOps team to move from AWS to GCP, as an example? What are the underlying hidden costs that are there?
What traps am I going to fall into because of that? It seems cool; Terraform does a great job of getting that pain into the multiple clouds from an operations perspective. Kubernetes does a great job as well to take some of that visibility into the underlyingâand I hate to use it this way but âhardwareâ [laugh] virtual hardwareâthatâs like EC2 or Google Compute, for example. And they do great jobs, but at the end of the day weâre still spending a lot of time figuring out what the foundational services are. So, what are those hidden costs?
Anyway, long story short, as part of my journey into FinOps, Iâm looking forward into not just uncovering the basics of FinOps, where is what are we spending? Where are we spending it? What are the optimization opportunities? But also take a look at some of the more hidden types of costs. Iâm very interested in that aspect of the FinOps world as well. So, Iâm excited.
Jason: Those hidden costs are also interesting because I think, given your background in securityâ
John: Mm-hm.
Jason: âone of the challenges in multi-cloud is, if Iâm an expert in AWS and suddenly weâre multi-cloud and I have to support GCP, I donât necessarily know all of those correct settings and how to necessarily harden and build my systems. I know a model and a general framework, but I might be missing something. Talk to me a bit more about that as a security person.
John: Yeah.
Jason: What does that look like?
John: Yeah, yeah. Itâs very nuanced, for sure. There are definitely some efforts within the industry to help alleviate some of that nuance and some of those hidden settings that I might not think about. For example, CIS Foundations as a community, the foundations of benchmarks that CIS produces can be pretty exhaustiveâand there are benchmarks for the major clouds as wellâthose go a long way to try and describe at least, what are the main things I should look at from a security perspective? But obviously, there are new threats coming along every day.
So, if I was advising security teams, security operations team specifically, it would be definitely to keep abreast into what are the latest and go take a look at what some of the exploit kits are looking for or doing and adopting some of those hidden checks into, for example, your security operations center, what you react to, what the incident responses are going to be to some of those emerging threats. For sure it is a challenge, and itâs a challenge that the industry faces and one that we go every day. And an exploit that might be available for EC2 may be different on Google Compute or maybe different on Azure Compute.
Jason: Thereâs a nice similarity or parallel there to what we often talk about, especially in this podcast, is we talk about chaos engineering and reliability and that idea of letâs look at how things fail and take what we know about one system or one service, and how can we apply that to others? From your experience doing a wide breadth of cloud engineering, tell me a bit more about your experience in the reliability space and keepingâall these great companies that youâve worked for, keeping their systems up and running.
John: I think I have one of theâfortunate to have one of the best experiences ever. So, Iâll have to dig way back to 11 years ago, or so [laugh]. My first job in the cloud was at Netflix. I was at Netflix right around the time when we were moving applications out of the data center and into AWS. Again, fortunate; large-scale, at the cusp of everything that was happening in the cloud, back in those days.
I had just helped finishâI was a systems engineer; thatâs where I transitioned from, systems engineeringâand just a little bit of a plug there, tomorrow is Sysadmin Day, so I still am an old school sysadmin at heart so I still celebrate Sysadmin Day. [laugh]. But I was doing that transition from systems engineering into cloud engineering at Netflix, just helped move a database application out from the data center into AWS. We were also adopting in those days, very rapidly, a lot of the new services and features that AWS was rolling out. For example, we donât really think about it today anymore, but back then EBS-backed instances was the thing. [laugh].
Go forth and every new EC2 instance we create is going to be EBS-backed. Okay, great. March, I believe it was March 2011, one of AWSâs very first, and I believe major, EBS outages occurred. [laugh]. Yeah, lots of, lots of failure all over the place.
And I believe from that a lot of whatâat least in Gremlinâa lot of that Chaos Monkey and a lot of that chaos engineering really was born out of a lot of our experiences back then at Netflix, and the early days of the cloud. And have a lot of the scars still on me. But it was a very valuable lesson that I take now every day, having lived through it. Iâm sure you guys at Gremlin see a lot of this with your customers and with yourselves, right, is that the best you can do is test those failure scenarios and hope that you are as resilient as possible. Could we have foreseen that there was going to be a major EBS outage in us-east-1? Probably.
I think academically we thought about it, and we were definitely preaching the mantra of architect for failure, but it still bit us because it was a major cascading outage in one entire region in AWS. It started with one AZ and it kept rolling, and it kept rolling. And so I donât know necessarily in that particular scenario that we could have engineeredâespecially with the technology of the dayâwe could have engineered full-on failover to another region, but it definitely taught us and me personally a lot of lessons around how to architect for failure and resiliency in the cloud, for sure.
Jason: I like that point of itâs something that we knew theoretically could maybe happen, but it always seems like the odds of the major catastrophes are so small that we often overlook them and we just think, âWell, itâs going to be so rare that itâll never happen, so we donât think about it.â As youâve moved forward in your career, moving on from Netflix, how has that shaped how you approach reliabilityâthis idea of we didnât think EBS could ever go down and lead to thisâhow do you think of catastrophic failures now, and how do you go about testing for them or architecting to withstand them?
John: Itâs definitely stayed with me. Every ops job that Iâve had since, itâs something that I definitely take into account in any of those roles that I have. As the opportunity came up to speak with you guys, wanted to think about reliability and chaos in terms of cloud spend, and how can I marry those two worlds together? Obviously, the security aspect of things, for sure, is there. Itâs expecting the unexpected and having the right types of security monitoring in place.
And I think thatâsâkind of going back to an earlier comment that I made about these unexpected or hidden costs that are there lying dormant in our cloud adoption, just like Iâm thinking about the cost of security incidents, the cost of failure, what does that look like? These are answers I donât have yet but the explorer in me is looking forward to uncovering a lot of what thatâs going to be. If we talk in a year from now, and I have some of that prescribed, and thought of, and discovered, and I think itâll be awesome to talk about it in a yearâs time and where we are. Itâs an area that I definitely take seriously I have applied not just to operational roles, but as I got into more customer-facing roles in the last 11 years, in between advising customers, both as a sales engineer, as head of customer success, and cloud security startup that I worked for, Evident.io, and then eventually moving here to Palo Alto Networks, itâs like, how do I best advise and think aboutâwhen I talk to customersâabout failure scenarios, reliability, chaos engineering? I owe it all to that time that I spent at Netflix and those experiences very early on, for sure.
Jason: Coming back to those hidden costs is definitely an important thing. Especially Iâm sure that as you interact with folks in the FinOps world, thereâs always that question of, âWhy do I have so much redundancy? Why am I paying for an entire AZs worth of infrastructure that Iâm never using?â Thereâs always the comment, âWell, itâs like a spare tire; you pay for an extra tire in case you have a flat.â But on some hand, there is this notion of how much are we actually spending versus what does an outage really cost me?
John: Right. We thought about that question very early on at another company I worked at after Netflix and before the startup. I was fortunate again to work in another large-scale environment, at Adobe actually, working on the early days of their Creative Cloud implementation. Very different approach to doing the cloud than Netflix in many ways. One of the things that we definitely made a conscious effort to do, and we thought about it in terms of an insurance policy.
So, for example, S3 replicationâso replicating our data from one region to anotherâin those days, an expensive proposition but one that we looked at, and we intentionally went in with, âWell, no, this is our customer data. How much is that customer data worth to us?â And so we definitely made the conscious decision to invest. I donât call it âcostâ at that point; I call that an investment. To invest in the reliability of that data, having that insurance policy there in case something happened.
You know, catastrophic failure in one region, especially for a service as reliable and as resilient as S3 is very minuscule, I would say, and in practice, it has been, but we have to think about it in terms of investing. We definitely made the right types of choices, for sure. Itâs an insurance policy. Itâs there because we need it to be there because thatâs our most precious commodity, our customersâ data.
Jason: Excellent point about that being the most precious commodity. We often feel that our data isnât as valuable as we think it is and that the value for our companies is derived from all of the other things, and the products, and such. But when it comes down to it, it is that data. And it makes me think weâre currently in this sort of world where ransomware has become the biggest headline, especially in the security space, and as Iâve talked with people about reliability, they often ask, âWell, what is Gremlin do security-wise?â And weâre not a security product, but it does bring that up of, if your data systems were locked and you couldnât get at your customer information, thatâs pretty similar to having a catastrophic outage of losing that data store and not having a backup.
John: Iâve thought about this, of course, in the last few weeks, obviously. A very, very public, very telling types of issues with ransomware and the underlying issues of supply chain attacks. What would we do [laugh] if something like that were to happen? Obviously, rhetorically, what would we do? And lots of companies are paying the ransom because theyâre being held at gunpoint, you know, âWe have your data.â
So yeah, I mean, a lot of it, in the situation, like the example I gave before, could not just the replication of, for example, my entire S3 bucket where my customer data is thwarted a situation like that? And then you think about, kind of like, okay, letâs think about this further. If we do it in the same AWS account, as an example, if the attacker obtained my IAM credentials, then it really comes down to the same thing because, âOh, look it, thereâs another bucket in that other region over there. Iâm going to go and encrypt all of those objects, too. Why not, right?â [laugh].
And so, it also begs the question or the design principles and decisions of, well, okay, maybe do I ship it to a different account where my security context is different, my identity context is different? And so thereâs a lot of areas to explore there. And itâs very good question and one that we definitely do need to think about, in terms of catastrophic failure because thatâs the way to think about it, for sure.
Jason: Yeah. So, many parallels between that security and reliability, and all comes together with that FinOps, and how much are youâhow much do we pay for all of this?
John: Between the reliability and the security world, thereâs a lot of parallels because your job is about thinking what are the worst-case scenarios? Itâs, what could possibly go wrong? And how bad could it be? And in many cases, how bad is it? [laugh].
Especially as you uncover a lot of the bad things that do happen in the real world every day: how bad is it? How do I measure this? And so absolutely thereâs a lot of parallels, and I think itâs a very interesting point you make. And so⊠yeah so, Jason, how can we marry the two worlds of chaos engineering and security together? I think thatâs another very exciting topic, for sure.
Jason: That is, absolutely. You mentioned just briefly in that last statement, how do you measure it?
John: Yep.
Jason: That comes up to something that we were chatting about earlier is monitoring, and what do you measure, and ensuring that youâre measuring the right things. From your experience building secure systems, talk to me about what are some of the things that you like to measure, that you like to get observability on, that maybe some folks are overlooking.
John: I think the overlooking part is an interesting angle, but I think itâs a little bit more basic than that even. Iâll go to my time in the startupâso at Evident.ioâmainly because I was in customer success and my job was to talk to our customers every dayâI would say that a bunch of our customersâand they varied based on maturity level, but we were working with a lot of customers that were new in the cloud world, and I would say a lot of customers were still getting tripped up by a lot of the basic types of things. For exampleâwhat do I mean by that? Some of the basic settings that were incorrect were things just, like, EC2 security groups allowing port 22 in from the world, just the simple things like that. Or publicly accessible S3 buckets.
So, I would say that a lot of our customers were still missing a lot of those steps. And I would say, in many of the cases, putting my security hat on, the first thing you go to is, well, thereâs an external hacker trying to do something bad in your AWS accounts, but really, the majority of the cases were all just mistakes; they were honest. Iâm an engineer setting up a dev account and itâs easier for me, instead of figuring out what my egress IP is for my companyâs VPN, itâs easier for me just to set port 22 to allow all from the world. A few minutes later, there you go. [laugh]. Exploit taken, right? Itâs just the simple stuff; we really as an industry do still get tripped up by the simple things.
I donât know if this tracks with the reliability world or the chaos engineering world, but I still see that way too much. And that just tells me that even if we are in the cloudâmature company or organizationâthereâs still going to be scenarios where that engineer at two in the morning just decides that itâs just easier to open up the firewall on EC2 than it is to do, quote-unquote, âThe right thing.â Then we have an issue. So, I really do think that we canât let go of not just monitoring the basics, but also getting better as an industry to alert on the basics and when there are misconfigurations on the basics, and shortening that time to alert because that really isâespecially in the security worldâthat really is very critical to make sure that window between when that configuration setting is made to when that same engineer who made the misconfiguration get alerted to the fact that it is a misconfiguration. So. Iâll go to that: itâs the basics. [laugh].
Jason: I like that idea of moving the alert forward, though. Because I think a lot of times you think of alerts as something bad has happened and so weâre waiting for the alert to happen when thereâs wrongful access to a system, right? Someone breaks in, or weâre waiting for that alert to happen when a system goes down. And weâre expecting that itâs purely a response mechanism, whereas the idea of letâs alert on misconfigurations, letâs alert on things that could lead to these, or that will likely lead to these wrong outcomes. If we can alert on those, then we can head it off.
John: Itâs all the way. And in the security world, we call it shifting left, shifting security all the way to the left, all the way to the developer. Lots of organizations are making a lot of the right moves in that direction for embedding security well into the development pipeline. So, for example, Iâll name two players in the Infrastructure as Code as we call it in the security space. And Iâll name the first one just because theyâre part of Palo Alto Networks now, so Bridgecrew; so very strong, open-source solution in that space, as well as over on the HashiCorp side where Sentinel is another example of a great developer-forward shift-left type of tool that can help thwart a lot of the simple security misconfigurations, right from your CI/CD pipelines, as opposed to the reaction time over here on the right, where youâre chasing security misconfigurations.
So, thereâs a lot of opportunity to shorten that alert window. And even, in fact, Iâve spent a lot of time in the last couple of yearsâI and my team have spent a lot of time in the last couple of years thinking about what can the bots do for us, as opposed to waiting for an alert to pop up on a Slack message that says, âHey, engineer. Youâve got port 22 open to the world. You should maybe think about doing something.â The right thing to do there is for somethingâcould be something as simple as an alert making it to a Lambda function and the Lambda function closing it up for you in the middle of the night when youâre not paying attention to Slack, and the bot telling you, âHey, engineer. By the way, I closed the port up. Thatâs why itâs broken this morning for you.â [laugh]. âI broke it intentionally so that we can avoid some security problems.â
So, I think thereâs the full gamut where we can definitely do a lot more. And thatâs where I believe the new world, especially in the security world, the DevSecOps world, can definitely help embed some of that security mindset with the rest of the cloud and DevOps space. Itâs certainly a very important function that needs to proliferate throughout our organizations, for sure.
Jason: And weâre seeing a lot of that in the reliability world as well, as people shift left and developers are starting to become more responsible for the operations and the running of their services and applications, and including being on call. That does bring to mind that idea, thoughâback to alerting on configurations and really starting to get those alerts earlier, not just saying that, âHey, devs, youâre on call so now you share a pain,â but actually trying to alleviate that pain even further to the left. Well, weâre coming up close to time here. So, typically at this point, one thing that I like to do is we like to ask folks if they have anything to plug. Oftentimes thatâs where people can find you on social media or other things. I know that youâre connected with Ana through Latinx in Tech, I would love to share more about that, too. So.
John: For sure, yeah. So, my job in terms of my leadership role is definitely to promote a lot of diversity, inclusion, and equity, obviously, within the workspace. Personally, I do also feel very strongly that I should be not just preaching it, but also practicing it. So, I discovered in the last yearâin fact, itâs going to be about a year since I joined Techqueriaâso techqueria.orgâand we definitely welcome anybody and everybody.
Weâre very inclusive, all the way from if youâre a member of the Latinx community and in technology, definitely join us, and if youâre an ally, we definitely welcome you with open arms, as well, to join techqueria.org. It is a very active and very vibrant community on Slack that we have. And as part of that, I and a couple of people in Techqueria are running a couple of what we call cafecitos which is the Spanish word for coffees, coffee meetings.
So, itâs a social time, and Iâm involved in helping lead both the cybersecurity cafecitoâwe call it Cafecito CibernĂ©tico, which happens every other Friday. And itâs security-focused, itâs security-minded, we go everywhere from being very social and just talking about whatâs going on with people personallyâso we like to celebrate personal wins, especially for those that are joining the job market or just graduating from school, et cetera, and talk about their personal wins, as well as talk about the happenings, like for example, a very popular topic of late has been supply chain attacks and ransomware attacks, so definitely very, very timely there. As well as Iâm also involvedâbeing in the cloud security space, Iâm bridging, sort of, two worlds between the DevOps world and the security world; more recently, we started up the DevOps Cafecito, which is more focused on the operations side. And thatâs where, you know, happy to have Ana there as part of that Cafecito and helping out there. Obviously, there, itâs a lot of the operations-type topics that we talk about; lots of Kubernetes talk, lots of looking at how the SRE and the DevOps jobs look in different places.
And I wouldnât say Iâm surprised by it, but itâs very nice to see that there is also a big difference with how different organizations think about reliability and operations. And itâs varied all over the place and I love it, I love the diversity of it. So anyway, so thatâs Techqueria, so very happy to be involved with the organization. I also recently took on the role of being the chapter co-director for the San Francisco chapter, so very happy to be involved. As we come out of the pandemic, hopefully, pretty soon here [laugh] rightâas weâre coming out of the pandemic, Iâll sayâbut looking forward to that in-person connectivity and socializing again in person, so thatâs Techqueria.
So, big plug for Techqueria. As well, I would say for those that are looking at the FinOps world, definitely check out the FinOps Foundation. Very valuable in terms of the folks that are there, the team that leads it, and the resources, if youâre looking at getting into FinOps, or at least gaining more control and looking at your spend, not so much like this, but with your eyes wide open. Definitely take a look at a lot of the work that theyâve done for the FinOps community, and the cloud community in general, on how to take a look at your cloud cost management.
Jason: Awesome. Thanks for sharing those. If folks want to follow you on social media, is that something you do?
John: Absolutely. Mostly active on LinkedIn at johnmartinez on LinkedIn, so definitely hit me up on LinkedIn.
Jason: Well, itâs been a pleasure to have you on the show. Thanks for sharing all of your experiences and insight.
John: Likewise, Jason. Glad to be here.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.