Gaining Stability with Rule Based Policies with Shimon Tolts - DevOps 226

Shimon Tolts is the CEO datree.io. Shimon talks the panel through an outage he experienced while working for a previous employer. He breaks down the situation and the explains the types of misconfiguration that caused his outage and how these things can cause problems in other applications as well.

Hosted by:

Will Button •

Charles Max Wood •

Jeffrey Groman

Special Guests:

Shimon Tolts

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Picks

Charles- Willow
Charles- Traeger Grills
Jeffrey- How to be most efficient in learning something new
Jeffrey- Don't fall into the trap of just buying cheap products
Shimon- daily.dev | All-in-one developer news reader
Will- The Manual: A Philosopher's Guide to Life

Transcript

Hey, everybody, and welcome back to another episode of Adventures in DevOps. This week on our panel, we have Will Button. What's going on, everyone? We have Jeffrey Gromen. Hey there.

I'm Charles Max Wood from devchat.tv. And this week, we have a special guest, and it is Shimon Taltz. Hi, everyone. It's a pleasure to be here. Thank you very much for hosting me.

So fun. Yeah. Absolutely. I mean, you brought all the energy. It's it's funny we were talking beforehand, and you're just excited.

And I I love it. Do you wanna just introduce yourself real quick? Let people know who you are, what you do at is it d a tree, dottree? Datre. Io?

Yeah. Sure. Yep. So, yeah, my name is Shimon. I've been in the infrastructure, the r and d space for more than 10 years, and I've worked at large companies like Intel, and I worked at start ups from, like, 30 employees until we were a 1000 employees.

And my previous role, I was an engineering manager for a media company, and I grew together with the organization from 30 employees to 1,000. And I really saw how the struggle is real when you have 400 engineers, and you're trying to to make this work while breaking things and moving fast. So this is actually what brought us here and what brought me to almost 4 years ago, me and my cofounder, to open the tree, which helps prevent misconfigurations from ever reaching production environments, especially around Kubernetes. Alright. So the code that crashes my stuff, that's my fault.

The misconfigurations, that's somebody else's fault. I've just I'm just being clear. I hope my boss hears this. Anyway, so yeah. So we were we were talking before the episode, and you said that you experienced this outage.

And and this is where you've learned a lot of the lessons that led to the tree. Do you wanna just talk about that? Kinda give us the background and the story so that we know what you screwed up? I mean, how that went. Sure.

Sure thing. So let me describe the the scenery first, you know, like in a book. So imagine a company, and you have 1,000 employees. 400 of them are developers. The company is born in the cloud, paying $1,500,000 for AWS every month.

A lot of stuff running. No one actually knows what is everything, but it it kinda works. You know? And moving really, really, really fast, working in microservices, a lot of different programming languages. And the philosophy of the company is, you know, like small speedboats, like Amazon calls it 2 pizza teams.

You know, you have your team, you find you have a problem, you find the best solution, you go, you do it, you're responsible for it. Sounds great. Removes bottlenecks, makes you move fast, and really gives great energy for people because they don't feel like they're a small small screw in a big organization. And my role there was the general manager of the infrastructure division. So my work was to find things that are relevant to all the other teams and build it as an infrastructure.

So for example, we built a data collection pipeline that ingested more than 200,000,000,000 with a b every month's events from 13 regions in AWS. And we this is what we would do, and every time there's a new technology or something that is cross the company, we would be responsible for it, my team. And sort of like sort of special ops in a way. And one day, we had a production outage. Now this happens.

You know, we're all people. I make mistakes. Everyone makes mistakes besides Charles. It's it happens. And and, like, every company, we said, okay.

Let's postmortem the problem, understand what happened, let's find the root cause, and we did it. And a developer made a misconfiguration in one manifest file. And we said, okay. We we totally understand. People make mistakes, and we want we we believe in the philosophy of, you know, run fast and break things, but we we don't believe in the philosophy of let's make the same mistake 5 times.

You know? It's there is a limit to that as well. So we said, okay. So how do we make sure that this does not happen again? So first of all, you send the postmortem in an email to everyone in the company.

So we tried that. Nice. Doesn't really work. No one reads it. No one remembers it.

And I gotta tell you from the other side as a developer, getting emails every day telling me, like, use this package, use this configuration, check this thing. It's it's not scalable. Like, how am I supposed to remember everything? It's it's just not feasible. Yep.

And we said, okay. We we we did, you know, internal educational systems, and then we did an internal meetup and explained to everyone. And everyone agreed and everyone understands, but it didn't really work well because I think, and this is what we thought, and this is what drove us to actually open the company, is that it has to happen in an automated way within the development flow of the developer. Because every inch, every small thing that you you do in order to change the workflow of the developer, it's it's crazy. It's almost never going to happen, and if it's going to happen, it's going to be very, very painful for the developer, for the manager, and for the company.

So we said, how can we do something that will be seamless in the flow? Because when we spoke with people, they said, I want to know when I'm doing something wrong. I I don't want to be the person that submits secret keys into our public GitHub repository. I don't want to be the person that takes production down, but but sometimes I just don't know. And this what drove us to actually opening the tree and building a solution that hooks directly within the development workflow.

So it's a CLI utility. You can run it on your laptop, Linux and Mac. Just run the tree test on your Kubernetes manifest file or Helm file, and then we provide out of the box predefined policies. So I'll give you simple examples that seems, you know, really trivial, but people don't do it. It's like memory limit, CPU limit, a liveness probe, readiness probe, pulling containers from a centralized registry, not using the latest Docker tag, because then every time you build it, it's like going to the casino.

You don't know which version you're gonna get. I don't know what you're what you're gonna have in production, like you know? And next, after you have it in your computer, you install it in your CICD. And at this point, this is one of the most powerful things because you get a centralized policy management solution. So I, as the DevOps engineer, can identify a problem, think of a policy that I want to apply to all of my 100 or 50, or 5000 engineers, and with a click of a button, I can enable a policy, and now all the projects that go through this CICD pipeline will actually comply with this policy.

And otherwise, it will fail. And the idea is that once it fails, it does not notify the DevOps person. It explains the developer what do they have to do, and shows them and links them to Wiki and to our docs, and tells them, hey, mister developer. Hey, missus developer. This is how you can fix it.

So we are very, very proud of it because I really believe that this is how I would want my organization to communicate those policies and practices to me as a developer. As an engineer, quit telling me what to do. No. It makes sense. And to be perfectly honest, you know, so yeah, I write web develop I'm a web developer for a fairly large financial firm.

And what's nice is a lot of this stuff does kinda get pushed into our CICD. But the other nice part of it is that generally, when these kinds of policy changes come down and I don't think they're using the tree. I think they're using just we're making this policy change, and we're configuring Jenkins to do it. But they generally are pretty good about going in and making the initial move. Right?

So they they move it to the to match the policy. And then from there, when it lines up with whatever we're doing, that's when we get to the point where it's like, okay. So then if we change something that messes it up, right, then it's on us. Okay. We can roll this back.

But, yeah, they're usually the ones that initially make it comply. And I just wanted to add that because I think there's some level of responsibility that goes both ways. And so that's what I like about this. But if you're the one that's making the initial change that's gonna cause it to fail in CI, then you probably also ought to be the person that's either working with somebody or doing the work yourself to make it comply in the first place. I totally agree.

And this is why when we designed our policy engine, we designed it to have several points of granularity. So first of all, you can see what how am I doing now? Let's scan my GitHub repositories and see, do I have any violations now or not? Secondly, you can enable a rule in a way that, we call it gradual rollout. So now every time that they make a change that is not compliant, it will tell them, listen.

On August 1st, this change will not be complied. Now you can it it is passing, and that's okay. Totally fine. But just so you know, we're going to have a policy in place in August 1st. And this is the policy, and here you have time to actually prepare to it.

And then once August 1st hits, then it fails as a warning, and not as enforcement. And then you have a grace period for adoption of the policy. And only then at the end of the end of the end, it actually goes to full enforcement. And you're totally right. This is the feedback that we got from our customers, and this is how they designed it, and we built it because this is how they want it.

Now that's super cool. I really like the approach of you mentioned the path of how you got here through the emails and the the meetings and the the workshops and stuff, but really all that is only relevant at the time. And doing it this way, I think one of the key things there is that you're meeting the developers where they are because that's the that's the right time to introduce the solution or the information is when it's relevant to them. Otherwise, it's just out of context. I totally agree.

You have to get the warning and the the the data in line. I call it in line. And this is why we're working on we have a Helm plug in, working on a kubectl plug in, Versus code plug in, everything. And this is very important because if it's not convenient and if it's not in the developer's workflow, then I'll give you a story. Okay?

I met with a big enterprise company, and they they talked to me about the certain policy that we have that says, like, pull containers from the centralized registry of the company. So it's like Mhmm. Docker at companyacmed.com. Right? And and he's like, it's it's a good policy.

I want to use your solution instead of ours. And I'm like, what what's your solution? So what do you do today? Oh, we just blocked Docker Hub in our firewall, and no one can access it. And I'm like, what?

Problem solved. For real. This is the real really, this is what he told me. I was shocked. And he's like, yeah.

We we just block it in the DNS and firewall level, and that's it. And they can't pull it from there. And and I think that this is absolutely not the way to do it. As we go forward, developers want to achieve left nice to play solution. Yeah.

That that that works until your developers get smart enough to figure out how to use a VPN or other ways of getting a proxy server. Yeah. Yeah. This is security by obscurity. It's yeah.

It's the wrong way. I was gonna say Jurassic Park nature or developers will find a way. Yeah. Right. Absolutely.

You know, just to pile on to I I totally love this idea too. And it's it really speaks, I think, to, like, the whole DevOps mentality of, like, flow and, like, pull requests versus push requests. Because I think, you know, the way you were describing it earlier, Sheila, was that it's basically just, you know, pushing stuff out, which never really works out that well. But if you, as Will said, you get the timing right so that now developers can pull that information as they need it. The timing is right.

The method is right. The information is is there for them to pull. It just, you know, so, again, I'm I know I'm just piling on, but I just feel like it it actually really fits really nicely and elegantly than the whole, like, DevOps, you know, sort of methodology. But, you know, I gotta say, of course, I gave a very radical, example now. But when we meet with companies, that are at this crossroad, because they say, okay.

Listen. We scale we had 30 developers. It was okay. 40, 5th like, now we have, like, 70 developers. It's COVID.

We're all working from home. You can't come to just a room and ask, hey. How do we do this and that? And he's like, we we need to put something in place. And then I see companies choose 2 different paths.

It's a it's like 2 2 opposites that that you can go to. And, of course, I think that the the best solution is the middle ground, but, like, some companies go the old way the the the whole way to okay. So DevOps is responsible for the cluster, which is true in many organizations. They're responsible for the operational excellence of the cluster and and for the day to day operations. But then the developers write the application.

Then what they say is, okay. So now every change that the developer makes to a Kubernetes manifest or Helm or anything that touches the infrastructure has to go through the ops team. Now what happens at this point is that there's a huge bottleneck. Eventually, it frustrates both sides because the developers, they have the r and d backlog and the product sitting on them with timelines that they need to release stuff, and they're waiting for the ops team to approve it. The ops team, they don't wanna babysit developers and tell them, listen, you forgot you're pulling the latest image, put up and down version.

No. Because it's not interesting. They want to do cost reduction. They want to optimize the performance. They want to upgrade it.

They want to bring the new best versions. You know, do crazy POCs. And then, like, all all sides are are basically frustrated because they babysit developers, the developers don't get autonomy, and at the end of the day, just bottleneck the DevOps. And not to talk about the fact that SRE and DevOps teams are usually, like, 1 to 10 developers. So you might have, like, 10 DevOps people to a 100 or 200 developers.

So that that's one side. I I can definitely identify with this. I mean, I'm I'm working on a project right now that's on several timelines. Right? And, yeah, whenever things won't deploy, when it doesn't play nicely with the cluster, things like that, we get frustrated.

Right? And then my boss gets frustrated, and he it's like, why isn't this out there? Right? And then, you know, well, DevOps. Right?

And so then they go to DevOps, and it's the same thing. Right? And then the DevOps guys, sometimes it's okay. Well, let's they'll go figure out what it is, and it's something that they can fix. And sometimes they're coming back to us and saying, well, there's this problem.

And they they don't wanna come back to us and and manage us. And and my nobody else is happy because whoever the powers that BR for the business needs, they just want it out. Right? And so, yeah, the what you're talking about, we've run into this more than once over the last year. And then it's like what I'm telling you now, I heard it from several and several companies multiple times.

And just to tell you another example, the the most common thing is that they come to DevOps and tell them I have a deadline, and then they go like, yeah, but the CFO told me to do cost reduction on AWS. So what do I do? Do I listen to the CTO to the CFO, or do I listen to the VPR and d? Or, like, what's more important? And who knows?

I don't know. It's, it's hard. Now let's talk about the other side. The other side is actually when this happens, but, DevOps does not assume responsibility, and and they they go the path of educating the developers, and they say, no. We're not going to lock everything.

We're not going to lock anything, but we're going to put additional efforts into educating the developers in order to making the right decisions, which is nice. The thing is while this happens, you're really not sleeping at night, both because you're afraid and because you're getting, like, pager paged for things that are happening. And secondly, the the developers are I find them terrified. They go like, I'm I'm going to to to do that something that is going to to to change production now. They go like, I'm a Java billing engineer.

I don't know Docker. I don't know Kubernetes. I'm I'm expert in Java billing. Not in, you know, Docker. I don't know.

And then you find them, like, almost crippled because because they're they're afraid. They say, I'm I don't know. I'm not an expert, and and I'm afraid to break it, and I don't wanna do it. And then the teams, they try to educate them and so on. But I think that what what goes best and it's solutions like the tree, or you can take OPA, open policy agent, with Conftest and and the gatekeeper and write your own policies.

And and what I've heard from developers is that when the middle ground, they call it, I feel like I have guardrails. Like, I'm riding the freeway, but I have guardrails, so I I can do it by myself. I'm not bottlenecked by DevOps, but if I do something horribly wrong, the system will stop me. And then and then it's like a nice middle ground between the 2, which I think can can greatly help both sides of the equation. I think that's one of the approaches I try to take in specifically in postmortems.

You know? Because in postmortems, a lot of the focus is on root cause and what went wrong. But I try to take it a little bit further than that and say, you know, the failure was not that this code did whatever. The failure is that the system didn't warn somebody that this was going to happen, right? We built an environment where a developer or an engineer was able to make a change that they shouldn't have been allowed to make.

And I think that's what you're describing there is they're free to do whatever they want, but there's the guardrails in place that keep them from doing something that they didn't intentionally wanna do. It's like in AWS, you go delete the resources, and it's like, there are 15 resources attached to this security group. I guess you don't want to delete this security group. So one one thing I'm curious about, Shimon, is is when you talk about your clients, I'm I'm curious to hear what are the common sort of misconfigurations. You know?

Like, well, I don't know. If you were to say if if, you know, if we were to ask you, hey. What are the top 5 or top 10? I'm really curious to hear, like, what you see commonly. It's just, yeah, common.

Great question. It's an absolutely great question. And, by the way, in our docs, hub.thetree.i0, we we list all of the policies that we have, and and you can view everything. So let's go over some categories and talk about them and talk about the their their severity. So one one one company, very big messaging company, told me he told me, I want an infra safety score.

I want this to run and for me to be to feel safe. Not not safe in regards to security safe. Also, it also has security aspects, but I want to to know that my safety score is high. So it starts from resource management. So in Kubernetes, for example, you can have CPU requests and CPU limit, memory limit, and memory requests.

This is very, very common, and it especially happens because the the developer, she codes the app, and then she sends it to the cluster. She doesn't know what is she going to be paired with. Now the DevOps engineer works on workload management and optimizes the workloads on the different nodes so it will be cost effective. And the problem starts when you don't have memory and CPU limits, and then you have a memory leak in one of your containers, and then it starts affecting it's like a noisy neighbor, but very, very noisy. And then Kubernetes starts to it it depends on how you configured it and so on, but it starts to kill services, starts to run out of memory.

There are different behaviors that none of them is good, and none of them is as expected. So I'd say this is the, like, no brainer one that you should do. Now what we see companies usually do also is that they start and they apply cluster wide memory and CPU limit because you can do it on the run time level. The problem starts when, you know, you have different departments. One of them needs 4 gigabytes of memory, and it's great.

But then you have the AI engineers, and they're like, we need 40 gigabytes. And then if you don't configure it on a shift left side, then you go like, okay. So I need to increase everyone's image to 40 gigabytes, and then it's like nothing. It's it doesn't doesn't matter. So it's really important to set it on the resources side.

So that's one. The second one is, I would say, around workload management in terms of making sure that you have a liveness probe, a readiness probe, that your Docker container has a health check. It sounds so trivial. It sounds so simple. But so many times, people go and and create the workload, don't set those things.

They just, oh, they just HTTP to it. HTTP 200. It works great. And then they don't configure it on the workload level. And, not to talk about, you know, deeper things where you have a service and you want the health check to include that, maybe a connection to a database or to a cache.

And I would really, really advise to, like, in order to increase your safety and stability, really put an effort into your health checks, readiness, liveness. Because if you do it right and correctly, once things fail and things always fail, it will be easy for you to find the root cause, and it will be easy for you to protect yourself and for Kubernetes to kill this workload and to get another workload running. So let me just stop you there because I I always like to ask the dumb questions. So I think I think I understand what you're saying. But for some of our for other people, you know, our listeners who may not have followed that whole train of thought.

Right? Because there's a lot there you just said in in 2 minutes that I feel like we could unpack. Right? So I'm gonna say in the unit, tell me how wrong I am or if I miss something. But so let's say we spin up, you know, I we we, we send out a new package.

We spin it up and I'm the developer. So what I do is I just check to say, hey. Can I hit can I can I with my browser, hit it, and I get a 200 response back saying we're good? So that's only a piece of it because maybe I'm only hitting, let's say, the load balancer. And so the load balancer is saying, I'm here.

Right? I'm answering you, but the application behind it is dead. Or maybe the application is alive, but the database behind it is dead. So unless I'm doing health checks that are so short of going through those steps, we may have had we may have just deployed something that broke everything and I don't even realize it because all I'm doing is pinging the the load balancer and getting a 200 response and everything looks good to me because I didn't check what's going on behind, you know, like, sort of peeling back the layers of the onion. Did I get that right, or am I missing something?

Absolutely right. This is one of the most common mistakes developers make is they just check the the simple front end web browser, and they don't do the entire process. And then when you do have a problem, it's so hard to debug it because everything returns a great health check. So and you don't understand what is actually the problem. So is this something that, you know, like like, the tree does for you?

Is this something I've gotta figure out? Like, how does that you know, how do you build that into a health check? So that sounds like there's a lot of steps, and and it and it really depend on the architecture of your application. Yeah. So we we can talk about it from engineering standpoint.

In in terms of the tree, the tree is a tool, and it's a tool that you can use in order to say, listen. From now on, all of our Kubernetes workloads are going to have a liveness probe, a readiness probe. Now how you configure this liveness probe and readiness probe is up to you. Same thing. You're gonna put a memory limit.

If you put the memory limit of 64 megabytes and your server can't even, it's some Java huge jar, I don't know, it can't even load up, it's it's your problem. But what we will do is we will make sure that a policy exists and and that it it is configured on the resource. The next layer is another thing, that is, like, what is the most common, you know, policies, it's actually labels. Again, it sounds so simple. It sounds so trivial to put the label, and there are so many reasons why to put a label.

So I'll start with with the one that we're talking about now. So first of all, you can use labels in order to say what type of workload it is, in order to determine which type of policy in terms of resource management, for example, it should use. So then you could say this is from type AI, and they use those types of limits, and those are from type back end, front end. I don't know. Different teams call it in different names.

And and you can use it in order to understand what are the the the relevant policies you should use. This is number 1. Number 2, cost management. This is also a very, very common use case that DevOps people have to deal with, which is constantly knowing to assign the cost center because they run the shared resources. And at the end of the day, they pay the check to AWS or Azure or whatever you run it, And and then the internal company goes like, okay.

But how much do we need to build each business unit inside of our organization? And then they go like, I don't know. We had, like, 5,000 servers. And and then they go like, okay. Now it's mandatory.

Everyone should say which department this server belongs to because otherwise, we're not gonna know how much to allocate to it because then you don't know what is the COGS, cost of goods of your business. And then the CFO doesn't know if the business is is profitable or not profitable, can we hire people, can we not hire people? And it's crazy because it's it's like board of directors decisions that go down to the CFO, that go down down down down down to the simple label that you need to put on your Kubernetes workload in order to know how much it costs. Can you define for us the difference between a liveness probe and a readiness probe? That is a great question.

So liveness probe works on I say readiness. Readiness probe is when I'm ready to serve traffic. So let's say I'm initializing myself. I need to start. I need to go create a cache, make sure that I can put it there and so on.

And then a liveness probe is when I'm running, am I running correctly? Can I continue communicating, for example, with my existing cache or whatever it is? I like the I I think I think it's too much. I like the health just simple health check, you know, that that goes end to end and and and does the the check. In addition, by the way, I also suggest for companies, it has nothing to do with the tree and so on, but, like, to configure outside health checks that actually go and do a user activity on your services for real.

Because the worst thing you want is a customer calling and saying the service is down. You want to be the first to know. So I think that those are the the main things that I I would focus on. Yeah. It's a really good point.

I've been in a few outages where everything was working internally, but nothing was working externally. Yep. I'll tell you one of my most severe outages. It was so hard to debug. It was my previous company.

It was not an outage. It it was even worse. What's worse than an outage? Everything slows down and works really, really bad, and it it doesn't break. So you don't really know.

And it was a data pipeline that collected 200,000,000,000 events every month, and it it was a geolocation based routing. So it would every time someone will click an ad, it will route to the closest AWS region and send the event there. And then we had 13 regions that would send everything to a centralized Kinesis, and then we would have workers that would process it. Now in order to do deduplication and add some attributes, we had a Redis cache. And this ready so all the workers would access the Redis cache in order to put in IDs, select IDs, and so on.

And at some point, we the amount of messages increased, you know, slowly, slowly, slowly, slowly. And then at some point, the memory of the Redis got filled. So what did it do? It switched to swap. What's the problem with swap?

It's slow. And then all the requests started returning really, really slowly, and then you don't understand. You think, okay. There's a problem. So you put on more servers, and then they bombard the redis even more.

And then you put in more workers, And and and, like, you're trying everything from like, you're just trying to debug everything until finally we're, like, open it. It was, like, oh my god. The Redis is running on swap. And then we had to increase the Redis memory, and then and then it fixed it. And if we had a check that a liveness check that said, I'm going to perform a put events to the Redis, and I expect it to take 2 to 4 milliseconds.

I'm just making this up. And if at some point this is more than 4 milliseconds, there's a problem. We would have immediately knew where is the root cause of this issue, but we didn't. That's the truth. Yeah.

Another good reason why postmortems are so important. I'm curious to know because here's my anecdotal, I guess, experience is that I find that very few organizations do postmortems well. And if they're doing them, I don't think that they do them in a very effective way. I think they do them in more of a finger pointing root cause analysis. Who caused the problem and who should we fire?

Right? And I just feel like Yeah. You feel that? Unfortunately, now listen listen. I'm on the security side.

So a lot of the post mortems I'm involved with are security incidents. So those might be a little bit, you know, handled a little bit differently than, like, you know, typical outage or or that sort of situation. But yeah. Will did it. Right?

Yeah. I mean so I I guess I should say, I I feel like most of the time, the postmortems just don't happen. I feel like the times that they do, it almost becomes a witch hunt, and those are very rare. But when they do happen, again, that's just my experience, they just just get nasty. So I I'm curious.

I I wanna hear a better story because I I feel like my experience is is is not good. I've never experienced anything like it, thankfully. The organizations that I've worked with, my company, thank god we did not have a security incident that someone stole all of our records or something because then I think you're, like, obligated to take action. And and and maybe most of the the like, your cases were those type of severe cases were, you know, it's it's just it's like borderline, like, federal. It's like, it's really a problem.

Right. And What I am Yeah. Yeah. What I'm referring to is more of a, you know, engineers and, like, like, the the ready, say, story I just told you. Who you gonna fire?

No one. It's just gonna make everyone better, you know, and tell them and then think of how we could've fixed it. And then we implemented the check. Believe me, every time there was a problem, the first thing everyone checked was the Redis. Everyone went to see that Redis is okay.

It was like a small baby that everyone takes care of. But I don't believe in the witch hunts. I I really believe in the culture where people come and and they say I made a mistake, and people help them understand. And, again, as long as there was no, like, negligence, I don't know, you know, something criminal or something like that Right. Sure.

Mistakes. Another story, there was an employee. It was her 1st day. The company was still using SVN and not Git. And on her first day on the job, she deleted the entire SVN tree.

Ouch. No. Nothing happened to her. Restored the backup. Yeah.

So IT restored the backup, and it's okay. But I think this is the main difference. I don't know. What is your experience? Personally, I've seen both ways.

You know, I remember in years past, postmortems were the lynch mob had the pitch the pitchforks and the torches trying to find out who we're gonna grab. But I think that's, in my experience, that's gone away over the last few years to people being more willing to accept that mistakes happen. But it's it's almost feels like a a pendulum where now it's an it's over overly trying too hard, I guess, to make sure that someone doesn't feel attacked in the postmortem, that you never get to the root cause either. You know? And and so I think you gotta struggle to find the happy medium there.

And, I mean, ultimately, you know, in a lot of these situations, someone did do something incorrect, and you've got to point that out in order to identify it. And when you point it out, you know, you're not, like, calling that person out or attacking their skills. It was just a mistake that happened, but it's important to fully understand what that mistake was so that you can build in the systems to prevent it from happening again. Yeah. Yeah.

I've I've I've been in the situation where and not because of a postmortem, but just because of other things. You know, I had a boss come in once on on one of the teams. I was team lead. And he basically walked in the room and said somebody's getting fired today. Right?

And you don't want people to feel that. Right? Because because I took him outside and I said, I said, if you're gonna pull this, they're all keeping their jobs. I'm just gonna quit. Right?

And it's because nobody should live in that kind of fear. Right? We're all trying to work on the same thing. But the flip side is is, yeah, I mean, if somebody is routinely reckless, right, it it's always Jim. Right?

It it's gone down 4 times this month, and Jim has been the one who messed it up every time. And this is all stuff that we've done training on, and so Jim should know better. You know? The first time, hey. Jim's a human.

2nd time, Jim's still a human. 3rd time, okay. Jim's a human, but Jim is starting to cause some problems. You can have the conversation about whether or not Jim needs to keep his job. But if people feel like they're going to be punished for making a mistake every once in a great while, then you're gonna slow the whole system way down.

And the whole point is, Shimon keeps pointing out, is we wanna keep moving fast. We wanna move fast. We wanna get stuff out. We wanna solve problems for our customers as quickly as possible, and at the same time, maintain some level of stability. Yep.

Totally agree. I think the last point I would make is that I I think the whole idea of root cause analysis, even if it is one person's you know, at the end of the day, even if you can tie it back to one person's typo or mistake or whatever, I I personally feel like the root cause analysis is generally flawed in the in in that it's rarely one person. Right? It's it might be one person, you know, again, who who typed it in wrong or did whatever, but there's a process breakdown as well. And there was an authority breakdown there or, like, what she was talking about before.

The guardrails didn't exist. You you just can't point it at one person. Like, it's the system broke down. Yes. It it resulted in somebody's mistake in a manifest file or something like that.

But if you go, you know, if you take it back, you look at it and you say, oh, wait a second, guys. It's because our process isn't all that great. He was trying to do the best he could. He didn't know or whatever it was. He can't be an expert in everything.

Made a mistake, but it's because the entire process broke down, not just because one person made a mistake. And I feel like that's the piece that, you know, when you're trying to do the root root cause analysis, that's the piece that people just don't think about. I totally agree. Just just to finish on this point, the best root cause analysis process that I have ever seen in my life is GitLab. They went down, and they've opened, live docs that everyone could see, all the customers, all everyone.

And they've had sessions that are, like, open Google Hangout or Zoom. I remember what they did, and anyone could join. And it was a totally transparent process of them debugging the outage that they had. And, of course, afterwards, they they they published everything, like, including, like, logs, like crazy stuff, and, like, here's what happened. Here's for transparency, and here's for you to learn how not to make our mistakes, and I really admired it.

Yeah. I think there's something to be said for gaining credibility with your customers whenever they find out that there's an outage from you instead of them telling you that there's an outage, and then you provide real time or near time updates to them up until the issue is resolved. Definitely. So I think we've all seen the scenarios where AWS has had an incident and you find out about it either personally or on Reddit 3 or 4 hours before the AWS status page updates. That is if it did not affect the status page because that happened as well.

There you go. Yeah. Well, that's interesting to me too. Right? Is that sometimes it's, hey, we screwed this stuff up, and so therefore, our app didn't run.

And then, yeah, we see these big companies that use a lot of the AWS or other infrastructure on the out there on the cloud. And what winds up happening is, yeah, what we're kind of talking about, except they take down the entire US East one region. Right. And everybody goes, why is the internet not working? And yeah, it, it turns out that, yeah, the internet relied on that, that region for a whole bunch of stuff and it's gone.

And so those kinds of externalities to where it's, it goes beyond even your code, your company, your infrastructure, your cloud setup. That that's fascinating too. And those cases, you know, as Will's pointing out, we all kinda wanna know. Right? Because it's affecting everybody.

This is why, you know, it was very interesting when Jeffrey said that, like, witch hunt and so on. And I think this is, like, the the the fine line between security and infrastructure, where it's, like, the culture and infrastructure is like, yeah. We all like, 17 outages and no problem. And then when it crosses this line, specifically, you know, privacy, security, you know, personal identifying information, and then it's like, okay. Something's different gonna happen here.

And and it's interesting because in organizations like government organizations, there are special ways to investigate what happened, let's say, in a military when there was an operation. So they want to learn from it. So there are 2 two paths of investigation. One path is, like, the regular path. They investigate and, like, they can put someone to jail and so on.

And then there is it's called the professional combat review, where everyone can say what they can say I killed someone, and they will not be eligible for every anything. Like, they can't do anything to them, and they have 100% immunity in this process. And this is done in order to be make sure that we learn and and that everyone say what really, really happened. And, like, everything you say there is classified. It cannot be used against you and so on.

So I think it's also an interesting thing to think about in our field. I totally agree. I I I feel like the organizations that yeah. Like I said, I think the witch hunt, you know, mentality is a terrible one. It regardless of what what happened.

I mean, unless you are talking about, like we've said before, malfeasance or negligence or something like that Mhmm. Or, you know, beyond negligence, but, you know, really criminal negligence, like which rarely happens. Right? It's it's generally speaking, you know, it's a breakdown of process and and then just fix it. I mean, just work together and fix it.

Nobody wants to I mean, I I've just been involved in so many companies post breach. And And so everybody just wants thing wants things to go back to norm. It's like COVID. Right? Everyone just wants things to go back to normal.

Like, let's just get past this. Let's move on. You know? Do we have to do, but let's stop reliving it on a daily basis. Yep.

Alright. Well, I I think we're kinda getting toward a place where we can start to wrap up. Are there any other kind of big pieces of advice that we need to put out there before we go to our picks? I want to point out one thing, which I really believe in, which is it's a big word called GitOps, but in in general, make sure that all of your configuration and all of your assets, everything is in code and and in Git. And and if you live with one thing from this podcast is make sure that everything is infrastructure as code and in Git because then you will be able to at least see what happened and what was the configuration and how did we configure it.

So this is my final small remark. Hear hear. That's good advice. Alright. Well, let's roll into PIX then.

Jeffrey, do you wanna start us off with PIX? Alright. So, yeah, this is something I I was just thinking about. I was actually thinking about it as we're talking, you know, just having our conversation here. So my pick isn't a specific thing.

It's more of just an approach. So I I get asked all the time, like, how do you, you know, sort of continue continuously learn and, you know, learn new technologies, new, you know, sort of stay on top of current threats. But it's technology in general is just a constantly changing space. But, I mean, honestly, I think it applies beyond technology. Our world is just constantly changing.

And how do you stay on top of that? And and how do you do that without spending 8 hours a day just trying to read or learn or watch or whatever? And so a couple of things that I have learned. So I think there's more of just ideas than actual, like, go go and buy a product or or something like that is, you know, the way that we learn I think, you know, there are different, you know, different people do learn differently. But what I've seen is that, you know, there's so much out there now, like on YouTube, for instance.

I mean, there's so much content out there, but it takes a long time to go through, especially now that everything has ads in it. So now now every video takes much longer to get through. Right? But if if what you're trying to learn is very specific, it's sometimes harder to figure out how to learn it because, there are so many blog posts that are too too generic or just repeating what everybody else has already said on the topic already, and everyone just wants to put it into their blog to try and, you know, get whatever it is, SEO or get, you know, traction out of the traffic, that sort of thing. Or you can try and, you know, pick it out of, like, a video, but, you know, you could be going through a 60 minute video and trying to figure out where where it is.

So I think part of it is and there's no real answer here, but part of it is just figuring out what's the best medium for learning what I'm trying to learn. Am I just trying to get an overview of it of that subject? Then maybe a video is good. If I'm trying to learn something very specific, maybe going to, like, Stack Overflow. And I think building that skill set in yourself of figuring out what is it that I'm trying to learn and what's the best way for me to get there is something that we all have been to sort of develop.

And I think a lot of us who've been doing this for years are probably thinking, yeah. I've been there. I've done that. I think I'm I'm there already. But I think for some of the people earlier on in their in their career, this might be something that you should really be thinking about is just how to be most efficient, learning something new.

And, obviously, it also goes back to figuring out what the best sources are. Because, like I said, there's a lot of content out there, and it's just regurgitating what's what's already out there and and sort of dumbing it down sometimes, like, pulling out some of the details. So those sources, you know, you wanna toss and you wanna just sort of go to, you know, figure out what are the right sources that that, you know, give you the right information. So that's one piece. The other thing I was gonna say is I think sometimes a lot of times we are we have this sort of natural tendency to look look for, you know, when when we do have to buy something, we think about what's the cheapest product out there.

Right? What's and I think that so many so often the cheapest product actually takes you more time, more energy, and you end up having to do things over, you know, over again or whatever, and it's not the cheapest product. And I think, you know, it's it's again, you know, as you go through that learning and figuring out where what is it that I need, don't fall into the trap of just buying the cheapest product. Sometimes it's buying the more expensive products. I mean, sometimes it is the cheapest products.

It's a use once type of thing or, you know, I'm really gonna use it. Great. But if it's something you're not going to, that you are gonna continue to use, spend some time figuring out, does it make sense to invest in something a little bit, you know, better quality? So anyway, those are my 2 picks, methodologies, whatever thoughts for for the day. Nice.

Will, what are your picks? Alright. So I have been working my way through this book, The Manual from Epictetus. So he was a stoic philosopher, and I've actually tried to read Marcus Aurelius' Meditations in the past and not really sure how much I actually retained from that. So I came across this book and I really like it because it's just, it's very short.

Like, each page just has one particular quote or saying from Epictetus, and it's been really helpful to just kind of come to understanding with the whole stoic philosophy and that in combination with daily emails, the email list from the daily stoic.com. I start each day by reading those, and it's a really good way to kind of level set your mind before you get started in a day and put things in perspective because I think that's helpful, especially with the amount of information. And if you can't avoid the news that's going on every day, it kind of helps you temper that message and, and keep things into more of a longer range perspective. So the manual from Epictetus and the dailystoke.com are my picks for today. Nice.

What do I have for picks? So Father's Day, I've got a couple of picks, for stuff that I did or got for Father's Day. The first pick that I have is my wife's like, k. You get to control the TV, which never happens at my house, both because I don't watch a ton of TV and because my kids just are on video games all day during the summer. So, you know, I'll go down there and I'll just kinda see what's going on.

But, yeah. So on Sunday afternoon, I watched Willow, which is one of my favorite old timey movies. So, I'm gonna I'm gonna pick that because I enjoyed it. I really enjoyed it. Of course, all my kids, the second we turn it on, they're they they sat there for 10, 15 minutes, and they just cleared out of the room.

And I'm just like, like, guys, this is a good movie. Whatever. Whatever. Anyway, the other pick that I have so my wife I've I've been having issues. My grill has been falling apart for a few years, and I like cooking me some meat.

So, I my wife got me a Traeger smoker for Father's Day. Oh, nice. And it's got a couple of meat probes in it and stuff like that, which is super nice because a lot of it's not it doesn't have Bluetooth or anything in it. I know some of the more expensive models do, but it's nice just because you can kind of cook it to temperature and then, you know, you're ready to pull it out. Right?

And so anyway, made made a brisket on it for Father's Day. So good. Oh my gosh. You know, I've got some baby back ribs in the fridge that I I need to throw on there sooner rather than later. But, it's it's just it's so nice and and all of the stuff that you kind of cook on the slow cook end of things, they just come out so, so, so good.

Right? So the other forms of that, I guess, are like the crock pot or the sous vide. But, yeah, the smoker is nice too because it gets all this flavor in there. And anyway, yeah, I I am loving having a Traeger, so I'm gonna pick that. Shimon, what are your picks?

So by the way, I'm I'm gonna have a barbecue now with 15 of my friends are coming, and I have a Napoleon grill, and I really love grilling. And I also I always measure the temperature of the meat, and I really, really love it. In terms of my picks, so I I found daily dot dev. It's something cool that you can daily dot 2, sorry, that no. Am I I'm I'm mistaking several things here.

It's called daily dot dev, and it's a Chrome homepage extension. So when you open up a new tab, it actually shows you, like, stuff from news and stuff like that, but, you know, targeted at dev. So it's really, really nice because, it just gives you a, like, a thumbnail and a title, and it shows you, what's going on. So I thought it's it's something nice because it's really targeted towards our target audience, so it's nice. Right on.

So that's my small tip besides the GitOps tip that I gave at the beginning. Awesome. If people wanna connect with you online, where do they find you? Yeah. So I'm at shimontholts@the Twitter, and then you can always go to the 3.io and the see our website there.

You can try to message me on LinkedIn, but it's gonna be you know, it's it's we can do a whole session about, like, what has LinkedIn become in that regard. But, yes. So Shimon told us at Twitter that's the best place to reach out, and I look forward to hearing from you and listening to feedback from users because this is what we love the most when people come in, run our run our CLI, get some stuff, and then they write to us, this is great, but we hate this thing, and why can't I do this and that? And then we talk to them, and and we hear their feedback, and this is how we prioritize our road map. So I encourage you to give us feedback about our product at the tree dot iodatree.i0.

Awesome. Alright. Well, we'll go ahead and wrap up here. Thanks for coming. This was a lot of fun.

Thank you very much for having me. It was really, really fun being here and geeking out about DevOps with you. I feel at home, so thank you very much for having me. Alright. Well, until next time, folks.

Max out.52:14

Gaining Stability with Rule Based Policies with Shimon Tolts - DevOps 226

0:00

52:14

Playback Speed:

Show Notes

Links

Picks

Transcript