Bridging Processes and Improving Incident Management - DevOps 187

Drew Stokes is a Senior Engineering Manager at PagerDuty. Drew and Will engage in a deep dive into incident management. Through the discussion, explore the complexities and challenges of incident management, emphasizing the critical role it plays in organizational performance and customer trust. Join them as they unravel the nuances of incident management and its profound influence on the software development landscape.

Hosted by:

Will Button

Special Guests:

Drew Stokes

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Socials

Transcript

WILL_BUTTON: Hey, what's going on everybody? I am the host of adventures in DevOps and wait. Yeah, that's the right channel name. Sorry. You know, sometimes I get that mixed up like almost every time I think. So if you're a frequent listener to the podcast, you know that I mess that up just about every time. So thanks for bearing with me on that. But today I'm not going to mess this part up because today I have as our guest. I have Drew Stokes, he's the senior manager of software engineering for PagerDuty. And we're talking about incident management. And that's one of my favorite topics because it just goes so deep and it crosses so many disciplines across your infrastructure and your application teams and marketing and sales and the executive suite. Like, depending on the level of the incident, you're just all over board, all over the board with this. So Drew, welcome to the show. I'm excited to have you here.

DREW_STOKES: Thanks, I'm excited to be here. It's good to meet you all.

Hey folks, this is Charles Maxwell. I've been talking to a whole bunch of people that want to update their resume and find a better job. And I figure, well, why not just share my resume? So if you go to topendevs.com slash resume, enter your name and email address, then you'll get a copy of the resume that I use, that I've used through freelancing, through most of my career, as I've kind of refined it and tweaked it to get me the jobs that I want. Like I said, topendevs.com slash resume will get you that and you can just kind of use the formatting. It comes in Word and Pages formats and you can just fill it in from there.

WILL_BUTTON: Right on. So tell us a little bit how you got into the field of incident management or incident response.

DREW_STOKES: That's a good question. Okay, yeah, so I've been in tech for a while, like most people here. I think it's been something like 16 years. And I think originally I was kind of trying to figure out my way, helping folks out with technology and networks. And then I got into front end development and moved into back end and then dropped into SRE and that's when I kind of really got familiar with not just the process of mitigating incidents, but actually managing them and trying to learn from them. I did that for a while. And then I think for something like the last eight years, I've been primarily focused on people manager role. And there's a lot of ways in which people managers are involved in incident management as well, both as stakeholders, but also facilitators and folks who are playing a supportive role for people who are responding. So kind of been in that space for a while now. And back in, I think it was May of 2021, I joined a startup called Jelly was founded by Nora Jones, who's the author of the Chaos Engineering book and the founder of the Learning from Incidents community. And that was kind of where I really dropped into incident management in general, but specifically this opportunity to kind of not just resolve incidents, mitigate the issues, but also to learn from them in order to improve future response and organizational performance. So there's a lot of really interesting ways to think about this space. And you mentioned at the beginning, it's really important. And part of the reason is because it's so cross-cutting, right? Because incidents are a lens through which you can see the way that your organization and your people operate. And that applies to customer service. That that applies to executives and to the folks actually responding to the incidents. It's a really interesting space with a lot of opportunity, which you'll, you'll hear that word a lot in this conversation. We refer to incidents as opportunities.

WILL_BUTTON: Oh, for sure they are because, you know, one of the things that I think about a lot is just because we're in tech, you know, we've all done the Google search for is such and such service down because you're having problems and you're like, did I do something wrong or are they actually dead in the water right now? And I think that's like, to me, that's one of the hallmarks of highlighting that your incident response plan is really, really well done customers know that you're having an incident because you told them versus them discovering that something was broken.

DREW_STOKES: Yeah, there's a level of... Well, so one interesting aspect here is you mentioned another cross-cutting function there, which is you have the internal stakeholders and external stakeholders for these types of things. But there's also this layer, I think, that you're referring to here of, like, operational excellence and observability, right? Do you know that the system's broken before someone tells you that the system is broken? And a lot of the ways in which you can improve that process is through the learning process after the incident, right? So if you have an incident, for example, where a customer reports an issue, looking at the details of that timeline and what actually happened can help you figure out where you need to add additional instrumentation or alerting or how to adjust your team's processes, you know, your software development life cycle or your release process to better account for those kind of unpredictable behaviors in the system. So really interesting, like, complicate, you know, when you're dealing with not just complex software systems, but also complex organizations and groups of people, right? really interesting opportunities to figure out how do we kind of iteratively improve our understanding of the system and our understanding of failure mode so that we can kind of inspire customer confidence and trust, right? Letting them know that there's an issue before they know.

WILL_BUTTON: Yeah, for sure. So you early on in this, you mentioned something I want to highlight. Mitigating an incident versus managing an incident. Can you elaborate on the difference between those two?

DREW_STOKES: Yeah, that's a great question. So there are a lot of different aspects of incident management in general. And I'll try to decompose them in a way that makes sense here. So I think when you just referenced detection, right? So there's a phase there of understanding whether or not there's an incident and trying to do something about it. And I think when we talk about managing incidents, what we're talking about is providing information and coordinating folks in incident response, right? Mitigating an incident is doing something to address the issue and get the system back to a stable state or performing in a way that's expected with regard to external stakeholders. But I think for us, managing an incident is really about investigating what's going on, getting the necessary folks with the subject matter expertise into the room to contribute to that coordinating that group of people in large organizations or really complex incidents. Sometimes you have multiple work streams of investigation within an incident. And then communicating status out to stakeholders, your customer success team, your executives, in a way that allows them to stay informed, but does not have them jump in and start trying to get involved in the process in a way that can add additional complexity to the overall incident management. So from my perspective, I think management is a lot more about the process of coordinating and communicating during an incident. And mitigation is about that moment when you've kind of identified and addressed the issue to stop whatever impact is associated with the incident, right? That's your signal to your external stakeholders that we are in a stable state, we think things are good, and there are various other steps after that. But for me, that's the primary difference.

WILL_BUTTON: Yeah, I think that's a really important for someone who's not done a lot of incident responses to understand that the management of it is equally important as the mitigating of it and in many of the environments i've worked in those are actually two key roles for any incident you have. The the first responder who's trying to find the cause and restore the service but then you also have your primary communications individual who is getting the information from that first responder and relaying it out and doing so in a way so that everyone feels like they're in touch with what's going on and they aren't going around the back door, sending DMs to the first responder to get status updates.

DREW_STOKES: Yeah, yeah, one thing we talk a lot about is kind of this incident management maturity model. We think about different buckets of, you know, engineering teams or organizations with regard to kind of how they approach this. And I've been in, you know, multiple layers of the lower maturity model. And sometimes it can be really difficult. Yeah. To even understand who's doing what and who do I ask for an update? You know, I've got a customer who needs an update now and we have an SLA in the contract, what's going on. It can be really difficult to even know who's doing that that in incident response tooling like Jelly, those roles are actually codified in the process. You're assigning an incident commander, you're assigning a communications lead to try and take care of that external communication of here's the person to connect with if you need an update, or here's the person responsible for managing this incident so that if you join in, you can say, hey, I'm here and I know about X, can I help? That sort of thing.

WILL_BUTTON: Right, and so that's one of the things that Jelly does for you need to improve the majority of your incident response plan, using something like Jelly can kind of help you say, hey, here are the people and the processes you need in place and provide a framework, right?

DREW_STOKES: Yeah. I think every small organization goes through a phase where someone opens a Google Doc and writes down a run book for how to run incidents, right? provide some of that for you in a way that didn't get in your way. So we've got a bot in Slack, right, that you can use to declare incidents, assign stakeholders, set stages, communicate status, all that sort of stuff, so that you don't really have to go in and kind of trial and error that Google Doc and try to get folks enrolled in the process. There's just a thing kind of nudging you along the way and helping to offload some of that cognitive burden when you're in the middle of managing an incident, right? Typically as an incident commander, you're thinking about a lot of things. Sometimes you're also trying to mitigate the incident, right? If it's 2 a.m., you may have a stretch of time where you're doing everything on your own. And so I think the more folks can find mechanisms and processes that help them reduce the number of things they're doing during management, so they can focus on getting the right folks in the room and finding the means to mitigation the more successful the response process becomes, which results in better data for your post-incident analysis and then your cross-incident learning over time.

WILL_BUTTON: Yeah, it's one of those things that like, we've all done incident response wrong enough times that we kind of know. So I think it's one of those things like, you know, like in software engineering, like writing logs has been done for decades now. So you don't write your own logging engine. You just pull in a logging library because you don't need to reinvent that wheel. And I think incident response is one of those, we don't need to reinvent this wheel. We can just buy a wheel that's already built.

DREW_STOKES: Yeah, we actually have a couple customers of Jelly who are trying to replace their wheels, right? Because some of the large organizations who started this process 10 years ago, had to make their own, I used to work at New Relic and we had a Slackbot we called Nerdbot, which was our incident response, you know, facilitation tool. But there's a cost associated with those things, right? You have to maintain them over time. Oftentimes they kind of fall to the bottom of the priority stack. And so iterating on your internal process becomes really hard. And I think that's where if you go with something, you know, like Jelly's incident response bot, which is, you know, fairly opinionated, but narrow in scope. Right. It's just here are the set of criteria that we use for this thing with some customizable features like automation. Um, then you don't have to kind of invent that wheel and then reinvent it iteratively for all time. And you also don't really have to, you know, answer a lot of those questions when your incidents become more complex, there's like different phases of your incident response process. When you're a five person team, you jump in a zoom call, right. And you fix it. Yeah. Uh, When you're 50 people in a major incident room, it's a very different experience and requires a different set of skills and supporting tooling. So, yeah.

WILL_BUTTON: Cool. You mentioned a couple of times the post-incident response plan. So elaborate on that a little bit for me.

DREW_STOKES: Yeah, this is another area where I think everyone kind of starts with a recognition that there's more that can be gleaned from these experiences, right? You have an incident, you respond to it, you fix it, maybe you shoot an email off to folks saying what happened and here's what we're gonna do to address it in the future. But as your system complexity grows and as your organization grows, there are many more opportunities to figure out how to change not just the system itself, right? To write better logs or increase visibility into the systems behavior, but also to change how the organization is structured around those systems, right? So one anecdote I like to share is at my time in a previous company, we had this custom feature flag system that had been around for, I don't know, it was like eight or nine years or something. Everybody wanted to get off of it. It wasn't great. And every time there was an incident with that system, someone from the network engineering team would be pulled in because they were one of the original authors. They had nothing to do with the system anymore, but no one else knew how it worked. And so if you're just responding to and mitigating incidents and not looking any further, you don't see those types of organizational misalignment, right? Where you've got a primary owner or subject matter expert that is, you know, accountable for a whole slew of things that have nothing to do with this foundational service that's critical for business function. If you've got a feature flag system in, you know, a 14 year old code base, it's got to work. Um, so I think when we, when we talk about post incident learning. This is the next phase in maturity, right? You figured out your response process. You know how to get the right folks in the room. You know how to move toward mitigation. And you're starting to capture some of the, you know, follow-ups that you want to take care of. Maybe we need more visibility, maybe this library in our service is out of date, and if we update it, we'll get better performance. Something like that. But it goes beyond some of those follow-ups. And as you start to cultivate a process around this. And there's a lot of different ways that folks do this. They're referred to them as postmortems or learning reviews, or sometimes you're just getting in a room and talking about the incident without any structure. You start to uncover all of these really interesting aspects of not only the responding team, but the organization overall. And so some of the things that we're most interested in learning is, what did folks know when they responded to the incident? And what did they not know? What are the ways in which the folks involved communicated successfully and maybe not so much. How did the organization's processes contribute to or prevent aspects of this specific incident? It's all kinds of interesting stuff to dig into, and you can look at it from a bunch of different angles. We have a lot of examples of our customers creating multiple investigations on an incident, where person A and person B both investigate. And then you see like where the differences are. And I think that turns up a lot of interesting stuff. We've taken the approach in jelly of writing incident narratives. So, post learning review, post mortem, whatever you want to call it. Our feeling is that incidents are stories and the way that people connect with information and learn is through storytelling. And so we've taken the approach that, we want to provide folks with a tool to tell a story backed by evidence, right? what was actually said during the incident, what metrics or data we were looking at, but to kind of nudge folks in the direction of sharing their perspective and their assertions about what it means. When these two folks were talking, they were talking about different aspects of the system and they didn't realize it. What does that mean? What's the opportunity there to improve the incident management and the way that these teams are connected and communicating those sorts of things?

WILL_BUTTON: Yeah, you see that a lot whenever you have people with different disciplines or different backgrounds, you know, a networking background versus a software engineering background. And I think that highlights one of the one of the arts of post incident response is creating those follow up items and getting those the right people engaged to recognize, prioritize, and address the things that you learned from that incident.

DREW_STOKES: Yeah, and you know that you mentioned like different disciplines. There are different disciplines within the responding team, but there are also incidents provide this really unique opportunity to consider the different disciplines across an organization, right? So for your major incidents, it's not just your, you know, senior engineers from a specific team. It's also your customer support with support folks on critical accounts. It's also your group leads and your executives. All of these people have different priorities and perspectives and understanding with regard to the impacted systems and the impact on the business. Right. If I'm responding to an incident, my goal is to make the chart go down. Whereas my executive or salespeople's goal is to minimize the costs associated with customer impact, right? We've got SLAs with our customers for uptime and we need to keep that in line. And I think the different perspectives and priorities there result in that same kind of different perspective that I mentioned earlier, where I may look at an incident and think it means one thing, but my group lead or my sales associate may look at it and think another thing. And that opportunity with incident narratives or post incident learning is to try and bridge that divide between those different perspectives and help everyone cultivate a shared understanding of what it means across those dimensions, right? This is what this incident meant for business impact and process for customer satisfaction and for the sustainability of our critical services and things like that.

WILL_BUTTON: Yeah, I've even worked in organizations where it involved the marketing team because they were out scrolling Twitter, you know, catching tweets going on about the incident and responding those and trying to, trying to minimize the blast radius there.

DREW_STOKES: Yeah, this is a whole nother aspect. It's really interesting, which is like where do incidents come from, right? Who, who says what an incident is? We've taken the approach that anyone can declare an incident. Some organizations we've worked with are very narrow in terms of who can declare them. But yeah, customer success, marketing, random person from the internet, they're all sources of potential incidents. Automation and observability, those sorts of things. Once you start thinking about this space and you start exploring ways of benefiting from these lenses on current state of systems and organizational process. You start to see like there are opportunities everywhere, right? Um, at Jelly internally, we create incidents for things that are not incidents, uh, if we have a release going out that we think might be, you know, impactful to customers because it changes some aspect of the user experience. That's an incident. Um, if we're trying to better understand database failover in RDS, for example, we run a game day as an incident. Um, and doing that gives you this repository of information that you can use again to build that narrative and make those assertions about where are we and where do we want to be with regard to how we're operating and the health and stability of our systems. So that's a really interesting anecdote about marketing. I love when those things come in from places you don't expect, right? You just kind of get a message from someone that you haven't met before and they're like, hey, something going on. Yeah, we better declare this.

WILL_BUTTON: Yeah, you see someone from marketing enter in one of the tech Slack channels. This is not going to go well. So I think one of the cool types of companies I like to work with fit the model of Jelly because you actually use your own product. When you build and release it your team actually uses it to manage your own incidents. And I think that is really, really cool because you get firsthand experience of what it's like to be your own customer. And you can understand what your customers are actually seeing when they're trying to use your tool.

DREW_STOKES: Yeah, one thing that was really interesting in the early days about working with our customers, it's interesting now as well, we'll have to talk about PagerDuty at some point later. But one thing that was really interesting is the customers that we work with are really passionate about their process and those opportunities to learn. And so we get to work really closely with them on, you know, understanding their process and building tooling that works for them. We work with F5 and Indeed and Honeycomb and Zendesk. These are like, you know, large influential organizations who are kind of at the cutting edge of this process. So there's this bi-directional information share where you know, we can build features that support those organizations processes, but then we can also adopt some of those organizational processes because they make a lot of sense and they work well for us. Um, I was, we were doing a product demo for a, uh, important group of people the other day and we noticed some lag in one of our features and I actually declared an incident with jelly about the performance of the incident response tool. Jelly. And we ran that in parallel during the demo. And there was this moment where I was just like, this is so cool. Running an incident with the tool that we're demoing to people. And there wasn't actually an issue. It was a wifi lag thing. So everything was good and that's okay. That's also a learning opportunity. But yeah, it's been really exciting to kind of watch things evolve over time and be a benefactor of that system as well as trying to evolve it for our customers find that alignment across orgs, which is really unique. Most of your incident response and post-incident learning is within an org. We've had the unique opportunity to extend that outward. So it's been fun.

WILL_BUTTON: Right on. One of the things I'm interested to get your opinion on is over the years, I've developed the opinion that there is a difference between mitigating the issue and resolving the issue. And I refer to that in terms of like during the incident, you know, you have, you know, say your API service is slow. It's okay during the incident to throw more servers at it. You know, we're gonna mitigate the issue by adding more servers or adding more memory or do something to make the symptoms of the problem go away. But then there's this like defining moment of, okay, customer impact has been resolved, but now we've got to go back and find the root cause because adding the additional servers did not fix the issue that fixed the symptoms. And I'm interested to get your opinion on that.

DREW_STOKES: Yeah, it's a really good distinction that you're making there. And I think it has a lot to do with prioritization and understanding, right? So oftentimes, especially in major incidents, there's a priority involved there to minimize customer impact, right? Because customer impact means lost revenue. Incidents are expensive, both in terms of time and customer satisfaction and trust. And so I think there are kind of two ways in my experience that you mitigate before resolution. And the first I'm mentioning now is about minimizing the impacts in favor of kind of getting things back on track. And so, Like you said, throw some additional servers at the API and that'll address the symptom, but we still don't understand what's going on with the hood, right? And so I think the second reason, sometimes you can choose not to mitigate an issue. I've been in situations where we've had customer impact, but the priority of understanding what's going on has exceeded the priority of needing to address that impact. Maybe because it's like, you know, one user at a customer rather than all customers in a major incident. And so that second bit I think is really interesting because you can use the incident and the, um, the, you know, levers you can pull during the incident to create the conditions for learning while it's happening. Right. So if you mitigate the incident with the API, it means that you have an opportunity to explore what was actually going on. Maybe you isolate one of those servers and you start to dig into, you know, which, uh, uh, function calls. If you got distributed tracing, which is amazing you know, which specific function or endpoint is causing delay in the response, right? That's causing a delay across all responses. And you can kind of take advantage of that system state, which, you know, if you reboot the servers, if you add a ton of them, those conditions go away and you lose your opportunity to understand what's going on. And so there's a lot of different ways to look at it. I think mitigation and resolution for folks outside of incident response, That's a mental framework for understanding, are we good now and are we good for the long term, right? But as a responder, those two events are really key in terms of communicating within the response group what our level of understanding is and what priority decisions we're making with regard to customer impact or system stability or what have you. Sometimes incidents are not resolved for days after the actual incident. Especially for large complex incidents, sometimes you just have to get things to a steady state and let them stay there Until you have a chance to enroll more folks or get a deeper understanding of what's going on And sometimes those fixes are not things you can roll out, you know as one hot fix sometimes they're major upgrades or major changes to kind of foundational business logic, so I'm glad you made that distinction because they're they're really important and I think oftentimes folks outside of the incident are just like When are we mitigated? we mitigated, but you can't lose sight of that timeframe between mitigation and resolution because that's where a lot of the exploratory understanding comes out.

Time is of the essence when it comes to identifying and resolving issues in your software and our friends at Raygon are here to help. With Raygon alerting, get customized error, crash and performance alert notifications. Set directly to your preferred channel with integrations including Microsoft Teams and Slack. Set thresholds on your alert based on an increase in error count, a spike in load time, or new issues introduced in the latest deployment along with custom filters that give you even greater control. Assign multiple users to notify the right team members with alerts linked directly to the issue in Raygun, taking you to the root cause faster. Never miss a runaway error. Make sure you're quickly notified of the errors, crashes, and front-end performance issues that matter the most to you and your team. Try Raygun alerting today and create a world-class issue resolution workflow that gives you and your customers peace of mind. Visit raygun.com to learn more and start your 14-day free trial today. That's raygun.com.

WILL_BUTTON: For sure. And one of the things that I try to insist on is that mitigating the issue, we're allowed to make live changes in production, but the actual root cause fix has to go through our normal development cycle of making the changes in dev, pushing the changes to a staging environment, validating them and then promoting those changes to production. So it has to follow that flow.

DREW_STOKES: Yeah, and that goes back to that prioritization opportunity, right? So once you've kind of addressed the business impacting issue, then you've got to get back to your fundamentals, right? And your business processes and compliance and all of that. And so detangling those two things allows you to respond in a way that helps the business and then address the issue in a way that helps the business and do those in different ways, because Especially when you're further along in your maturity model, when you're a larger organization, there's a lot of things that can stand in the way of quickly addressing an issue, right? If you don't create a path for doing that, then incidents end up taking longer and having a lot more impact. So, yeah, and the other thing we've learned in all of this is that every organization is different, right? Some organizations have response processes that specifically call out different ways of mitigating uh impacting issues and different ways of capturing follow-ups for those right sometimes the incident's not closed until you've resolved it and Sometimes it's closed at the point that it's mitigated and you've captured the follow-ups you want to take action on, you know as a result Um, sometimes folks keep talking about the incident after it's been closed and they want all of that for their post incident learning review as well uh, there's just so many different ways to tailor this whole incident management process to help an organization be more successful.

WILL_BUTTON: Yeah, one of the places I worked years ago was at a healthcare provider and we did, we provided medical services for hospitals across the US for trauma patients. And so every incident that we had, whenever we broke out an incident room, we actually had a person from our quality team who would join the call as well and let us know like every five or 10 minutes, how many patients across the United States couldn't receive life-saving healthcare because our stuff was broken. And so we had a very unique incident response model there that doesn't really apply anywhere I've been since then, but there were still lessons that I've taken away from that. You know, number one is mitigate the issue as possible.

DREW_STOKES: I'm so interested to hear how did that information help or hinder mitigation for your teams?

WILL_BUTTON: It really set the priority and kept us focused, you know, because as that number went up, you started to understand, you know, the impact that this was having. And this was not a, oh, the development team sucks or their network is terrible or, and many, many of our incidents, it was because of user error at one of the trauma centers, but it's still not okay to say, Oh, well, they're just doing it wrong because you have to realize at the same time, you know, while you're on the phone with that person, they're up on a table in the emergency room doing chest compressions on this patient. So they're going to give it their best shot, but they may not be the most attentive user at that time, and you just gotta work with that.

DREW_STOKES: Yeah, you're highlighting a perfect example of I think why we are so focused on post-incident learning, and it's because the most important aspect of these complex technical systems that we're all building and maintaining are the people involved, right? And when you're in an incident response room, a major incident room, whatever, and you've got someone reminding you of the impact, especially when that impact is not just on dollars, but also on people's lives, you create the conditions for this profound human creativity, right? And in terms of figuring out what can we do as a team to kind of, we're back to the incident management space. What can we do as a team to kind of come up with a creative solution here and get us back to good, you know, temporarily? And I think if you're not reflecting on and talking about those moments in incident response in your, you know, post-incident learning review or, or a narrative review, whatever you call it, then you're missing out on all of those examples of the ways in which the people are helping support the system and keep things moving. You know, we hear a lot in, in, uh, tech and DevOps and elsewhere that like automation is the key to sustainability and more reliable systems. And there are things that we can automate, you know, especially assigning roles during incident management and response. But there's a lot of, you know, human involvement, tweaking of the system and adding, you know, capacity, not, you know, technical capacity in terms of number of network requests you can handle or things like that. But adding capacity in terms of the system's adaptability. And I just, like, I would love to be a fly on the wall for one of those incidents that you mentioned, because I imagine folks really came together and came up with some creative solutions to find a way to mitigate those incidents and get things back to good so that they could figure out, you know, what the long-term solutions were. That's such an exciting, like, space.

WILL_BUTTON: Yeah, for sure. And it's, you know, it was a majority of the role was communication. Like all of, all the, my coworkers there had exceptional technical skills, but their communication skills were just a plus one on top of that. And I think that's what made it work so well. And I still say that to this day, you know, DevOps is not a technical role. There's a technical component, but it really is communications and building the technical framework but then communicating that out to your customers, the engineers that you support, and getting the feedback from them to understand what's the difference between what I built and what they thought I built.

DREW_STOKES: Yeah, it's really great when you have those folks who kind of know how to be in a critical situation and maintain effective communication and find a solution to the issue. One thing we talk a lot about is like, how do you scale that? how do you externalize those skills? Oftentimes we find that the folks who are most effective in incidents don't have the capacity or time to help up level or train folks into that discipline, right? It kind of requires a lot of different skills. You need the technical expertise, you need experience with the systems involved, and you need a good handle on like effective communication, not just for communicating the status of the incident, but also communicating with the folks that you are directing if you're in an incident commander role, for example. There's another area where if you invest in learning from these things, you can create artifacts that folks pick up when they join the organization, right? In almost every large org I've been, there's a Confluence space or a Google Drive folder or something full of post-incident reviews. Sometimes I'll just go in and read those, right? And you start to learn, you know, who are the folks who demonstrate an ability to kind of respond to some of the most significant incidents. What are they doing? How are they doing that right? What? Skills or actions have they taken that stood out in the learning group review should I try and cultivate as a responder? And so that that can be a really interesting space too is Not just learning about the system and what things we can change to improve performance over the time But how are we leaving breadcrumbs? for the new folks coming into the org who are growing into that discipline because Trial by fire during a major incident can be a really stressful kind of terrifying experience. So the more you can kind of give You know these anecdotal or or story based accounts of how things go in your organization more comfortable folks can feel When they step into that role.

WILL_BUTTON: Yeah, I think it's one of those areas where um There's like a mentoring path there. And as I have gotten older and been doing this for a while, I've realized that that's a larger part of my job is sharing that context because you can put the documentation down, but then there's also like the unspoken or the unwritten part of that. There's the mood, the field, the context of the situation. And I think that's been a problem for far beyond my lifetime and the only way we've been successful at solving it now up to this point is just through that mentoring type role where you bring people in even though you know that they aren't ready to be the lead in this, you bring them in just so that they can witness it and start making notes for themselves.

DREW_STOKES: Yeah, and that's where a process or a policy around incident response and incident learning based on transparency can be really helpful, right? Sometimes you get a lot of folks joining a major incident room that are trying to contribute in ways that may not actually help with mitigation. But a lot of times we find in large organizations that have policies angled toward transparency, folks just join to kind of understand and learn in the moment and also after the fact. So You know, the incident learning review calendar is always a place that I go and try to figure out, you know, which of these incidents are gonna be most helpful for me understanding the way this organization operates and the critical systems, right? In the past role, we had a Kafka platform that was, you know, involved in a lot of incidents, not because the Kafka platform was a problem, but because everything was built around it, right? So anytime there was an issue with any system, it kind of tied back to there. And that presents a really interesting lens You know, how do these folks communicate with the broader org and what changes are we making to shore up some of those critical dependencies? And, you know, just being able to join a conversation about that, not having been involved in response or having anything to do with the teams involved can be a really powerful opportunity for you to kind of learn about the teams that you're working with and the underlying technologies especially for folks like me. It's been eight years since I was, you know, maintaining those types of platforms. And so picking up on some of that nuance so that I can support the folks who are around those systems can be really helpful. There's a line there though. You gotta make sure that expectations are clear, right? If you're participating in something for the purpose of learning, you're kind of a sponge rather than someone who's bringing opinions, you know, not having understood the circumstances of this specific incident. So you need a healthy kind of culture and set of expectations around this, but I've seen a lot of orgs that do it well, and it is a game changer, you know, for helping to provide, you know, scalable mentorship and opportunities for folks to get a better understanding of the details.

WILL_BUTTON: Yeah, one of the things you commented on that I think just can't be elaborated enough is transparency. And I've worked in multiple places and when I first started my career it was um It was in many instances a fireball event if you created an incident and for that reason people would try to Hide and cover up their their incidents which led to no one learning from that And and these days, you know, I I almost paraded around, you know, hey I broke this because there's a learning opportunity there and um I think it's really important to be open and to build the environment where people aren't afraid to say that they made mistakes. Even the dumb mistakes, we all do them. You learn from it. Actually, at some point in my career, a boss of mine told me, and it's an anecdotal story, but it's still effective. Someone created an incident, cost several hundred thousand dollars and said, am I going to be fired now? The boss responded no, I just spent $200,000 on your education. Why would I fire you now?

DREW_STOKES: Yeah, and this is where I think like, it's really difficult to build trust, right? It's really easy to damage trust. It's really difficult to build it. And so if you're approaching your incident management, life cycle and process from the perspective of trying to support folks doing what they can to help the business be successful you get a lot of really impactful contribution and collaboration with regard to keeping systems healthy and things like that. But if you over-index on the measurable metrics, we're humans, right? Every human will gain a measure. You start to cultivate some of those types of environments where what's the consequence of me doing the right thing here? Is it going to reflect poorly on me? Is it gonna cause an issue? And so thankfully, I think every organization that Jelly has worked with over the past two and a half years since I joined, or two and a half plus years, they've taken the approach that, yeah, these are blame aware learning reviews, right? We know that folks make mistakes, that they don't have sufficient context in the moment, and that they can learn from those experiences and change their approach next time, versus this kind of, you know older model, we'll say, of prioritizing the public visibility of how things are going. And maybe we don't declare an incident for that one, we just try to fix it quickly. Early in my career, I was learning how to use Microsoft SQL databases, and we had a large SharePoint site. It was another medical audit company. And I learned what database commands do and I did. And fortunately I had enough experience to quickly restore it before anyone noticed, but that was an environment where I didn't feel comfortable, you know, uh, broadcasting that I had just in the process of learning some new commands dropped an entire database. So yeah, it can be a tricky balance, but you know, uh, Sunlight is the best medicine, right? Transparency in these types of environments allow folks to do what's necessary to get things back to good. And I think the more you can kind of socialize and demonstrate that transparency, the more effective your organization is gonna be and the more folks are gonna wanna contribute to that mission, whatever it is.

WILL_BUTTON: Yeah, yeah, absolutely agreed. So let's talk a little bit about what's going on with Jelly these days.

DREW_STOKES: Yeah, so Jelly has been like the most interesting experience of my career, I think. I mentioned I joined in 2021, I think it was. Jelly was just a post-incident analysis tool at that time. So we had this notion of building narratives and not much else. And we recognize that part of the post-incident learning process involves having good data. And the way that you get good data is you get consistent in your process. And so We ended up building this incident response bot, and we also went to the other end of the spectrum and started introducing features for cross-incident analysis. And so this is, after an incident, let's spend some time learning, but then how do we roll up those learnings into themes across incidents that help the organization make decisions around growing teams to support services or changing direction with regard to build versus buy, those sorts of things. And so We've been working on a lot of cool stuff for the last two and a half years. And then in, what was it, I think November 2nd, the public announcement that we were merging with PagerDuty went out, which has been like really exciting and also a buying experience has been a month, right? And so PagerDuty is something like 1100 employees as of January of this year. We were 21. We're kind of in the process of figuring out how to bridge those two divides. And one thing that I'm really excited about is, you know, Jeli has spent a lot of time differentiating itself as a product in the post-insider learning area. And I think we've brought a lot of kind of novel approaches and opinions to incident response in general. PagerDuty has been doing this for 14 plus years, right? And they created the category within which Jellard could become a company, which is pretty cool. And so what we're looking to do now is to take that practice, post-incident learning, and really get folks from the earlier phases of the maturity model, they're just doing incident response, and maybe they're doing a post-incident learning review in a Google Doc, and bring them into the modern right, and start creating incident narratives and doing interviews. PagerDuty has something like 27,000 free and paid customers. There's a huge opportunity there to help folks understand a better way of kind of benefiting from all that incident. So that's my focus right now is figuring out how do we bring those two worlds together while keeping an eye on preserving that kind of post-incident learning a lot of tooling and opportunity. So yeah, a lot of exciting stuff on the horizon. We are going into a new year. So I think things will look very different on the PagerDuty side and probably also improve on the Jelly side as well. It's gonna be really interesting.

WILL_BUTTON: Yeah, I think it's a natural fit, you know, cause PagerDuty is is hands down a great tool for notifying people that there's something requesting their attention. But what you do after that is kind of up to you. And so it seems like a natural fit to just roll that right into into jelly and and help help people like just from a business perspective, take this huge pager duty customer base and just guide them into the thing that they thought they were doing all along.

DREW_STOKES: Yeah, one focus for us has always been, you know, how can we improve the quality of our customers post-incident learning reviews? And how can we allow the folks conducting those investigations to focus on what matters? We've talked to organizations where, you know, there was one problem manager at a company that used Microsoft Teams, and part of their job was to go through every Teams channel and find transcripts associated with an incident and put them in ServiceNow. Nobody should be doing that. That's just toil. That's not productive. One thing I'm especially excited for with this partnership with PagerDuty is, or this acquisition by PagerDuty is, they've got a ton of data. When you're building your post-incident narrative, your timeline, you're adding evidence and you're trying to help folks understand the details of an incident, the more data you have to substantiate those claims and those events that you're highlighting in the incident, the more folks can learn from, you know, the not only the overall shape of the incident, but the systems involved and how they're used to understand, you know, the underlying technology. And so there's an element there that's really exciting, which is just, we have a lot more data to allow our users to work with. Um, but I also think, like I said, um, PagerDuty has been known for a really long time, uh, as kind of an industry leader in scheduling and alerting, right? Uh, act and bail. I got paged. I'm going to go fix it. Uh, there is a better way, right? Like there are ways to tie that process into the incident response process and the post incident learning media. And I think that's. That's going to be our focus over the next several months, is figuring out how do we give PagerDuty more mechanisms for supporting responders throughout the entire incident management lifecycle, not just the detection phase, which a lot of folks know and are familiar with. But PagerDuty's full operations cloud, which most folks I've talked to don't even know exists. And this is the the AI automation for reducing noise to signal when it comes to events. This is all of the mechanisms around running actual incidents and then this is the post incident as well. Pager.j has a feature today called postmortems which is fairly straightforward. It's your your Google postmortem doc but we think there's a lot of opportunity to not require that folks are going and creating these datasets manually but just kind of provide that information so they can use it to build better narratives. better learning, all the things. And yeah, I think it's a natural fit too. I mean, I've been using PagerDuty for basically my entire career. And being able to bridge that gap between that paging and scheduling and then the things that I need to do to help my team be successful is gonna be huge for, from my experience.

WILL_BUTTON: Yeah, I think having access to that data is gonna lead to better collaboration after the fact. Because that's for me, I've always struggled with that. After the incident's over and you're trying to do the review of it, trying to remember what things happened in what order and remember all of those steps that you took. So if you've got something that can prompt you with reminders and kind of pre-populate that narrative for you, I think it's just gonna lead to better results at the end.

DREW_STOKES: Yeah, there is nothing better than having a starting point when you are trying to investigate an incident, right? If you open an empty Google Doc, it's a hard time. But if you can start with, you know, in Jelly today, you start with the incident transcript, all the conversation that happened in Slack and data about who is involved. So much better than starting from nothing. And that's especially true when your incident response process uses multiple data sources, like multiple incident response channels or your data dog charts or what have you. So we're not really looking to do the post-incident narrative for you. We're looking to give you a point to start from, because that saves time, it saves energy, and it lets you focus on the things that only you can create within your post-incident narrative. The investigator is a conduit through which the folks who are involved in the organizational miscellany kind of come together into a coherent story about what happens and what it means. So we really want to like provide a foundation on top of which folks can have these conversations. And I think there's a lot of opportunity there with this kind of broader spectrum of data and integrations within, you know, customers existing processes.

WILL_BUTTON: Yeah, it reminds me a lot of like a There's a like a People skill there you put two people who don't know each other in a room and Anything could possibly happen. They could strike up conversation. They could sit there in silence, you know, there's just no way to gauge it but then if you Give them a conversation starter then you can sort of like guide the results from there and I think I think that's the real value of what the post-incident narrative does, is it's that conversation starter.

DREW_STOKES: Yeah, that, I mean, certainly for us, as you mentioned, like we use Geli internally and we do our own learning reviews. I think the exercise of, you know, mitigating the incident, putting together the learning review, those are valuable experiences for the folks involved. But getting everyone in the company, because we can do that at 21 people, into a room that's to talk about what happened, to ask questions, to figure out what did you know, what did you not know, what did I know and I wasn't involved, those sorts of things. That's where you get really interesting kind of exponential increases in understanding. And it's not just the thing that most excites me about these learning reviews is it's not just the understanding of the technical or the organizational process, it's the understanding of each other. Right? How I communicate in these environments, how you communicate, what your expectations are, what sorts of things I need to be better about informing during response. It's a retro, right? And the software can't operate itself. And so if the people are working effectively together, then the software is working effectively. And if they're not, then it's not. And I think that's one of the really big opportunities, especially for large complex, organizations in novel economic environments to figure out, you know, how do we improve our efficiency in our collaboration so that we can do what needs to be done. Really exciting.

WILL_BUTTON: Yeah. Oh, it is really exciting. I'm looking forward to seeing how this plays out for y'all.

DREW_STOKES: Yeah, I'll have to let you know. There's, there's, you know, we're in a phase right now where there are too many good things for us to do. So we got to figure out the the next best thing and focus on that. But yeah, that's the spot I want to be in. Endless opportunity ahead of us. We just got to figure out how we're going to get that to our customers as quickly as possible.

WILL_BUTTON: Yeah, for sure.

Hey, have you heard about our book club? I keep hearing from people who have heard about the book club. I'd love to see you there. This month we are talking about Docker Deep Dive. That's the book we're reading. It's by Nigel Poulsen. And we're going to be talking about all things Docker, just how it works, how to set it up on your machine, how to go about taking advantage of all the things that offers and using it as a dev tool to make sure that what you're running in production mirrors what you're running on your machine. And it's just one of those tools a lot of people use, really looking forward to diving into it. That's February and March, April and May, we're going to be doing the Pragmatic Programmer. And we're going to be talking about all of the kinds of things that you should be doing in order to further your career according to that book. Now, of course I have my own methodology for these kinds of things, but we're going to be able to dovetail a lot of them because a lot of the ideas really mesh really nicely. So if you're looking for ways to advance your career, you're looking to learn some skills that are going to get you ahead in your career, then definitely come join the book club. You can sign up at topendevs.com slash book club.

WILL_BUTTON: Cool. So anything else you'd like to share with us about incident response, Jelly, Page of Duty, any topics at all?

DREW_STOKES: Yeah. If you're not already using Page of Duty, take a look. It's the best thing for paging that I've ever found. And if you want to give Jelly a try, there's a free trial on the site. And we start you off with some pre-built learning reviews so you can see what they look like. Start playing around in there. And if you have any questions, you I'm sure you'll be able to find me in the show notes here, but it's been really great to meet you, Will. And thank you so much for the opportunity to chat.

WILL_BUTTON: No, thank you. It's been a great conversation. I've enjoyed it. And if you're up for it, I'd love to have you back on the show.

DREW_STOKES: That'd be great. I would love that so much.

WILL_BUTTON: All right, cool. Well, thanks for listening, everyone. And I will see you all next week.

Bridging Processes and Improving Incident Management - DevOps 187

0:00

56:20

Playback Speed:

Show Notes

Sponsors

Socials

Transcript