The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Paul Zaich from Checkr tells us about a critical outage that occurred, what caused it and how they tracked down and fixed the issue. The conversation ranges through troubleshooting complex systems, building team culture, blameless post-mortems, and monitoring the right things to make sure your applications don't fail or alert you when they do.

Special Guests: Paul Zaich

Show Notes

Paul Zaich from Checkr tells us about a critical outage that occurred, what caused it and how they tracked down and fixed the issue. The conversation ranges through troubleshooting complex systems, building team culture, blameless post-mortems, and monitoring the right things to make sure your applications don't fail or alert you when they do.

Links


Picks

Transcript

Charles Max Wood [00:00:05]:
Hey, everybody. And welcome to another episode of the Ruby Rogues podcast. This week on our panel, we have Luke Stutters.

Luke Stutters [00:00:11]:
Hello.

Charles Max Wood [00:00:12]:
We have Dave Kamura. Hey, everyone. I'm Charles Max Wood from dev chat dot tv. Quick shout out about most valuable dot dev. Go check it out. We have a special guest this week, and that is Paul Zietch.

Paul Zaich [00:00:25]:
Zietch. Well done. Thank you.

Charles Max Wood [00:00:27]:
Now you're here from Checkr. You gave a talk at RailsConf about how you broke stuff or somebody broke stuff. Do you wanna just kind of give us a quick intro to who you are and what you do, and then we'll dive in and talk about what broke and how you figured it out?

Paul Zaich [00:00:42]:
Sure. So I've been a software engineer for about 10 years. Recently, in the last year or so, transitioned into an engineering management role. But, I've I've worked at a number of different small start ups, and joined Checkr in 2017 when the company was at about a 100 employees, 30 engineers, contributed as an engineer for a couple years to our team, and then have recently transitioned, like I said, into, engineering management role at the company.

Charles Max Wood [00:01:09]:
Very cool. I actually have a checker t shirt in my closet that I never wear. It's check r for those that are listening and not reading it. Yeah. So why don't you kinda tee us up for this as far as, yeah, what happened, what broke? Yeah. Give us kind of a preliminary timeline and explain what Checkr does and why that matters.

Paul Zaich [00:01:29]:
Sure. So Checkr Checkr was founded in 2014. Daniel and Jonathan, our founders, had worked in the on demand space, another company, and had discovered that it's very difficult to integrate background checks into their onboarding process. Background checks tend to be a very important final safety step for a lot of these companies to make sure that their platform is gonna be safe and secure for their customers. And so in 2014, they started a an automated background check company. And, initially, the biggest selling point was that Checkr abstracted away a lot of complexity of background check process, collecting candidate information, and then executing that flow and exposing that via an API that was developed in a Sinatra app. And, 3 years later, in 2017, I just joined about 4 or 5 months, 4 or 5 months before this particular incident happened. Fast forward to that point, we're running, I'd say, a few 1000000 checks a year, for a variety of different customers.

Paul Zaich [00:02:33]:
Most of those customers use our API, like I said before, to to manage that process, and they do most of the the collection and interface with the candidate on their side in their their own application.

Charles Max Wood [00:02:46]:
Oh, that's interesting. Yeah. I think a lot of the background check portals that I've seen, they yeah. They're like the fully baked portal instead of, yeah, being a background service that somebody else can integrate into their own app.

Paul Zaich [00:03:00]:
Did you Yeah.

Luke Stutters [00:03:00]:
Exactly. Background check on me before this episode?

Paul Zaich [00:03:03]:
I did not. That there are a lot of very important, guidelines, and stipulations governed by the Federal Credit Reporting Act that make sure that you have to have a permissible purpose for running a background check. So in this case, most of our customers are using the permissible purpose around employment as as the the reason for actually running that check.

Charles Max Wood [00:03:25]:
Well, that's no fun. I know. Right? I wanna know everybody's dirty secrets. Interesting. So, yeah, why don't you tell us a little bit about what went down with the app. Right?

Paul Zaich [00:03:37]:
So like I said, in 2017, Checkr, at this point, was a pretty important component of the number of customers onboarding process. But we've started off small, and things grew quickly. In a lot of ways, we're just trying to keep the lights on and scale the the system, along with our customers as they continue to to grow. On demand is growing a lot in this time as well. So in 2017, we're doing some fairly routine changes to a data model. I wasn't directly involved with that, but, we're we're changing something from an integer ID to a a UUID and the references, and there were some backfills that needed to happen. And so an engineer executed a script on a Friday afternoon, which is always a great idea to do. And they actually used script at about 4:30 PM.

Paul Zaich [00:04:30]:
Probably went and grabbed something, had a little happy hour, and then headed home. And about an hour later, we we started to receive a few various different pages, to completely unrelated teams that didn't really know what was going on in terms of this backfill. And it didn't look like anything too serious. It was just an elevated number of exceptions in our client application that does some of the candidate PII collection. Mhmm. And so we just decided to that team decided to snooze that and decided just kind of ignore it.

Charles Max Wood [00:05:01]:
Yeah. So people that aren't aware, PII is an acronym for personally identifiable information and is usually protected by law.

Paul Zaich [00:05:08]:
Thank you.

Charles Max Wood [00:05:09]:
Anyway, go ahead.

Paul Zaich [00:05:10]:
So, come Saturday morning, this is this has been going on for about 12 hours now. This exception comes in again. And at that point, someone on our team actually decided to escalate that and get more stakeholders involved. We had some variety of other issues going on. We just migrated from one deployment platform to Kubernetes, and so we had some issues getting on to the cluster. There are too many of us trying to get on at the same time. So we ended up all having to to actually go into the office to the physical Internet to finally get on debug the issue. Wow.

Paul Zaich [00:05:44]:
So we had a a couple of other confounding issues come up at the same time that that made the the process of response even worse. So finally, this is maybe 10:10 o'clock in the morning, 10 or 11 in the morning. We finally, after being able to take a look at that, identified what the issue was. And we're we're starting we're responding to about 50 to 60% of the most one of the most critical endpoints on our system, which is to actually create request to make a report. So after you've collected the candidate's information, you say, please execute this report so we can get that back, and that's a synchronous request that you make using our API. And when that request was failing, it was failing about 40 to 50% of the time with a 4 zero four response, which isn't really expected. So at that point, we were finally able to pin down the issue, and it came back to this script. And it turned out that when you went to create this report, we would look for create these additional sub objects called screenings.

Paul Zaich [00:06:41]:
And due to the script, we'd actually created an issue where validation would cause the the reports to fail to create, in this edge case. So there's some confounding issues with the way that we had set up the data modeling to begin with that we were trying to work around, and this exception happened. But when we finally fixed the issue, that's where we we shifted more into what could what actually went wrong and what were the real issues that caused us at this outage.

Charles Max Wood [00:07:08]:
Gotcha. So I'm I'm curious as you work through this. What did you add to your workflow to make sure that this doesn't happen again? Because, I mean, some of it's gonna be technical. Right? It's testing or, you know, maybe you set up a staging environment or something like that. And some of it is going to be, hey, when this kind of alert comes up, do this thing. Right? Because it sounded like you did have some early indication that this happened.

Paul Zaich [00:07:30]:
Right. So I I think the first most important thing that we did was that our really from the beginning, we have had, what we call, like, a blameless culture. I think it's a common term now in the industry, but the idea there is to really focus on learning from issues, not trying to find who made the the particular, mistake. And trying to look at what processes you're missing and what changes you need to make in just your code base as well that would have prevented the problem from happening. So not trying to focus on the individual mistake that was made. So as part of that, we did a post mortem doc, and we went through and identified things like, one, we we should really have, like, a dedicated script repository that that goes through a code review process. So that's that's one thing we we implemented. And we made some safeguards and to address this particular issue with the data models as well.

Paul Zaich [00:08:24]:
But I think for everyone, that the bigger issue is really the fact that we missed the outage, for so long. And we did actually have some monitoring in place for this particular issue that would've that should've paged for the downtime that we were experiencing for some report creation. But it turned out that our monitors were really just not set up in in the most effective way to trigger for that particular type of outage. In this case, there's a partial outage, and that's that requires a much more sensitive monitor in order for us to detect. Everything we designed beforehand was much more targeted towards a complete failure of our system.

Dave Kamura [00:09:04]:
And so was this something that could have been caught by automated tests?

Paul Zaich [00:09:09]:
This particular issue most likely could not have been caught by an automated test because it was a it was so, so outside of the norm of what we expect the data to look like. So we had I mean, we had, of course, unit tests for for everything that we were running and we had request specs as well. We did not have, like, an end to end environment set up for, like, a staging environment where you could run these tests end to end. But, again, the the data in that this particular case was very old, and it was essentially doing a migration where that data was in a a state that wasn't anywhere in our code base at this point. So I I'm not sure we could have anticipated this particular issue.

Luke Stutters [00:09:52]:
What what was the fallout? Did everyone, like, phone up and get really angry? Oh, from a customer's perspective? Yeah. Yeah. This is the best bit of outage stories is the kind of the human cost that whoever has to answer the phone the next week.

Charles Max Wood [00:10:05]:
Code drama.

Paul Zaich [00:10:06]:
Right. That's I that's that's always one of the especially as your as your application becomes more important to customers and what your sir your service, the the impact to customers is is more and more extreme. And so in this case, I think this is a Friday night. It wasn't something where a lot of our customers are actively monitoring on their end as well. Fortunately, we were we were able to see that retries were happening, and many of our customers use a retry fallback mechanism. So they were able to just allow those to to run through. But this is particularly tricky in this case because there wasn't actually, like, a record ID for many of these, these particular responses. Fortunately, we did have a we do keep API logs.

Paul Zaich [00:10:50]:
So we're able to see exactly which request failed for each of our customers. And so we're able to then reach out to our customer success team and they were able to start to share the impact with each of those customers pretty quickly. I will say that we've we've done a lot of work to make our customer communication a lot a lot more polished since then. And that's something that we're really focusing on now as well. And just being able to get more visibility to customers sooner. And one of the most important things there is when it comes to monitoring is that you really wanna be able to find the issue and be able to start to investigate it before you you don't want a customer to to identify it first. You should really understand what's happening in your system before anyone else, detects that issue.

Luke Stutters [00:11:33]:
And I guess for this this specific not this specific product, but kind of product where your customers are consuming your API, you're also at the mercy of their implementation too. So, you know, they're making a kind of call against you. And, if that call is failing, you know, you you've got you've got a hope that their system can cope with that as well.

Paul Zaich [00:11:55]:
Exactly. If if some of these requests were happening in the browser or were were not set up to automatically retry. That could be a much worse impact on the customer.

Luke Stutters [00:12:05]:
Can we talk about the blameless culture for a bit? This is this is a new idea. And, when I was managing engineering teams, I used to have what I call the finger of blame. So I used to do it the other way around. I would hold up my finger in a meeting, and I'd introduce the finger as the finger of blame. And then we'd work out who the finger of blame should be pointing to. Now, more often than not, of course it was me. So the finger of blame is a double edged finger, but it was, it was a kind of way of, you know, people take it very seriously when they mess up that kind of stuff. So you kind of have to get your, get your team back on board.

Luke Stutters [00:12:41]:
So it's a way of kind of lightening the mood after after that week's disaster. But a blameless culture, as you said, is the kind of more more sophisticated way of doing it instead of pointing a jovial finger at the person who messed up. What's what does that look like? I mean, you know, do you just go around telling people it's not their fault? Or, you know, how do you implement a blameless culture in what sounds like quite a big engineering team?

Paul Zaich [00:13:09]:
I think I think it starts for us with it really started with our CTO, Jonathan, and cofounder making that a priority from pretty much day 1. Basically, from the beginning of our process when we've had issues or incidents, we've done a postmortem doc. We've had a process around that, and it's always been very forward facing. Very, very much about what could we have done better, what can we improve, what are the things we should be doing going forward. So I think having that first touch point and really having that emphasis from the beginning was really important and down. I think as you're building out a bigger engineering team, that's critical is to be able to just continue to build, keep that culture going. And I think that's that's definitely a challenge to continue doing. But I think as we've grown, we've been able to do that so far.

Paul Zaich [00:14:02]:
So I think that's that was step number 1. I think the second piece of it is understanding and trying to understand when it's more of a process issue versus something that someone particularly did wrong. And I think a lot of the time I think a lot of incidents do occur because you're trying to you're trying to make different prioritization decisions, and you're trying to make sure that you anticipate things in advance or failure very low moments. And sometimes you just miss those. And, those are particular cases where I think the management team needs to really take responsibility for it. It's not an individual issue that caused that particular downtime or that, that was not necessarily that one piece of code. And so it could be just an example is I mean, this is an example, I think, actually, where we had some technical debt. We were trying to clean it up, and that was a good thing.

Paul Zaich [00:14:53]:
But I think we didn't necessarily have everything in place to be able to to address that technical debt effectively. And that's not necessarily one engineer's responsibility to be out in front of.

Charles Max Wood [00:15:03]:
Yeah. One thing I just wanna add is that I I like the blameless culture just from the sense of unless somebody is either malicious, which I have never ever ever encountered, or is chronically reckless, which I've also never encountered. Right? Everybody is usually trying to pull along in the same way. You know? If somebody has that issue, you identify it pretty fast, and you usually are able to counter it before it becomes a real problem. But, yeah, just to put that together, then, you know yeah. The rest of it, it's, hey. Look. We're on the same team.

Charles Max Wood [00:15:36]:
We're all trying to get to the same place. So let's talk about how we can do this better so that it doesn't happen again. Because next time it might be me. Right? That's that misses a critical step. And I don't want you all fingering me either. I mean, I wanna learn from it, but I you know, we we don't want people walking around in fear. Instead, if somebody screws up, we want them to come forward and say, hey. I might have messed this up before it becomes an issue next time.

Paul Zaich [00:16:02]:
Absolutely. And I think one other thing to highlight here is that when you don't have a dilemmas culture, folks are gonna be very afraid to speak out when they do see an issue. Whether it was their they think it was their mistake or someone else's. They're not gonna want to escalate that issue and make sure that they get attention necessarily. And so one of the the best side effects of having a blameless culture is that you get really engaged response, and everyone's gonna work together to try to address the issue. I think that even cascades down to customer communication as well, because when you're you're really engaged in trying to do that, then you're you're doing the best thing for the customers as well, because you're trying to address these issues head on. Not not try to find ways to kinda smooth them out under the surface.

Charles Max Wood [00:16:50]:
Yeah. It also and this is important. And sometimes I think people hear this, and they're gonna go, that sounds a little scary. But you want people to take chances sometimes. Right? You want people to kinda take a shot at making things better. That opens it up to them to do that. Right? It's, oh, well, you know, I I tried this tweak on the Jenkins file, or I tried this tweak on the Kubernetes setup, or I tried this tweak on this other thing. And a lot of times, those things pay off.

Charles Max Wood [00:17:19]:
But if you don't give people the freedom to go for it, a lot of times, you're gonna miss out on a lot of those benefits. And, again, as long as they're not being reckless about it. Right? So they're taking the steps. They're verifying it on their own system and things like that. Then you benefit much, much more from people being willing to take a shot. So, yeah, so with the blameless culture, I'm curious. So you get together and you start identifying what the issue is. So what does that look like then as far as figuring out what's going on? Cause you're not pointing fingers, but you are looking for the commit that made the problem.

Charles Max Wood [00:17:51]:
Right?

Paul Zaich [00:17:51]:
You are. I think at the end of the day, you're you're going to try to find the root cause. Right? You're you're gonna look for that commit. You're gonna look for the log. Maybe it was a script that was logged into

Luke Stutters [00:18:03]:
Mhmm.

Paul Zaich [00:18:03]:
Your your logging system. Whatever it is, you're gonna look for that and look for the root cause. So, honestly, a lot of times, you know, maybe, what caused the issue from whether if it was something that was specifically run by a specific person. And they probably feel a little bit of guilt there, but there's no reason to lay on more there. And I think everyone, like you said, feels a lot of responsibility around the work that they're doing already. So there's no reason to overemphasize that. So what that looks like is typically the team that is impacted is really gonna own that postmortem, and that's one way for you to feel like you're you're resolving the incident or that they issue that the cost of the incident. So this is a definitely become a different a bit of a different process as the team is growing.

Paul Zaich [00:18:48]:
When we're at 30, I think it's a little bit easier just to know exactly who should work on those types of mitigations. It doesn't typically, it's pretty isolated to a specific team as the team is growing and the the system is growing. That's definitely become more of a challenge because sometimes incidents happen because different issues that multiple teams have introduced, or maybe there's multiple teams that need to be involved in the mitigation. And for that in that case, we've definitely been trying to evolve our postmortem process and the the action items. So we have a program manager that one of her responsibilities is specifically around making sure that we are coordinating some of those efforts and and meeting some SOAs. So we had to we've added some some additional rules and coordination around the the process as we as we've started to grow. A lot of it was just on the the individual teams initially. And now as we've grown it, again, there's more process involved.

Paul Zaich [00:19:44]:
I think that's a pretty common thing that you have to introduce as teams grow.

Luke Stutters [00:19:48]:
I will say that, if you got relatives who are in the medical profession, especially if they're pathologists, Even the use of the term postmortem makes me uncomfortable, because those are no fun at all. But, yeah, it's it's also a word that we we use. So, yeah, it's all it just makes me, oh, it's creepy. It's all zombies. I don't know. Yeah. The, post mortem brings me flashbacks to episodes of The X Files in the nineties when Dana Scully was taking an alien apart.

Charles Max Wood [00:20:20]:
Yep. But, it does give you a little perspective too. Right? Because usually in our post mortems, we're talking about what went wrong with the system, not that somebody actually died because of this. Right?

Luke Stutters [00:20:31]:
I just got a weird brain. Alright? That's what my brain thinks.

Charles Max Wood [00:20:35]:
Well, some software, it is life supporting, you know, a lot of the medical equipment stuff out there. But, you know, in this case, yeah, it yeah. We all wanna keep our jobs as well. So, I mean, it's not like we can just blow it off either. So yeah. So I wanna get back to the topic at hand, though, and talk a little bit about, what kind of monitoring did you have before and what kind of monitoring do you have now in order to catch this kind of thing?

Paul Zaich [00:21:01]:
So we we use a number of different types of monitoring at the time. We we used a lot we were pretty heavily reliant on exception tracking, and we we also had some application performance monitoring as well, commonly called APM. Couple examples of that would be something like New Relic or Datadog as a product as well now. And then we did we did also use a stats d cluster that sent metrics over to Datadog. And I think we just had started using that maybe just a few months before this particular incident occurred. So like I alluded to before, we had some we had some monitors for this particular issue, but they were pretty simplistic. They basically just looked for a minimum threshold of the number of reports that we're creating. And we had to set that that threshold to be very low over, like, an hour period because traffic is variable.

Paul Zaich [00:21:55]:
You never know exactly how many reports you're gonna get created. There's times a day where we've received very few requests, and then there's other times where we see large spikes. So we just had very simplistic monitoring in in place for some of these, key metrics at that point. And at that point, we're we're still very heavily reliant on, like I said, exception tracking using systems bug trackers like Sentry that then could then alert if you had certain thresholds of number of errors over a period of time. In this particular case, exception tracking isn't very useful because we are responding with a 404. So there wasn't actually a there was an exception in the system. It was just, automatically active record not found, something like that. That was unhandled automatically and then responded with the 404.

Paul Zaich [00:22:42]:
So it was an expected behavior, but there wasn't an exception that could have been caught.

Charles Max Wood [00:22:47]:
Yeah. That makes sense. Somebody typed this question in. It was one of the panelists. Did you get that answered? I don't know if it was Luke or Dave.

Luke Stutters [00:22:54]:
It was me. I was just just just to be clear, was this this incident was it a monitoring problem or an alerting problem? Because it sounds like an alert did go off at some point.

Dave Kamura [00:23:06]:
Sounds like it was a people problem because they snooze the alert.

Paul Zaich [00:23:09]:
I think this was more of a monitoring problem overall. As Dave mentioned, there there was a component where a page was met was snoozed, but I think that was still a failure on our on our monitoring, because, in this case, that was just a signal of what the true issue was. It was a downstream client application that had had a a page earlier on. And it wasn't it wasn't clear at all what the issue was. So and I think when you're when you're developing a system for alerting, you need to have clear action items. So you need to have and that's for, custom metrics, building application metrics as you as you grow become really important. Having the having clear signal of what's wrong so that that someone knows where to investigate. In this case, it was a client application and browser.

Paul Zaich [00:24:07]:
There's a lot of noise there, and I can easily understand why someone would just sneeze something like that. I in my opinion, it wasn't really a people issue in this particular case.

Dave Kamura [00:24:18]:
Yeah. I think we've all been there before where we get an alert from whatever monitoring that we're doing, and the error looks serious, but you kinda read it and, like, oh, you know what? This is probably just a one off situation. And then turns out it is actually a big deal that needs to be addressed as soon as possible. So I've know I've I've been there before. And, you know, the hard times to really track this, I use, Sentry for my air tracking. And so I get email, text notifications with that. And one of the nice things about it is that it'll show the number of occurrences, whether they are unique or not. So I can see if, okay, this particular error is only coming from one user, or I could see we're getting a 100 errors that's coming from a 100 different users.

Dave Kamura [00:25:16]:
So there's a more widespread problem. So I think, you know, definitely getting the notifications, but then having proper analytics on your errors so you can actually see the scope of how big this is can really kinda weigh in on the importance.

Luke Stutters [00:25:33]:
Yeah. Makes sense. I imagine, Dave, you've been through, like me, many different, monitoring platforms, Datadog. You said New Relic. So what which are the good which are the good monitoring platforms, or which ones are you like, this is the platform that works really well for this API situation?

Dave Kamura [00:25:54]:
I think it all depends on what you're doing. So if you have a heavy JavaScript front end kinda deal and if you also have a lot of Ruby back end code, I know Sentry can handle both of those situations. Other people will go with another solution. So, I personally found Sentry to be my flavor of choice, but, you know, mileage will vary based on what other people have.

Paul Zaich [00:26:22]:
It also depends on where where you are in terms of your applications, use cases, what what customers what the customer profile looks like, how large the company has gone, how many people are supporting it. When you're early on, when you're building a new application, new, product, by definition, the the developers on that are gonna really understand the whole system very well. So, essentially, exception tracking probably is gonna be able to give you most of what you need to know in terms of being able to understand what's going on As the the system starts to grow and especially as you have more discreet teams, I think that's where things like stats d become more useful because you need to be able to set up specific use cases for for core parts of your application. And I would maybe say that the bar there is maybe when you start to hit the point where you start to have significant number of paying customers using specific features. Maybe you need to start to hone in on 1 or 2 key processes that they break. Like, it's absolutely critical that you know immediately. That that's kind of the point that Checkra was out in 2017. We really needed to have high or very clear intelligence and and visibility into specific parts of of our system.

Paul Zaich [00:27:38]:
And we're we're trying to move in that direction when the incident happened. We've continued to invest in that area going forward. I think it's becoming even more important as we're getting larger because there's just so many different systems that are interacting together that no no one really understands the whole system at this point. And the only way to really know how the different systems are working together is maybe make sure everything's working properly is to have some of these custom metrics defined for specific key processes.

Luke Stutters [00:28:08]:
Do you find that putting really large screens on the office wall helps make your application more reliable?

Paul Zaich [00:28:15]:
That's a good question. We we don't we're all remote now. So at this point, having had to experiment with that, we did have we did have some of those in our office. I think I've been trying to find ways to make that more visible and make metrics more visible to our team as we've been been shifted to a 100% remote, due to the pandemic. There's also a challenge for our business, in particular, where sometimes, things are very many of our process are very asynchronous, and they could take hours to date to fully execute. And so finding ways to short circuit and know that those things are broken can be challenging at times. So one of the things we have to do is we have to look at data over time as well and not just look at real time metrics. So one thing I've been experimenting with is is trying to create more automated reports, like, going to sort of a Slack channel that we can look at, and so people can review that.

Paul Zaich [00:29:07]:
And we've also implemented a basically, a a biweekly review during our our retro, where we just look at our metrics and some of the longer longer running trends so that we can see if those look correct. Is there anything that's wrong? We can talk about it, see if there's things that we wanna actually action on based on that review. So we're trying to find some some ways to to do check ins that don't require us to be all in the office together.

Luke Stutters [00:29:30]:
The Slack channel truly is the giant performance monitor of, of 2020. That is that is literally what tells me where the stuff is working at the moment. I'm thinking there are a lot of people in the same boat. So it sounds like that you you're saying that once you get to a certain stage, then the off the shelf monitoring isn't really gonna cut it. So you have written custom monitoring for your application. Is that correct?

Paul Zaich [00:29:56]:
We have implemented what I consider custom metrics. We use Datadog. So a lot of this is out of the box. You can use their their implementation. But you're you're adding some code just as, like, parts of your application. Maybe it's a maybe it's a callback on your active record model. When something is created, you send a a message to you in a queue, and then that triggers over a message into stats, stats d that goes to Datadog. Anyways, you can do it it it's a pretty lightweight implementation in terms of what you can do, but you're adding specific events that you wanna track.

Paul Zaich [00:30:32]:
And then you can you can create your own monitors and and alerting around those or correlations between different different, events in your system. So you could potentially look at a custom metric and then and look at that compared to HTTP statuses that are coming through or the latency of an endpoint. And then you could correlate those 2 metrics as well. So there's some there's some more advanced things you can do there as well if you need to. But, again, it's not really a lot of custom work. It's just adding some specific points in your code base that you feel like are really important track. And one example of this for Rails users is I believe there's something like this already set up for stat for Datadog for Sidekiq. So we instrument on a lot of our Sidekiq jobs, and we can see when the lag is growing on on one of those queues.

Paul Zaich [00:31:20]:
We can see what the the average completion time is and look at the p ninety completion time for different types of jobs. So you get a lot of visibility into your sidekick workers and processes very easily, basically, for free.

Dave Kamura [00:31:33]:
And if you're gonna use Slack for your error notification, now I'm not dissing that at all. You know, I have a few applications that actually do that. It just triggers a Slack notification. But if you're only capturing the error message and not a stack trace along with it, then that error message is pretty much useless because it tells you you have a problem somewhere in your millions of lines of code, but we're not gonna tell you where it's at.

Paul Zaich [00:32:00]:
Just to be clear, we we capture all of our our errors in Sentry. We do have some alerting that goes to Slack, but I would also want to emphasize that anything that's truly has any chance of being a serious issue should never be like a an either an email or a century alert or sorry, a Slack alert. You really should have some kind of escalation via either maybe it's text, maybe it's an actual incident response system like pager duty where you can have an escalation policy. For us, that's what we're using. It should have the synchronous alerting that really forces someone to look at it. You can't rely on something asynchronous like Slack in this case for serious, response Yeah. On issues.

Dave Kamura [00:32:46]:
Now this is a little off topic, but you know what issue I found with that is I use my cell phone for everything. It's where I have my email, get my text messages, phone calls, and all that stuff. And so I would like to keep it on full volume late at night when I'm sleeping. So if a critical does arrive, then I can get notified. But my issue is that I would never get any sleep because my phone would just go off. So I need to figure out some way that I can set up for a particular phone number or something to override any kinda sleep mode or whatever that I have on my phone right now, or get a different phone for that purpose. That seems a bit overkill.

Paul Zaich [00:33:30]:
Doesn't You can actually do that? You can do that, I believe, with at least with iOS. You can set up an override where you you snooze everything else, and then you can set up I think you have to just put it in your personal contacts, whatever numbers you think you're gonna receive critical notification from, and then that'll actually ring through. Alright.

Dave Kamura [00:33:47]:
I need to quit being lazy then and just do that.

Luke Stutters [00:33:50]:
Back in 2015, I was working in the States. And due to various issues, I was still responsible effectively for a bunch of servers in the UK. And I'd gone to see a film, put my phone on silent, and, of course, all the servers melted halfway through Skyfall or whatever movie it was. Tom Cruise did not alert me of the impending server disaster while he was dealing with the aliens. So I came out, and everyone was very upset. So I ended up writing custom alerting with a custom app using the the Android automator that when it received a text message with the magic string in it would actually, like, man, turn turn up volume and then play the Beatles help at full volume. And that worked. That worked very well.

Luke Stutters [00:34:40]:
But what what it didn't have, which I like on the page 2 system, is the acknowledgement. So you can see, you know, yeah, I've sent the message. Has that person seen that message? And, you know, tapped the, yes, I am aware, so it was a melting button. Yeah. I've got I think it's the bedtime settings in in iOS.

Charles Max Wood [00:35:00]:
And, yeah, I've just told it if it's a number in my contacts, then ring. And if it's not, then don't. So, yeah, it'll go off, but it'll only go off if it's, yeah, if it's in my contacts. So, yeah, then I just add whoever or whatever to my contacts, and I'm set. Yeah.

Dave Kamura [00:35:14]:
That should work well for my use case because no one ever calls me.

Charles Max Wood [00:35:17]:
Yeah. It's just the stammers. Right?

Luke Stutters [00:35:18]:
That's that's that's sounds that's that's a tragic thing to say, Dave.

Dave Kamura [00:35:22]:
Now I have the Verizon call filter, which actually works pretty well. It's reduced the 15, 20 phone calls I would get a day down to, like, 1.

Charles Max Wood [00:35:32]:
Yeah. The iPhone has that feature too where you can essentially tell it, don't ring unless the number's in my contacts.

Dave Kamura [00:35:37]:
Yeah. I got burned by that pretty bad one time. My wife was over the pool. She had forgotten her phone or she had lost it, and so she borrowed somebody's phone there. And because that random person wasn't in my contacts, I never got her phone call. My phone just stayed silent. So I had to disable that pretty quick. That'll teach you.

Luke Stutters [00:35:58]:
Can I ask you about composite monitors? Because that is that is a phrase I have not heard before. I've I'm familiar with a rate monitor. My understanding of that is, you know, if it if it drops really quick, it goes off. But if it drops slowly, it doesn't come off. But what is this composite monitor?

Paul Zaich [00:36:16]:
So comp composite monitor is basically a combination of several different metrics that you're measuring using team together those with and or or statements. So maybe referencing what I was talking about before, or you might wanna have a custom metric that you're looking at, and you wanna look at how many of those are coming through, how many events are coming through. And then you might also wanna look at, in this case, the error rate for HTTP status. Maybe how many 400 errors you're getting, relative to 200. You could basically do something where you have an end statement between those two different measures, and those billing evaluations. Or you could do something where you you have an or. So you could say, these are basically signaling for the same type of issue that I wanna alert on, but I'm going to look for these different conditions in all in the same monitor.

Luke Stutters [00:37:09]:
So you look at multiple different things at once. Is that so that you could combine those to kind of set effectively a much lower threshold and get higher signal to noise? So you say something like, you know, well, well, allow this some number 4 o fours or numb this number of, I don't know, server load, this number of other errors. But if you get all 3 at the same time, then it triggers something different, or does it use a lower number? What's the what's the result of that the advantage of using that logic instead of just saying, here is the minimum number of 404s. Here is or no. Here is the minimum. Here is the maximum number of 404s. Here's the maximum number of, like, errors. How how does that actually translate into a better metric?

Paul Zaich [00:37:55]:
Right. So I think I think it gives you the the ability to tune things to make potentially make something have a higher fidelity of of when it alerts, so you're not getting one, you can set the the threshold actually higher and keep things. It depends how you wanna use it. But you can in this case, you could set the thresholds higher, but you could have something where it's like, well, if it's all there aren't any errors coming through, then maybe we're okay with that. Even though the number's a little bit lower. Or you can do things where you can be more and again, you can also tune this to be more sensitive. And in this particular incident, if we had had some air monitoring around 400 in addition to the threshold that we had that was pretty low, I think we would have been triggered on we would have been alerted on that within maybe an hour. So you can do things there that give you more sensitivity without necessarily causing a lot more false alarms.

Paul Zaich [00:38:47]:
And and that's something that you have to just be really careful with with any kind of monitor on a team is you you really need to make sure that you are not creating false alarms. It's I would say it's almost as important or equally important to the sensitivity of alarm as well. Because if you're creating false alarms all the time, it's just human nature to basically start to ignore those or not really give them the review that they need. So if you're doing that all the time, you're you're probably gonna miss something inevitably, when there's actually a real issue.

Charles Max Wood [00:39:19]:
Makes sense. Alright. We're getting close to the end of our time. Yeah. Are there any other stories or examples or lessons that you wanna make sure somebody listening to this, gets?

Paul Zaich [00:39:30]:
I just wanna emphasize that this is a this is a growing process that I think every, team should go through. It's something that is going to evolve over time. And as your product becomes more important to customers and can use to grow, you need to just be constantly reviewing what your approach is to this. What's gonna work for brand new product, brand new start up, brand new company? Isn't that certainly gonna be the right fit as you continue to grow? And it's something that you need to evaluate. And as your your product starts to be something that's really a critical service for your customers or for other teams at your at your company, you just need to to continually set the buy bar higher and make sure that you're continuing to grow observability across the the stack.

Charles Max Wood [00:40:17]:
Alright. Well, one more thing before we go to PIX, and that is if people wanna get in contact with you, how do they find you on the Internet?

Paul Zaich [00:40:23]:
You're welcome to reach out to me on Twitter, GitHub, GitHub at kyzeitsch, or you can reach out to me on LinkedIn as well.

Charles Max Wood [00:40:31]:
Awesome. Yeah. We'll get links to those, and we'll put them in the show notes. Let's go ahead and do some pics then. Dave, do you wanna start us off with pics?

Dave Kamura [00:40:37]:
Yeah. Sure. So went to the doctor the other week, and they said I had high blood pressure, which I attribute to raising kids, and them stressing me out. So I got this blood pressure monitor that syncs up with my iPhone, so it keeps a historical track of it. Oh, and it's been really nice. And I guess it's accurate. I don't know. It says it's high, so I guess it's doing something.

Dave Kamura [00:41:03]:
So it is the Withings, and it's a wireless rechargeable blood pressure monitor.

Charles Max Wood [00:41:09]:
Cool. Luke, how about you?

Luke Stutters [00:41:11]:
That's a it's just a really interesting, is this is this something you wear all the time, Dave?

Dave Kamura [00:41:17]:
No. It's just like the doctor's one where they put it roll up your sleep, put it on your arm, and, you know, it starts to squeeze your arm. It's not like a wristwatch or anything. So I do it a couple of times a day. That'll raise your blood pressure. Just kidding.

Luke Stutters [00:41:30]:
Yeah. Just just checking it, just obsessing about it. I suppose that's that's good. It's not real time. Otherwise, that'd be even more stressful because you'd be sitting there and it go off and say, yeah. Blood pressure's going up. Get caught in the feedback loop.

Charles Max Wood [00:41:44]:
Cool. How about you, Luke? What are your picks?

Luke Stutters [00:41:46]:
Oh, I've been fighting the code this week, Chuck. I've been building strange command line interfaces in Ruby, and I've been using a little application which has installed by default on most Ubuntu based systems, called a Wiptail. This is an old school text style interface for when you can't put a GUI on it for various reasons. So this is kind of like a it makes makes it look more professional. You know? It makes it look like a real piece of software. And using this in Ruby has been a real pain, because you need to do funny things with file descriptors to get the user data out. So it turns out, a very nice man by the name of Felix c Sturgesman has written a gem has written a gem to do it all for you in

Charles Max Wood [00:42:32]:
Way to go, Felix.

Luke Stutters [00:42:33]:
So yeah. You know, all of that work I did was totally unnecessary, and you too can build amazing old school ASCII looking faces using the gem. It's called EFE, and it's on GitHub under the Odfusk. And there's loads of really interesting utivities on the Oktopaster GitHub. If you dig in, there's some interesting low level stuff for when you wanna kind of Ruby yourself up on the command line. So well worth a look.

Charles Max Wood [00:43:03]:
Awesome. Alright. I'm gonna throw out a couple of picks. The first one is I'm still working on this, so keep checking in most valuable dot dev and summit dot most valuable dot dev. I think I've mentioned it on the show before, but I'm talking to folks out there in the community. We've talked to a number of people that you've heard of, that you know well, that you're excited to hear from. But, yeah, I'm gonna be interviewing them and asking them what they would do if they woke up tomorrow as a mid level developer and felt like they didn't quite know where to go from there. So a lot of folks, that's where they kind of end up.

Charles Max Wood [00:43:33]:
Right? They get to junior or, mid level developer, and then it's, okay. I'm proficient. Now what? Yeah. There are a lot of options, a lot of ways you can go. I'm hoping to have people come talk about blogging, podcasting, speaking at conferences, and all the other stuff, and then just how to stay current. You know, how they keep up on what's going on out there. So I'm gonna pick that. I've been playing a game on my phone just when I have a minute, and, you know, I wanna sink a little bit of time into it.

Charles Max Wood [00:43:57]:
It's called Mushroom Wars 2. It's on the iPhone. I don't know if it's on the Android phone. Yeah. Liking that. And then, yeah, I'm also putting on a podcasting summit. So if you're interested in that, you can go to, podcasts podcast growth summit dotco, and, we'll have all the information up there. If you listen to the Freelancer Show, the first interview I did was with Petra Manos.

Charles Max Wood [00:44:19]:
She's from she's in Australia, so I was talking to her in the evening here and morning there, which is always fun with all the time zone stuff. But she talked about basically how to measure your growth and then how to use, Google's tools, not just to measure your growth, but then to figure out where to double down on it and get more traffic. So, it was awesome. I'm talking to a bunch of other people that I've known for years years in the podcasting space, and I'm super excited about it too. And I should probably throw out one more pick, so I'm gonna throw out Gmelius. That's g m e l I u s. And what it is is it's a tool. It's a CRM, but it also has, like, scheduling.

Charles Max Wood [00:44:55]:
So, like, schedule once or, what's the other one? Calendly. It allows you to set up a series of emails. It'll it'll do automatic follow-up for you and stuff like that. And so it just does a whole bunch of email automation, but it runs out of your email account, your Gmail account. That's the big nice thing about it is that you don't get downgraded by SendGrid or something if your emails aren't landing. And so that's another thing that I'm just really digging. So, I'm gonna shout out about that. Paul, what are your picks?

Paul Zaich [00:45:29]:
I really enjoyed, something that was in the Ruby weekly newsletter this last week. There's a Ruby one liners cookbook. And so it has a bunch of different one liners you can actually just shell out to and make those calls, and it explains how you can do a lot of things that you would do with a shell script, very easily with Ruby.

Charles Max Wood [00:45:50]:
Awesome. I'll have to check that out. Sounds like a decent episode too, whether we just go through some of those and pick our favorites or whether we get whoever compiled it on. Thanks for coming, Paul. This was really helpful. And I think some folks are probably gonna either encounter this and go, yeah. I wish we were doing that because the last time we were ended something like this, it was painful. Or some folks hopefully will be proactive and go out there and set things up so that they're watching things and communicating about the way that they handle issues and the way that they avoid them in the first place.

Paul Zaich [00:46:20]:
It's been a pleasure.

Charles Max Wood [00:46:21]:
Alright. We'll go ahead and wrap this up, and we will be back next week. Till next time, Max out, everybody.
Album Art
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652
0:00
46:26
Playback Speed: