Error Reporting and Bug Monitoring with James Smith - RRU 278

In this episode of React Round Up we chatted with James Smith from Bugsnag. We talked about the importance of error monitoring and reporting, and how to actually implement those workflows in your production apps. James shared a number of tips for React developers, like what are the most common errors and how you can help prevent them (hint: linters help a lot). We also got into mobile, and what developers can do to protect against third-party SDK errors.

Hosted by: TJ VanToll
Special Guests: James Smith

Show Notes

In this episode of React Round Up we chatted with James Smith from Bugsnag. We talked about the importance of error monitoring and reporting, and how to actually implement those workflows in your production apps. James shared a number of tips for React developers, like what are the most common errors and how you can help prevent them (hint: linters help a lot). We also got into mobile, and what developers can do to protect against third-party SDK errors. 

Picks 

Transcript


Welcome to another episode of React Roundup. My name is TJ Van Tull and I'm flying solo here on the panel today, but that is all right because I have James Smith with me today. James, why don't you go ahead and introduce yourself, tell everybody a bit of why you're famous? Thanks, TJ. Yeah.

I'm James Smith. I, the CEO and cofounder of a company called Bugsnag, and Bugsnag detects when software breaks. But prior to running a company and being a founder, I built software in in web and mobile applications in various industries for for quite a while. I like to think of myself these days as a retired software engineer. I know just enough to be dangerous.

Awesome. So bug monitoring software. So I'm gonna start with, like, the like a softball and easy question. But, like, why do you why do developers need bug monitoring software? Like, why isn't it enough to just throw it out there and rely on, like, you know, user reports and QA and that sort of stuff to find these bugs?

Well, surprising number of companies that that that we work with still do that, and they're coming to us to rehabilitate. But I think that modern software development has changed hugely. I joke a lot about how, you know, 25, 30 years ago when you delivered software, you you printed it onto a CD or floppy disks and that software was done. And these days, most software is running in an environment where it can be updated and fixed and patched. And, also, I think that people are adopting more principles like agile and lean, where sometimes you're gonna build something that's not ready intentionally, and you're gonna say, look.

We're gonna release this to customers early because we don't even know if customers are gonna like this yet. And so the concept of keeping and working on something after it's shipped to the customers is now the default in most companies, in most cases. So you can't have this, like, perfect 5 month QA process printed to Gold Master and CD and shipping out to Best Buy when it's done. You now have this living, breathing piece of software. So I think that's the main thing that's caused this evolution.

I think that most people have now taken to squishing down that QA period and replacing it from both the left hand side and the right hand side. Probably almost everyone from the left hand side is adding in really nice automated testing, unit testing, integration testing, linting, and and things like that. And from the right hand side, QA is getting pushed down by production awareness and production monitoring where things like Bugsnag error monitoring products are are keyed there. Yeah. That's interesting stuff.

And I think maybe the next thing could you just paint me a picture of, like, when you talk about, like, bug monitoring or error reporting software, like, what does that actually look like? Like, so suppose I'm working at a big company. I've got a giant React app. Maybe I've also got a React, like, mobile app. What is my steps?

Like, what is my experience actually like? Like, how do I install this? And then what sort of thing am I looking at, like, once I deploy this app out to production? Yes. It depends on the type of software you're running.

But, yeah, in a in a React application or a web app stack, for example, you want to be able to monitor runtime errors and and bugs that are affecting your your your end users, your customers out in the wild. And in order to do that, the way that our product works is we have SDKs or libraries that you install via your package manager. So, actually, our software runs as part of your code. It's linked in as part of your code rather than being something that ingests log files or anything like that. And so, yeah, you're running a React tab, you can just NPM or Yarn install Bugsnag.

If you've got a Rails API powering your back end, you add it to your gem file and do bundle install. And the same is true of pretty much every single platform that we run-in. So once you've installed that SDK, you set an API key in code or configuration depending on the platform, and then Bugsnag basically sits in the background taking up almost zero resources until we detect a problem as a curtain. A problem differs on each platform. And in React app, for example, we will detect any exception that bundles up to the window dot error handler on the browser.

We will detect any unhandled promise rejections. And in React specifically, we'll look into React error boundaries as well. So you could use a Bugsnag provided error boundary and wrap your parts of your code base in a Bugsnag wrapper, and we'll then automatically report them off to bugsnags.com and send diagnostics alongside with it. But the process is pretty similar on every platform, mobile, desktop, web browser, just with slight differences of the types of error that we catch. Yeah.

It's interesting. I don't we were talking before that the last time I used some sort of error reporting software was quite a few years ago. And I remember the first time I did it, I was absolutely astonished at what it was spitting out because but you'll you have this, like I I think when you work in a big piece of software, like, you know, there are some bugs out there. Right? Like, you've got some Jira tickets that have been open for a while.

You're like, oh, yeah. Yeah. We'll get to that. That's sort of a hard problem to solve. But I remember actually putting this stuff in and, like, you get stuff that, like, you had absolutely no clue, what it meant.

And I I guess one thing I'm curious about because by actually, my one problem with using this this was years ago when these things were probably a lot less refined. But it got hard to, like, make sense of all the errors because it it almost became, like, there was just this, like, a huge mess of errors. So I'm curious, like, the sort of things you do. I I imagine you have, like, some sort of, like, aggregation algorithms that tries to make sense of, like, well, okay. Well, these bugs are all the same or, like, do you help try to help developers, like, get at, like, the root cause?

Like, maybe this is, like, browser specific or that sort of thing? Yeah. It's funny when I remember apps have been chucking out log files for for for ages. And a lot of the time, you don't think actively to look at log files and you just go in there when you absolutely have to. And they're generating gigabytes and gigabytes of data that maybe you never look at.

And then there was this leap from reactive error monitoring to proactive error monitoring, and there's really early players in the space. It was a a product called HopToad, which rebranded to something else, which was super early in the rail space. And people were like, wow, this is really cool, but holy moly, do I have a lot of stuff coming in. And I think the leap that products like HopToad initially made and then we've kind of refined over the years is aggregation number 1, as you say. First off, can we say that, look, we've had 10,000 bugs 10,000 exceptions or crashes, but actually all of these 10,000 exceptions or crashes came from the exact same bug, the underlying same line of code.

And so at the most basic level, we have these grouping algorithms. We call them grouping algorithms that look at the line of code where the bug originated, and it differs depending on the platform. We can be a lot more sophisticated in some areas where we'll look at how similar the code is and take a snapshot of, like, 7 lines before and after where the crash happened and look at code similarity heuristics. And in other platforms, we keep it very simple. We don't have to do that level of complexity where we'll say, this was this type of exception, so runtime error, on line 59 of user dot Java.

And that's enough for us to say with pretty high confidence that this is a unique bug, to this this version, for example. So, yeah, the getting that aggregation and grouping in place is the step 1. But even then, you said earlier how do people move from customer reports and customer feedback to having a a proactive system like this. Well, not all bugs we have that we have a t shirt with this. So not all bugs are created equal.

And if you just went through this list and said, I'm gonna fix every single bug that my Bugsnag tool is reporting to me, then you're gonna waste your life away. You're gonna be spending time on stuff that really doesn't matter, especially in the client side, especially when it comes to JavaScript and React applications because it's the Wild West. You've got browsers all over the different places. You've got Chrome extensions that are causing problems and injecting content into the DOM. And so the next layer on top of that aggregation is then sophisticated prioritization tool.

So figuring out things like, well, which one affected the most customers, which one affected customers that are paying us the most money, if you're gonna keep it straightforward, or which one affected customers that are in key states or key flows like a login or sign up flow, for example. And so we try to capture as much information as we can at run time and then allow you to create filters, bookmarks, and prioritization rules inside of Bugsnag. It's funny. I I didn't even really think about that, but you're right because someone could be just using some garbage Chrome extension or, like, maybe they're even, like, developing their own Chrome extension that's just, like, you know, totally screwing things up. And if you try to debug that, my god, like, you're just gonna be spending days weeks.

Can can you even know, like, for example, any way you can tell how bad it affected the user. Right? Like, is this, like, is there a way of knowing, like, this is just an error, but it didn't actually affect the user experience versus, like, this is actually I don't know, like, forcing the UI to be unresponsive or maybe even, like, crashing the tab or something. Is there ways that you can even tell on that level of detail? Yeah.

It it's the the most the simplest way to do that is to look at the what we call the error handler. And so I kinda mentioned this. We use JavaScript as an example or React as an example here. There's various ways that we automatically detect that a bug has happened. And some of them, for example, we wrap event handlers.

So we wrap the callbacks of event handlers. So if an exception happens in an event handler and that doesn't necessarily always bubble up to your window dot on error, It might just mean that your click failed to do what you expected your click to do because the callback crashed halfway through. That is almost always less bad than a bug that bubbles up. Still bad. It's less bad than a bug that bubbles all the way up to window dot on error, which basically means no JavaScript is executing in the script tag anymore, especially because most people are using bundlers these days and bundling all their JavaScript up into application dot to JS.

If your JavaScript stops executing in that script tag, you'll you'll burn. That's it. You'll the the whole page stops responding. There's other things more like a promise rejection handler. Again, if it's in a promise that happened asynchronously, it's probably not the end of the world.

So that's the most straightforward way to look at it and to say, like, if it was a click event, that's bad, but not as bad as if the entire page locked up. In terms of the performance aspects, though, they're a lot more subtle. We have, all code snippets to detect certain things like freezes, and my favorite one is, our frustration detection snippet. So it detects rage clicks. So if you I said earlier, if you've got code that made it on click, handler fail.

Fine. But how how can you detect if your developers forgot to hook up on click at all to a button? So you just got a button that looks like it's clickable. There's literally nothing. And so we've got some snippets you can drop in that will detect things like when you click on the same DOM element multiple times within a particular time window, and then it will send a message to Bugsnag saying someone's rage clicking this button.

And so things like that are still as frustrating, maybe more frustrating than a full page freeze. But it's kind of up to the developer to decide, yeah, this is the one that's that's that's causing customers the most pain. Yeah. Because you said snippets. So is your model that there's some default handling and then there's extra things you can add on that you might not wanna give everybody?

Because it would I'm assuming it works it attaches to the event handler. So there's, like, some small performance hits, so you might not wanna go nuts with it sort of thing. Yeah. It's it's more it's more that we have an opinion that that our product is opinionated yet extensible. And so pretty much all of our SDKs are plug in based.

I mean, our JavaScript one, we just released a new version of this 2 days ago. The whole thing's built around plugins even internally. So things start off as snippets sometimes and then graduate into official plugins that we put as default inside of the application. Some of them like the rage clip one, they're more interesting than actionable in a lot of cases. And so if we ever evolve that to be one of these ones where it's like we are confident that something is going wrong based on these rage clicks, then we'll put it as a default plug in inside of our JavaScript SDK.

But actually, it's one of those things that people wanna tune. How many clicks is a rage click? What time period should I measure and all that kind of stuff. So the stuff that's on by default, the opinionated stuff is is what we think are the most important negative signals inside of your application typically. But, yeah, it's it's all plug in based and we try to expose as much as possible in terms of API so that you can hook in your own plug ins and do your own stuff.

Like, you you guys said earlier about reporting handled versus unhandled exceptions. Sometimes you've got your try catch. And my favorite piece of code to read ever is when it says try catch, and all there is in the catch is a comment that says should never get here. And inevitably, it's gonna get there. So what one of our customers do, most of our customers do is they'll put a Bugsnag dot notify e error, whatever it is.

And so that way you know if it's got there, then you can decide if that's a problem or not. But, yeah, we try to be opinionated, yeah, sensible is our product philosophy. I kinda like that for the catch block because I've I've totally been that that person that that you go into the catch and you think to yourself, like, I don't even know how in the world this would happen, but, like, I feel like I can't just leave this empty. Right? So I have to put something in your cell phone.

It will happen. Yeah. It's like it's the the ones that crack me up all the time, I think this is because I'm getting across the old program right now, but try catch blocks where the catch block has just a comment and nothing else in it. And then switch case statements where the default case says should never get here. It's just like, cool.

Let's make sure it doesn't. What's funny too because I the part of my life I did Java code, and Java, like, had I I think it was like assert false or something like that. There is some way that you could put in your code that, like, if almost like at the compiler level that if this code ever executed, it would have a way of, like, informing you. Right? Whereas in JavaScript, outside of some tool like you're saying like, there's no built in way of doing that.

Right? Like, there's no way of, saying, hey. Just let me know if this code ever runs. JavaScript will just merrily go ahead, ignore that comment, and go right on its way and Yep. Who knows what's gonna happen.

Well, people people would you know, I've seen people put to throw their own custom exceptions in those cases. But would you rather kill JavaScript execution completely and and and completely screw your app if it ends up in that case? Or have it run and then know about it because maybe it was okay that it hit that case. And I think that in in JavaScript and in client side land in particular, you don't have the luxury of being in a controlled environment. You can't just open the log file.

Your the logs are on someone else's machine, and that machine in that environment is completely out of your control. So, yeah, there's so many like you said, when you add these solutions in, sometimes you're surprised of how many bugs appear. I mean, mainly that's because a lot of people don't think about it during development. And then when they do turn something on, eventually, they're like, uh-oh. Look at all these edge cases that happened to loads of customers.

Yeah. It's actually in a way a testament to JavaScript because I like, in a way, when I think back to Java and saying, well, the code would completely stop if this happened. But in a way, that's kind of a bad thing too. Right? Because if a customer hits it and then the app just totally says, like, oh, compiler, you know, or runtime error and just totally just dies.

It's kinda nice in JavaScript that some of these errors can exist and things are mostly okay. Right? Like it like, you you still wanna know about it, but maybe some they're still able to do the user might still be able to do the tasks that they they're able to do. So you don't necessarily want to just completely crash in these situations. So I kind of like the notification approach.

It's it's good it's good and bad. You end up with a world that, you know, almost a dirty word these days. Actually, it's having a bit of a renaissance. But in PHP, in original PHP, not cool new PHP, you could have an error and then the code would keep executing. It would say, oh, we had an error.

Okay. Let's just keep keep going. Let's keep going. So you end up sometimes having, you know, 20 compounding errors on a page because this one variable wasn't initialized, and then it just kept on trying all the way down the page. And I think that we found pretty quickly that resilience was not helpful in that case.

You ended up with with this, getting into worse and worse situations as the code kept on trying to execute down the page. I think that the trade off is, you know, I think it's okay to have, like, click handler fail in some particular case and but the rest of the app continues to work. But I think that it's much harder to diagnose and debug and reproduce problems. And so you end up with you know, you're getting a report from a customer saying, I got into the state, then how the heck is the developer or or the support person gonna reproduce that to get back into that state? That's the hard part with allowing code to continue to execute.

For sure. I wanna turn the conversation here in a second over to mobile because I know that's that opens a whole another can of worms. But I do have, I think, one last question on the website. Are there since you're sort of the aggregator of the aggregator of bugs in a sense, are there any, like, really common things that you find people do or, like, things that your average developer should be aware of? Like, common mistakes that people overlook that to just look out for and sort of be cognizant of?

Yeah. And it's it's sounds so obvious, but it's just by far and away the highest order of magnitude type of bug that we see and that is uninitialized variables. Still, in 2020, null pointer exceptions, uninitialized variables are the number one cause of bugs. There's no surprise that languages like like Swift try to come in and say, right, let's let's force things not to be null or or uninitialized when at the compile level. The other thing I think we see all the time in JavaScript especially, and, again, no surprises here, is type errors and and problems caused by unclear typing or coercion of typing.

And so a lot of these things, I think, can be solved by having really nice linting in place or using a typed variant of JavaScript. We use all of our new code. The VoxNeX React app is now in TypeScript. We have a ton of linting in place. I think we use Airbnb's ESLint rules off the bat, but it's we're we're trying to keep things very tight before they even get merged in a PR.

But because we see all of these problems that come up, but, yeah, no points or exceptions, uninitialized variables, type errors, still in 2020, the the biggest problems. Yeah. It's funny. It's amazing how, like, simple linting tools can catch so much of these these things. I'm curious when you say uninitialized variable, like, I'm like, what the specific scenario is like so I could declare variable, I don't know, x, right, that I'm gonna use.

How is this an error that's not, like, caught by the developer during testing? Is it that it's like a like a different scenario, like some if check or something that, like, there's some case that they're not accounting for? Or Yeah. It's almost it's almost always when one of the things is when we see all these bugs coming in and but we can't see the full source code of our customers. We don't you know, it's it's a sensitive area.

We we don't want them to know have have give us access to that, so we keep it isolated. But what we see in our code and what what I've seen in my career at least is, yeah, when the developer has overconfidence in the order or structure of code execution. And so you're like, well, it's gonna go off into this function over here. It's gonna fill in all these this this data, and then we'll run the next thing. But there are about 50 ways that that function that's meant to fill in all these variables could fail.

And, I mean, he this is a really, I think, a really straightforward one. I wrote blog post about this years ago, but one of the most common bugs that we see in JavaScript land is for for legacy applications is jQuery is not defined. And jQuery is not defined as a bug because most people historically would put in jQuery from a CDM, and then they would run their code afterwards. So they would expect jQuery or the dollar symbol to be defined. But because of the way that the JavaScript engines run, if one script tag fails, the next script tag will continue to to run and try to run.

So if your next script tag, the whole thing relies on there being a dollar symbol or jot jQuery defined, but it's not. You're kind of burned. And so if you're using some kind of mod module system or bundle system that has interdependencies, that can be the case as well. But it's true of any code that expects something else to be available and set up. If that fails, you're out of luck.

So, yeah, it's really just being overconfident about code paths running and not failing. Who would do that, though? I mean, I luckily, I'm not guilty of that. So so so we're good to go. So I I do wanna get into mobile because this seems like even more of a hairy territory.

So I imagine, like, from the web perspective, your code is still gonna run for, like, the mobile web. And so it's it's very similar sort of workflow and such. But you work with React Native as well. Is that correct? That's right.

Yeah. It's mobile is, I talk about client side being the wide web Wild West and people who are JavaScript developers for the browser know that the browser environment differences are constant pain in the butt. Now think if you if you think it's a pain in the bum developing for, you know, 3, 4 different major browsers, there are 20 to 40000 different Android devices out there in the wild. And every single manufacturer of Android devices, LG, Samsung, who whoever puts their own little flavor and spice on on Android. And what I mean by that is they can do things such as actually edit the core operating system of Android.

There was really crazy bugs back in the day on Android when it was a bit more uncontrolled before Google stepped in and said, hey. Scott messed with this. I forget who it was now, so I don't wanna, shame the wrong vendor, but a manufacturer of Android devices edited JSON parsing code in core Android to do something different. And And so if you're an Android developer expecting JSON to be parsed and handled in a particular way, it would work absolutely fine on every vendor apart from this one, and then your code wouldn't work. And if you didn't have some if you didn't have that in your test form of devices or you didn't have something like a bug starting in production, you wouldn't know about it apart from someone saying, hey.

I've got an LG phone or whatever it is, and it's not working. Your app isn't working. And so, yeah, it's not as easy as spinning up a couple of VMs. You can't have all 20 to 40000 Android devices sat on your desk from all these different vendors. So mobile is hairy, and the more hairy the environment is, the harder it is to build really good high quality SDKs to these platforms.

But, yeah, we do support React Native, and we have to deal with that plus all the other layers of React Native. And what's the actual, like like, high level implementation look like? Because on the web, I imagine, like like, this is over trivialized, but it's like a window dot on air handler and then a lot of logic around that. On native, does react native provide, like, hooks into this? Or is or do you have, like, native code that gets into this and finds all the errors?

Or how does that work? The last so there's there's a React Native is one of these environments where I actually think that when when React Native first came out, people were like, oh, great. This is right once deploy multiple places. This is gonna be awesome. But in reality, I mean, that's what kind of expose for these days.

But I think in reality, a lot of people are using React Native to do retrofit work. They're taking existing native applications and they're putting in replacing some chunks of it or some components of it with React Native. Now because there is JavaScript code and there is iOS and Android code running inside of most of these applications, we need to make sure that we catch bugs in every single layer. So imagine, like, a layer cake. At the top, you've got the JavaScript runtime.

Even that, you've got different types of JavaScript runtime running because you got JS core versus whatever else is is being distributed. So you've got that JS runtime differences, then you've got the operating system iOS or Android layer differences. We have to capture objective c errors and swift errors and and and all sorts of stuff on iOS and then JVM errors on, the Android side. And then one layer down, you've got things like MDK errors, c and c plus plus errors happening in Android and then the same equivalent happening on iOS. So tons of layers, they will have to communicate.

Recently, React Native actually, I don't know how recently it was, but, React Native now supports React error handlers, which is great. Error boundaries. Sorry. And so that's something that Bugsnake's always supported, and now you can use those in React Native. But, effectively, what we have to do is we have to be able to capture bugs at every single layer and reliably report them.

And sometimes we have to be able to report bugs before the React Native engine has even initialized because there might be some Android code that ran before the React Native code initialized. So, yeah, we we just recently released the new major version of our React Native notifier and put it all under 1 mono repo, Bugsnag JS, to make sure that this initialization logic is just buttoned up. Yeah. And, yeah, that's crazy. I I have some background in native scripts.

So, like, similar technology to React Native, like JavaScript running on on mobile. And I remember one thing we struggled with is getting really good JavaScript stack traces. Are you able to when you catch the air on, like, native lands, like, give people the, like, this is the line of code where there was a problem? Does and is that something, like, React Native exposes for you, or do you have to do some, like, magic to try to access that? There's a lot of magic.

In every layer, the stack traces are almost certainly obfuscated in some way, either intentionally or unintentionally. In the JavaScript layer, we rely on the source map standard, which isn't as standard as you think. It's it's very wildly and differently implemented on each platform. But what we do is if we have an obfuscated JavaScript stack trace, we're not gonna give that to the developer because they're gonna be like, what the heck is, you know, function xy2? That's not what I wrote.

We wanna show people the line of code that they wrote rather than what it ended up being obfuscated into. So we automatically ingest and apply source maps to the JavaScript layer so that we can present the original stack trace to the developer. But, like as I said, because it's a multilayer system, we also have to do that in iOS and Android. In Android, a lot of people use something called ProGuard, and ProGuard obfuscates the Java stack traces. So we have to then reapply to get the original stack traces back, and then the same is true with iOS.

IOS is almost even worse because it's effectively just I'm not offending iOS developers here, but it's like it's c. It's it's low level. And so if a crash happens, what we get is a memory address. It looks more like a classic core dump. And so we have to take that memory address and then reapply something called a a d sim file to it to produce an original stack trace.

So, yeah, magic of all layers of the stack. Yeah. I didn't even think about the because the ProGuard thing, I had actually I've run across this with my data script experience as well because a lot of people don't think about the fact that when you think of iOS and Android apps, you think the source code is, like, compiled and obfuscated, and you can't just download and use it. But since with React Native, you're running with JavaScript code. If you don't take any additional steps, your JavaScript code is just hanging out there right in your bundle.

And to a lot of people, especially, like, I don't know, people dealing with sensitive work or company data, they don't wanna expose that. So they do some basically additional obfuscation on top of what you'd normally do on the web. So you end up with some absolutely garbled nonsense. I'm actually pretty impressed that you're able to, like, sort of undo because that's like unreversed reverse engineering in a sense to get at the the parts you're interested in. It's I'm glad that there are somewhat standards here.

And we've we've been a part of evolving the source map standard as well a little bit. But I'm glad there's somewhat standards here because there's a bit of reverse engineering required, but mostly we're just trying to follow the rules almost and say, right. Let's let's take this back apart. But, yeah, without the aggregation and grouping and without the de obfuscation work that we do to provide original source maps, I think that a product like Bugsnag would be a lot less valuable. For sure.

The other thing I wanna get into is I know one of the the the key things you do is help protect against, like, I think you say erratic SDKs, right, or other SDKs you use. And I know you were telling me a story of, Facebook, their SDK is sort of, going down. So why don't you share, like, what I I guess, what I'm talking about, right, in terms of third party SDKs in a React Native world and sort of what you can do about that. Well, I was joking about jQuery is not defined earlier and and and and things like that. But, you know, modern software, you don't write the whole thing yourself from scratch.

You're relying on other people's open source packages and SDKs. And for anyone who is a React Native or iOS developer recently that uses the Facebook SDK, you'll be very familiar with the fact that there were 2 outages within two and a half months on the Facebook SDK that caused iOS applications to crash at boot if they were using Facebook's authentication platform. And this is super frustrating because so first off, I I like to say I don't wanna anger the ops gods. And so, you know, Facebook had this issue. You know, it it sucks.

It but, like, you know, I give them a break a little bit. It's a tricky one to deal with. But in reality, it's a really difficult one for developers to to deal with as well. If you are Spotify, you use the Facebook SDK to allow people to authenticate. One day without any code changes happening at all on your side, suddenly your app stops working and you get a ton of bugs and eggs or or whatever you're using for error monitoring coming in.

And what happened was, in this particular case, Facebook's SDK reaches out to Facebook's API to say, hey. Tell me information about how I should initialize. Facebook's API responded differently to how the SDK was expecting. So it came back and said, instead of giving a structure, a dictionary, it came back with a a Boolean. And so the code that was reading that JSON payload basically just wet the bed and just like, I don't what do I do here?

So it's it kinda sucks because normally developers think about bugs that are introduced as part of a code change. But in this case, it wasn't a code change. The data changed, and it was data that wasn't even part of my application. It was data that was part of a third party SDK. And so yeah.

Holy moly. We had all of these apps that use the Facebook SDK, which is almost every consumer mobile application, completely die on boot. Some of them didn't, and we found that quite interesting. And so folks like because we're a crash monitoring solution, an error monitoring solution, we saw all of the crashes coming in from all of these major consumer mobile applications that use our products. So we had a bit of a deluge.

And luckily, my infrastructure teams built an architecture that was auto scaling, and we we barely noticed the blip, which is fantastic. But, yeah, we were like, well, why does this app have this volume of crashes with this app? It's fine. In reality, it's really sensible defensive programming that some developers had taken and others hadn't. So one of them was wrapping the SDK in their own error hooks.

So if this crash happened, it could bubble up to the the top and crash the application. That one's pretty straightforward. Easier said than done, though, because a lot of asynchronous work was happening in that SDK. The other one, which a bit more aggressive, which I actually think is a really good best practice in general for anyone using SDKs is wrapping the SDK initialization in a feature flag. So we saw some shapes of error chart coming in that were like, woah.

There's tons of crashes then immediately went down to 0 because these customers of ours were able to turn off Facebook's SDK by updating a feature flag remotely that was then did not initialize that code for their for their customer base. I wish I could tell you which customers because then I give a shout out, but it's obviously obviously private stuff. But, you know, there's all these ways that now if you're relying on third party code, you have to be super aware of all the ways that that code could change based on external dependencies and and and protect against that. Yeah. I I I'm actually quite amazed that some people actually were that proactive to account for this because I think in the the data script apps I wrote, I never once made an assumption.

I mean, okay. So it's one thing if, like, you call to a third party, like, API or something. Right? Like, those are the situations usually you would have some air handling. Like, I'm building a mapping app, and I need to get, like, locations to show on markers, and I call some service.

Well, I'm gonna have some air handling for that because this is a call. But I never I never accounted the service itself. Right? Because almost all of these things at native have, like, some sort of a net call. Right?

You pass it an API key, so you you initialize it. And usually those things don't even have air handling hooks. Right? Like, at least in my experience always work. Yeah.

Right. Like, it's it's not like you call, like, facebook.sdk.net, and you have to pass it a non air handler. It's just assumed you do it. Right? And, like, normally then later on in your code, you just have to, like, make sure it's there sort of thing.

But you don't you never come for the fact that, like, what if it did, like, something erroneous or something that I totally didn't expect? So I I'm absolutely amazed because I would definitely fall in the camp of people that, like, hard crash for sure on this sort of situation because I I sort of assume these things are always gonna be there. I I I gotta believe that the people who did that either have been bitten by this problem in the past. And in their postmortem retrospective, they were like, let's do this. Or they were used to working in an environment where SDKs are less reliable.

And one of those is in my former life, I used to work for a company that made gaming SDKs mobile gaming SDKs. And, notoriously, ad provider SDKs were the crashiest SDKs because you've got these companies where they're experts in monetization. They're experts at, building relationships with developers and publishers. But maybe their SDKs aren't the hottest SDKs around, and people are swapping them in and out all the time to get the best deal. The business team is saying, right.

We need to swap out to use this ad provider because they're giving us a better deal. But no one's saying, well, are they well known developer developers? Do they have good high quality SDK? So I know at least in the gaming space, mobile gaming space, people are very wary about adding in new ad SDKs and therefore probably more likely to protect against problems. Yeah.

And the other thing too is that in native lands, it's really hard, if not in some cases impossible to actually fix these problems on the fly. Like, there I mean, there are some things you can do, like, in a React Native world to, like, hot swap production code, but it gets wonky at a times. So I'd imagine too, like, a lot of these would require full, like, app updates through the App Store, Google Play, and such to actually fix too. So, like Yeah. Like a fairly significant business loss, I imagine, for some of these people.

Yeah. I I I wouldn't even wanna think about the actual dollar amount of that because it would stress me out too much. But I know that it's it was very rare that people use flags to turn these things off. It's very rare that people are using code push or something like that to to to hot patch these things. I know from looking at the data that a lot of our customers that solve these problems effectively just rode the wave and waited for Facebook because they couldn't do anything else.

They had to wait for Facebook to fix it because this is a multiple hour issue. But Yeah. Facebook's gonna fix it faster than they're gonna be able to do it. Yeah. And in the end, I just think that Facebook rolled back the the code changes on their side that that made the data structure change.

Gotcha. So one of the things I wanna get into is, like, workflow from a company perspective and, like, sort of any recommendations you might have. So, like, if we take this example, what is, like what would you I guess, what's the ideal, like, developer experience? Because obviously, like, you don't wanna be notified every time, like, somebody gets jQuery is not defined on your page. But you might wanna know if your entire iOS and Android user base is suddenly hard crashing instantly.

So what are, like I guess, what sort of systems do you support and what do you maybe recommend for notifications? Like, are are, like, people getting emails for this or, like, how how quickly are people getting notified and be able to respond when something like this happens? So we the way we've built Bugsnag at least is that we fit we we try to fit into existing workflow. So if you are using, Jira or or any other project management or issue tracking tool, we support that. And we don't just support it as a, you know, create an issue in that tool.

We typically have what we call a two way synchronization. So if you send a if link a Bugsnag bug in a Jira and someone box is fixed in Jira or whatever you're using, that will mark it as fixed inside of Bugsnag. And so we support pretty much every major issue tracker and project management tool, and we try to have a two way sync on all of those platforms. We even do things like if you've marked it as fixed in your Jira tool or whatever it is, and then Bugsnag detects it's still happening in a later release, we'll automatically reopen it and mark it as a regression. So we don't wanna be a product that comes in and says you have to completely change the way you're doing things.

We wanna be a little nudge in the right direction. So integrations of project management tools is the a key way that we do that. We also integrate with alerting and chat tools. And so most people are using Slack these days or MS Teams or something like that. You can configure on an application by application basis.

I want these types of errors to go into these types of chat rooms. So if you are a JavaScript developer, you might wanna see all new errors that you but we haven't seen before pop up as a message. We've recently launched something, about a month ago, a month and a half ago that we call the alerting and workflow engine, and this is basically a more sophisticated way of routing those things. So you can say, look. I work on the payments platform in the React application, and that is defined as living under these URLs or having this package name in the copath.

So you can now set up alerts to go that that match those patterns to go into a particular Slack channel or to alert to a higher frequency because they're key copas. So in I I've said this couple of times now, but the client side is the wild west. It's a little bit insane to turn on error alerts for every new error. Some we have the option to bug site to turn on alerts for every occurrence of each error. And that makes sense again if you've honed it down to say, I wanna see on my Rails application every time a credit card fails to pass through because of a problem on our side.

But the the web your code is running on so many different devices in so many different environments, that it's a bit much to have all of those coming in. So, really, the default that we have is alert me on any new type of bug that hasn't happened before, and then you can go in and and, again, opinionated yet extensively. You can go in and then fine tune exactly what you wanna see. We also integrate with everything else. You know, like, page duty.

We integrate with webhooks. We integrate with Splunk. You name it. We've got a connection to it. Gotcha.

Yeah. That makes sense. I didn't even think about that angle, but I could totally see when I'm setting this up saying, like, well, if something goes on here, like, I wanna, like, I wanna log, but I don't necessarily wanna know about it. But if a payment fails or if, like, a user registration, something, like, highly valuable knows, I might wanna ping someone, like, immediately. And cool.

Yeah. I I could see, like, the Slack bit of it being pretty nice also. Well, you you wanna you wanna build trust in in Slack. You don't wanna be one of those tools, those products that's a noisy box. And so what what we typically do is we tune it down by default.

And the other thing we do is spike detection. So if we detect that there's an unusual increase in in errors on on a project, that will ping into Slack by default. But, also, that's the sort of thing that people hook up to Page Studio, Opsgenie, or whatever your on call system is. So we're seeing more and more development teams rather than just operations in for teams having on call rotors. So you get working up at 4 in the morning if there's an unusual spike in activity.

If that Facebook bug happened, for example, in the middle of the night. Yeah. Yeah. Because that's, like, the one situation where you actually probably would want to be woke up at 4 in the morning. That's Yep.

What's worth looking into for sure. Oh, this this has been sort of fascinating to me. Is there any topics that we have yet to cover that you'd like to get into, or do you have any other advice you'd like to to give out there to to people, you know, with decent sized React audience that they should know just about bug reporting in general? I feel like the the the obvious one is, like, if you're not using some kind of awareness, production error monitoring, you absolutely have to these days. Just it feels like people who aren't doing it these days are just sticking their fingers in their ears and hoping for the best.

And so I think most of the audience is is already using something even if they've homegrown their own thing that's in window.onerror and sent an email or something like that. So that's the kind of first thing. But I also think that, like you said before, that error monitoring and stability management's come a really long way in the past 5, 10 years, and you don't need to declare bug bankruptcy anymore. You don't need to just turn on this all and be, oh, that's too much. We just give up.

If you and your one thing I talk about all the time is that engineering and product teams actually should be really, really aligned on what a bad bug is. The definition of when is the right time to work on bug fixes versus getting that new feature live. So have a tool that monitors this in production and then have alignment inside your development team that says, we are gonna fix bugs that pass this threshold, and we're gonna stop on new product development if our stability drops below a particular percentage. So I genuinely think if you align between product and engineering or some people call it business and engineering, but those 2 teams need to have alignment in order for you guys to to know when the right time is to to fix bugs and clean up technical debts. So they're the 2 things that I I evangelize.

Cool. Yeah. And I I definitely agree with that with my experiences as well. Like, it's I I think, like you said, it's it's still far too common for I mean, really huge organizations in some sense are really important apps to just be totally flying by. Right?

This is just not doing anything. So it's a good note. I I think it's a good note to end on. So why don't we go ahead and move into the picks? So I I have just one pick today.

I've gotten into I don't I don't even remember, like, the weird way the Internet works what got me into this. But there's this guy have you ever heard of the guy named Wim Hof? He's Oh, yeah. Like this I I think he's this this Dutch guy that's, like, famous for going outside, like shirtless and like climbing mountains in the snow sort of thing. And for some reason this just fascinated me.

So I I've got a book on him called what doesn't kill us and I'm a few few chapters into it. I've been just listening to the audio book and it's it's sort of fascinating. It it talks through the science of, like, is it just this dude's genetic makeup that's that allows him to do it, or is it just training? Right? Like, getting getting more accustomed to this.

And the answer seems to be a little bit of both, but it's it's interesting because it definitely takes more of, like, a scientific perspective on, like, how in the heck is this possible. So I'd recommend it if you're at all curious about that. I'd check that out. So it's a silly pick. I I is that the guy who can mentally control his body temperature?

Well, yeah. That's that's where it gets, yeah, that's where it gets into, like, a little bit of craziness because, like, you find yourself as you read through this going back and forth between, like, this guy is just Looney Tunes crazy a little bit. But at the same time, he subjects himself to, like, scientific studies a lot. Right? Like, unlike a lot of these people that are, like, you know, clerics and crazy, like, sort of nutcases, he actually submits himself to scientific studies a lot.

Like, he's been put through all sorts of rigorous tests and such as well. So, like, some of what he claims has actually been proven, but then that part of, like, he's controlling this with his mind. That's the stuff that's sort of like, but it's interesting. It's interesting nevertheless, I guess. Yeah.

On my picks, the the thing that I've been playing with a lot recently is, this new game that just came out just a few days ago called Fall Guys. I don't know if you've heard of this. No. It is, it's the it's on Steam, and it's the PlayStation game of the month, 3 game of the month, and it is ridiculous. It's like a battle royale game, but crossed with, like, Mario Party minigames.

And it's it's brings out the best and worst in people, but it's insane. You should it's so much fun, and it's a relatively small developer that that's built it. And I imagine that their servers are getting hammered right now. But it's an insanely fun game. But, yeah, apart from that, I don't get much time to do hobbies and side things at the moment, but my my fun pick at the moment is, this weird scene called the console portabilizing scene.

And if you're I come from a gaming background. I'm a big gamer, but there's this huge scene of people who take games consoles and then chop them up and make them portable. And so my little side hobby has been taking Nintendo Wiis, chopping them up, and making them into, like, Game Boy form factor and things like that. So that's my my other little pick, little side side hobby. That could be a lot of fun.

I've I've wanted before because, like, the the virtual console type stuff is it's it's fun, but it's not quite the same as, like, holding the actual thing in your hand. Like, there you get a nostalgia over time of, like, man, I really like there's something that kicks in when you just hold, like, an old Sega Genesis or Super Nintendo or NES controller. It that's just like I don't know if it's just nostalgia or or what it is, but it's it's different than just playing it virtually. So it's pretty cool. Yeah.

You wanna sit down and actually complete the games rather than just like, I found with emulators and things you'd pull them up and just go through 10 games and say, oh, that was interesting. Next game rather than sitting down and having that couch experience. Yeah. Yep. Well, James, this has been a great chat.

I think my last question for you is where can people find you if they wanna, you know, ask any further questions, follow all of what you do? I'm on Twitter at loopjl0opj. I'm not super active on that. I'm trying to get back into it. And then I'm kind of relatively active on the conference and speaking scene as well.

So catch check out your, my Twitter or Bugsnag's Twitter to see which conference I'm at next. But I do talk a lot about technical depth on the conferencing. So if you're around and you see me drop in, then I'll say hi. Awesome. Well, thanks again for joining us, and spend another episode of React Roundup.

So have a good one, everyone. Thanks, CJ.
Album Art
Error Reporting and Bug Monitoring with James Smith - RRU 278
0:00
44:38
Playback Speed: