DevOps and Log Management - DevOps 127

Logging in today’s DevOps landscape has become more difficult than ever. DevOps needs visibility into workloads that security, development, and operations teams collaboratively manage and provide. Today on the show, Jonathan and Will discuss various strategies to logging and log management and what you can start implementing today.

Hosted by:

Jonathan Hall •

Will Button

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

In this episode…

Who owns the logging process?
Lack of strategy in logging
Building the library
Debugging and error management
Following structured logging patterns
Basic monitoring
Building dashboards vs 3rd party services

Picks

Jonathan – Sentry.io
Jonathan- Files · master · Jonathan Hall / Ale · GitLab
Will- The Network State: How To Start a New Country

Transcript

Jonathan_Hall:

Ladies and gentlemen, welcome to an exciting episode, or maybe a boring episode, we'll see, of Adventures in DevOps.

Will_Button:

I'm sorry.

Jonathan_Hall:

We never know until we're done.

Will_Button:

Jonathan_Hall:

I'm

Will_Button:

want

Jonathan_Hall:

your

Will_Button:

Jonathan_Hall:

host

Will_Button:

Jonathan_Hall:

Jonathan

Will_Button:

up front

Jonathan_Hall:

Hall.

Will_Button:

with that.

Jonathan_Hall:

Yeah, yeah. Come back in a half an hour and I'll tell you if it's a good episode.

Will_Button:

Yeah

Jonathan_Hall:

I'm your host for the day, Jonathan Hall. And today with me as frequently happens is Mr. Will Button.

Will_Button:

What's going on man?

Jonathan_Hall:

Hey, I'm just trying to chill here. It's kind of hot here where I live now. Kind of like, I guess it is everywhere in the rest of the world these days. I was reading that this is like a global heat wave happening.

Will_Button:

I've

Jonathan_Hall:

Will_Button:

been

Jonathan_Hall:

I'm drinking

Will_Button:

seeing

Jonathan_Hall:

some ice

Will_Button:

Jonathan_Hall:

water.

Will_Button:

lot about that, yeah. I mean, I'm in Arizona, so it's

Jonathan_Hall:

You have

Will_Button:

pretty

Jonathan_Hall:

a global

Will_Button:

normal.

Jonathan_Hall:

heatwave year round pretty much, yeah.

Will_Button:

Yeah, pretty much, yeah. But I do have a lot of empathy for people who aren't used to it and are suddenly hit

Jonathan_Hall:

Yeah.

Will_Button:

with this. It's tough.

Jonathan_Hall:

Here in the Netherlands, until recently, it's rare to have an air conditioner. So

Will_Button:

I'm going to take a break.

Jonathan_Hall:

I have one because I work in the attic where it tends to get warmer than the rest of the house anyway. But man, we wish we had one in the whole house these last couple of days. Anyway, I'm glad the freezer works, makes nice ice for us to chill our water and our other beverages. So I'm sitting here sipping some ice water and I think we should talk about logging and alerting and monitoring. What do you think?

Will_Button:

I think that's a pretty good idea because there's a lot of ways, well first of all, it's one of those things you can just skip completely and never realize that you need it until you really, really need it. And then once you do implement it, there's a lot of ways that you can implement it that still aren't very helpful. So yeah, I'm actually looking forward to

Jonathan_Hall:

Definitely.

Will_Button:

this.

Jonathan_Hall:

Cool. And the reason this comes to mind is because I'm working on an app at the new job I started. You may recall a month and a half ago, I started working for a new company. And they're using a microservices architecture, but most of their apps are, well, all are poorly configured for logging. Some are not configured barely at all for logging. And that makes it really difficult to troubleshoot when things go wrong. And so I like the theme of logging and observability in general, because it goes back to what we've talked about on a few other episodes in the last month or so of time to recovery. When you have a problem, it's really important to be able to diagnose it and solve it quickly. Especially if you're doing continuous delivery and logging is central to that. So, you know, we've had a couple of outages here. or bugs or various other problems in production that are taken in some cases hours or days to solve because the lack of logging makes it literally impossible to figure out what's going on. So it's a battle I'm struggling with. I'm fighting constantly. And I've fought this battle at other companies too. I'm no stranger to this problem. It's a very common problem, especially in early startups where people don't think about this, I guess. Yeah, I think it's a good topic to discuss. If you are on a team where you're, you have any development responsibility, take logging seriously, it's important. And I'll talk about some of the reasons why I think it is.

Will_Button:

So let me jump in right there. Who do you think owns logging? Is it the development team, operations team, IT, SREs?

Jonathan_Hall:

I think all of the above. I mean, it can depend a little bit on how your team is structured and built.

Will_Button:

Great.

Jonathan_Hall:

But I don't think that anybody gets a free pass on this one.

Will_Button:

I agree with you on that.

Jonathan_Hall:

The developers are definitely responsible for logging in an intelligent way. But you know, honestly, if you have, I don't know what they call this in different companies, but if you have a team that's sort of the, like an internal platform team, or maybe they do like internal libraries, if you're in a

Will_Button:

Mm-hmm.

Jonathan_Hall:

large enough company, you might have a team that manages your frameworks or your libraries. the bulk of the responsibility for logging in particular is gonna fall on that team.

Will_Button:

Yeah.

Jonathan_Hall:

But it, I mean, any everyday developer needs to be involved. And I'll explain why, and we can go into details maybe in a little bit. But one of the biggest problems I see with logging on most applications is there's no strategy involved. So you tend to see an application, I mean, you can tell where the buggy code is because there's more logs there because that's where people

Will_Button:

Hahaha

Jonathan_Hall:

have been. adding debug lines. Ha ha ha.

Will_Button:

console.log baby

Jonathan_Hall:

Exactly. And that's a bad, I mean, if you see that, that just means that code base is not well maintained or not well designed. So what you really need,

Will_Button:

Yeah.

Jonathan_Hall:

so let's imagine an HTTP service because that's, on this particular project is mostly GRPC. There's a little bit of HTTP, but HTTP is easy to think about because it's pretty ubiquitous. So we'll just use that as a simple example. You don't want all your controllers to have like log.error or whatever you might have, whatever your. library looks like. You don't want to be calling log error, log error, log info, log info, log debug. You don't want that scattered around your code everywhere. What you really want in most cases is to have a central log handler. That's part of your framework. So that whenever your web controller generates an error, that middleware or that framework, whatever it is, can handle that in an appropriate way. It can serve it to the customer, to the client, if necessary, turn it into a 400 error and say, sorry, you're... It's invalid JSON or whatever the thing is. And when appropriate can log it somewhere that your developers can see. But that's the first simplest thing that I see most people get wrong is they just, they log an error and then they either return it or they rethrow it depending on which language you're using. They rethrow the exception. And that's terrible, that's a terrible pattern. Because every time you do that, you're doing the wrong thing. You... you're guaranteed to do the wrong thing, and here's why. When you rethrow or return an error, then it's the caller's responsibility to either handle it or log it. Logging is a form of handling, right? So

Will_Button:

Great.

Jonathan_Hall:

that caller is either gonna log it and then rethrow it again, or they're gonna

Will_Button:

Hahaha

Jonathan_Hall:

actually resolve the problem and then it's not an error anymore, and then you have all these error logs that... shouldn't be there. So it's really important to handle your logs in one place and one place only. If you log your error, then you should not re-throw it because then you're handling it twice. If you re-throw it, you should not log it. Pick one or the other, don't do both. So that's the first mistake I see people make all the time because they just take this sort of hack and slash debug mentality to their logging. It's like, oh, I don't know what's happening here. I'll add a log and... and then I'll check the logs and see what's happening. And then usually those logs sit there for years later, confusing everybody else who might come along.

Will_Button:

Yeah, you know, we talk in software development a lot about the dry principle, but I think logging is one of those areas where we do duplicate code all over the place and fail to make that connection that we're not following the dry principle in there. You know, we think of dry as far as business logic, but it actually applies to logging as well.

Jonathan_Hall:

Definitely, definitely. So that's the first problem I see. And when you solve that correctly, that's it's primarily the responsibility. Whoever builds that library, which happens to be me in this case, I'm building this library as we speak, you know? And then once I've built the library, then it's fairly, hopefully, if it's a well written library and it serves the needs of your developers well, then it's fairly simple for your developers just to use it correctly. They still need to use it correctly. So it still is their responsibility. And maybe there is a time when it's appropriate to do an explicit log because you're doing some debugging or there's something weird happening, or it's an error that you can't rethrow because of the situation or whatever. So there are times when it's appropriate to do your own logging, but like in my experience, 95% of those logs should be handled in this centralized place, which cleans up your code. It makes your code so much easier to read. And it makes your logs easier to read. So when something actually goes wrong, you can see what's going on. So.

Will_Button:

Yeah, one of the hidden benefits to doing it is now your logs all follow the same format. A lot of us now are using structured logs, but if your structured logs aren't all using the same structure, it makes it really hard to analyze those logs across a bunch of different environments and sections of your code. But doing it this way, all the logs are going to be in the same format, which makes it a lot easier to build dashboards, reports, and visualizations to see what's really going on.

Jonathan_Hall:

Definitely a good point. Now I can hear people saying, okay, I hear you, Jonathan, you shouldn't log and rethrow an error. But I need to log extra context that's not included in that error. So, you know, at my middleware layer, it doesn't have the context. It doesn't know which user made the request, or it doesn't know whatever details you might care about logging. And that's a valid concern, and that's one reason I see this pattern happen all the time. The answer to that is to annotate your errors with that metadata. And how you do that depends on your language. In Go, for example, you can wrap an error value with arbitrary other data with a custom type. If you're using JavaScript, it's just an object. You can add arbitrary data to your object, and re-throw it again. So whatever language or framework you're using, just, yes, that metadata is important, but that doesn't mean you need to log it at the low level. It means you need to add it to the error, and then... hand that error back up the stack so that when it gets logged, it gets logged with all that extra data that you care about. This also has another benefit in many applications that you get one, or have the potential to have a single error or single log per request, rather than 15 logs per request. In other words, you could roll up, if you have three errors that happen on a single request, you could roll those up into a single log if you want to. which can make your logs easier to read and parse.

Will_Button:

Oh yeah, that's a definite pet peeve of mine of duplicated logs, because you captured it down at the low level, so you logged it out, and then you re-threw the error, and then whatever you re-threw it to, it also logs it out.

Jonathan_Hall:

Yeah.

Will_Button:

And so you're staring at all these logs, and then finally realize, oh wait, it's all just

Jonathan_Hall:

It's

Will_Button:

this

Jonathan_Hall:

all

Will_Button:

one

Jonathan_Hall:

the same

Will_Button:

problem

Jonathan_Hall:

thing.

Will_Button:

that we reported multiple times.

Jonathan_Hall:

Yeah. So the way I, this library I'm working on, it logs everything to standard error, which gets picked up, we're using Amazon, so it gets picked up by CloudWatch, and we can do the insight search on it and so on. But I've also added support for Sentry IO. So anything that's logged at error level goes to Sentry and alerts us on Slack and by email on however else we can figure, which is really nice. Basically that means that anytime we have a developer facing error, we consider it a bug and it has to be fixed. So any panics or a crash gets captured as a 500 level error in HTTP parlance. So it's, you know, gets treated as an internal server error. And then we can also, for other errors, we can look at the status code that we send. If it's a 400, our administrators, our developers don't care about that. So we just, we give that to the client, but we don't send it to Sentry. We don't consider that an application error, right? It's a user error. So that's an important distinction to make too. I've seen a lot of developers who just think, oh, this is an error. Bad password is an error, so log error. Yes, it's an error, but it's not an application error. It's a user error. So that distinction is really important. You can't just think, Error means error. Another good example is not found. If you do a query and you don't find something in the database, yes, that's an error from the database standpoint. Is that an application error? Is it a user error? It depends. There's no common answer, right? You have to think this through. Unfortunately, you have to be a smart person to do coding correctly

Will_Button:

Hahaha!

Jonathan_Hall:

and to handle these errors correctly. But if you are a smart person, if you can think through the situation. you can decide is this an application error or is it a user error and then handle it appropriately.

Will_Button:

by the way if you're listening to this podcast you are a smart person.

Jonathan_Hall:

Of course. Of course.

Will_Button:

So you whenever you.

Jonathan_Hall:

That's not to say that if you're on the show, you're a smart person, because I know I'm not a smart person most days. I make a lot of these mistakes.

Will_Button:

Oh yeah, here here.

Jonathan_Hall:

I'm paying penance now.

Will_Button:

Yeah, no, I totally agree. I'm not smart. I'm just persistent and hard-headed. Ha ha ha ha. So when you send the errors to Sentry, you're doing that using the Sentry SDK or library from directly within the application, right?

Jonathan_Hall:

Right, right. So

Will_Button:

Right.

Jonathan_Hall:

this is a Go application. So we're using Logris as the backend, which is a plugable. It's sort of like Log4j or something, is a pluggable log framework. So we have two plugins into that. One is for Sentry and one is for our JSON logging to the console.

Will_Button:

Gotcha. So the alternative way to doing that would be. you're having your centralized log facility identify and raise those errors. as

Jonathan_Hall:

Sure,

Will_Button:

well.

Jonathan_Hall:

I mean,

Will_Button:

So, there's…

Jonathan_Hall:

yeah, it doesn't matter if you use Sentry or some other services.

Will_Button:

Yeah.

Jonathan_Hall:

What's important is that the important logs, which I like to use the error distinction to make that, for that, but maybe someone else uses some other distinction, but I like to use error for that. So the important thing is that anything logged at error level gets seen by human eyes that can resolve it. I still log, I mean, so I log every request at info level. and there's a lot of like startup and shut down. Like if we get a control, a SIGINT or something to shut down, I log that, but I log that at info level. Cause I

Will_Button:

Yeah.

Jonathan_Hall:

do want to know, if I'm looking at my logs to debug something, I want to know when did the service restart? When did certain events happen? When did caches get flushed? You know, things like that should go to info level because when I'm debugging something, I want to look at the raw logs and see that. But I don't want to be alerted every time the service restart normally. I only want to be alerted when an application error occurs. So that's why those levels are important.

Will_Button:

Makes sense.

Jonathan_Hall:

So I think the important thing is, oh, I guess I could talk about some other anti-patterns I've seen lately with regard to logging. A really common one I see is, especially during initialization process, your application starts, you connect to a database, you connect to a third party API, whatever. And an error is gonna occur in those situations, right? Maybe the database is offline for the moment until you get a connection failure. I have seen so many of these applications do one of two extremely wrong things. Either they just panic with an error, couldn't connect to database panic, or they completely ignore the error. Maybe they log it, but then they keep running. And these are both wrong for different reasons. So let me talk about panic first. Panicking or even worse in GoParlin's log.fatal. is really problematic because it ties your logging to your control flow. And when you think about it as the consumer of an API, and I don't mean like a REST API, I mean you're an SDK or something like that, I don't know how many times I've used an API or a library where it panics when there's an error and it just drives me freaking nuts. It's like, I want you to tell me the error and then I can do something with it if I want to. And if I want to, then I'll panic or exit or whatever. But I think we need to design our programs with the same mentality. If you're doing test-driven development or otherwise running lots of unit tests, you tend to do this anyway. But the problem with a database init that just panics when it can't connect to the database is that you don't have that option to recover if something goes wrong, right?

Will_Button:

Right.

Jonathan_Hall:

And then the problem that the opposite I see is when it either ignores the error and just continues, assuming that that database will never be offline or whatever, maybe the database usually is online, but suppose it's not that one time, and maybe you log the error and keep going, or maybe you don't log the error, but you keep going, you're gonna have a really hard time when you make that first database call and that object has not been initialized, you're gonna get a

Will_Button:

Anyway.

Jonathan_Hall:

null pointer dereference or something like that. So. You know, logging and error handling are just so closely related. You almost need a single library to handle both of them in some cases. Just think about these things when you're writing code. And it's hard to know the importance of these until you've been bitten by them. But once you get bit by them, I don't know anybody who's been bitten by these and made the same mistakes again, you just have to learn.

Will_Button:

Yeah, on a similar note with that, I see that quite frequently in health checks too. If you're running a containerized solution, your health check doesn't need to just return an HTTP 200 if it's up. It should also validate that you can connect to the database if that's a requirement

Jonathan_Hall:

Yeah.

Will_Button:

for your application and if not, fail the health check.

Jonathan_Hall:

Right. Yup. And sometimes, depending on what you're doing, I mean, I've worked on a few applications where that health check was fairly sophisticated in terms of, is the database up? Is the cache initialized? You know, there could be various things that need to be working first. And maybe those things come and go. I mean, if you take the time, this isn't like microservices 101, but if you take the time. You can build a health check that's smart enough to look at over the last five minutes, how many database queries have failed. If it's more than a certain threshold, then you report that service is unhealthy or whatever. So there's a lot of flexibility you can do there, but at minimum, yes. Just make sure that everything is working before you return 200.

Will_Button:

Seems reasonable.

Jonathan_Hall:

So I think a really closely related topic that falls more on the operations side than on the developer side is just basic monitoring. And this is something we lack on this team I'm working with too. We have some monitoring, like, you know, we monitor just the sort of the generic dashboards that AWS gives you that show

Will_Button:

Mm-hmm.

Jonathan_Hall:

how much percent, what percent memory you're using and CPU and stuff like that. But what we don't have is something that make sure that every service is responding correctly. I think we do pings, but pings aren't, just as you described, aren't really sufficient to validate that your service is working. And so just a couple of weeks ago, I had an incident where I had made a change. I had actually added logging to a service and made a mistake, and the service wasn't working correctly. But I didn't realize it wasn't working correctly until about three hours later, because there was

Will_Button:

Yeah

Jonathan_Hall:

no monitoring on this service to tell me that it wasn't working correctly. So I didn't notice until a customer noticed. So

Will_Button:

Oh, no man.

Jonathan_Hall:

these are sort of obvious, in hindsight, obvious things. You want your service to log and you want your service to be monitored so that if it stops working, you notice immediately, not three hours later when a customer complains.

Will_Button:

Yeah, that ties directly into why I emphasized earlier the need for a consistent logging format because one of the things I like to do is log out like the normal or visualize what normal operation is. So the count of... HTTP 200 requests per minute, the count of 400 requests per minute, the count of

Jonathan_Hall:

Yeah.

Will_Button:

500 requests per minute, so that you can visualize that and notice trends. So in your instance there, you probably would have seen the number of HTTP success responses go down. and the number and possibly the number of HTTP error responses go up, which could have alerted you before the customer actually noticed, which to me that's the gold standard. If a customer notices that there's a problem in production before I do, I take that as a personal failure.

Jonathan_Hall:

Right. Yeah. I think in this particular instance, that alone would not have been sufficient because it was a very low volume service. I get probably gets 20 requests a day on average, and most of those are during business hours. So, you know, it's very spiky and very

Will_Button:

Yeah.

Jonathan_Hall:

low volume, but active monitoring would have, would have found it. So whichever approach you take, of course depends on how busy your service is and how high volume it is, but yes. You need something a little bit smarter than a call from a customer to detect a problem.

Will_Button:

Yeah, and I think that's a good segue. You mentioned that you were using, shipping your logs to CloudWatch logs. And I think

Jonathan_Hall:

Mm-hmm.

Will_Button:

that's an important point to elaborate on is the use of a centralized logging service, because writing the logs out to standard error. is a good start, but it's not enough. You've got to get those logs off of that instance or that container or wherever they are into a central service so that you can visualize the logs across all of the running instances for your application.

Jonathan_Hall:

Absolutely. Especially if you're using Kubernetes. I mean, any containerized orchestration, you have the same essential problem, but Kubernetes is especially annoying with containers that have died. The logs are just gone, generally.

Will_Button:

gone forever.

Jonathan_Hall:

What? It crashed 100 times yesterday? Let me go look at the logs. Oh, never mind.

Will_Button:

Never mind. The

Jonathan_Hall:

Will_Button:

good

Jonathan_Hall:

yeah.

Will_Button:

news is it's likely to crash again soon, so we'll catch

Jonathan_Hall:

Will_Button:

it then.

Jonathan_Hall:

if you're fast enough. Yeah. So at least AWS by default lets you look at logs for containers that have, I believe it does, that are containers that have ended, but Kubernetes does. So yes, you definitely want centralized logging. Even if you're using old bare metal servers that you actually physically installed with spinning hard disks, centralized logging is still very useful. I've been using centralized logging. longer than I knew what Docker was. In fact, I'm sure it was before Docker existed. It just makes your life so much simpler.

Will_Button:

Yeah,

Jonathan_Hall:

Will_Button:

I really

Jonathan_Hall:

definitely.

Will_Button:

like CloudWatch logs. And it's simple, easy to use. And they've got that, I don't know what to call it, but it's like a SQL type syntax that you can use to query your logs that

Jonathan_Hall:

Yeah.

Will_Button:

I think is pretty cool. Elk Stack is another one that I've used quite a bit. And. Then there's Splunk also, which I have a love-hate relationship with Splunk.

Jonathan_Hall:

I think the ones I've used are CloudWatch, Google's, I can't remember what they call theirs, their default one. I've used MelkStack before, but my favorite ones have been some of the more commercial ones, actually. I used Greylog for a while with some success, but I'm trying to remember what it was. The other one from, I think it's the same company that does Sentry. I don't remember what they... What they do. Loggly maybe? I think that's what they're called. Loggly.

Will_Button:

Oh yeah, yeah, yeah.

Jonathan_Hall:

Will_Button:

I don't think I've ever used Loggly, but I've heard it quite frequently.

Jonathan_Hall:

And they're a little bit more expensive, but the main thing I like about it over, especially Google or Amazon's default one is permissions handling is easier. It's easier to grant permission to all your developers than it is to mess with IAM credentials. But if you have that problem solved on your team, then that's not an issue.

Will_Button:

Right? And if you do, give us a shout, because we'd probably

Jonathan_Hall:

Yes.

Will_Button:

like to have you as a guest on the podcast.

Jonathan_Hall:

Oh, it's even as I joined this company and they gave me all the access I needed. And then for the first couple of weeks, almost daily, I'd send them a screenshot. Um, I don't have permission for this thing. And it was something stupid. Like, in fact, one time I thought I wanted to get the logs for a container that wasn't working and I thought that's what I was clicking on. They gave me the permission. I clicked the link and all it did was take me to GitHub to show me the repo where the, where the service was defined. Like I. really didn't need permissions to go to GitHub. I already have that account.

Will_Button:

Hahaha

Jonathan_Hall:

I just didn't have access for the hot link, but I didn't care about that anyway. Yeah, it's a pain to set up those permissions in AWS and Google.

Will_Button:

Yeah, cause it... it gets out of control very quickly.

Jonathan_Hall:

Yeah, I mean, and not to complain, I mean, there's a reason that it's that complex.

Will_Button:

Yeah.

Jonathan_Hall:

But when all you're talking about is logs, it doesn't need to be that complex. We're reading logs for goodness sake. We're not charging credit cards and deploying servers and whatever.

Will_Button:

But that's a really good point. When you do log something, you should ensure that you're not logging sensitive data, like the credit card number that you just failed to process.

Jonathan_Hall:

That, yes, that is a great point. So this, for all the faults of the logging system on this application, it did have a filter in place to try to filter out obvious PII and sensitive information. Like if you had a key called password, it would block out the value. I guess that's okay. I guess it's, I mean, I have mixed feelings about it. But the real head slapper came when I saw that we were logging the entire GRPC requests that included authentication credentials and personal addresses and stuff like that. So like, why are we even bothering to try to filter the stuff if we're logging all this in the first place? So yes, be very conscious of that. It's probably, I don't know if I wanna say illegal. It might be illegal to log that stuff, depending on where you are. It probably violates GDPR or other privacy laws, depending on your jurisdiction.

Will_Button:

Mm-hmm.

Jonathan_Hall:

It certainly, if it's not illegal outright, it could get you in legal trouble. You could be sued if there was a data breach and someone found their credit card numbers and logs that were leaked or whatever. So it's a really serious thing. Don't do that. And of course, you just don't want to give all your developers access to... customer data like that, that's in the logs. I mean, the developers already have enough access to stuff. If they wanna be nefarious, they can. Don't make it easier. Ha ha ha.

Will_Button:

For sure. One of the companies I worked with years ago, I worked closely with the attorney for that company and he gave me some great advice in terms of retention, because you retain data like logs. His advice was, don't keep anything longer than you absolutely have to. If you do, it just opens up the opportunity to get sued for something that you were never required to keep anyway.

Jonathan_Hall:

Mm-hmm.

Will_Button:

Do you still have audio, Jonathan?

Jonathan_Hall:

Yeah, my computer just like froze up for a second. I think I'm okay though. Yeah, I don't know what happened. CPU spiked, I don't know what. Something in the background. Probably that virus I was downloading.

Will_Button:

That malware is not going to spread itself.

Jonathan_Hall:

Alright, what else shall we discuss?

Will_Button:

So now we've, okay, so we've got our logs, we're sending them to a centralized location. What do we do with them there?

Jonathan_Hall:

Yeah, I think the most important thing is to have them available when there's a problem to be diagnosed and debugged. You can, as you've already alluded to, you can also do all sorts of analytics on them, which can be sent to Grafana or other graphing tools for dashboards to show you error rates or success rates or whatever metrics you might care about, either from a business standpoint or from a performance slash availability standpoint. Yeah, so I think that the three main things that logs are useful for are debugging things that have already happened, visualization and business metrics, and then using them to create alerts based on thresholds or in some cases, as simple as if an error occurs, maybe generate an alert as I described that I do a sentry.

Will_Button:

Yeah, I think you mentioned business metrics. I think that's worth elaborating on, because I know a lot of companies I've worked with spend significant sums of money using third party services to track how many people have used the app, how many sign ups were there, what's our retention, what's our churn. And if you set your logging up in the way that we've described here. you actually already have all of that information. You just need to build a dashboard for it. And so doing so can save your company a pretty significant amount of money by not having to use those third party services. Plus you get all the other benefits that we've already talked about from having your logging set up this way as well.

Jonathan_Hall:

Yeah, that's totally right. I mean, you could easily build a monitoring on how many times was the signup endpoint called, for example, or the delete my account endpoint or, you know, whatever you could make, you could make your turn just by measuring those two things, right? So there's a lot of business value you can get from these logs. If you, if you do that.

Will_Button:

for sure. I think one other thing worth talking about, I kind of alluded to it earlier, is log retention. One from a security standpoint, like I mentioned, the other is from a cost standpoint because I've burned myself on this with not setting log deletion policies, and then a year or two later, looking at the AWS bill, going... is this so expensive only to find out that, you know, I've got logs from three years ago sitting in a bucket somewhere.

Jonathan_Hall:

So I can think of three reasons to keep logs, and they might affect this retention. The most obvious is to debug stuff. And for that purpose, more than 30 days is probably not useful. If you're doing continuous delivery, because within 30 days, your code has changed so drastically that a 30 day old bug is gonna be impossible to debug anyway.

Will_Button:

Great.

Jonathan_Hall:

If a customer calls and says, hey, last month I couldn't do something,

Will_Button:

Hehehehe

Jonathan_Hall:

your only recourse is. Try again and see if it's still broken.

Will_Button:

Hehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehe

Jonathan_Hall:

If it is, I'll have fresh logs to look at. So from a debugging standpoint, something like a week or two weeks is probably all you really need, maybe 30 days maximum for some cases. The next one I can think of is an audit log. And that would be where people make changes to the system. And you probably want that for much longer. So that could be like, who deleted my account? Or when did you delete your account? Or which IP address did you connect from when you deleted your account? These sorts of things where a customer might call or the court might call, or some authority might ask for details about who did something when. So that sort of logging, for one thing, likely needs to go in a different system anyway. It's probably not something you log to standard output or standard error. It's probably something you keep in a separate database table. But that kind of logging you likely want much longer, maybe forever. And then the third kind of retention I can think of is for regulatory reasons. Like if you're in finance or something, maybe you're required to keep at least transaction logs for seven years or 10 years or whatever your jurisdiction calls for. Again, that's probably not the same kind of probably, but it's worth keeping in mind. And maybe it's maybe most worth keeping in mind just so that you think about these as different types of logs and treat them differently. And don't just call log info, the new balance is $6 or whatever, and think that's good enough. You probably want to keep that in a little more pristine special location.

Will_Button:

Yeah, for sure. So that feels like a pretty thorough discussion of logging. What about observability?

Jonathan_Hall:

That's a huge topic, isn't it?

Will_Button:

Yeah.

Jonathan_Hall:

So, yeah, I think that good logs and basic service monitoring are kind of the bare minimum observability you need. It's embarrassing if you don't have that and you're trying to solve a problem. And I know that because I've been embarrassed several times in the last week.

Will_Button:

Hahaha

Jonathan_Hall:

But you can go way, way beyond that. You can go down to, you know, like if you're using Data Dog or other application monitoring systems, you know, you could get all sorts of insights into the way the application is working, which libraries are using the most memory, how many go routines are running or how many threads are running, all this sorts of stuff can be useful. I guess I like to take the approach of adding that stuff when I need it. to solve actual problems, rather than just adding it all up front. Now, maybe if in a mature team, a mature company, you've already gone to the effort of adding it, so now you just have the library, just build everything on top of that, that's fine. But talking about building from scratch, I wouldn't spend a lot of time adding application-level monitoring like that until I need it. For the simple reason that there's probably more important things I can be doing to drive the business forward until it's proven that I need that stuff.

Will_Button:

Yeah, I agree 100 percent. It's, I kind of treat that type of observability as kind of almost like building an MVP, a minimum viable product. Build the bare minimum that you need and then as you get requests from your customers, in this case, the business will be your customer, add the features that they need when they need them.

Jonathan_Hall:

Yep. Just this last week, we were troubleshooting an out of memory crash actually on one of our services. So we added temporarily, we added pprof, which is the, it does some of this monitoring for you. It's not nearly as detailed as Datadog or something. It comes built into the Go language. So we were adding that to one of our services and it helped us debug what was going on. But like you said, it was just, we needed it for that one thing. We... We temporarily hacked it into there, solved our problem and pulled it back out again. So, that's all we needed for that case.

Will_Button:

So when you did that, did you consider leaving that code in there, but putting it behind a feature flag?

Jonathan_Hall:

So the problem with doing that, we talked about that. We talked about adding it to this library I'm building that handles our logging and everything else. The problem with that is it actually slows down the runtime because it's so integrated with the Go runtime. So, we're going to add a new library. that it can have a serious performance impact. So you don't

Will_Button:

Gotcha.

Jonathan_Hall:

really want to run that in production, except maybe if you only have a bug, you can't reproduce it anywhere else, then you might run it in production temporarily. But for that reason, we didn't want to build all of our Docker images with that enabled. So we could add it to the library behind a build flag, just to make it easier to enable it on demand for testing purposes. In our case, we were easily able to reproduce the bug in our test environment or even on my local laptop. So that wasn't something that I would ever have wanted to build into a production image, but it might've been nice to have that in the library just to flip it on quickly.

Will_Button:

Right on. Makes sense. So I think for me, some of the base metrics that I try to track for observability, because I deal a lot with HTTP API type services. So I mentioned already the throughput, request per minute by response type, whether it's a 200, 400, or a 500 type request.

Jonathan_Hall:

Mm-hmm.

Will_Button:

The other thing I like to track is latency. Like how quickly, what's the average latency for the response times because that's nice to see when you're troubleshooting an issue. Is the latency increasing steadily over time or was there a dramatic increase in latency that we could tie back to a specific event? And I think those, in my experience, those have almost been more helpful than CPU and memory statistics for troubleshooting

Jonathan_Hall:

Oh yeah.

Will_Button:

problems or identifying problems earlier, especially with Node.js. I've worked with Node.js quite a bit, and oftentimes you don't see spikes in CPU and memory until the server is. checking out saying goodbye and calling it quits,

Jonathan_Hall:

Hehehehehehe

Will_Button:

but you'll see latency increase long before that as the server starts struggling with the JVM and garbage collection.

Jonathan_Hall:

That's a good point. That's actually something I added to this logger that we're adding now to our services, is it for every request that logs the response time, just the wall clock time. Some languages let you also measure CPU utilization per request, which is nice, but Go doesn't really let you do that because of its multi-threaded nature. But yeah, so that means it would be super easy if we ever want to, to just query our logs for request time. And we could do it by endpoint, by HTTP status, as an average of everything, you know, it'd be really easy to build whatever we want from that, with that data.

Will_Button:

And again, because you build it into your library, it's something that you get for free. You just have to take the time to either write the query or build a dashboard to expose

Jonathan_Hall:

which goes

Will_Button:

it.

Jonathan_Hall:

back to this dry concept you were talking about. So many of these functions in our code had done timing logging explicitly. So at the beginning of the function, it said like,

Will_Button:

Right.

Jonathan_Hall:

here's the start time. And at the end, it's like hundreds of places we can delete that now because it's duplicated and stored and free in this library now.

Will_Button:

All right.

Jonathan_Hall:

Yeah, don't know what else to say. I guess one other thing that could be really useful, especially if you're doing microservices, is propagating a trace ID across services. And that's something

Will_Button:

Oh yeah.

Jonathan_Hall:

we're using Amazon X-Ray for, but you could easily do it with your own thing. Basically an HTTP header, for example, that has a unique ID, probably randomly generated, wherever the request is initiated within your system. and then each service can propagate that to, if you have service A talk to B talk to C, then you propagate that to all three services and just make sure that gets logged on all three services. That way you can say, oh, this request that came in from customer blah on service A is actually the same thing that caused this error on service C. It can really help with that sort of debugging.

Will_Button:

Yeah, now that's a great point. That's one of those that you typically don't think of until you've spent hours or days trying to identify a sporadic problem only to find out that it only happens to this one customer because they've got this special setup or this special feature or whatever.

Jonathan_Hall:

Exactly. Yep. Yeah, I think my main advice to anybody who's struggling with this is the next time or think back to the last time you were trying to solve a problem and it was difficult, what would have made it easier and do that thing next. If it was a lack of logs that made it hard to troubleshoot, then add logs. If it was a lack of monitoring, add monitoring. If it was a lack of trace ID between services, add that. Don't try to do everything at once, just do the one thing that would have made your last troubleshooting session easier.

Will_Button:

Yeah, for sure. And then you can always build on top of that. You don't have to solve all of the problems in a single pull request.

Jonathan_Hall:

Exactly. And please don't. I don't want to review that full request.

Will_Button:

Right? Well, cool. I'm glad we had this chat.

Jonathan_Hall:

Yeah. Are you going to go add some logging to your systems now? Or is that still just on my shoulders?

Will_Button:

It's all on you, man.

Jonathan_Hall:

Alright.

Will_Button:

I'll assign the ticket to you.

Jonathan_Hall:

Yeah, sign it to me.

Will_Button:

Like, what's this strange repo that just assigned a ticket

Jonathan_Hall:

Hahahaha

Will_Button:

to me? I don't even have access to it.

Jonathan_Hall:

Sounds like my day job.

Will_Button:

Yeah!

Jonathan_Hall:

Shall we think of picks? You have any? I didn't even think about it.

Will_Button:

No, that's the reason I'm stalling in between my sentences here, because I was like, wow, what am I going to pick this week?

Jonathan_Hall:

Ha ha ha.

Will_Button:

Because I really haven't done anything new. I've been... I'm still reading Balaji Srinivasan's The Network State and enjoying that. Probably still butchering his last name because I can never remember how it's spelled. Yeah, so I guess that's my pick that's not a pick. It's just super interesting on how to think about the role that technology is playing in modern society, both good and bad.

Jonathan_Hall:

Alright, I don't really have a pick either. I'm sure I can think of something.

Will_Button:

That

Jonathan_Hall:

Gosh,

Will_Button:

was a pretty tasty

Jonathan_Hall:

maybe not.

Will_Button:

looking glass of water. You could pick water.

Jonathan_Hall:

I could pick water or air conditioning.

Will_Button:

Yeah, for sure.

Jonathan_Hall:

You know what, I'm gonna pick sentry.io, why not?

Will_Button:

Oh, there you go.

Jonathan_Hall:

Yeah, I like them

Will_Button:

Nice.

Jonathan_Hall:

and I'm using them a lot. And say what, I'm gonna pick a library I wrote to help with some of this stuff too.

Will_Button:

Hmm.

Jonathan_Hall:

It's called gitlab.com slash flimsy slash ale. A-L-E, like a pale beer. I noticed the trend in Go web frameworks to name them after alcoholic beverages like gin and martini and whatever. So like

Will_Button:

Right?

Jonathan_Hall:

what's one of those I would actually drink that's short and easy to type? I thought Indian pale ale would be too long. So anyway, I chose ale.

Will_Button:

Right on.

Jonathan_Hall:

But it's basically a collection of some HTTP middlewares and there's an error handling or an error package in there, a sub package. And I'm making heavy use of it with this logging and error handling that I'm doing for this company. Yeah, check it out. It's no guarantees about the stability of the API, but check it out. If anybody likes it, maybe I'll release a version one and actually guarantee stability. It's still a version like. 0.19 or something, I don't know. But I've been using it. This represents code that I've been using for three different clients now over the last three or four years. So it is actively used in production at at least two or three companies.

Will_Button:

That's more

Jonathan_Hall:

Will_Button:

than some

Jonathan_Hall:

there's

Will_Button:

repos can

Jonathan_Hall:

that,

Will_Button:

claim.

Jonathan_Hall:

yeah. So getlab.com slash flimsy slash ale, that's my pick.

Will_Button:

Right on.

Jonathan_Hall:

Awesome.

Will_Button:

That's got me wondering. So we've got Jen and Martini that are web frameworks, ALE that's the logging and error handling. What would it take to create the library everclear?

Jonathan_Hall:

I don't know, but they're a good band. I used to listen to them a lot.

Will_Button:

Oh they are banned, yeah.

Jonathan_Hall:

Yeah.

Will_Button:

I was thinking of the, it's like pure grain alcohol, like 100% alcohol.

Jonathan_Hall:

All right, all right. I think you have your project for the week, Will. You need to write a library called Everclear.

Will_Button:

Yeah. All right, stay tuned for next week's pick. Ha ha ha.

Jonathan_Hall:

All right. Until then, thanks for

Will_Button:

Alright,

Jonathan_Hall:

listening

Will_Button:

see you

Jonathan_Hall:

Will_Button:

man.

Jonathan_Hall:

Adventures in DevOps. Cheers.

DevOps and Log Management - DevOps 127

0:00

43:52

Playback Speed:

Show Notes

In this episode…

Sponsors

Picks

Transcript