Practical Observability: Logging, Tracing, and Metrics for Better Debugging - RUBY 656
Today, they dive deep into the world of observability in programming, particularly within Rails applications, with special guest, John Gallagher. Valentino openly shares his struggles with engineering challenges and the frustration of recurring issues in his company's customer account app. They explore a five-step process Valentino has developed to tackle these problems and emphasize the critical role of defining use cases and focusing on relevant data for effective observability.
Hosted by:
Valentino Stoll
Special Guests:
John Gallagher
Show Notes
Today, they dive deep into the world of observability in programming, particularly within Rails applications, with special guest, John Gallagher. Valentino openly shares his struggles with engineering challenges and the frustration of recurring issues in his company's customer account app. They explore a five-step process Valentino has developed to tackle these problems and emphasize the critical role of defining use cases and focusing on relevant data for effective observability.
In this episode, they talk about the emotional journey of dealing with bugs, the importance of capturing every event within an app, and why metrics, logs, and tracing each play a unique role in debugging. They also touch on tools like Datadog, New Relic, and OpenTelemetry, discussing their practical applications and limitations. Valentino and John shed light on how structured logging, tracing, and the concept of high cardinality attributes can transform debugging experiences, ultimately aiming for a more intuitive and robust approach to observability.
Join them as they delve into the nexus of frustration, learning, and technological solutions, offering valuable insights for every developer striving to improve their application's resilience and performance.
Socials
Socials
Transcript
John Gallagher [00:00:04]:
Hey,
Valentino Stoll [00:00:04]:
everybody. Welcome to another episode of the Ruby Rogues podcast. I am your host today, Valentino Stoll, and we are joined by a very special guest today, John Gallagher. John, can you introduce yourself and tell everybody, a little bit about yourself and why we had you on today?
John Gallagher [00:00:21]:
Sure. Thanks for having me on. My name's, John Gallagher, and I am, pardon me, a senior engineer at a company called BiggerPockets, and we teach how to invest in real estate based in the US. And, I also run my own business on the side called Joyful Programming to introduce more joy to the world of programming. And I'm on today to talk a bit about observability, which is one of my many passions. I'm a bit of a polymath. This is one of the things that is really, really, important to me and I'm passionate about. I'm particularly passionate about introducing this into Rails apps.
John Gallagher [00:01:00]:
So thanks for having me on.
Valentino Stoll [00:01:02]:
Yeah. And thank you for, all the joy you're bringing to people, I hope. You definitely picked the right language. Ruby, if if you're not familiar with this podcast, Ruby is a very, joyful, experience, personally. So it's very cool. I love I've loved been digging into all of the observability talk, that you have on on joyful programming. And it's a kind of a very important topic that I feel is definitely kind of overlooked if you're starting up. Maybe you get some, like, you know, bug alerting or something like that in place as, like, a standard.
Valentino Stoll [00:01:53]:
But kind of anything performance monitoring wise is kind of like a, oh, no. Like, something happened. Let's look into it now. I feel like it's like the the typical, flow of things, as people start up. Do you wanna just give us, like, a high level, like, what is observability and why should we care? What, you know, we could drill into the the details of it after.
John Gallagher [00:02:21]:
Brilliant. Well, I don't actually think anybody should care about observability, and I don't care about observability as a thing because it's just a means to an end. And what's the actual goal? Doesn't matter how you get there, but the goal is being able to, number 1, understand your rails app in production. And number 2, be able to ask unusual questions. Not questions that you've thought of a day, 2 days, 3 weeks ago because that's not really very useful or interesting. If we knew exactly the questions to ask in the future of our apps, everything would be easy. Just be like, how many how many 200, have we had in the last week? It's kind of boring question to ask. Maybe a bit useful.
John Gallagher [00:03:14]:
I find the more obvious the question, the less useful it is. So, observability is the practice of making a black box system more transparent. So I like to think of it, imagine your entire rails app, all the hosting, everything to do with that app is wrapped up in an opaque black box. And somebody says, how does it work and why is this thing going wrong? You would have no hope of understanding it. If the box is completely translucent and you can see everything, which of course is completely impossible in software, but in theory, you'd have this completely translucent box and you can ask all these questions and you get instant answers. That's like 100% observability and of course that is absolutely impossible. And so what we're trying to do with observability is understand what is going on. Not just when it goes wrong, although that's the obvious use case is we have an incident the most, you know, the most, critical point where observability comes into play is an exact scenario that I landed in 2 weeks into a a new role I had.
John Gallagher [00:04:29]:
So it was 2 weeks in, the site had gone down, I was in the u I am in the UK and my tea the rest of my team were in the US and there were 2 other engineers in my time zone. And all of us have been in at the company for a total of 5 weeks. Right? So we've got this app, it's down, it's on fire, and we need to put the fire out. And we just the 3 of us looked at each other, we're like, what? Should we just restart the dynos? Yeah. So we restarted the dynos, we crossed our fingers, and it was pure luck that the app came back up. That is the exact opposite of what we want. And we've now moved to a situation where we can ask our app a whole load of very unusual questions, and we will get an answer to that. Why are there a peak of 4 o fours on iOS at 3 AM? Looks like that a lot of them are coming from this IP address.
John Gallagher [00:05:30]:
Okay. What's the IP address doing on the site? Okay. Interesting. How many users are using that IP address? 5. So only 5 people use it and you keep so that's the point of observability to me, to be able to ask unusual questions that you haven't thought of already dynamically and explore the space and come to some conclusions.
Valentino Stoll [00:05:54]:
Yeah. I think that's a great overview. And your, your your debugging, reminded me of the I I had the lucky experience of, running reels with, Ruby 1.87. And every once in a while, you just had to, like, give the give the server a little kick because it started to grow in memory size, and just, you know, giving it a quick little flush, like reset things. And you're just like, oh, I guess that's how we're gonna do it, until we can get, like, some insight into what's happening. Right? And I think that's definitely underlines the importance of observability in general. Like, you know, how do you get those insights to begin with? And maybe that's a great starting point. Like, where do you start, like, looking at it, like, adding these insights? Right? Like, what's the is there a modular approach you could take, or is it more of, like, a you should look at doing everything all at once kinda thing?
John Gallagher [00:06:57]:
You should definitely not look at doing everything all at once. As I think we can all agree in software, doing everything all at once is the recipe for disaster no matter what you're doing. That's not
Valentino Stoll [00:07:07]:
just I mean, there's no vendor you could just, like, pay money to and, like, you get a 100% of serviceability.
John Gallagher [00:07:12]:
There are vendors that tell you that you can do that. Whether you actually can or not is a different matter. Spoiler alert, you can't. So, I just wanna back up a little bit and talk about the feelings because I think it's the feelings that is where all of this start for me. Pardon me. So, I got into observability and it's funny because for the first kind of year of my journey doing this, I didn't even realize I was doing observability. I'd heard about this observability thing and it was out there in the universe. I'm like, okay.
John Gallagher [00:07:46]:
Maybe I should learn that, I should learn that, and I kept using the should, I should learn this actually. I have loads of other stuff to do. I've got loads of other things. I know what it is. I know it comes from controls here and there's a Wikipedia page that's really complex and really confusing. Whatever. I've got real work to do. But what I know is that I kept coming across these bugs in Bugsnag, Sentry, Airbreak, choose your error reporting tool.
John Gallagher [00:08:11]:
They all help you to a degree, but they're not a silver bullet. And I kept coming across these defects over and out, and the story was exactly the same. Come across a defect, I'd see the stack trace in the error reporting tool, and I would look at it and it it first first emotion right out the gate complete confusion. What is going on here? No idea. So I dig a little bit into the code. I dig it a little bit into the stack trace so it's it's coming from here, and this thing is nil. Classic. Right? This thing is nil.
John Gallagher [00:08:43]:
Where was it being passed in as nil? I don't know. So now I'm like, well, I can't just say I can't fix this. So I now have to, well, do what exactly? I don't have any information to go off. Well, I guess we'll do that bug right. Let's look at the next one, and this just kept happening. And I would find myself going through all the bugs in the backlog, and I couldn't fix any of them. And I just wasted 4 hours looking at things, asking questions that I couldn't explain, looking at things I didn't understand. And for years, I thought the problem was with me.
John Gallagher [00:09:19]:
I honestly thought I'm just not smart enough. I'm not a good engineer. Blah blah blah blah blah. Book fixing isn't really my thing. It's I'm just not really good at it. And then after many, many years of this, I was in a company, and I just got really sick of this. We we just released a brand new app, and it was it was a customer account app. And we're getting all these weird bug reports.
John Gallagher [00:09:43]:
People saying how to log in, people kept saying I can't reset my password. And every time we did this, we would add a little bit of kind of this ad hoc logging and then put the the bug back in the backlog, and then it would come up again and come up again. And after a while, I was just like, this is just this is ridiculous. We're highly paid engineers. This is not a bad way. So then I started looking into we were using Kibana at the time or rather I should say we were not using Kibana at the time. Kibana was there. We were paying for it.
John Gallagher [00:10:12]:
And and I was like, I've heard this is something to do with logging. So where do we do our logging? People like, I have no idea what this even is. Let's open it up. And there was just all this trash, all this rubbish. I was like, what's this? How is this supposed to be useful? People are like, oh, wait. We don't really look at that. It's not very useful. So so how do you figure out bugs? They're like, well, we just, yes, we just figure it out.
John Gallagher [00:10:37]:
Well, yes, but we're not figuring it out. So all of this was born through frustration. And so what I did back then is what I recommend everybody does now to answer to your to answer your question, come back to the point, John, which is take a question that you wish you knew the answer to, a very specific question, not why is our app not performing as we want, not as in, like, why do our you know, a very, very specific question. So so take your big, big question, and at the time this was why are people being locked out of the app? Why can't they not reset the password? They're clicking on this password link, and they're saying it's expired or it goes nowhere or it doesn't work. Okay. Why are those people like, why is that happening? So that's quite a general question, and you wanna break it down into some hypotheses. So that's the first thing. I have a 5 step process, and this is step 1.
John Gallagher [00:11:38]:
I'll go through the 5 step process in a minute. So step 1 is think of a specific question. So a specific question this might case case might be, okay. I've got one customer here. There's many, many different types of defects. So this one customer here is saying it was expired. I went to the web page and the the link said it had expired. Okay.
John Gallagher [00:12:02]:
When did they click on that link? What response did the app give to them? And when did the token time out? Right? So those are 3 questions. Now they're not going to get us to the answer directly, but there are 3 questions, very specific questions that we can add instrumentation for. So I take one of those questions. When did the token time out? Great question. So in order to do that, we need to know when the token was created and what the expire of the token was. This is just a random example off the top of my head. So you'd be like, okay. Well, we need to know the customer ID.
John Gallagher [00:12:46]:
We need to know the token. We don't actually need to know the exact token, but we need to know the customer ID. We need to know the time that the token was created and the the expiry time of that token. Is it 15 minutes? Is it 2 hours? Whatever. So I would then look into the code, that's the next. So so we've done step 2. Step 2 is define the data that you want to collect. User ID, token expiry, and an event saying the token has been created now for this user ID.
John Gallagher [00:13:23]:
K. So that's the second step. The third step is build the instrumentation to do that. So whatever you have to do, maybe it's you need to actually add structured logging to your entire app. I don't know. Maybe it's that you've got structured logging fine, but you there's nothing listening to it. Maybe. Maybe the tool just can't actually measure what you want it to measure, so maybe you need to invest in a new tool, whatever it is.
John Gallagher [00:13:47]:
And then you build some code to instrument just that very small piece of functionality. And then once you've done that, you wait for it to deploy, and then you look at the graphs, you look at the logs, you look at the charts, whatever output you've got. And what normally happens is, for me, I look at the charts and I say that is not what I wanted at all, actually. I've misunderstood the problem. I've misunderstood the data I want. Now that I see it, just like you would with, agility, true agility, not agile because agile means something else now. But true agility is you do a little bit of work, you develop a feature, you show the customer they say, not quite right. Go back.
John Gallagher [00:14:29]:
Adjust it. Closer, but still not quite right. But if you ask them to describe it exactly right from the beginning, it doesn't align with what they want at all. You need to show them, and it's only by showing them that you get feedback. And the same is true for ourselves. It's only by looking at the graphs and logs that I realized that's actually isn't what I wanted to begin with, or it is, or I'm onto something there. And so I keep then sort of I've used the graph. Maybe it was unusable.
John Gallagher [00:14:59]:
Maybe I couldn't query the parameter. Maybe there's all sorts of things that might be happening there. So then the last stage is improve. And so from improve, you can go back to the very beginning, ask a different question, or maybe you just want to iterate on the instrumentation a bit, deploy it again. Oh, that's more like it. Okay. So now we know the token expiry. What's the next question we want to ask? Well, why did like, when did the user actually hit the site? Was it after the token expiry or before? Okay.
John Gallagher [00:15:28]:
Sounds like an obvious question, but maybe maybe it's after, which would indicate the token really had expired. Oh, it's before. How could it be expired when it was before? Oh, hang on. What's the time zone of the token? Now we're getting into it. Right? So you log the time zone. Holy cow. The time zone of the token is out of sync with the time zone of the user. That's what it is.
Valentino Stoll [00:15:54]:
Yeah. I love that I love that analogy of identifying the use case in order to expose what to observe and and where to insert, you know, all of these pieces that are missing or identify them really. Right? Not to just insert them, but to identify them. I think that's very important, I think, in general is, like, trying to identify the actual use cases, in order to know what you even wanna capture to begin with. Right? Like, yeah, we get to throw a wall of logs at at a source resource like Kibana, and, it's not very useful. But, once you start to abstract the ideas and use cases, and how people are actually, like, using the thing that you've built, you know, you can definitely isolate, what it is that you actually care about. And I think that I think you're right. Like, that is, like, kind of the whole importance of observability is is identifying that use case and exposing what what you actually care about, as far as all these things that are because, I mean, you know, there's HTTP logs.
Valentino Stoll [00:17:05]:
There's, like, all all kinds of logs and information available that's just, like, omitting all the time, like, how do you know and identify, you know, which are really important. And I I think it just depends. Right? Like, what are you, yeah, what are you trying to capture? So it's a it's a great, like, step wise way to just, like, start to figure that out. Right? Because, yeah, I guess depending on your role and depending on what, you know, your responsibilities are, that could change and that could be different. And your observability needs will change with that. So, identifying that is probably most important, I think.
John Gallagher [00:17:44]:
But but as with everything else, I would say, if you're really not feeling any pain, don't bother. Just don't bother. I'm not into kind of not really interested in telling people what they should be doing or could be doing. I mean, goodness me. We we hear enough of that in engineering, don't we? You should really learn a language every year. You should be blah. You should be blah. Sick of it.
John Gallagher [00:18:08]:
Absolutely sick of all these gurus telling me what to do and what I should be learning and what I and very few of them talk about, well, what's the benefit to me? And in order for me to do anything, in order for me to change as a human being in any way, learn anything, I have to feel the pain of it. If you're not feeling the pain, don't bother. But if you are feeling pain, if deploys are really, glitchy, if you keep ask for me, the kicker is if I keep asking questions I don't have the answer to, that's a concern. And if they're just minor, oh, like, why did I wake up 10 minutes late today? Who cares? It's not important. But if the site's gone down for the 4th time this month, and every time the site goes down, we lose at least 5 grand, 10 grand, maybe even more. And even worse, every single time the site does go down, we just kind of get it back up more by look than good judgment. This kind of feeling of, we kind of got away with it that time. That that's okay.
John Gallagher [00:19:15]:
And all there was this weird thing, and we you know, and it's still not really figured that one out, but that's okay. We'll just put it in the backlog. It's the operational risk. You've gotta decide. Are you comfortable with that operational risk or not? Is it big enough? And in my experience, you've kind of got to hit rock bottom with this stuff as I did. There were loads and loads of bugs that I could have investigated and added logging for and fixed, but, you know, it's pushing a boulder up a hill. It's not actually worth it. And it was only when it reached my threshold of pain.
John Gallagher [00:19:47]:
I was like, you know what? I have to do something about this now. This is just ridiculous. We're professional people. We're being paid a lot of money, and it's not working. The app that we've delivered is not working. What's more? We don't know why. But also I do just want to add, and this may broaden out the conversation a little bit, you may want to may we may want to keep it narrow on Rails apps, but I've realized that observability principles go way beyond how does our web app work. It applies to any black box.
John Gallagher [00:20:23]:
So as an example, a few years ago, I was working at a company and their SEO wasn't great. And they just kind of were like, oh, you know, we'll we'll we'll try and fix it, and they they had several attempts to fix it. None of them really worked, and, every attempt was the same. They would get some expert in. The expert would give us a list of a 100 things to do. We would do 80 of the 100, and then nothing would really improve. And then they'd be like, well, we did everything you said, and then they'd move on to another. And rinse and repeat, keep doing that.
John Gallagher [00:21:03]:
And then one day, within 4 weeks, 20% of the site traffic disappeared, and nobody could tell us why. Nobody understood why. Observability. Now Google is a black box, so, you know, you're not going to be able to instrument Google, but you there's lots of tools that allow you to peer into the inner workings of Google, SEMrush, Screaming Frog, all these kind of tools. They are, in my opinion, actually into some degree the observability space. They're not you know, everybody thinks of them as marketing tools, search engine optimization tools, whatever whatever whatever. They're allowing you to make reasoned guesses about why your searches aren't performing the way they are. And then you can actually take action on that because now you have some data.
John Gallagher [00:21:53]:
Oh, this keyword dropped from place 4 to place a 100. Why is that? Okay. Let's try a. Let's try hypothesis a, put that live, and see if Google will respond to that. Oh, we're now up to, you know, position 80, whatever it is. So the idea of observability goes way, way beyond Datadog and New Relic and, obviously, all of those people in the observability space, but I I see it as a much, much wider, much more applicable topic.
Valentino Stoll [00:22:26]:
Yeah. I I hear you there. And I'm I'm all again I'm I'm all also, like, you know, let's not just add New Relic to every app that we that we deploy or, you know, is bug snag even needed for every app? Like, these are questions that I ask myself too. Like, what value are you getting from all these auxiliary services that give you the observability into, like, just blanket things? Right? Like, at what at what point do you, like, stop, like, that kind of mentality and be like, alright. Well, you know, every Rails app should at least be able to get insight into the logs so that you can see what the application is doing. Like, well, how long do you capture that? Like, what kind of time frame? Do you have any, like, default standards where you're like, well, I know that I'm gonna need to look at this at some point in the application cycle. Like, what are your defaults?
John Gallagher [00:23:24]:
Great question. I would say if you're if you're making a small app with very little traffic and it's thresholds like anything else. You're making a small app with very little traffic. I have a I have a client at the moment I'm consulting for and I've made them an app and it has maybe flipping 20 visits a day or something. 20 hits a day. So I installed roll bar, free version of roll bar. Anything goes wrong, I get a notification. It's fine.
John Gallagher [00:24:03]:
The further up the stack you move, the more the defaults change. For, a rails app that's mission critical that I'm not even gonna say mission critical, but just serving a decent number of hits a month. 10000, 20,000. I don't know. I've tried a lot of observability tools, and there's no one that yet that I can unreservedly recommend. They all got their pros and cons. Datadog is a good option if money is no object. I kinda don't wanna get into the the tooling debate because there's it's kind of a bit of a red herring, I think, in in many ways.
John Gallagher [00:24:48]:
There's various cost benefit trade offs there. But in terms of the defaults, in terms of what you observe, requests has got to be up there. So every app that I, that I have in my care of any significant size, I would always say install Symantec Logger. Symantec Logger is the best logger I've found. App does JSON out of the box. It's quite extensible. There are many problems with it, but it's the best option that we've got. So that's number 1.
John Gallagher [00:25:18]:
That will log every like, rails already logs every request for you that will format in JSON for you. There are some notable, missing defaults in semantic logger, and I'm working on a gem at the moment that will add some even more sensible defaults into it. So for example, I believe that request headers do not get logged out of the box. Certainly request body does not get logged out of the box. Request headers might be. The user agent doesn't get logged out of the box. I mean, this is just pretty basic stuff. Right? But, so I I have a setup that I use that, logs, a whole load of things about the requests out of the box.
John Gallagher [00:26:06]:
I liked adding user ID out of the box. It depends what kind of setup you have for authentication, but at the very, very least, if somebody's logged in, the ID of them should be logged in every single request. That is absolutely, you know, absolute basic stuff. A request ID is also a really, really useful one. I I have a complex relationship with logs and tracing because tracing is essentially the the pinnacle of observability. I hear a lot of people say logging, like like logging is a be all and end all. Logging is a great place to start, but tracing is really where it's at. And I can go into that why that is in in in a bit, but logging is a great default.
John Gallagher [00:26:51]:
Logging is a good place to start. Start with semantic logger. Basically, every single thing that's important in any request should be logged. So that's every header. Obviously, you need to be careful with sensitive data in headers, like do your rails, active. Can't remember what it's called, but there's the filtering module that you can add in. And sometimes semantic logger doesn't give you that by default, so you need to be a bit careful. A good default as well is logging all, background jobs.
John Gallagher [00:27:30]:
Background jobs are one of the most painful areas of observability that I've experienced, and we still haven't really cracked it. We have some very, very basic, logging out of the box and semantic logger. I believe it logs the job class, the job ID, and a few other things, but it doesn't log the latency, which is a huge, huge missed opportunity. And it also I don't believe it logs the request ID from when it was, enqueued. So when a job is enqueued, it will, by default, semantic logger will trigger a little entry in the logs. This job is enqueued, and it will tell you what request it came from. But on the other side, when it's picked up and the job is performed, that request ID is missing. So you need to kind of go into the request ID, find the enqueued job, find the job ID, and then take that next leap.
John Gallagher [00:28:26]:
So, I mean, it's a bit clunky, but it's manageable. So in short, semantic logging gives you you some okay defaults out of the box, but there's some really basics that it it still misses. And so background jobs, requests, those are the 2 really, really big ones to start out with, but as you can imagine, there are a ton more.
Valentino Stoll [00:28:49]:
Yeah. You mentioned kind of some key pieces I I always think of with observability in general, which is like, separating the, the p the pieces into their own puzzle. Right? Like, we have logs, which are kind of just like our data, and then we have individual metrics that we're, like, snapshotting, the logs for particular, segments like traffic or number of people using it, like the number of jobs that are running. And then there are traces, which we could dig into next because I I I have a lot of, I have a lot of love for all of the, standards that are coming out of this with open tracing and things like that. I love to dig in there. But, also, like, alerting. Like, you know, how how do how does anybody know that there's ever a a problem?
John Gallagher [00:29:40]:
So much to talk about.
Valentino Stoll [00:29:42]:
Yeah. I mean and and I love, I love, like, thinking about it in these separate groups and categories because I think it also helps, to think about, like, the overarching theme, which is, like, getting insight, but also, like, getting meaningful insight, and, like, when you want. Really, like, the only the only reason ever anybody ever cares about observability anyway is, like, when something goes wrong, you know, or something is problematic that causes something to go wrong, and you wanna either catch it early or, you know, try and remediate? That's right. And so, like, where do you find, like I mean, background jobs are, like, kind of like I I feel like the first instance where people realize, like, oh, like, we need to start looking at, you know, what it's doing. Right? Like, you start throwing stuff in the background. You're like, okay. Great. Like, it's doing the work.
Valentino Stoll [00:30:38]:
And then you don't maybe realize if you're on the same node, like, well, those, you know, slow requests can block the web requests. Right? And then, okay, well, if you split those up, finally you got that resolved, but then, okay, well, one problematic, you know, job can back up a queue that it's on. You know, like, where do you to to me, like, the background processing aspect is, like, why we have tracing to to begin with, because it does like it's concurrency. Right? So it's like that's where everybody, like, ends up hitting their pitfalls is as soon as you start doing things, like, all at once, like and thinking, oh, like, we just throw it in the background and, like, process things as they come. And as things start to scale, it causes more problems as you try and find figure out timing and stuff like that. Like, where do you find the most important pieces of, like, making sure that you, you know, are capturing the right segments and the right flows, you know, in that process?
John Gallagher [00:31:43]:
Yeah. There's so many things you touched on there I want to come back to. To answer your question, first of all, it's the 5 steps that I walked through. Yeah. That's the short answer is if you have a specific question that you cannot answer, what we're really talking about is the implementation details of how you answer that question. So what question you pick determines a whole load of whole load of stuff. I can't just give you a bog standard answer because it it just it depends. I hate saying that, but it does.
John Gallagher [00:32:18]:
So I think the first question is to ask the question, figure out what data is missing, and then choose the right piece to add into your logs. I feel like I've maybe not understood your question maybe.
Valentino Stoll [00:32:36]:
Yeah. I mean, it's more of like a an open open question. I I guess, like, when trying to think about like, I one of my biggest debugging pitfalls is, like, trying to, like, reconstruct the state of what happened when something went wrong. It's like Right. I feel like that's, like, one of the most typical, things is, like, okay. Something happened. Like, well, like, it's the data has changed since something had happened. Yeah.
Valentino Stoll [00:33:07]:
Maybe the change resolved the issue, but, like Yeah. You know, trying to figure out what that is and and going running through those questions. Right? Like, how do you think about, like, reconstructing data or reconstructing the state of a issue? Like, is that not the right way to go about it, or do you try and, like, do something else?
John Gallagher [00:33:29]:
Fantastic question. So, and and this gets to the root of why the 3 pillars are complete nonsense. K. So there'll be a lot of
Valentino Stoll [00:33:39]:
What are the what are the 3 pillars?
John Gallagher [00:33:42]:
Metrics, traces, and logs. Okay. Nonsense. They're not 3 pillars. The analogy I like to use is saying that observability is 3 pillars and its traces, logs, and metrics is a bit like saying programming is 3 pillars. It's arrays, integers, and strings. It's the same kind of deal. It's it's no.
John Gallagher [00:34:07]:
It's nothing to do with those things. Well, it is because you use those every day. Yes. But it you're kinda missing the point. So, thanks to some amazing work by people at Honeycomb and Charity Majors, and reading their stuff and reading their incredible work, I've realized that on the metrics tracing logs are missing the point. The point is, we want to see events that happened at some point in time, and that neatly answers your question about how do you reconstruct the state of the app. I mean the short answer is of course you can't. If you're not in an event driven system, if you're in a crud app, if you're storing state to the database, there is no way you can go back in time and accurately recreate it.
John Gallagher [00:34:55]:
But we can give it a reasonably good start. And we can do this by capturing the state of each event when it was forget about observability tools and logging and structured logging and tracing just now. Imagine if when that incident happened, let's say my expired token would would be, would be a maybe potentially a good example. There are several points in that timeline that we want to understand. Number 1, when the token was created. Number 2, when the user hit the website and maybe there's a third one, when the account was created, let's say that. So imagine if at each of those three points, we had a rich event with everything related to that event in it. So when the account was created, we had the account ID, the status of the account, whether it's pending or not, the creation date, the customer, the customer ID, blah blah blah blah blah.
John Gallagher [00:35:55]:
And then when the user visited the site, what was the request? What was the request ID? What was the user ID? What was the anonymous user ID, etcetera, etcetera. And then when the token was created, what was the expiry? What was the this? What was the that? What was the user ID? Okay. So if we have those 3 events and we have enough rich data gathered with each of the events, we can answer your question. Does that make sense so far? There's a whole load of more blah blah blah, but is does that make sense so far?
Valentino Stoll [00:36:28]:
No. I think that you you're making some great points of, like, capturing the transactional, like, user information or user's actions. Yes. And the
John Gallagher [00:36:38]:
same also other events that happening in the system. Yep. So there's user did something, computer did something, computer enqueued a background job, performed a job, etcetera, etcetera. So the way I think about it is everything that happens in your app whether it's initialized by the computer, an external data source, use its basic events, don't make stuff really. That creates an event. That event, if you don't capture enough data, that is it. The data is lost forever if you're not in an event. I'm assuming you're not doing event sourcing and assuming you're not in an event driven system.
John Gallagher [00:37:13]:
So to the way I think about it at the most core fundamental level is whether it's truck logs, traces, metrics, whatever it is, we need a way of capturing those events. And more importantly, ideally, we need to link the events together, and this is really, really, really important. So if somebody create a let's say somebody hits our app and it creates the token. Well, there's 2 parts to that. They hit the app, there was a request to our app, and then in the call stack somewhere, the token is created. Those two things are 2 separate events, but they're nested. We want to capture that causal relationship. 1 calls the other.
John Gallagher [00:37:54]:
One is a subset of the other. 1 is a parent, a child, whatever, however you wanna put it. Without that causal link, we're lost again. We don't know what's caused what. So there are some, like, 3 or 4 ideas here. Number 1, events. Number 2, contextual data with each of those events. And number 3, nested nested events, if you like, causal relationships between events.
John Gallagher [00:38:23]:
And with those three things, you can debug any problem that that you would like is my is my claim. And so if you just keep that model in mind, let's examine traces, logs, and metrics and see where they fall short, see which one meets those criteria. So tracing gives us all 3. You can so for those of you I I should explain what tracing is because I was confused about what tracing even was for absolutely years. So tracing allows you to when somebody hits your app, a trace is started. So there are 2 concepts in tracing. There's traces and there are spans, and then there's the data associated with spans. But let's just leave that to one side.
John Gallagher [00:39:10]:
So when somebody hits your app with a request, a trace is started. And so the trace would be like, okay. I've I've started. Here I am. You can append any data that you want to me whilst I'm open. It's like opening the cupboard door and then you keep putting stuff in the cupboard, and then once the cupboard door's closed, you can't put any more stuff in it. Very simple analogy. So we open the door, we start the trace, and so it it goes down to the controller level.
John Gallagher [00:39:39]:
The controller says, oh, I'm going to glom on some data into whatever the existing trace is about the the method, the the post body, the request, blah blah blah blah blah, headers, whatever it is. I'm gonna glom that onto the current trace. And then we get down into maybe you've got a service object. I know some people hate them. I love them. Blah blah blah. Whatever. That's not the podcast about, John.
John Gallagher [00:40:03]:
So you get into a service object, and the service object says, oh, whatever is in the current trace, I want you to know you hit me, and you hit me with these arguments. Cool. I'm gonna append that to the trace as well. And then we enqueue a background job. That event gets added onto the trace. And then even more excitingly, there's a setting in OpenTelemetry where when the job is picked up and performed, the the the, the trace is kept open and there's a whole load of debate about whether this is a good idea or not, but you can do it. You can keep the trace open until that job is started. And so the job says, ah, I've I've kicked off now.
John Gallagher [00:40:41]:
It gloms a whole load of muscle. Maybe you make an API request in the job. It gloms a whole load more stuff on into the the trace. And then it comes all the way back at the stack, and you have this trace with all this nested context. And when it's saying I'm gonna glom this data onto the trace, that's called a span, and a span is nested. So you can have spans nested inside, spans inside, spans. So, essentially, it's this big tree structure, And you might have seen this before. It's the flame graph that you get in in Datadog and New Relic and all these kind of things.
John Gallagher [00:41:13]:
And everybody looks at these things and thinks they're really pretty, and they are. Indeed they are. So that is the that's the pinnacle of observability in my head. Traces give it us all. And we can say, as you can do in any of these observability tools that support tracing, you can do some really cool stuff. Show me all the requests that were at 200 that enqueued a job where the job lasted for more than 3 seconds. Holy cow. Now we're cooking with gas.
John Gallagher [00:41:42]:
We've got everything that we need. Show me all this all the spans that indicated anything to do with the background job where it was a 500 response, but the user was logged in and and and and and so we can start to not only query the the spans, but query the parents of the spans. So you've got all these nested calls or relationship, and it gets ridiculously powerful. So that's traces. Cool. Let's look at logs. What does what do logs give us? Well, it gives us events. That's all logs are really.
John Gallagher [00:42:14]:
It's a series of events that happened. Does it give us the ability to nest events inside one another? Nope. Sorry. Your luck's out. You can you can log causation IDs and you can link them together, and obviously, you can log request IDs and filter everything by the request ID, but there's no concept in the log of this log is nested inside this other log. So that information, poof, goodbye, is gone. Don't have it. But you have the rich data in every event.
John Gallagher [00:42:48]:
Let's look at metrics. What does metrics give you? Doesn't give you the events. Doesn't give you the nesting, and it just gives you some aggregated numbers. So I don't think of them as 3 pillars. They're 3 rungs of a ladder. The very top rung is tracing. Awesome. The next rung down is logs.
John Gallagher [00:43:11]:
Pretty good. And metrics are useless. Now when I say metrics are useless, people get upset with me and say, oh, well, I look at metrics all the time to understand my app. Yeah. Okay. But if you derive metrics from higher rungs, that's totally cool, totally fine. But what's a really bad idea is to directly say, I'm going to send this metric right now to my back end, and people do this all the time. People think this is a good idea.
John Gallagher [00:43:40]:
It's okay. I mean, it's better than nothing. Right? It's it's just depends on the fidelity of information you want. But the problem is there's 2 problems actually, but the main one is you've sent that data. Okay? You've sent it to Prometheus, Datadog, whatever. You sent that one data point. So then you look in the metrics and you say, holy cow. We're getting all these 500.
John Gallagher [00:44:01]:
Why is that? I'll sit here and wait as long as you want. You're not gonna be able to tell me the answer to the question unless it's blindingly obvious, unless you can say, oh, well, this this other bit of data over here is, like, correlates with it time wise and maybe. It might be that. Yeah. Okay. It might be that. How do you know it's that? Well, we're we're having to guess. Guessing is not a strategy.
John Gallagher [00:44:23]:
Hope is not a strategy. I don't really want the debug by just flipping guessing. I want to know, and the only way of knowing is having traces. So the way I like to think of it is tracing is the pinnacle. Logs can be derived from traces, which is why the 3 rungs of ladder, and everything can be derived as a metric from the 2 rungs above. So if you've got only logs, you don't have any nested context, but you can get metrics from logs. Fine. If you just have metrics, I would say you're not in great shape because you can't understand why without pure guessing.
John Gallagher [00:45:00]:
And it amazes me how many people push back on this idea and think just having some metrics is enough. It's nowhere near enough. Not in my experience. If somebody wants to refute me and come on this podcast or have a chat with me after, I would love to listen to how metrics allow you to debug, like, very, very deliberately and get the exact data that you need. You can send off dimensions to metrics, and then your metrics bill explodes within about 5 seconds, especially if it's high cardinality data like IP addresses. I've made that mistake before. We're gonna send, a dimension of IP with our metrics so that we can understand what's going on. In a week, my manager usually messages me usually less than a week saying, can you you you could turn that off.
John Gallagher [00:45:43]:
We we just got a day's bug bill of, like, $5. Whoopsies. I guess
Valentino Stoll [00:45:50]:
I I do have, like, may maybe some specific instances where metrics alone can help, like, identify things. And that's more where it's like the granular metrics are the things that you're actually looking like, care about. Right? Like, let's say, for example, like, back to the sidekick background jobs example, like, if you notice, like, your queues piling up and you happen to have your dashboard of metrics just looking at queue size and looking at throughput, like, you can easily say, oh, like, there's something blocking it and gives you kind of a point of where to look at, in this specific instance. Or, as an example, like, also, you know, you can notice, like, there's a leak in memory by monitoring, you know, your memory consumption of the app, and just looking at the metrics for that and getting an alert and saying, why is the memory not stopping growing, after a certain amount of time? I mean, they these are, like, you know, very specific examples that I'm giving. But, like, I agree. Like, if if you're looking for, like you know, it's not gonna tell you, like, if your users are, like, back to your, like, token expiration. Like, are people having a problem with our application that we've made? Like, you know, and, like, we keep getting these, you know, client, you know, emails coming in like, oh, I can't, like, sign in to your app. Like, what's happening? You know? You can't just, like, take that and be like, oh, yeah.
Valentino Stoll [00:47:25]:
It's obviously the token's, like, expiration. Right? Like, it's your customer's emails aren't gonna, like, translate directly to that, and you're not gonna know right away, without having your tracing in in in place.
John Gallagher [00:47:39]:
So so a few a few
Valentino Stoll [00:47:41]:
things there.
John Gallagher [00:47:42]:
Number 1, you bring up a really good exception I'd forgotten to mention, conveniently. If it's infrastructure stuff, if it's like memory, hard disk space, all that kind of stuff, fair game for metrics, fine. Yeah. The second thing is I I'm quite hyperbolic, so I'm quite an extreme person. So when I say they're useless, I don't mean literally they're completely useless. I think of metrics as a hint. Hey, there's something going on over here. Cool.
John Gallagher [00:48:09]:
That's that's not useless. Obviously, it's useful. But then the next question is why? And if you've got a super simple system, then it's probably like the 3 things, and you go, well, there's only 3 jobs in the system, so cool. And maybe you've segregated your metrics by background jobs, which is fair. You know, it gives you a place to look. It gives you a starting point. But I've yeah. Yeah.
John Gallagher [00:48:35]:
They're they're useful in the aggregate, and they're useful at giving you a hint. But and, yes, they're useful in in terms of, like, making sure the infrastructure's still running. But I see a lot of people depending on them. And I you know, there's a guy I really respect, Used to work with him, called Lewis Jones. And him and I have gone back and forth on this over over LinkedIn, and he is convinced I'm wrong about this. He's like, we run everything through metrics. Metrics are awesome. You're just on cloud 9 if you think you can trace everything.
John Gallagher [00:49:07]:
And there's also a significant weakness with tracing as well, which is you can't trace everything unless you've got relatively low throughput or even medium throughput, you can you can make it work. If you trace every single request and you're doing millions of requests a day, I dread to think what your your bill is going to be. So, and then that's where head tracing and head sampling and tail sampling comes into it, and we can get into that if you would like.
Valentino Stoll [00:49:35]:
I mean, I would love to dig more into tracing in general and maybe more the distributed aspect of it because, I think what you're talking about is very important. Like, you know, if we're just talking about tracing through, like, a single request in a Rails app, it's not not as useful as maybe, what what where tracing really comes into play is where there's multiple things that start happening. Like, once you start having more than one application and the, you know, the data starts trickling from one application to the other, even in in Sidekick example. Right? If you're throwing stuff into the background, how does that data snapshot transition through the background jobs? Especially if you have ones that start depending on each other. How do you then manage the queue, like, in the making sure that you know where it started and, you know, where it's going. Because sometimes you can catch a problem before it starts by having the traces in play and know where it's heading. Right? And so I I would love to dig into that to to those aspects. Like, where do you like, what tooling or may maybe we shouldn't talk about tooling specifically, but, like, what aspects of tracing are most important for, like, holistically looking at your system outside of, like, you know, running through your your quest like, I I think at this point, we're beyond, like, having your questions of what you're trying to look at and that you already know what those questions are.
Valentino Stoll [00:50:59]:
And where do you start, like, setting up tracing? Because I know we're like at Doximity, we'd use open tracing as, like, an open standard for, tracing and observability across, like, platforms, languages, and things like that. Do you find that the industry standards are, like, heading in the right direction? Or, like, where where are the pitfalls there? Like, because I know it's like, it just introduces a lot of dependencies once once you start to adopt a lot of these things.
John Gallagher [00:51:29]:
Totally. So I should say I am singing the praise of as of tracing, but it's a slightly utopian vision that I'm painting because 90% of the work I've done is with logging. Purely because it's simple to get going. It's more of a known quantity. And a lot of my talks is why I'm not talking a lot about tracing, and I'm talking about structured logging. Because I think structured logging gives you this kind of event based mindset that you can then start extending to tracing, and the reverse is not true. Like, you can't take that event based kind of mindset into metrics because metrics is just about aggregation. Right? So, but I have, like, recently, I've been doing a lot of queries in our rails app and I've been going to we use New Relic.
John Gallagher [00:52:21]:
Sorry. We use Datadog at work. And I've been going to Datadog's tracing, interface and really trying to answer my questions there instead of in in logging. So, we have both tracing and logging. Our tracing, is hobbled a little bit just purely because of cost reasons, and our logging is not so hobbled. So are the standards heading in the right direction? Yes. But it's going to take a really long time to get there. It's my short answer.
John Gallagher [00:52:58]:
There is a lot of, there's a lot of different ways of going about tracing. The most promising as we all know is open telemetry. But I mean, I read some, pretty harsh critiques of open telemetry. There's kind of a a topic that generally divides people. If you if you don't know anything about OpenTelemetry, it sounds an absolute utopia. And I got really excited when I started researching into it. The more you dig into it, the more you realize how much complexity there is to resolve and how many challenges that project faces in order to resolve them. And so, I mean, what it's what it's trying to resolve is 30, maybe 40 years, possibly even more of legacy software.
John Gallagher [00:53:45]:
Right? Because that's how long logging has been around. And they're trying to aggregate all of that into one single standard good look. It's a very, very difficult problem to solve. And they're doing an incredible job, but it's it's very, very difficult. So they have, open telemetry is where I'd start with the answer to your question. Open telemetry is a 100% of the future. I've not seen anything that rivals it. An open tracing, I believe came first, and then evolved into open telemetry in my understanding.
John Gallagher [00:54:17]:
Apologies if I've got that slightly wrong. And so, yeah, I think there's a few options if you're in Ruby. None of which are ideal. So the open telemetry client in Ruby is, not ready for prime time. It's quite behind the current standards in open telemetry. It doesn't obey any of this latest semantic standards, for example. I have, I've played around with it in an example project, and when it's working, it's absolutely incredible. It's next level brilliant.
John Gallagher [00:54:53]:
There are a few problems with it. It's extremely slow, so I tried to use tracing on our test suite at work using this open telemetry tracing, and it just it's like I can't remember the numbers, but it really slowed down our test suite to the point where it really just wasn't practical to use because we were trying to measure the performance of the test suite. So, you know, Yeah. I I could've been doing something stupid there. It's very possible that I just wasn't using it the right way. So sorry open, folks. If I've I got I know, I think a lady is called Kaylee who is from New Relic, and she and, I'm so sorry. My my name the names, escaped me.
John Gallagher [00:55:37]:
But there's a whole bunch of people in the Ruby space who are working really hard on open telemetry, but it's just that, like, the open telemetry project is moving so fast. That's your problem. So that's option number 1, open telemetry. You could maybe fork it and tweak it yourself. The second option and what we use at work is because we're using Datadog, we use Datadog's tracing tool, which is pretty good. But then even with tracing or logging, I feel like we're kind of maybe 20 years behind where everybody else is in programming in terms of observability. Because one of the questions I often have when I look at this stuff and even think about tracing, I maybe have, like, 5, 6, 7 questions that even I can't resolve. Just what do I trace? How much detail do I trace in? How much is this gonna cost me? And we're still in the stone age with a lot of this stuff.
John Gallagher [00:56:35]:
So I don't have any good answers for you in that regard. So we use, the vendor tooling for tracing. I'm sure has its own version of that. In fact, I know they do. I know Sentry does. There are certain other providers that don't have any tracing capabilities at all. So I would say for now, the best option we have is relying on the vendor tracing tools, I would say.
Valentino Stoll [00:57:02]:
Yeah. It's funny you mentioned Dave Dog. We've had, Ivo on before, from Dave Dog, to talk about a lot of the, like, I think memory profiling. He he works on a lot of, like, granular Ruby performance tooling. Really interesting stuff. Yeah. I mean, I I would love to see maybe some more, I don't know, higher level examples of, like, making use of OpenTelemetry in the Ruby space in general. Because I think that that level I mean, especially with all of the solid queue, like, or solid trifecta or whatever stuff that's coming around, it would be nice to see something like, tracing specifically introduced to rails.
Valentino Stoll [00:57:47]:
That that would make, you know, more sense in that ecosystem because, yeah, I mean, where do you where do you start profiling stuff is, like, kind of like an intro to tracing. Yeah. So, like, if you wanted to see, like, the the request, it reminds me of, was it Rack Mini Profiler Yeah. Tool. Right? Where you you can just see a little tiny tab that says, oh, it took this number of seconds to load this particular page you wanted to get. And you can click on and expand and see, oh, well, what did your application do at each step of the way and see how long each thing took. Right? And I think of that as, like, a trace, a lot of the times. Right? And and it's very, like, useful, like, even when you're just starting out to see that.
Valentino Stoll [00:58:32]:
Right? And it helps you visualize the and so I got I feel like maybe that's what's missing is a lot of, like, visualization aspects of all this tracing stuff. Is there something that you, look at or find useful when you're starting to dig into, like, structuring the traces and, things like that?
John Gallagher [00:58:52]:
Definitely. That's leading me up to my one of my big kind of rants, passions, whatever, within the observability space. And I don't see anybody talking about this. I feel like it's either I'm onto a really great idea or it's an unbelievably idiotic idea for some reason I don't know about. It's usually the latter as a spoiler. Okay. So when I'm looking at traces, there's almost never enough information. Almost never enough information.
John Gallagher [00:59:28]:
And this is why Charity Majors and the team at Honeycomb and Liz Fung Jones always talk about, have wide context aware events. That's their mantra. Wide context aware events. And events, we've already talked about. Context, we've already talked about. We haven't talked too much about the wide. So wide means lots of attributes. So their take on it is, add as many attributes as you can to every event and make them high cardinality attributes.
John Gallagher [01:00:03]:
What does that mean? It took me about 3 months to wrap my head around what high cardinality means. It means anything ending in an ID. There you go. That's an easy that's an easy explanation. So a request ID. Anything that looks oops. Sorry. That was me in my microphone.
John Gallagher [01:00:20]:
Anything that looks good like, anything that is a unique identifier for anything. So that's user ID, request ID, but anything that is a domain object, and this is the real, missed opportunity I think that we have in the rails community and in observability community potentially in in general. When there is, when something goes wrong or even when something goes right, let's say, let's take the, let's take the token as an example. When that token is created, the token is a domain object. Now, okay, it's to do with authentication, so it's not it's not really a domain object in a way. But let's say that customer is signing up for an account. The account definitely is a domain object. And if you want to understand what I mean by domain object, I just mean an object that belongs to the domain the the business domain in which you're operating.
John Gallagher [01:01:19]:
It's a business object, a domain object, call it what you will. But, when a when the CTO or the even better, the CEO or somebody in marketing talks about this customer account, they talk about people creating accounts. They use that word account. That's your first clue that it's a really important concept in the domain. So that's what I say when I mean domain objects. I mean words that non technical people use to describe your app. Right? So they're domain objects. Why are we not adding every relevant domain object to every event? We don't do it.
John Gallagher [01:01:56]:
And so what you'll see is people do this kind of half hearted, oh, well, we'll add the ID to the current span or the current trace or even the current log. We'll add the ID and that that's okay. That'll be enough. But you're not capturing the state of the object. Why not just take the object, in this case the account, convert it into a hash, and attach it to the event. Why can't we do that? Now there's a number of reasons why we actually can't do that in some cases. If you're build, in terms of the size of your event, so if you're build on data, obviously that's going to get expensive fast. But if you're build on pure events as in your observability provider, your observability tooling is is saying for every x number of events or x number of logs per month, we will charge you this much but the size doesn't matter, then this is a perfect use case to be taking those rich domain objects, converting them into a structured format, and dumping them in in the log or the trace.
John Gallagher [01:03:06]:
And so I've, kind of thought about this quite a lot and I've come up with this a few quite simple ideas that people can use starting tomorrow in their rails. Not without their problems, but the first of which is, I don't know if anybody's worked with formatted. So, 2 formatted s for date time strings. And we have this idea in Ruby, don't we, of duct typing. We have an object and really good o o designers that you shouldn't understand anything about that object. You just know it's got 4 methods on it, and it can be an account. It can be an invoice. It can be many different things.
John Gallagher [01:03:49]:
So, my approach, and I'm testing this approach out at work at the moment, is instead of having 2 formatted s, have 2 formatted h. What does that mean? It means you're going to format the domain object as a hash. And so to format it s allows you to pass in a symbol to define the kind of format that you want. So it could be short, ordinal, long, humanized, and it will output a string. It'll output a stringified version of that date in these different formats. So my idea is why can't we have a method on every single domain object in our rails app called 2 formatted h and you pass it in a format. Format could be then open telemetry. It could be any one of numbers, a short, compact.
John Gallagher [01:04:42]:
And so for every trace, the way I like to think of it is I want to into that trace, add every object that's related to that, and you could you could format those in open telemetry format, for example, or you could have a full format or a long format, whatever you want. And so that way, you can say, oh, I just want to I I want a representation of the account that is short, and it's just got the ID. And that's a totally minimal skeleton and that's enough for me. But actually here, the work I'm doing is a bit more involved. So I want to call to formatted h with full, and that will give the full account, like the updates that created that everything about it, and then that will be sent to my logs and traces. And I now have a standardized way of observing what's going on with all the rich data of my app app state at that point with all the relevant domain objects in it. So that's that's my dream that I'm headed towards with with this gem. So that's kind of the way I think about structuring it.
John Gallagher [01:05:48]:
And I think about the like, people I see people doing all this ad hoc kind of well, this is this is an ID and then we'll call the the job ID job underscore ID, I suppose. Well, what's the account? We can call that, accounts underscore ID. And I just like to think of it as imagine your domain object. So an account has a customer, a customer has a some bank details. Bank details is a bad idea, but address maybe. And so we could have these different formats that, load nested relationships or not. And, obviously, there's you gotta be careful about the performance, problems with that. And so you'll have the exact structure of your domain object in your logs, in your traces.
John Gallagher [01:06:31]:
That for me is a dream. And then every single time an account is logged, it's in the same structure. Awesome. So I know that an account is always gonna have an ID. It's always gonna have a whatever other attributes. You can't have pending status, whatever it is. And so, therefore, I can say, show me every trace where the account was pending. Boom.
Valentino Stoll [01:06:55]:
Yeah. I love that idea. And, it does it reminds me a little of the introduction of the, you know, the new Rails, like, you know, logger where you could tag you know, the tag logger was was kind of like a, a start to kind of this idea of, okay, capture all of these pieces with this tag. And it's, like, almost it's pseudo trace, I call it. But it does go along that formatting aspect of, like, okay. Format all the things like this in a specific way. And I I agree that that there's definitely a lot to to unwind there. We'll have to have you on more, if you, you know, when you, you know, put this together as a gem or something because I I would love to dig into that.
Valentino Stoll [01:07:45]:
Cool. Yeah. I mean, it definitely is I I love the idea of, like, the domain objects, and extracting those out into a formatable way that you can then trace and follow through because that that design decision is definitely missed a lot. And seeing things like Packwork as an example was a a great step in the right direction, I thought, and, I'd like to see more of that kind of evolve in the the rails ecosystem, of abstracting the domains into their own kind of segments and then being able to format them for traceability and and things like that. I think you're on to the right you're on to a lot here.
John Gallagher [01:08:24]:
And then, I mean, the thing that I think is unbelievably ironic is all I'm talking about is convention over configuration. And is that not why we all got into Rails? I know Ruby is a different thing, but Rails is all about convention over configuration. And the entire area of observability strikes me could do with a massive dollar of convention over configuration, and that's what OpenTelemetry are trying to do. The one last thing, and I know that time is getting on, but one last thing I want to just say on that is, the other huge opportunity is adding context to errors. So we have these exception objects in Ruby, and people store strings with them. And it's like, what? How do you suppose how am I supposed to understand anything from a string? And then people try and put IDs in the strings and I know. So at work, I've made this extremely simple, basically a subclass of standard error where you can attach context. So when you create the error, you pass in structured context.
John Gallagher [01:09:27]:
So if our logs are structured, surely our error should be structured as well. Makes sense. Right? So you can say, this error happened and here was the account associated with it when that error happened, and here's a user, and here's this. So it it gets attached into the error and then using Rails' new error handling, rails.error.handle, if you've not used it before, look it up. It's absolutely awesome. It's one of the my favorite things that they've added to Rails recently, relatively recently the last few years. And you can, basically have listeners to these events, to these errors, beg your pardon, it will catch the errors, and then the context is encapsulated in the error. So you can pass these errors around, then you can do interesting stuff with that context.
John Gallagher [01:10:16]:
And all I do is pull out all the context and send it straight into blogs. And that has absolutely changed changed the way I debug because whenever there's an error and it has all those rich data, you just look in the rich data and you're like, oh, that was the the account. That was the Shopify ID. That was a product ID. I've got it. And then you just look at the ID and your externals. Oh, right. Okay.
John Gallagher [01:10:38]:
It's out synced, whatever it is. It makes life so much easier. So that's something I'm really passionate about how as well, having domain objects, encapsulated within errors. So we've got structured errors, not just structured logs.
Valentino Stoll [01:10:52]:
Yeah. I mean, that's definitely one thing that I look for when I'm looking for, you know, installing dependencies. Right? Like, does the gem have its own, you know, base error class that it then can, you know, give metadata about whatever that it's raising the errors about? Like, more than just, like, a string of some error that then you have to figure out what it is. Like, having that extra metadata that you could just because you can. You could just add attributes to a class. Right? And say, this error has these attributes. Like, it it has, you know, meaning associated with the error. I think more people doing that is definitely gonna be making that easier to do, first of all.
Valentino Stoll [01:11:30]:
But yeah. And then also getting more people to take on that convention. I completely agree with you there. Yeah. I mean, we are getting a time here. Is there any last, you know, pieces you wanted to, you know, quickly highlight or, mention before we, you know, move into pics?
John Gallagher [01:11:50]:
I think the main thing is if you're listening to this and anything that I'm saying is resonating, forget about the domain object stuff. That's like getting really into the nitty gritty. But coming back to the beginning, if you're frustrated by your debugging experience, if you're thinking, why am I not smart enough to understand this? Chances are the problem is not with you, it's with the tools. So if you improve the tools, not only do you make your life easier and better, you level up everybody around you because all the engineers can use the same tools. And that's what we've experienced at BiggerPockets. And that culture of observability has really worked its way into our culture so that now anybody is equipped to go into the logs and ask any question that they want. So it is a long road, but it all starts with a single step. And so if you are feeling that pain, feel free to reach out to me.
John Gallagher [01:12:47]:
I've I can go through all my socials in a minute, but feel free to reach out to me. Ask me any questions. I'm happy to jump on a Zoom call for half an hour and and help you for for free. But basically, it all starts by taking very small steps towards a very specific question. Don't try and add observability because you'll still be here next Christmas. So, take heed. There is hope. And if any anything that I say resonates, please feel free to reach out to me, and I'll help you figure it out.
Valentino Stoll [01:13:19]:
That's awesome. Yeah. I I I also echo, that sentiment, of, you know, tooling is so important. And, you know, open tracing definitely is a a great, great framework. And if if we can improve that in the Ruby space, that'll definitely, we'll we'll be reaping the rewards as as well. So let's move into picks. John, do you have anything that you want to, share first, or you want me to go?
John Gallagher [01:13:53]:
Am I limited to one pick? Because I have many.
Valentino Stoll [01:13:56]:
No. Okay. Go ahead.
John Gallagher [01:13:57]:
Cool. So, the first one is, a new language, and I already, I already thoroughly trounced the idea that we should be learning 1, programming language a year, or rather I just I just dissed it off without actually, giving much justification. So, I'm going to go back on what I just said and say that this language, has changed the way I think pretty much forever, and it's changed the way I see Ruby and Rails and just programming in general. And the language is called Unison. Now it's a very, very strange, unusual, language. It's maybe not that readable in places, and it's also extremely new. I mean, it's been going for 5 or 6 years, but what they're trying to do is incredibly ambitious. But look it up.
John Gallagher [01:14:51]:
It's, yeah, it's an incredibly interesting language, and it will expand your mind. That's what it's certainly what it's done for me. And so it's kind of a language, that's targeted at creating programs that are just much, much simpler, but, actually more com more difficult to get your head around. It's a completely new paradigm for distributed computing, basically. And, it's absolutely fascinating. So I would highly suggest check that out. I know that Dave Thomas at Yuriko, when I spoke at Yuriko recently, he was on the stage and he was, championing Unison, and he called it the future of programming and I could not agree more. It's an it's an incredible language made by some incredibly smart people.
John Gallagher [01:15:38]:
So that's number 1. Number 2, there is a static site builder. I've used pretty much all the static site builders on planet earth, and this is my favorite. It's called 11 t. It's a really odd name. But I am I'm embarking upon this project at work, that really is exciting me which is how do you serve UI components from a dynamic app for it, so rails and meld them into a static site builder without having a pile of JavaScript that you have to wade through. So I want to offer my UI components in Rails and I want to deliver them extremely fast through a static site that's just a blog without having to run that blog on Rails. So Eleventy is my go to tool for doing all that stuff.
John Gallagher [01:16:30]:
It also, encompasses this thing called WebC, which is my new favorite templating language. Yes. I know another templating language. I promise I promise it's really good. It's not another retread of all these other templating languages that are very, very niche and very whatever. So, web c is compatible with web components, and it's a fantastic way of making HTML like components that are server side rendered. And I would love to see a plug in, for that come to rails because it is absolutely phenomenal. So those are my 2 favorite things at the moment.
John Gallagher [01:17:10]:
If anybody's trying to, wrestle with UI components in Rails and trying to extract them out of Rails components also, I would love to chat through that with anybody who's interested in that kind of, area because it I think it's yeah. There's a potential to really break new ground. How about you?
Valentino Stoll [01:17:29]:
Yeah. Thanks. I I'll I'll definitely be digging into some of those. Yeah. I was at a in New York City, the other day for the, Ruby AI, happy hour that they've been doing, every couple months. This time, they, did demos, and I I demoed, this real time, podcast buddy that I've made. It's called Podcast Buddy. And it just kind of, like, listens in the background and in real time, like, keeps track of the topics and the discussions and some example questions worth mentioning or maybe some, you know, topics to transition to.
Valentino Stoll [01:18:07]:
And it's it's a lot of fun. I just did it for fun, but I recently refactored it to use the async framework. And shout out to Samuel Williams. Just phenomenal, like, so well put together. The documentation is coming along. It is, lacking in some areas, but, I was able to just completely refactor the code so that it works with async and runs things, you know, as they come in, and it's streaming the the whisper, you know, transcripts. It performs actions in the background just, like, in the same thread, all managed with async. Just I love it.
Valentino Stoll [01:18:45]:
So check out Podcast Buddy and check out async. You can't go wrong. Async WebSocket now. You can handle even WebSockets asynchronously, just, like, completely seamless HTTP 2 and one, compatible. Love it. So, check those out. And, John, if people wanna reach out to you on the web, or just in general, how can they, how can they reach you?
John Gallagher [01:19:14]:
Thank you. Yeah. So I'm on LinkedIn. That's a platform I'm most active on, and, my LinkedIn handle is synaptic mishap, which is, yeah. I really regret that. Sorry, everybody. But, yeah, so if you just search for John Gallagher, g a, double l, a g h e r, and maybe rails or observability, you should be able to find me. I've got quite a cheesy photo, black and white photo of me in a suit.
John Gallagher [01:19:44]:
It's a horrible photo. And I blog at joyful programming.com. It's a substack. So is this still a blog anymore? I have no idea, but that's where I write. I'm on Twitter at, synaptic mishap, and my GitHub handle is John Gallagher, all one word. So, yeah, joyful programming is the main, source of goodies for me. I've also got a fairly minimal YouTube channel called joyful programming. So feel free to reach out to me, connection request me, ask me any question.
John Gallagher [01:20:18]:
I would love to engage with some Ruby folks about observability. Tell me your problems, and I'll try and help you wherever I can.
Valentino Stoll [01:20:25]:
Awesome. I love it. Keep up the great work and keep, you know, shouting from the mountains off about observability, pulling those pillars down, and just focusing on, the important stuff. Right? I I I love it. So until next time, everybody. I'm Valentino. Thanks, John, for coming on, and, look forward to next time.
John Gallagher [01:20:49]:
Thanks for having me, Valentino. It's been amazing.
Valentino Stoll [01:20:52]:
Awesome.
Hey,
Valentino Stoll [00:00:04]:
everybody. Welcome to another episode of the Ruby Rogues podcast. I am your host today, Valentino Stoll, and we are joined by a very special guest today, John Gallagher. John, can you introduce yourself and tell everybody, a little bit about yourself and why we had you on today?
John Gallagher [00:00:21]:
Sure. Thanks for having me on. My name's, John Gallagher, and I am, pardon me, a senior engineer at a company called BiggerPockets, and we teach how to invest in real estate based in the US. And, I also run my own business on the side called Joyful Programming to introduce more joy to the world of programming. And I'm on today to talk a bit about observability, which is one of my many passions. I'm a bit of a polymath. This is one of the things that is really, really, important to me and I'm passionate about. I'm particularly passionate about introducing this into Rails apps.
John Gallagher [00:01:00]:
So thanks for having me on.
Valentino Stoll [00:01:02]:
Yeah. And thank you for, all the joy you're bringing to people, I hope. You definitely picked the right language. Ruby, if if you're not familiar with this podcast, Ruby is a very, joyful, experience, personally. So it's very cool. I love I've loved been digging into all of the observability talk, that you have on on joyful programming. And it's a kind of a very important topic that I feel is definitely kind of overlooked if you're starting up. Maybe you get some, like, you know, bug alerting or something like that in place as, like, a standard.
Valentino Stoll [00:01:53]:
But kind of anything performance monitoring wise is kind of like a, oh, no. Like, something happened. Let's look into it now. I feel like it's like the the typical, flow of things, as people start up. Do you wanna just give us, like, a high level, like, what is observability and why should we care? What, you know, we could drill into the the details of it after.
John Gallagher [00:02:21]:
Brilliant. Well, I don't actually think anybody should care about observability, and I don't care about observability as a thing because it's just a means to an end. And what's the actual goal? Doesn't matter how you get there, but the goal is being able to, number 1, understand your rails app in production. And number 2, be able to ask unusual questions. Not questions that you've thought of a day, 2 days, 3 weeks ago because that's not really very useful or interesting. If we knew exactly the questions to ask in the future of our apps, everything would be easy. Just be like, how many how many 200, have we had in the last week? It's kind of boring question to ask. Maybe a bit useful.
John Gallagher [00:03:14]:
I find the more obvious the question, the less useful it is. So, observability is the practice of making a black box system more transparent. So I like to think of it, imagine your entire rails app, all the hosting, everything to do with that app is wrapped up in an opaque black box. And somebody says, how does it work and why is this thing going wrong? You would have no hope of understanding it. If the box is completely translucent and you can see everything, which of course is completely impossible in software, but in theory, you'd have this completely translucent box and you can ask all these questions and you get instant answers. That's like 100% observability and of course that is absolutely impossible. And so what we're trying to do with observability is understand what is going on. Not just when it goes wrong, although that's the obvious use case is we have an incident the most, you know, the most, critical point where observability comes into play is an exact scenario that I landed in 2 weeks into a a new role I had.
John Gallagher [00:04:29]:
So it was 2 weeks in, the site had gone down, I was in the u I am in the UK and my tea the rest of my team were in the US and there were 2 other engineers in my time zone. And all of us have been in at the company for a total of 5 weeks. Right? So we've got this app, it's down, it's on fire, and we need to put the fire out. And we just the 3 of us looked at each other, we're like, what? Should we just restart the dynos? Yeah. So we restarted the dynos, we crossed our fingers, and it was pure luck that the app came back up. That is the exact opposite of what we want. And we've now moved to a situation where we can ask our app a whole load of very unusual questions, and we will get an answer to that. Why are there a peak of 4 o fours on iOS at 3 AM? Looks like that a lot of them are coming from this IP address.
John Gallagher [00:05:30]:
Okay. What's the IP address doing on the site? Okay. Interesting. How many users are using that IP address? 5. So only 5 people use it and you keep so that's the point of observability to me, to be able to ask unusual questions that you haven't thought of already dynamically and explore the space and come to some conclusions.
Valentino Stoll [00:05:54]:
Yeah. I think that's a great overview. And your, your your debugging, reminded me of the I I had the lucky experience of, running reels with, Ruby 1.87. And every once in a while, you just had to, like, give the give the server a little kick because it started to grow in memory size, and just, you know, giving it a quick little flush, like reset things. And you're just like, oh, I guess that's how we're gonna do it, until we can get, like, some insight into what's happening. Right? And I think that's definitely underlines the importance of observability in general. Like, you know, how do you get those insights to begin with? And maybe that's a great starting point. Like, where do you start, like, looking at it, like, adding these insights? Right? Like, what's the is there a modular approach you could take, or is it more of, like, a you should look at doing everything all at once kinda thing?
John Gallagher [00:06:57]:
You should definitely not look at doing everything all at once. As I think we can all agree in software, doing everything all at once is the recipe for disaster no matter what you're doing. That's not
Valentino Stoll [00:07:07]:
just I mean, there's no vendor you could just, like, pay money to and, like, you get a 100% of serviceability.
John Gallagher [00:07:12]:
There are vendors that tell you that you can do that. Whether you actually can or not is a different matter. Spoiler alert, you can't. So, I just wanna back up a little bit and talk about the feelings because I think it's the feelings that is where all of this start for me. Pardon me. So, I got into observability and it's funny because for the first kind of year of my journey doing this, I didn't even realize I was doing observability. I'd heard about this observability thing and it was out there in the universe. I'm like, okay.
John Gallagher [00:07:46]:
Maybe I should learn that, I should learn that, and I kept using the should, I should learn this actually. I have loads of other stuff to do. I've got loads of other things. I know what it is. I know it comes from controls here and there's a Wikipedia page that's really complex and really confusing. Whatever. I've got real work to do. But what I know is that I kept coming across these bugs in Bugsnag, Sentry, Airbreak, choose your error reporting tool.
John Gallagher [00:08:11]:
They all help you to a degree, but they're not a silver bullet. And I kept coming across these defects over and out, and the story was exactly the same. Come across a defect, I'd see the stack trace in the error reporting tool, and I would look at it and it it first first emotion right out the gate complete confusion. What is going on here? No idea. So I dig a little bit into the code. I dig it a little bit into the stack trace so it's it's coming from here, and this thing is nil. Classic. Right? This thing is nil.
John Gallagher [00:08:43]:
Where was it being passed in as nil? I don't know. So now I'm like, well, I can't just say I can't fix this. So I now have to, well, do what exactly? I don't have any information to go off. Well, I guess we'll do that bug right. Let's look at the next one, and this just kept happening. And I would find myself going through all the bugs in the backlog, and I couldn't fix any of them. And I just wasted 4 hours looking at things, asking questions that I couldn't explain, looking at things I didn't understand. And for years, I thought the problem was with me.
John Gallagher [00:09:19]:
I honestly thought I'm just not smart enough. I'm not a good engineer. Blah blah blah blah blah. Book fixing isn't really my thing. It's I'm just not really good at it. And then after many, many years of this, I was in a company, and I just got really sick of this. We we just released a brand new app, and it was it was a customer account app. And we're getting all these weird bug reports.
John Gallagher [00:09:43]:
People saying how to log in, people kept saying I can't reset my password. And every time we did this, we would add a little bit of kind of this ad hoc logging and then put the the bug back in the backlog, and then it would come up again and come up again. And after a while, I was just like, this is just this is ridiculous. We're highly paid engineers. This is not a bad way. So then I started looking into we were using Kibana at the time or rather I should say we were not using Kibana at the time. Kibana was there. We were paying for it.
John Gallagher [00:10:12]:
And and I was like, I've heard this is something to do with logging. So where do we do our logging? People like, I have no idea what this even is. Let's open it up. And there was just all this trash, all this rubbish. I was like, what's this? How is this supposed to be useful? People are like, oh, wait. We don't really look at that. It's not very useful. So so how do you figure out bugs? They're like, well, we just, yes, we just figure it out.
John Gallagher [00:10:37]:
Well, yes, but we're not figuring it out. So all of this was born through frustration. And so what I did back then is what I recommend everybody does now to answer to your to answer your question, come back to the point, John, which is take a question that you wish you knew the answer to, a very specific question, not why is our app not performing as we want, not as in, like, why do our you know, a very, very specific question. So so take your big, big question, and at the time this was why are people being locked out of the app? Why can't they not reset the password? They're clicking on this password link, and they're saying it's expired or it goes nowhere or it doesn't work. Okay. Why are those people like, why is that happening? So that's quite a general question, and you wanna break it down into some hypotheses. So that's the first thing. I have a 5 step process, and this is step 1.
John Gallagher [00:11:38]:
I'll go through the 5 step process in a minute. So step 1 is think of a specific question. So a specific question this might case case might be, okay. I've got one customer here. There's many, many different types of defects. So this one customer here is saying it was expired. I went to the web page and the the link said it had expired. Okay.
John Gallagher [00:12:02]:
When did they click on that link? What response did the app give to them? And when did the token time out? Right? So those are 3 questions. Now they're not going to get us to the answer directly, but there are 3 questions, very specific questions that we can add instrumentation for. So I take one of those questions. When did the token time out? Great question. So in order to do that, we need to know when the token was created and what the expire of the token was. This is just a random example off the top of my head. So you'd be like, okay. Well, we need to know the customer ID.
John Gallagher [00:12:46]:
We need to know the token. We don't actually need to know the exact token, but we need to know the customer ID. We need to know the time that the token was created and the the expiry time of that token. Is it 15 minutes? Is it 2 hours? Whatever. So I would then look into the code, that's the next. So so we've done step 2. Step 2 is define the data that you want to collect. User ID, token expiry, and an event saying the token has been created now for this user ID.
John Gallagher [00:13:23]:
K. So that's the second step. The third step is build the instrumentation to do that. So whatever you have to do, maybe it's you need to actually add structured logging to your entire app. I don't know. Maybe it's that you've got structured logging fine, but you there's nothing listening to it. Maybe. Maybe the tool just can't actually measure what you want it to measure, so maybe you need to invest in a new tool, whatever it is.
John Gallagher [00:13:47]:
And then you build some code to instrument just that very small piece of functionality. And then once you've done that, you wait for it to deploy, and then you look at the graphs, you look at the logs, you look at the charts, whatever output you've got. And what normally happens is, for me, I look at the charts and I say that is not what I wanted at all, actually. I've misunderstood the problem. I've misunderstood the data I want. Now that I see it, just like you would with, agility, true agility, not agile because agile means something else now. But true agility is you do a little bit of work, you develop a feature, you show the customer they say, not quite right. Go back.
John Gallagher [00:14:29]:
Adjust it. Closer, but still not quite right. But if you ask them to describe it exactly right from the beginning, it doesn't align with what they want at all. You need to show them, and it's only by showing them that you get feedback. And the same is true for ourselves. It's only by looking at the graphs and logs that I realized that's actually isn't what I wanted to begin with, or it is, or I'm onto something there. And so I keep then sort of I've used the graph. Maybe it was unusable.
John Gallagher [00:14:59]:
Maybe I couldn't query the parameter. Maybe there's all sorts of things that might be happening there. So then the last stage is improve. And so from improve, you can go back to the very beginning, ask a different question, or maybe you just want to iterate on the instrumentation a bit, deploy it again. Oh, that's more like it. Okay. So now we know the token expiry. What's the next question we want to ask? Well, why did like, when did the user actually hit the site? Was it after the token expiry or before? Okay.
John Gallagher [00:15:28]:
Sounds like an obvious question, but maybe maybe it's after, which would indicate the token really had expired. Oh, it's before. How could it be expired when it was before? Oh, hang on. What's the time zone of the token? Now we're getting into it. Right? So you log the time zone. Holy cow. The time zone of the token is out of sync with the time zone of the user. That's what it is.
Valentino Stoll [00:15:54]:
Yeah. I love that I love that analogy of identifying the use case in order to expose what to observe and and where to insert, you know, all of these pieces that are missing or identify them really. Right? Not to just insert them, but to identify them. I think that's very important, I think, in general is, like, trying to identify the actual use cases, in order to know what you even wanna capture to begin with. Right? Like, yeah, we get to throw a wall of logs at at a source resource like Kibana, and, it's not very useful. But, once you start to abstract the ideas and use cases, and how people are actually, like, using the thing that you've built, you know, you can definitely isolate, what it is that you actually care about. And I think that I think you're right. Like, that is, like, kind of the whole importance of observability is is identifying that use case and exposing what what you actually care about, as far as all these things that are because, I mean, you know, there's HTTP logs.
Valentino Stoll [00:17:05]:
There's, like, all all kinds of logs and information available that's just, like, omitting all the time, like, how do you know and identify, you know, which are really important. And I I think it just depends. Right? Like, what are you, yeah, what are you trying to capture? So it's a it's a great, like, step wise way to just, like, start to figure that out. Right? Because, yeah, I guess depending on your role and depending on what, you know, your responsibilities are, that could change and that could be different. And your observability needs will change with that. So, identifying that is probably most important, I think.
John Gallagher [00:17:44]:
But but as with everything else, I would say, if you're really not feeling any pain, don't bother. Just don't bother. I'm not into kind of not really interested in telling people what they should be doing or could be doing. I mean, goodness me. We we hear enough of that in engineering, don't we? You should really learn a language every year. You should be blah. You should be blah. Sick of it.
John Gallagher [00:18:08]:
Absolutely sick of all these gurus telling me what to do and what I should be learning and what I and very few of them talk about, well, what's the benefit to me? And in order for me to do anything, in order for me to change as a human being in any way, learn anything, I have to feel the pain of it. If you're not feeling the pain, don't bother. But if you are feeling pain, if deploys are really, glitchy, if you keep ask for me, the kicker is if I keep asking questions I don't have the answer to, that's a concern. And if they're just minor, oh, like, why did I wake up 10 minutes late today? Who cares? It's not important. But if the site's gone down for the 4th time this month, and every time the site goes down, we lose at least 5 grand, 10 grand, maybe even more. And even worse, every single time the site does go down, we just kind of get it back up more by look than good judgment. This kind of feeling of, we kind of got away with it that time. That that's okay.
John Gallagher [00:19:15]:
And all there was this weird thing, and we you know, and it's still not really figured that one out, but that's okay. We'll just put it in the backlog. It's the operational risk. You've gotta decide. Are you comfortable with that operational risk or not? Is it big enough? And in my experience, you've kind of got to hit rock bottom with this stuff as I did. There were loads and loads of bugs that I could have investigated and added logging for and fixed, but, you know, it's pushing a boulder up a hill. It's not actually worth it. And it was only when it reached my threshold of pain.
John Gallagher [00:19:47]:
I was like, you know what? I have to do something about this now. This is just ridiculous. We're professional people. We're being paid a lot of money, and it's not working. The app that we've delivered is not working. What's more? We don't know why. But also I do just want to add, and this may broaden out the conversation a little bit, you may want to may we may want to keep it narrow on Rails apps, but I've realized that observability principles go way beyond how does our web app work. It applies to any black box.
John Gallagher [00:20:23]:
So as an example, a few years ago, I was working at a company and their SEO wasn't great. And they just kind of were like, oh, you know, we'll we'll we'll try and fix it, and they they had several attempts to fix it. None of them really worked, and, every attempt was the same. They would get some expert in. The expert would give us a list of a 100 things to do. We would do 80 of the 100, and then nothing would really improve. And then they'd be like, well, we did everything you said, and then they'd move on to another. And rinse and repeat, keep doing that.
John Gallagher [00:21:03]:
And then one day, within 4 weeks, 20% of the site traffic disappeared, and nobody could tell us why. Nobody understood why. Observability. Now Google is a black box, so, you know, you're not going to be able to instrument Google, but you there's lots of tools that allow you to peer into the inner workings of Google, SEMrush, Screaming Frog, all these kind of tools. They are, in my opinion, actually into some degree the observability space. They're not you know, everybody thinks of them as marketing tools, search engine optimization tools, whatever whatever whatever. They're allowing you to make reasoned guesses about why your searches aren't performing the way they are. And then you can actually take action on that because now you have some data.
John Gallagher [00:21:53]:
Oh, this keyword dropped from place 4 to place a 100. Why is that? Okay. Let's try a. Let's try hypothesis a, put that live, and see if Google will respond to that. Oh, we're now up to, you know, position 80, whatever it is. So the idea of observability goes way, way beyond Datadog and New Relic and, obviously, all of those people in the observability space, but I I see it as a much, much wider, much more applicable topic.
Valentino Stoll [00:22:26]:
Yeah. I I hear you there. And I'm I'm all again I'm I'm all also, like, you know, let's not just add New Relic to every app that we that we deploy or, you know, is bug snag even needed for every app? Like, these are questions that I ask myself too. Like, what value are you getting from all these auxiliary services that give you the observability into, like, just blanket things? Right? Like, at what at what point do you, like, stop, like, that kind of mentality and be like, alright. Well, you know, every Rails app should at least be able to get insight into the logs so that you can see what the application is doing. Like, well, how long do you capture that? Like, what kind of time frame? Do you have any, like, default standards where you're like, well, I know that I'm gonna need to look at this at some point in the application cycle. Like, what are your defaults?
John Gallagher [00:23:24]:
Great question. I would say if you're if you're making a small app with very little traffic and it's thresholds like anything else. You're making a small app with very little traffic. I have a I have a client at the moment I'm consulting for and I've made them an app and it has maybe flipping 20 visits a day or something. 20 hits a day. So I installed roll bar, free version of roll bar. Anything goes wrong, I get a notification. It's fine.
John Gallagher [00:24:03]:
The further up the stack you move, the more the defaults change. For, a rails app that's mission critical that I'm not even gonna say mission critical, but just serving a decent number of hits a month. 10000, 20,000. I don't know. I've tried a lot of observability tools, and there's no one that yet that I can unreservedly recommend. They all got their pros and cons. Datadog is a good option if money is no object. I kinda don't wanna get into the the tooling debate because there's it's kind of a bit of a red herring, I think, in in many ways.
John Gallagher [00:24:48]:
There's various cost benefit trade offs there. But in terms of the defaults, in terms of what you observe, requests has got to be up there. So every app that I, that I have in my care of any significant size, I would always say install Symantec Logger. Symantec Logger is the best logger I've found. App does JSON out of the box. It's quite extensible. There are many problems with it, but it's the best option that we've got. So that's number 1.
John Gallagher [00:25:18]:
That will log every like, rails already logs every request for you that will format in JSON for you. There are some notable, missing defaults in semantic logger, and I'm working on a gem at the moment that will add some even more sensible defaults into it. So for example, I believe that request headers do not get logged out of the box. Certainly request body does not get logged out of the box. Request headers might be. The user agent doesn't get logged out of the box. I mean, this is just pretty basic stuff. Right? But, so I I have a setup that I use that, logs, a whole load of things about the requests out of the box.
John Gallagher [00:26:06]:
I liked adding user ID out of the box. It depends what kind of setup you have for authentication, but at the very, very least, if somebody's logged in, the ID of them should be logged in every single request. That is absolutely, you know, absolute basic stuff. A request ID is also a really, really useful one. I I have a complex relationship with logs and tracing because tracing is essentially the the pinnacle of observability. I hear a lot of people say logging, like like logging is a be all and end all. Logging is a great place to start, but tracing is really where it's at. And I can go into that why that is in in in a bit, but logging is a great default.
John Gallagher [00:26:51]:
Logging is a good place to start. Start with semantic logger. Basically, every single thing that's important in any request should be logged. So that's every header. Obviously, you need to be careful with sensitive data in headers, like do your rails, active. Can't remember what it's called, but there's the filtering module that you can add in. And sometimes semantic logger doesn't give you that by default, so you need to be a bit careful. A good default as well is logging all, background jobs.
John Gallagher [00:27:30]:
Background jobs are one of the most painful areas of observability that I've experienced, and we still haven't really cracked it. We have some very, very basic, logging out of the box and semantic logger. I believe it logs the job class, the job ID, and a few other things, but it doesn't log the latency, which is a huge, huge missed opportunity. And it also I don't believe it logs the request ID from when it was, enqueued. So when a job is enqueued, it will, by default, semantic logger will trigger a little entry in the logs. This job is enqueued, and it will tell you what request it came from. But on the other side, when it's picked up and the job is performed, that request ID is missing. So you need to kind of go into the request ID, find the enqueued job, find the job ID, and then take that next leap.
John Gallagher [00:28:26]:
So, I mean, it's a bit clunky, but it's manageable. So in short, semantic logging gives you you some okay defaults out of the box, but there's some really basics that it it still misses. And so background jobs, requests, those are the 2 really, really big ones to start out with, but as you can imagine, there are a ton more.
Valentino Stoll [00:28:49]:
Yeah. You mentioned kind of some key pieces I I always think of with observability in general, which is like, separating the, the p the pieces into their own puzzle. Right? Like, we have logs, which are kind of just like our data, and then we have individual metrics that we're, like, snapshotting, the logs for particular, segments like traffic or number of people using it, like the number of jobs that are running. And then there are traces, which we could dig into next because I I I have a lot of, I have a lot of love for all of the, standards that are coming out of this with open tracing and things like that. I love to dig in there. But, also, like, alerting. Like, you know, how how do how does anybody know that there's ever a a problem?
John Gallagher [00:29:40]:
So much to talk about.
Valentino Stoll [00:29:42]:
Yeah. I mean and and I love, I love, like, thinking about it in these separate groups and categories because I think it also helps, to think about, like, the overarching theme, which is, like, getting insight, but also, like, getting meaningful insight, and, like, when you want. Really, like, the only the only reason ever anybody ever cares about observability anyway is, like, when something goes wrong, you know, or something is problematic that causes something to go wrong, and you wanna either catch it early or, you know, try and remediate? That's right. And so, like, where do you find, like I mean, background jobs are, like, kind of like I I feel like the first instance where people realize, like, oh, like, we need to start looking at, you know, what it's doing. Right? Like, you start throwing stuff in the background. You're like, okay. Great. Like, it's doing the work.
Valentino Stoll [00:30:38]:
And then you don't maybe realize if you're on the same node, like, well, those, you know, slow requests can block the web requests. Right? And then, okay, well, if you split those up, finally you got that resolved, but then, okay, well, one problematic, you know, job can back up a queue that it's on. You know, like, where do you to to me, like, the background processing aspect is, like, why we have tracing to to begin with, because it does like it's concurrency. Right? So it's like that's where everybody, like, ends up hitting their pitfalls is as soon as you start doing things, like, all at once, like and thinking, oh, like, we just throw it in the background and, like, process things as they come. And as things start to scale, it causes more problems as you try and find figure out timing and stuff like that. Like, where do you find the most important pieces of, like, making sure that you, you know, are capturing the right segments and the right flows, you know, in that process?
John Gallagher [00:31:43]:
Yeah. There's so many things you touched on there I want to come back to. To answer your question, first of all, it's the 5 steps that I walked through. Yeah. That's the short answer is if you have a specific question that you cannot answer, what we're really talking about is the implementation details of how you answer that question. So what question you pick determines a whole load of whole load of stuff. I can't just give you a bog standard answer because it it just it depends. I hate saying that, but it does.
John Gallagher [00:32:18]:
So I think the first question is to ask the question, figure out what data is missing, and then choose the right piece to add into your logs. I feel like I've maybe not understood your question maybe.
Valentino Stoll [00:32:36]:
Yeah. I mean, it's more of like a an open open question. I I guess, like, when trying to think about like, I one of my biggest debugging pitfalls is, like, trying to, like, reconstruct the state of what happened when something went wrong. It's like Right. I feel like that's, like, one of the most typical, things is, like, okay. Something happened. Like, well, like, it's the data has changed since something had happened. Yeah.
Valentino Stoll [00:33:07]:
Maybe the change resolved the issue, but, like Yeah. You know, trying to figure out what that is and and going running through those questions. Right? Like, how do you think about, like, reconstructing data or reconstructing the state of a issue? Like, is that not the right way to go about it, or do you try and, like, do something else?
John Gallagher [00:33:29]:
Fantastic question. So, and and this gets to the root of why the 3 pillars are complete nonsense. K. So there'll be a lot of
Valentino Stoll [00:33:39]:
What are the what are the 3 pillars?
John Gallagher [00:33:42]:
Metrics, traces, and logs. Okay. Nonsense. They're not 3 pillars. The analogy I like to use is saying that observability is 3 pillars and its traces, logs, and metrics is a bit like saying programming is 3 pillars. It's arrays, integers, and strings. It's the same kind of deal. It's it's no.
John Gallagher [00:34:07]:
It's nothing to do with those things. Well, it is because you use those every day. Yes. But it you're kinda missing the point. So, thanks to some amazing work by people at Honeycomb and Charity Majors, and reading their stuff and reading their incredible work, I've realized that on the metrics tracing logs are missing the point. The point is, we want to see events that happened at some point in time, and that neatly answers your question about how do you reconstruct the state of the app. I mean the short answer is of course you can't. If you're not in an event driven system, if you're in a crud app, if you're storing state to the database, there is no way you can go back in time and accurately recreate it.
John Gallagher [00:34:55]:
But we can give it a reasonably good start. And we can do this by capturing the state of each event when it was forget about observability tools and logging and structured logging and tracing just now. Imagine if when that incident happened, let's say my expired token would would be, would be a maybe potentially a good example. There are several points in that timeline that we want to understand. Number 1, when the token was created. Number 2, when the user hit the website and maybe there's a third one, when the account was created, let's say that. So imagine if at each of those three points, we had a rich event with everything related to that event in it. So when the account was created, we had the account ID, the status of the account, whether it's pending or not, the creation date, the customer, the customer ID, blah blah blah blah blah.
John Gallagher [00:35:55]:
And then when the user visited the site, what was the request? What was the request ID? What was the user ID? What was the anonymous user ID, etcetera, etcetera. And then when the token was created, what was the expiry? What was the this? What was the that? What was the user ID? Okay. So if we have those 3 events and we have enough rich data gathered with each of the events, we can answer your question. Does that make sense so far? There's a whole load of more blah blah blah, but is does that make sense so far?
Valentino Stoll [00:36:28]:
No. I think that you you're making some great points of, like, capturing the transactional, like, user information or user's actions. Yes. And the
John Gallagher [00:36:38]:
same also other events that happening in the system. Yep. So there's user did something, computer did something, computer enqueued a background job, performed a job, etcetera, etcetera. So the way I think about it is everything that happens in your app whether it's initialized by the computer, an external data source, use its basic events, don't make stuff really. That creates an event. That event, if you don't capture enough data, that is it. The data is lost forever if you're not in an event. I'm assuming you're not doing event sourcing and assuming you're not in an event driven system.
John Gallagher [00:37:13]:
So to the way I think about it at the most core fundamental level is whether it's truck logs, traces, metrics, whatever it is, we need a way of capturing those events. And more importantly, ideally, we need to link the events together, and this is really, really, really important. So if somebody create a let's say somebody hits our app and it creates the token. Well, there's 2 parts to that. They hit the app, there was a request to our app, and then in the call stack somewhere, the token is created. Those two things are 2 separate events, but they're nested. We want to capture that causal relationship. 1 calls the other.
John Gallagher [00:37:54]:
One is a subset of the other. 1 is a parent, a child, whatever, however you wanna put it. Without that causal link, we're lost again. We don't know what's caused what. So there are some, like, 3 or 4 ideas here. Number 1, events. Number 2, contextual data with each of those events. And number 3, nested nested events, if you like, causal relationships between events.
John Gallagher [00:38:23]:
And with those three things, you can debug any problem that that you would like is my is my claim. And so if you just keep that model in mind, let's examine traces, logs, and metrics and see where they fall short, see which one meets those criteria. So tracing gives us all 3. You can so for those of you I I should explain what tracing is because I was confused about what tracing even was for absolutely years. So tracing allows you to when somebody hits your app, a trace is started. So there are 2 concepts in tracing. There's traces and there are spans, and then there's the data associated with spans. But let's just leave that to one side.
John Gallagher [00:39:10]:
So when somebody hits your app with a request, a trace is started. And so the trace would be like, okay. I've I've started. Here I am. You can append any data that you want to me whilst I'm open. It's like opening the cupboard door and then you keep putting stuff in the cupboard, and then once the cupboard door's closed, you can't put any more stuff in it. Very simple analogy. So we open the door, we start the trace, and so it it goes down to the controller level.
John Gallagher [00:39:39]:
The controller says, oh, I'm going to glom on some data into whatever the existing trace is about the the method, the the post body, the request, blah blah blah blah blah, headers, whatever it is. I'm gonna glom that onto the current trace. And then we get down into maybe you've got a service object. I know some people hate them. I love them. Blah blah blah. Whatever. That's not the podcast about, John.
John Gallagher [00:40:03]:
So you get into a service object, and the service object says, oh, whatever is in the current trace, I want you to know you hit me, and you hit me with these arguments. Cool. I'm gonna append that to the trace as well. And then we enqueue a background job. That event gets added onto the trace. And then even more excitingly, there's a setting in OpenTelemetry where when the job is picked up and performed, the the the, the trace is kept open and there's a whole load of debate about whether this is a good idea or not, but you can do it. You can keep the trace open until that job is started. And so the job says, ah, I've I've kicked off now.
John Gallagher [00:40:41]:
It gloms a whole load of muscle. Maybe you make an API request in the job. It gloms a whole load more stuff on into the the trace. And then it comes all the way back at the stack, and you have this trace with all this nested context. And when it's saying I'm gonna glom this data onto the trace, that's called a span, and a span is nested. So you can have spans nested inside, spans inside, spans. So, essentially, it's this big tree structure, And you might have seen this before. It's the flame graph that you get in in Datadog and New Relic and all these kind of things.
John Gallagher [00:41:13]:
And everybody looks at these things and thinks they're really pretty, and they are. Indeed they are. So that is the that's the pinnacle of observability in my head. Traces give it us all. And we can say, as you can do in any of these observability tools that support tracing, you can do some really cool stuff. Show me all the requests that were at 200 that enqueued a job where the job lasted for more than 3 seconds. Holy cow. Now we're cooking with gas.
John Gallagher [00:41:42]:
We've got everything that we need. Show me all this all the spans that indicated anything to do with the background job where it was a 500 response, but the user was logged in and and and and and so we can start to not only query the the spans, but query the parents of the spans. So you've got all these nested calls or relationship, and it gets ridiculously powerful. So that's traces. Cool. Let's look at logs. What does what do logs give us? Well, it gives us events. That's all logs are really.
John Gallagher [00:42:14]:
It's a series of events that happened. Does it give us the ability to nest events inside one another? Nope. Sorry. Your luck's out. You can you can log causation IDs and you can link them together, and obviously, you can log request IDs and filter everything by the request ID, but there's no concept in the log of this log is nested inside this other log. So that information, poof, goodbye, is gone. Don't have it. But you have the rich data in every event.
John Gallagher [00:42:48]:
Let's look at metrics. What does metrics give you? Doesn't give you the events. Doesn't give you the nesting, and it just gives you some aggregated numbers. So I don't think of them as 3 pillars. They're 3 rungs of a ladder. The very top rung is tracing. Awesome. The next rung down is logs.
John Gallagher [00:43:11]:
Pretty good. And metrics are useless. Now when I say metrics are useless, people get upset with me and say, oh, well, I look at metrics all the time to understand my app. Yeah. Okay. But if you derive metrics from higher rungs, that's totally cool, totally fine. But what's a really bad idea is to directly say, I'm going to send this metric right now to my back end, and people do this all the time. People think this is a good idea.
John Gallagher [00:43:40]:
It's okay. I mean, it's better than nothing. Right? It's it's just depends on the fidelity of information you want. But the problem is there's 2 problems actually, but the main one is you've sent that data. Okay? You've sent it to Prometheus, Datadog, whatever. You sent that one data point. So then you look in the metrics and you say, holy cow. We're getting all these 500.
John Gallagher [00:44:01]:
Why is that? I'll sit here and wait as long as you want. You're not gonna be able to tell me the answer to the question unless it's blindingly obvious, unless you can say, oh, well, this this other bit of data over here is, like, correlates with it time wise and maybe. It might be that. Yeah. Okay. It might be that. How do you know it's that? Well, we're we're having to guess. Guessing is not a strategy.
John Gallagher [00:44:23]:
Hope is not a strategy. I don't really want the debug by just flipping guessing. I want to know, and the only way of knowing is having traces. So the way I like to think of it is tracing is the pinnacle. Logs can be derived from traces, which is why the 3 rungs of ladder, and everything can be derived as a metric from the 2 rungs above. So if you've got only logs, you don't have any nested context, but you can get metrics from logs. Fine. If you just have metrics, I would say you're not in great shape because you can't understand why without pure guessing.
John Gallagher [00:45:00]:
And it amazes me how many people push back on this idea and think just having some metrics is enough. It's nowhere near enough. Not in my experience. If somebody wants to refute me and come on this podcast or have a chat with me after, I would love to listen to how metrics allow you to debug, like, very, very deliberately and get the exact data that you need. You can send off dimensions to metrics, and then your metrics bill explodes within about 5 seconds, especially if it's high cardinality data like IP addresses. I've made that mistake before. We're gonna send, a dimension of IP with our metrics so that we can understand what's going on. In a week, my manager usually messages me usually less than a week saying, can you you you could turn that off.
John Gallagher [00:45:43]:
We we just got a day's bug bill of, like, $5. Whoopsies. I guess
Valentino Stoll [00:45:50]:
I I do have, like, may maybe some specific instances where metrics alone can help, like, identify things. And that's more where it's like the granular metrics are the things that you're actually looking like, care about. Right? Like, let's say, for example, like, back to the sidekick background jobs example, like, if you notice, like, your queues piling up and you happen to have your dashboard of metrics just looking at queue size and looking at throughput, like, you can easily say, oh, like, there's something blocking it and gives you kind of a point of where to look at, in this specific instance. Or, as an example, like, also, you know, you can notice, like, there's a leak in memory by monitoring, you know, your memory consumption of the app, and just looking at the metrics for that and getting an alert and saying, why is the memory not stopping growing, after a certain amount of time? I mean, they these are, like, you know, very specific examples that I'm giving. But, like, I agree. Like, if if you're looking for, like you know, it's not gonna tell you, like, if your users are, like, back to your, like, token expiration. Like, are people having a problem with our application that we've made? Like, you know, and, like, we keep getting these, you know, client, you know, emails coming in like, oh, I can't, like, sign in to your app. Like, what's happening? You know? You can't just, like, take that and be like, oh, yeah.
Valentino Stoll [00:47:25]:
It's obviously the token's, like, expiration. Right? Like, it's your customer's emails aren't gonna, like, translate directly to that, and you're not gonna know right away, without having your tracing in in in place.
John Gallagher [00:47:39]:
So so a few a few
Valentino Stoll [00:47:41]:
things there.
John Gallagher [00:47:42]:
Number 1, you bring up a really good exception I'd forgotten to mention, conveniently. If it's infrastructure stuff, if it's like memory, hard disk space, all that kind of stuff, fair game for metrics, fine. Yeah. The second thing is I I'm quite hyperbolic, so I'm quite an extreme person. So when I say they're useless, I don't mean literally they're completely useless. I think of metrics as a hint. Hey, there's something going on over here. Cool.
John Gallagher [00:48:09]:
That's that's not useless. Obviously, it's useful. But then the next question is why? And if you've got a super simple system, then it's probably like the 3 things, and you go, well, there's only 3 jobs in the system, so cool. And maybe you've segregated your metrics by background jobs, which is fair. You know, it gives you a place to look. It gives you a starting point. But I've yeah. Yeah.
John Gallagher [00:48:35]:
They're they're useful in the aggregate, and they're useful at giving you a hint. But and, yes, they're useful in in terms of, like, making sure the infrastructure's still running. But I see a lot of people depending on them. And I you know, there's a guy I really respect, Used to work with him, called Lewis Jones. And him and I have gone back and forth on this over over LinkedIn, and he is convinced I'm wrong about this. He's like, we run everything through metrics. Metrics are awesome. You're just on cloud 9 if you think you can trace everything.
John Gallagher [00:49:07]:
And there's also a significant weakness with tracing as well, which is you can't trace everything unless you've got relatively low throughput or even medium throughput, you can you can make it work. If you trace every single request and you're doing millions of requests a day, I dread to think what your your bill is going to be. So, and then that's where head tracing and head sampling and tail sampling comes into it, and we can get into that if you would like.
Valentino Stoll [00:49:35]:
I mean, I would love to dig more into tracing in general and maybe more the distributed aspect of it because, I think what you're talking about is very important. Like, you know, if we're just talking about tracing through, like, a single request in a Rails app, it's not not as useful as maybe, what what where tracing really comes into play is where there's multiple things that start happening. Like, once you start having more than one application and the, you know, the data starts trickling from one application to the other, even in in Sidekick example. Right? If you're throwing stuff into the background, how does that data snapshot transition through the background jobs? Especially if you have ones that start depending on each other. How do you then manage the queue, like, in the making sure that you know where it started and, you know, where it's going. Because sometimes you can catch a problem before it starts by having the traces in play and know where it's heading. Right? And so I I would love to dig into that to to those aspects. Like, where do you like, what tooling or may maybe we shouldn't talk about tooling specifically, but, like, what aspects of tracing are most important for, like, holistically looking at your system outside of, like, you know, running through your your quest like, I I think at this point, we're beyond, like, having your questions of what you're trying to look at and that you already know what those questions are.
Valentino Stoll [00:50:59]:
And where do you start, like, setting up tracing? Because I know we're like at Doximity, we'd use open tracing as, like, an open standard for, tracing and observability across, like, platforms, languages, and things like that. Do you find that the industry standards are, like, heading in the right direction? Or, like, where where are the pitfalls there? Like, because I know it's like, it just introduces a lot of dependencies once once you start to adopt a lot of these things.
John Gallagher [00:51:29]:
Totally. So I should say I am singing the praise of as of tracing, but it's a slightly utopian vision that I'm painting because 90% of the work I've done is with logging. Purely because it's simple to get going. It's more of a known quantity. And a lot of my talks is why I'm not talking a lot about tracing, and I'm talking about structured logging. Because I think structured logging gives you this kind of event based mindset that you can then start extending to tracing, and the reverse is not true. Like, you can't take that event based kind of mindset into metrics because metrics is just about aggregation. Right? So, but I have, like, recently, I've been doing a lot of queries in our rails app and I've been going to we use New Relic.
John Gallagher [00:52:21]:
Sorry. We use Datadog at work. And I've been going to Datadog's tracing, interface and really trying to answer my questions there instead of in in logging. So, we have both tracing and logging. Our tracing, is hobbled a little bit just purely because of cost reasons, and our logging is not so hobbled. So are the standards heading in the right direction? Yes. But it's going to take a really long time to get there. It's my short answer.
John Gallagher [00:52:58]:
There is a lot of, there's a lot of different ways of going about tracing. The most promising as we all know is open telemetry. But I mean, I read some, pretty harsh critiques of open telemetry. There's kind of a a topic that generally divides people. If you if you don't know anything about OpenTelemetry, it sounds an absolute utopia. And I got really excited when I started researching into it. The more you dig into it, the more you realize how much complexity there is to resolve and how many challenges that project faces in order to resolve them. And so, I mean, what it's what it's trying to resolve is 30, maybe 40 years, possibly even more of legacy software.
John Gallagher [00:53:45]:
Right? Because that's how long logging has been around. And they're trying to aggregate all of that into one single standard good look. It's a very, very difficult problem to solve. And they're doing an incredible job, but it's it's very, very difficult. So they have, open telemetry is where I'd start with the answer to your question. Open telemetry is a 100% of the future. I've not seen anything that rivals it. An open tracing, I believe came first, and then evolved into open telemetry in my understanding.
John Gallagher [00:54:17]:
Apologies if I've got that slightly wrong. And so, yeah, I think there's a few options if you're in Ruby. None of which are ideal. So the open telemetry client in Ruby is, not ready for prime time. It's quite behind the current standards in open telemetry. It doesn't obey any of this latest semantic standards, for example. I have, I've played around with it in an example project, and when it's working, it's absolutely incredible. It's next level brilliant.
John Gallagher [00:54:53]:
There are a few problems with it. It's extremely slow, so I tried to use tracing on our test suite at work using this open telemetry tracing, and it just it's like I can't remember the numbers, but it really slowed down our test suite to the point where it really just wasn't practical to use because we were trying to measure the performance of the test suite. So, you know, Yeah. I I could've been doing something stupid there. It's very possible that I just wasn't using it the right way. So sorry open, folks. If I've I got I know, I think a lady is called Kaylee who is from New Relic, and she and, I'm so sorry. My my name the names, escaped me.
John Gallagher [00:55:37]:
But there's a whole bunch of people in the Ruby space who are working really hard on open telemetry, but it's just that, like, the open telemetry project is moving so fast. That's your problem. So that's option number 1, open telemetry. You could maybe fork it and tweak it yourself. The second option and what we use at work is because we're using Datadog, we use Datadog's tracing tool, which is pretty good. But then even with tracing or logging, I feel like we're kind of maybe 20 years behind where everybody else is in programming in terms of observability. Because one of the questions I often have when I look at this stuff and even think about tracing, I maybe have, like, 5, 6, 7 questions that even I can't resolve. Just what do I trace? How much detail do I trace in? How much is this gonna cost me? And we're still in the stone age with a lot of this stuff.
John Gallagher [00:56:35]:
So I don't have any good answers for you in that regard. So we use, the vendor tooling for tracing. I'm sure has its own version of that. In fact, I know they do. I know Sentry does. There are certain other providers that don't have any tracing capabilities at all. So I would say for now, the best option we have is relying on the vendor tracing tools, I would say.
Valentino Stoll [00:57:02]:
Yeah. It's funny you mentioned Dave Dog. We've had, Ivo on before, from Dave Dog, to talk about a lot of the, like, I think memory profiling. He he works on a lot of, like, granular Ruby performance tooling. Really interesting stuff. Yeah. I mean, I I would love to see maybe some more, I don't know, higher level examples of, like, making use of OpenTelemetry in the Ruby space in general. Because I think that that level I mean, especially with all of the solid queue, like, or solid trifecta or whatever stuff that's coming around, it would be nice to see something like, tracing specifically introduced to rails.
Valentino Stoll [00:57:47]:
That that would make, you know, more sense in that ecosystem because, yeah, I mean, where do you where do you start profiling stuff is, like, kind of like an intro to tracing. Yeah. So, like, if you wanted to see, like, the the request, it reminds me of, was it Rack Mini Profiler Yeah. Tool. Right? Where you you can just see a little tiny tab that says, oh, it took this number of seconds to load this particular page you wanted to get. And you can click on and expand and see, oh, well, what did your application do at each step of the way and see how long each thing took. Right? And I think of that as, like, a trace, a lot of the times. Right? And and it's very, like, useful, like, even when you're just starting out to see that.
Valentino Stoll [00:58:32]:
Right? And it helps you visualize the and so I got I feel like maybe that's what's missing is a lot of, like, visualization aspects of all this tracing stuff. Is there something that you, look at or find useful when you're starting to dig into, like, structuring the traces and, things like that?
John Gallagher [00:58:52]:
Definitely. That's leading me up to my one of my big kind of rants, passions, whatever, within the observability space. And I don't see anybody talking about this. I feel like it's either I'm onto a really great idea or it's an unbelievably idiotic idea for some reason I don't know about. It's usually the latter as a spoiler. Okay. So when I'm looking at traces, there's almost never enough information. Almost never enough information.
John Gallagher [00:59:28]:
And this is why Charity Majors and the team at Honeycomb and Liz Fung Jones always talk about, have wide context aware events. That's their mantra. Wide context aware events. And events, we've already talked about. Context, we've already talked about. We haven't talked too much about the wide. So wide means lots of attributes. So their take on it is, add as many attributes as you can to every event and make them high cardinality attributes.
John Gallagher [01:00:03]:
What does that mean? It took me about 3 months to wrap my head around what high cardinality means. It means anything ending in an ID. There you go. That's an easy that's an easy explanation. So a request ID. Anything that looks oops. Sorry. That was me in my microphone.
John Gallagher [01:00:20]:
Anything that looks good like, anything that is a unique identifier for anything. So that's user ID, request ID, but anything that is a domain object, and this is the real, missed opportunity I think that we have in the rails community and in observability community potentially in in general. When there is, when something goes wrong or even when something goes right, let's say, let's take the, let's take the token as an example. When that token is created, the token is a domain object. Now, okay, it's to do with authentication, so it's not it's not really a domain object in a way. But let's say that customer is signing up for an account. The account definitely is a domain object. And if you want to understand what I mean by domain object, I just mean an object that belongs to the domain the the business domain in which you're operating.
John Gallagher [01:01:19]:
It's a business object, a domain object, call it what you will. But, when a when the CTO or the even better, the CEO or somebody in marketing talks about this customer account, they talk about people creating accounts. They use that word account. That's your first clue that it's a really important concept in the domain. So that's what I say when I mean domain objects. I mean words that non technical people use to describe your app. Right? So they're domain objects. Why are we not adding every relevant domain object to every event? We don't do it.
John Gallagher [01:01:56]:
And so what you'll see is people do this kind of half hearted, oh, well, we'll add the ID to the current span or the current trace or even the current log. We'll add the ID and that that's okay. That'll be enough. But you're not capturing the state of the object. Why not just take the object, in this case the account, convert it into a hash, and attach it to the event. Why can't we do that? Now there's a number of reasons why we actually can't do that in some cases. If you're build, in terms of the size of your event, so if you're build on data, obviously that's going to get expensive fast. But if you're build on pure events as in your observability provider, your observability tooling is is saying for every x number of events or x number of logs per month, we will charge you this much but the size doesn't matter, then this is a perfect use case to be taking those rich domain objects, converting them into a structured format, and dumping them in in the log or the trace.
John Gallagher [01:03:06]:
And so I've, kind of thought about this quite a lot and I've come up with this a few quite simple ideas that people can use starting tomorrow in their rails. Not without their problems, but the first of which is, I don't know if anybody's worked with formatted. So, 2 formatted s for date time strings. And we have this idea in Ruby, don't we, of duct typing. We have an object and really good o o designers that you shouldn't understand anything about that object. You just know it's got 4 methods on it, and it can be an account. It can be an invoice. It can be many different things.
John Gallagher [01:03:49]:
So, my approach, and I'm testing this approach out at work at the moment, is instead of having 2 formatted s, have 2 formatted h. What does that mean? It means you're going to format the domain object as a hash. And so to format it s allows you to pass in a symbol to define the kind of format that you want. So it could be short, ordinal, long, humanized, and it will output a string. It'll output a stringified version of that date in these different formats. So my idea is why can't we have a method on every single domain object in our rails app called 2 formatted h and you pass it in a format. Format could be then open telemetry. It could be any one of numbers, a short, compact.
John Gallagher [01:04:42]:
And so for every trace, the way I like to think of it is I want to into that trace, add every object that's related to that, and you could you could format those in open telemetry format, for example, or you could have a full format or a long format, whatever you want. And so that way, you can say, oh, I just want to I I want a representation of the account that is short, and it's just got the ID. And that's a totally minimal skeleton and that's enough for me. But actually here, the work I'm doing is a bit more involved. So I want to call to formatted h with full, and that will give the full account, like the updates that created that everything about it, and then that will be sent to my logs and traces. And I now have a standardized way of observing what's going on with all the rich data of my app app state at that point with all the relevant domain objects in it. So that's that's my dream that I'm headed towards with with this gem. So that's kind of the way I think about structuring it.
John Gallagher [01:05:48]:
And I think about the like, people I see people doing all this ad hoc kind of well, this is this is an ID and then we'll call the the job ID job underscore ID, I suppose. Well, what's the account? We can call that, accounts underscore ID. And I just like to think of it as imagine your domain object. So an account has a customer, a customer has a some bank details. Bank details is a bad idea, but address maybe. And so we could have these different formats that, load nested relationships or not. And, obviously, there's you gotta be careful about the performance, problems with that. And so you'll have the exact structure of your domain object in your logs, in your traces.
John Gallagher [01:06:31]:
That for me is a dream. And then every single time an account is logged, it's in the same structure. Awesome. So I know that an account is always gonna have an ID. It's always gonna have a whatever other attributes. You can't have pending status, whatever it is. And so, therefore, I can say, show me every trace where the account was pending. Boom.
Valentino Stoll [01:06:55]:
Yeah. I love that idea. And, it does it reminds me a little of the introduction of the, you know, the new Rails, like, you know, logger where you could tag you know, the tag logger was was kind of like a, a start to kind of this idea of, okay, capture all of these pieces with this tag. And it's, like, almost it's pseudo trace, I call it. But it does go along that formatting aspect of, like, okay. Format all the things like this in a specific way. And I I agree that that there's definitely a lot to to unwind there. We'll have to have you on more, if you, you know, when you, you know, put this together as a gem or something because I I would love to dig into that.
Valentino Stoll [01:07:45]:
Cool. Yeah. I mean, it definitely is I I love the idea of, like, the domain objects, and extracting those out into a formatable way that you can then trace and follow through because that that design decision is definitely missed a lot. And seeing things like Packwork as an example was a a great step in the right direction, I thought, and, I'd like to see more of that kind of evolve in the the rails ecosystem, of abstracting the domains into their own kind of segments and then being able to format them for traceability and and things like that. I think you're on to the right you're on to a lot here.
John Gallagher [01:08:24]:
And then, I mean, the thing that I think is unbelievably ironic is all I'm talking about is convention over configuration. And is that not why we all got into Rails? I know Ruby is a different thing, but Rails is all about convention over configuration. And the entire area of observability strikes me could do with a massive dollar of convention over configuration, and that's what OpenTelemetry are trying to do. The one last thing, and I know that time is getting on, but one last thing I want to just say on that is, the other huge opportunity is adding context to errors. So we have these exception objects in Ruby, and people store strings with them. And it's like, what? How do you suppose how am I supposed to understand anything from a string? And then people try and put IDs in the strings and I know. So at work, I've made this extremely simple, basically a subclass of standard error where you can attach context. So when you create the error, you pass in structured context.
John Gallagher [01:09:27]:
So if our logs are structured, surely our error should be structured as well. Makes sense. Right? So you can say, this error happened and here was the account associated with it when that error happened, and here's a user, and here's this. So it it gets attached into the error and then using Rails' new error handling, rails.error.handle, if you've not used it before, look it up. It's absolutely awesome. It's one of the my favorite things that they've added to Rails recently, relatively recently the last few years. And you can, basically have listeners to these events, to these errors, beg your pardon, it will catch the errors, and then the context is encapsulated in the error. So you can pass these errors around, then you can do interesting stuff with that context.
John Gallagher [01:10:16]:
And all I do is pull out all the context and send it straight into blogs. And that has absolutely changed changed the way I debug because whenever there's an error and it has all those rich data, you just look in the rich data and you're like, oh, that was the the account. That was the Shopify ID. That was a product ID. I've got it. And then you just look at the ID and your externals. Oh, right. Okay.
John Gallagher [01:10:38]:
It's out synced, whatever it is. It makes life so much easier. So that's something I'm really passionate about how as well, having domain objects, encapsulated within errors. So we've got structured errors, not just structured logs.
Valentino Stoll [01:10:52]:
Yeah. I mean, that's definitely one thing that I look for when I'm looking for, you know, installing dependencies. Right? Like, does the gem have its own, you know, base error class that it then can, you know, give metadata about whatever that it's raising the errors about? Like, more than just, like, a string of some error that then you have to figure out what it is. Like, having that extra metadata that you could just because you can. You could just add attributes to a class. Right? And say, this error has these attributes. Like, it it has, you know, meaning associated with the error. I think more people doing that is definitely gonna be making that easier to do, first of all.
Valentino Stoll [01:11:30]:
But yeah. And then also getting more people to take on that convention. I completely agree with you there. Yeah. I mean, we are getting a time here. Is there any last, you know, pieces you wanted to, you know, quickly highlight or, mention before we, you know, move into pics?
John Gallagher [01:11:50]:
I think the main thing is if you're listening to this and anything that I'm saying is resonating, forget about the domain object stuff. That's like getting really into the nitty gritty. But coming back to the beginning, if you're frustrated by your debugging experience, if you're thinking, why am I not smart enough to understand this? Chances are the problem is not with you, it's with the tools. So if you improve the tools, not only do you make your life easier and better, you level up everybody around you because all the engineers can use the same tools. And that's what we've experienced at BiggerPockets. And that culture of observability has really worked its way into our culture so that now anybody is equipped to go into the logs and ask any question that they want. So it is a long road, but it all starts with a single step. And so if you are feeling that pain, feel free to reach out to me.
John Gallagher [01:12:47]:
I've I can go through all my socials in a minute, but feel free to reach out to me. Ask me any questions. I'm happy to jump on a Zoom call for half an hour and and help you for for free. But basically, it all starts by taking very small steps towards a very specific question. Don't try and add observability because you'll still be here next Christmas. So, take heed. There is hope. And if any anything that I say resonates, please feel free to reach out to me, and I'll help you figure it out.
Valentino Stoll [01:13:19]:
That's awesome. Yeah. I I I also echo, that sentiment, of, you know, tooling is so important. And, you know, open tracing definitely is a a great, great framework. And if if we can improve that in the Ruby space, that'll definitely, we'll we'll be reaping the rewards as as well. So let's move into picks. John, do you have anything that you want to, share first, or you want me to go?
John Gallagher [01:13:53]:
Am I limited to one pick? Because I have many.
Valentino Stoll [01:13:56]:
No. Okay. Go ahead.
John Gallagher [01:13:57]:
Cool. So, the first one is, a new language, and I already, I already thoroughly trounced the idea that we should be learning 1, programming language a year, or rather I just I just dissed it off without actually, giving much justification. So, I'm going to go back on what I just said and say that this language, has changed the way I think pretty much forever, and it's changed the way I see Ruby and Rails and just programming in general. And the language is called Unison. Now it's a very, very strange, unusual, language. It's maybe not that readable in places, and it's also extremely new. I mean, it's been going for 5 or 6 years, but what they're trying to do is incredibly ambitious. But look it up.
John Gallagher [01:14:51]:
It's, yeah, it's an incredibly interesting language, and it will expand your mind. That's what it's certainly what it's done for me. And so it's kind of a language, that's targeted at creating programs that are just much, much simpler, but, actually more com more difficult to get your head around. It's a completely new paradigm for distributed computing, basically. And, it's absolutely fascinating. So I would highly suggest check that out. I know that Dave Thomas at Yuriko, when I spoke at Yuriko recently, he was on the stage and he was, championing Unison, and he called it the future of programming and I could not agree more. It's an it's an incredible language made by some incredibly smart people.
John Gallagher [01:15:38]:
So that's number 1. Number 2, there is a static site builder. I've used pretty much all the static site builders on planet earth, and this is my favorite. It's called 11 t. It's a really odd name. But I am I'm embarking upon this project at work, that really is exciting me which is how do you serve UI components from a dynamic app for it, so rails and meld them into a static site builder without having a pile of JavaScript that you have to wade through. So I want to offer my UI components in Rails and I want to deliver them extremely fast through a static site that's just a blog without having to run that blog on Rails. So Eleventy is my go to tool for doing all that stuff.
John Gallagher [01:16:30]:
It also, encompasses this thing called WebC, which is my new favorite templating language. Yes. I know another templating language. I promise I promise it's really good. It's not another retread of all these other templating languages that are very, very niche and very whatever. So, web c is compatible with web components, and it's a fantastic way of making HTML like components that are server side rendered. And I would love to see a plug in, for that come to rails because it is absolutely phenomenal. So those are my 2 favorite things at the moment.
John Gallagher [01:17:10]:
If anybody's trying to, wrestle with UI components in Rails and trying to extract them out of Rails components also, I would love to chat through that with anybody who's interested in that kind of, area because it I think it's yeah. There's a potential to really break new ground. How about you?
Valentino Stoll [01:17:29]:
Yeah. Thanks. I I'll I'll definitely be digging into some of those. Yeah. I was at a in New York City, the other day for the, Ruby AI, happy hour that they've been doing, every couple months. This time, they, did demos, and I I demoed, this real time, podcast buddy that I've made. It's called Podcast Buddy. And it just kind of, like, listens in the background and in real time, like, keeps track of the topics and the discussions and some example questions worth mentioning or maybe some, you know, topics to transition to.
Valentino Stoll [01:18:07]:
And it's it's a lot of fun. I just did it for fun, but I recently refactored it to use the async framework. And shout out to Samuel Williams. Just phenomenal, like, so well put together. The documentation is coming along. It is, lacking in some areas, but, I was able to just completely refactor the code so that it works with async and runs things, you know, as they come in, and it's streaming the the whisper, you know, transcripts. It performs actions in the background just, like, in the same thread, all managed with async. Just I love it.
Valentino Stoll [01:18:45]:
So check out Podcast Buddy and check out async. You can't go wrong. Async WebSocket now. You can handle even WebSockets asynchronously, just, like, completely seamless HTTP 2 and one, compatible. Love it. So, check those out. And, John, if people wanna reach out to you on the web, or just in general, how can they, how can they reach you?
John Gallagher [01:19:14]:
Thank you. Yeah. So I'm on LinkedIn. That's a platform I'm most active on, and, my LinkedIn handle is synaptic mishap, which is, yeah. I really regret that. Sorry, everybody. But, yeah, so if you just search for John Gallagher, g a, double l, a g h e r, and maybe rails or observability, you should be able to find me. I've got quite a cheesy photo, black and white photo of me in a suit.
John Gallagher [01:19:44]:
It's a horrible photo. And I blog at joyful programming.com. It's a substack. So is this still a blog anymore? I have no idea, but that's where I write. I'm on Twitter at, synaptic mishap, and my GitHub handle is John Gallagher, all one word. So, yeah, joyful programming is the main, source of goodies for me. I've also got a fairly minimal YouTube channel called joyful programming. So feel free to reach out to me, connection request me, ask me any question.
John Gallagher [01:20:18]:
I would love to engage with some Ruby folks about observability. Tell me your problems, and I'll try and help you wherever I can.
Valentino Stoll [01:20:25]:
Awesome. I love it. Keep up the great work and keep, you know, shouting from the mountains off about observability, pulling those pillars down, and just focusing on, the important stuff. Right? I I I love it. So until next time, everybody. I'm Valentino. Thanks, John, for coming on, and, look forward to next time.
John Gallagher [01:20:49]:
Thanks for having me, Valentino. It's been amazing.
Valentino Stoll [01:20:52]:
Awesome.
Practical Observability: Logging, Tracing, and Metrics for Better Debugging - RUBY 656
0:00
Playback Speed: