Understanding Observability in Elixir with Dave Lucia - EMx 195

Dave Lucia is a CTO at a media company called Bitfo, which builds high-quality educational content in the cryptocurrency space. He has been an Elixir Developer for about 6 years. He is the author of “Elixir Observability: OpenTelemetry, Lightstep, Honeycomb”. He joins the show to talk about how they were able to build their system and other websites like DeFi Rate and ethereumprice.

Special Guests: Dave Lucia

Show Notes

Dave Lucia is a CTO at a media company called Bitfo, which builds high-quality educational content in the cryptocurrency space. He has been an Elixir Developer for about 6 years. He is the author of “Elixir Observability: OpenTelemetry, Lightstep, Honeycomb”. He joins the show to talk about how they were able to build their system and other websites like DeFi Rate and ethereumprice.

About this Episode

  • Observability
  • OpenTelemetry
  • OpenTracing
  • Analyzing and Making Data useful
  • Tools used for tracing and metrics

Sponsors


Links


Picks

Transcript


Sascha_Wolf:
Hey everybody and welcome to another episode of Elixir Mix. This week on the panel we have Alan Weimar.
 
Allen_Wyma:
Hello?
 
Sascha_Wolf:
and me, Sasha Wolf, and as usual, we have a special guest, and that is Dave Lucia. And I hope I didn't botch this. So Dave,
 
Dave_Lucia:
Nailed
 
Sascha_Wolf:
why don't
 
Dave_Lucia:
it.
 
Sascha_Wolf:
you... Nate, that's amazing. I'm getting better at this. So Dave, why don't you tell the audience why we invited you and what we are going to talk about today?
 
Dave_Lucia:
Sure. Well, hi again. I'm Dave Lucia. I'm CTO at a small media company called Bitfo. I think you've invited me on to talk about a blog post I wrote called Observability with Elixir and LightStep and Honeycomb. Something like that is the title. So, you know, have a general chat about observability in the Elixir ecosystem and some of the tools you could use. And yeah, a little bit about me. I've been using Elixir for about six years now and love the ecosystem. Really excited about, you know, where we're heading as a community.
 
Sascha_Wolf:
Yeah, exactly. That sounds about right. Something rings a bell here. There was this blog post. Funnily enough, two weeks ago we already had an episode about OpenTelemetry because we actually got one of the Erlang maintainers on the show whose name escapes me right now. I'm very sorry. But I do remember from the blog post that OpenTelemetry is also something you're using and it's also something we are using at my current job. So I'm actually kind of interested to see. how the journey is going. So Dave, you said that you're a CTO of like a small media company, right? So I would assume that you were involved in like the product building from the get-go, right? So what-
 
Dave_Lucia:
Uh, yeah, yes and no, sorry. And
 
Sascha_Wolf:
Okay,
 
Dave_Lucia:
you're finishing your question.
 
Sascha_Wolf:
I wanted to add one little thing and then you can go running. And so I would be interested to hear how the development kind of started and where and when observability ended the equation and if open telemetry was always the thing you went for at the platform you're now using or if there's a little story to that, right?
 
Dave_Lucia:
Very good question. Okay, so. The answer is have we always been, well, have I been involved since the beginning? And yes and no. So no because when I joined, which was back in last April, we already had four websites. And all of them were running in WordPress and PHP and that whole fun ecosystem. So I inherited a lot of legacy code and it had no observability. So we'll get into that in a second. One of my first... focuses after I joined was relaunching one of our flagship websites, which is defi-rate.com and moving that to a more modern CMS that focuses on the data rather than data presentation all in one and as well as building out a data backend that could support all of the financial cryptocurrency data, which is our domain that we supplement with our editorial content. So as I was building out this relaunch of our flagship website, observability came in at the very beginning, where probably one of the first things I did after deploying the app was getting OpenTelemetry integrated, setting up, I actually started with Honeycomb, and getting full observability of the system so that as I iterated on the project, I could observe how queries were impacting, wrap performance, how my open jobs were working in the background, all of that. So since the beginning of my new tech stack, on our legacy tech stack, we actually started out observability and we have slightly more incrementally more observability now through something called ping Dom which is basically a service that will ping your website every few minutes and just check is it responding and that's been useful because some of our legacy websites just randomly go down for no
 
Sascha_Wolf:
Because PHP. Yeah. I can't sympathize with that, to be honest. Like, my first, the first experience I had in the context of background development was with PHP. And I, when I think the ecosystem there has gone a long way, I still enjoy and prefer working with something like Alexia nowadays. So,
 
Dave_Lucia:
And you know what,
 
Sascha_Wolf:
yeah.
 
Dave_Lucia:
I think this is unfair to PHP because modern PHP is actually quite good from my perspective. And I'm saying that at like an arm's length where I don't
 
Sascha_Wolf:
Yeah,
 
Dave_Lucia:
necessarily
 
Sascha_Wolf:
yeah,
 
Dave_Lucia:
want
 
Sascha_Wolf:
yeah,
 
Dave_Lucia:
to
 
Sascha_Wolf:
yeah.
 
Dave_Lucia:
be writing that PHP, but I respect people who
 
Sascha_Wolf:
Yeah, like
 
Dave_Lucia:
do
 
Sascha_Wolf:
I said,
 
Dave_Lucia:
and are
 
Sascha_Wolf:
they've
 
Dave_Lucia:
very
 
Sascha_Wolf:
gone
 
Dave_Lucia:
productive
 
Sascha_Wolf:
a long way.
 
Dave_Lucia:
with it.
 
Sascha_Wolf:
They've gone a long way. I definitely agree with that. Like, I mean, FunFact, one thing they kind of do the same as Elixirs, like every request has its own process, right? Like it's been... So there is some similarities there already. Yeah, okay. I'm actually kind of curious, because when you say observability, right? And observability was something from the get-go and something you also incrementally increased. What exactly are you thinking about? Because I f- My impression is that the world in of itself is starting to get into this buzzword realm, you know? Like everybody means something else when they mean observability. So what do you mean when you say that?
 
Dave_Lucia:
So I subscribe to the charity majors, Liz Fung Jones, Ben Siegelman, Siegelman description of what is observability, which is just understanding your running system. Like I have software, it runs in production, I test in production, things go wrong in production. I wanna have insight into the black box that is my server. And so anything that helps me do that. Uh, whether it's a tool, a practice, a philosophy, I put that under the bucket of observability. Um, so we could get now it's more specifics of like, okay, what types of data, uh, is observability you've got metrics logs, error, uh, reporting, tracing, uh, Endpoint monitoring, all these things, these all fall under the bucket of observability. And it, you know, I think people, when they think observability today in open telemetry, they might jump right to tracing. Just like, if I'm not doing tracing, I don't have observability. That's just not true. There's a degree of observability or like how much your product is observable. And there's just a lot of different tools and a lot of different types of data that can help you with achieving observability.
 
Sascha_Wolf:
Yeah, I think that makes a lot of sense. It's also what I would basically say that's similar to my definition of observability in that it's meant to be this outcome, right? Like understanding what's happening. And I mean, metrics and locks and traces are all tools in that fashion. And they can be somewhat more powerful, some are less powerful, but they all help to paint a more or hopefully, a less or hopefully more complete picture of a running system in which you want to understand. Okay, so based on what you just said, and based on what your bug pose is about, I would assume that your relaunch is written in Elixir.
 
Dave_Lucia:
That's right.
 
Sascha_Wolf:
So why don't
 
Dave_Lucia:
I'm-
 
Sascha_Wolf:
you tell us a little bit about that? Like how did that came to be, and how long has this thing now been running, and is this still something you're tweaking on, like is it iterating on, or you feel you kind of came to a stable kind of observability plateau? Did see that.
 
Dave_Lucia:
Um, well, let's hold two different ideas. So one is the observability and the other is like the evolution of
 
Sascha_Wolf:
Mm-hmm.
 
Dave_Lucia:
let's call it the platform that we're building. So. Maybe it's helpful to give more context on who we are as a company. So we're a media company that's focused on building mostly evergreen educational content, but also some news content in the cryptocurrency space. So we have a website called defirate.com as well as a few others, Bitcoinprice.com, Ethereumprice.org. And they're... primarily there to provide information, as I said, educational content to people who are searching on Google and want to find an answer to some very specific thing. And then we have long form content, like several thousand words on various topics like this. So it's very much like a SEO driven, affiliate marketing subscription based business that focuses on providing this utility and service to people for free. Um, but with that, you know, there's products that we might feature that if you sign up for them or interact with, you know, we make money through, um, driving people to those products. Um, so as we think about building the system, when I joined, we, we started with these WordPress websites and the first thing or problem that I notice is, okay, we've got content mixed with presentation, mixed with data. Um, And the data that I'm talking about is financial data. So whether it's interest rate information, APRs, sorry, I've got a dog. He's also excited about financial data.
 
Sascha_Wolf:
He's helping
 
Dave_Lucia:
Sorry
 
Sascha_Wolf:
you
 
Dave_Lucia:
one
 
Sascha_Wolf:
with
 
Dave_Lucia:
second
 
Sascha_Wolf:
observability.
 
Dave_Lucia:
he's gonna come Yeah, these are my my dogs. I made this joke before but uh, you know I have built-in observability into my house whenever the mailman is here. I get an alert and it's very helpful
 
Sascha_Wolf:
Yeah, probably the oldest system of observability in human history.
 
Dave_Lucia:
That's right. So okay, so we have all this financial data that we're mixing in with our editorial content. And my first process is, okay, we've got all of these different websites, these different CMSs. The first thing that I would want to consolidate is the collection and distribution of this financial data. So that was the perfect use case for Elixir. Uh, previously it was basically a bunch of, uh, crime jobs running in some PHP framework that was just shoving it into, uh, Maria DB, um, and that works just fine, but you know, I think we have pretty ambitious goals as a company for, uh, the types of offering and pairing of education, educational content with, with financial data, excuse me. And, um, just being able to scale these processes and be able to ensure high quality of data, I wanted to have really good tools for maintaining, collecting, distributing data. So the tech tech of choice in Elixir is we have open jobs that are running on an hourly basis as well as web socket connections that were either collecting data every hour or streaming data from exchanges. That all gets inserted into a time series DB that we personally use Postgres and TimeScale. TimeScale is a Postgres extension that adds a time series functionality onto Postgres and is quite nice and maybe a whole topic. So all of these things are very important to the business. When data collection goes down, that's a problem. And so adding Tracing to our collection of data was probably one of the first things I did to make sure that, you know, we could regularly see when our jobs are running, they should be running every hour. If any of those jobs fail, I want to know about it. They pretty much never fail unless the third party that we're collecting data from is having a maintenance window, which I was surprised to find out happens quite often and for long periods of time. So a lot of these crypto exchanges will just regularly go down for like three, four or five hours. And I really don't know why or what they're doing, but
 
Sascha_Wolf:
Maybe
 
Dave_Lucia:
uh,
 
Sascha_Wolf:
they're not using Elixir.
 
Dave_Lucia:
Yeah, I couldn't tell you
 
Sascha_Wolf:
I'm just kidding.
 
Dave_Lucia:
I Couldn't tell you but they're going down and they're not available and I get a alert about it It's actually sometimes more annoying than anything because it's like, alright, we know we get it like every hour I'm getting another alert that You know exchange XYZ is down for maintenance So that's where elixir fits in as I mentioned, we're a news website. We use Next.js and the JavaScript React ecosystem for that. And our CMS is also a JavaScript CMS called Sanity that's a proprietary. So kind of a few moving pieces, not all of them Elixir, but this is one of those cases where I wanted to pick the right tool for the job. I've built some CMSs in previous companies from the ground up, actually in Elixir. And I didn't think that that was the best value that we could be bringing to the company where CMSs are a bit of a commodity these days. So I'll pause there. That was a lot of information, but let's see where we want to dive into.
 
Sascha_Wolf:
No, I think that makes sense to give that context and also thank you for sharing a bit of that journey. I always think it's super interesting to see how people in the community use Elixir and for what, because there's a billion reasons from my experience why people might choose Elixir and I think for me makes sense what you told us so far. I would actually, one little thing you just mentioned stood out to me also in the context of observability in specific. because it just said, right, like, you know, they had, you're scrapping data from these exchanges and they regularly go down for maintenance. And then you said this one sentence, like, it's kind of sometimes annoying, right, like these alerts. And I think that's like one of the biggest challenge in observability where you have data, you have things which are coming in and there is some signal in the data, something like the piece of information which is interesting to you, right? But there's often much more and- than just this little piece of information and you kind of have to find this piece of information and all the data which is available to you. Usually it's, but I like to use this like a metaphor and I feel it's also used quite often it's like noise versus signal, right? How much signal do you have and how much noise is there? And I'd be interested to hear like what kind of approach you've taken. So like in... adding observability to your product to like the, maybe also the legacy system, because I fear that's the situation that a lot of people are probably in, whether you have an existing system and you might want to make it a little bit more observable, but also in maybe taking that as like a consideration from the get go for a new system and what kind of choices you personally have been making to, well, hopefully get more signal than noise, right?
 
Dave_Lucia:
Really good question. I gave a talk at OLLIConf, I think it was last summer or two summers ago. I can't keep track of time anymore. But one of the things I talked about is like observability is a garden. You need to tend to your garden. And sometimes, you know, you might have added alarms and those alarms are just not adding value to their noise, as you called it. So you want to kind of pull those out of your garden and get rid of them. It's also kind of the idea that like, as you're building a product, the, you don't just like slap in some libraries and, you know, call it like, all right, we're observable now. Like we've got, we've got alerts, we've got metrics, like we're done. Um, it's kind of like something that you need to maintain over time, no matter what, because depending on how your product evolves, uh, you know, the value of things that you added early on or later on, um, may increase or decrease. Um, and so it's kind of something that's never done. And that's why I kind of think of the metaphor of, uh, tending to a garden where you need to always be, you know, think, think, thinking and reevaluating, replanting, moving things around and, uh, tailoring to the needs of your business, uh, or your product. Um, so with that being said, you asked about like the legacy products and the way I like to think about, uh, approaching observability for legacy products is I don't want to know the internals of the system. Um, I want to keep those in maintenance mode. Like eventually we will. Replatform those and pull them onto our new tech stack, or maybe we'll just bring them down. Who knows what we'll do. But the point is, is that right now, I don't want to invest in those things, but I do want to keep them up and running and available. Um, and so the most thing that's the most valuable thing to me is just know are the lights on or are they off? Um, right. Like, it wouldn't be very valuable for me to know that, like, this one particular function in this PHP app is crashing and that's causing some issue. Like I wouldn't even know to begin where to fix it. And if that's going on all the time and taking down the system, I might care more. But I really just care about, are the lights on? And so for me, that's when I was like, okay, well, I just really care about can a user hit the site and get a page back? Um, and that's where I turned to ping Dom, as I mentioned before. So, um, Really all I'm getting is that binary is it on, or is it off? Am I getting a 200 response? I am getting a little bit more information with, um, you know, the response time, uh, spoiler alert, they're slow. Uh, you can add more things like synthetic monitoring, you know, have, uh, like actual actual user journeys get reported and,
 
Sascha_Wolf:
Mm-hmm.
 
Dave_Lucia:
you know, see how. Um, users might be experiencing the page. Maybe I could add century and get some error reporting on the page. Um, but again, to me, those are not like high value items right now, because the site's in maintenance mode. Just want to keep it up. If we decide we want to invest more in it, that's when I had started to be thinking about how can I get deeper into the system and have a stronger understanding, uh, but for now I want to keep it at an arms line.
 
Sascha_Wolf:
Alan, just saw you unmute for a second. Do you have a question?
 
Allen_Wyma:
Uh, not quite. I'm just kind of taking in everything and thinking about your, yeah, lights on or off. I mean, that's one thing to think about, but I always think about, well, if you go back to your analogy, is it, yeah, okay, lights are on, but if you plug in one more thing, will you have a circuit break, right? That's always something that's also in my head. And I'm thinking about like, you know, when you do come to an existing legacy project that you are obviously working with. I mean, I'm guessing that you start layering these things on just like that, right? Is it on or off? Like I said, is it going to overload the circuit and, you know, start adding in small things here and there? Is it that's basically how you're working right now, I'm guessing.
 
Dave_Lucia:
Right. And I think a important point to be made is I'm not changing the system. The system is existing as it is. There is content being written. So it's being interacted with. It's also serving content. So there is risk there. Like the risk is non-zero that things will break. Uh, and then things will break in new and unexpected ways. Um, what, one of these new and expected ways, and probably the thing I hate the most about WordPress is that, um, Any person who has admin access to the system can just go ahead and install and update plugins. And plugins are a vector for security vulnerabilities, bugs, taking down the entire system, wiping your database, running the system out of memory. Any like bad scenario that you could think of, which is kind of like one of the main reasons why I think WordPress is not a very good choice. As popular as it is. And. as far as you could get, I think for being on the tech side of the house, I want to make sure that our systems are running and reliable and available. And I think kind of the fundamental design choices of WordPress are antithetical to those goals. So you know, while I think about these systems, I'm not changing them actively. And that's why I only take this like lights on lights off approach. If I was changing them. valued like their, um, the content higher than we do as a business right now. We, we value our other properties a little bit higher. I would probably start thinking about instrumenting, um, more metrics, more logging, more getting more alerts, getting deeper on the system and getting more observability, um, but I'm not there yet. And so it doesn't provide value to me that it's just noise right now.
 
Sascha_Wolf:
Yeah, that makes...
 
Allen_Wyma:
I was also thinking about something, because you're working with WordPress, right? And I think everybody knows that it's, well, it's super dangerous to work with because of all the security vulnerabilities. Do you also think that adding in something like checking what version of WordPress you have and all the plugins and then seeing if they're most up to date, et cetera, could also be something of observability too?
 
Dave_Lucia:
Absolutely. I think having I mean if you're gonna run WordPress then Maybe you want to look at some of the more enterprise Installations that I don't know maybe have some of those tools out of the box the first thing that I was looking for when I wasn't sure if we were gonna keep WordPress or not was Can I like generate a lock file for dependencies and like install them as I would with like a you know Mix install, mix steps, get kind of thing. Um, there is something that exists. I think it's called fabricator. I could be wrong about that. Um, but yeah, that that's maybe a, I don't know if I put that under the camp of observability, but it's certainly like a, a protection. Um, maybe has a dependency changed in the running system would be an observability thing.
 
Sascha_Wolf:
And so I think we kind of got a complete picture of also like where you're standing and what your reasoning has been for like some of the choices you've made, right? I'm kind of interested from where it went from there, right? I mean, like I said, mentioned Pingdom is like this method of choice for just figuring out, hey, is this thing up, right? Like, is this thing running? Um, I would assume you probably do a bit more than that for like your new Elixir applications. I mean, open telemetry isn't the title. of a blog post. But all of this is just at the level of collecting information. You're still collecting data, you're collecting information and now what? What do you do with that? Because I've been at places where there were three different separate and distinct think spots you could go to to get information on your running system. Right. Like I saw one was potentially like an AWS logging thing. Then there maybe was also a sentry page and so on and so forth. I'd like multiple distinct, uh, spots to get information from. And I think that's also at least for me personally, I feel part of the observability question, like how do you make this accessible, right? Like how can you get people. help people to find the answers to their questions quickly. And what does that mean for you personally? Like how do you tackle this? To be honest, I feel a challenging problem.
 
Dave_Lucia:
That's a big question. So
 
Sascha_Wolf:
Yeah, no.
 
Dave_Lucia:
I think, I think we get before we get to how do we like, what do we do with the data? Like, how does
 
Sascha_Wolf:
Mm-hmm.
 
Dave_Lucia:
it help
 
Sascha_Wolf:
Mm-hmm.
 
Dave_Lucia:
us? How do we analyze? I think we need to talk about open telemetry and what it is and like
 
Sascha_Wolf:
Fair
 
Dave_Lucia:
what,
 
Sascha_Wolf:
enough here.
 
Dave_Lucia:
what, what we're getting from that. So open telemetry is a standard. It's a working group. It's a series of language specific. libraries and SDKs that allow you to instrument your application to produce, collect and distribute specifically tracing data metrics and soon to be logging. I don't know is stabilized yet. Um, so in the past, all of these things would typically be done by various vendors specific means. Um, so. Maybe logging could be an exception where you shoot it to a standard out or some file and then something collects it. But in general, if you're producing tracing information, I remember before open telemetry was widely available in elixir. I was using spandex and spandex
 
Sascha_Wolf:
Yeah, me too.
 
Dave_Lucia:
was like, yep. And spandex was a tracing library that helped you annotate your functions. I think it had like a little decorator thing.
 
Sascha_Wolf:
Mm-mm.
 
Dave_Lucia:
But the idea is that. You're. trying to say, you know, spots in your code that have a starting point and an ending point, which is called a span. And as you call functions and send messages across processes, these spans build up a tree. And this tree is called a trace. And with this trace, you can kind of see where, you know, for a particular request or a particular background job or user interaction. what is happening in the system and where the system is spending time. And on each span, you could see information. Uh, like for example, if we're looking at a user request, the top level span, uh, of the trace is going to be like the route that was hit. And you might have information like a slug, uh, the user agent that hit the request and IP address, the HTTP verb, yada, yada, yada underneath that. The first thing that might happen in your application is that it's going to check some cookie and query a database to see is this user logged in. Then you're gonna see a bunch of spans going from the left side of the screen, the top left, all the way to the bottom right. And that's your N plus one query that you didn't know you had in your application where you're seeing a query to the database followed by a query to the database going down
 
Sascha_Wolf:
I've
 
Dave_Lucia:
in that.
 
Sascha_Wolf:
seen that. I've seen that thing, yeah. Ha ha ha ha ha ha ha.
 
Dave_Lucia:
So this is what tracing helps you do is like, it's almost like, um, a live visual stack trace where you, you're sampling different parts of your application, you've chosen places to instrument and produce spans, uh, very intentionally, or you've used libraries that kind of instrument framework level stuff and that framework level instrumentation produces spans, um, All of this is then collected and then shipped off to somewhere. Um, now in the past, when we're talking about spandex, you might've had a vendor that you're using, maybe you chose data dog, um, and with data dog, you probably had a data dog agent that you needed to run, which is either a library or another process, um, that runs right next to your application. And your application will send it either over like. a Unix socket or over HTTP or some other means, that agent's job is to take the data and then ship it off to that vendor specific thing. So OpenTelemetry was trying to solve this problem collectively where the ecosystem has so many different vendor specific agents. There's many different ways to instrument your application. And if you want to switch from one vendor to the other, it's almost like a whole rewrite of your application. Um, so with open telemetry, and now you have a vendor agnostic way to produce metrics to, uh, instrument tracing and spans, um, and hopefully to produce logging, there's something called the open telemetry collector, which is job is to take all that data from your application and then you can configure the collector to say, send this one over to data dog, send that over to Splunk, send that to light step and this one to honeycomb. Um, and you could do that and switch vendors in, you know, a few lines of configuration, or at least that is the goal. Um, so all of this with open
 
Sascha_Wolf:
Just
 
Dave_Lucia:
telemetry.
 
Sascha_Wolf:
to interject real quick here for our
 
Dave_Lucia:
Yeah.
 
Sascha_Wolf:
listeners. So if you'd like to hear a little bit more also about OpenTelemetry and the development history, we actually, two weeks ago, we had Tristan Slaughter on the show, like the Erlang maintainer basically for OpenTelemetry, where we talked about the 1.0 version of OpenTelemetry in Erlang. And that was quite an interesting discussion. So just, if you now think, hey, I want to learn more, maybe go to listen to that.
 
Dave_Lucia:
Perfect. Well.
 
Sascha_Wolf:
Yeah, go ahead, Dave. Sorry for the interjection here.
 
Dave_Lucia:
Now that's awesome and glad I know that because Tristan is basically, he wrote like 80 or 90 or 100% of the implementation for Erlang and Elixir with the rest of the OpenTelemetry working group. Okay, so let's talk about analyzing this data and actually doing something useful with it. So data is not intrinsically useful. Like it's only useful if it provides value to the business. So you can produce all of the data you want. Uh, you could have a million metrics and a billion traces, but unless you know how to funnel that down and sift through it and get to kind of like, what is the problem you're trying to solve or the thing you're trying to understand, not very useful. Uh, so I've been using two different tools, uh, specifically for tracing and metrics. over the last few years, that's Honeycomb and LightStep. What's interesting about Honeycomb and LightStep is that they provide analysis and correlation tools that help you find what is relevant to your problem as soon as possible. So let's just imagine that production goes down on my website, defi-rate.com. It just stops working, that ping-dom fails I can alert, uh-oh, we're down the site's 500ing. So where do you go? Like what, what's the first thing you do to uncover that issue? Maybe you're getting alerts. Maybe you're not. Um, I now I need to go and I need to like figure out like, where do I even start to look, you might have a hunch, um, but maybe you don't. So the first thing you do is that route that is failing, uh, go into your observability tool, so I'm going to go into light step and I'm going to see. I'm going to see the 500s and I'm going to click on an example trace. And maybe that will bring me right to the root of the problem where. Something changed and now this database call is failing and that's bubbling up and breaking everything. Um, sometimes it's not so simple. Sometimes you have a bunch of services and they're all communicating together. They each have their own database. Um, and maybe like one service up the chain that manages user authorization. Is the thing that's failing and that's why everything is breaking. Um, what you can do with lights up and honeycomb is that you can. Select sample traces. So you might see like, okay, this failure has been happening for the last hour. You're going to select that. You're going to also take a baseline of like when things were good. So you'll select like two hours ago, everything looked good. And what light step can do is it's going to compare the spans of the failure mode time with the baseline time, do a correlation analysis and bring up spans where there is a statistical significant change in what has happened in that span. So maybe it will correlate it to like, this one user is actually the only person who's causing this outage on the site, or hey, this one database and this one specific region, every query to that is failing. And then that's where you could be like, Okay, I need to go into RDS and I need to restart something or run some maintenance job or talk to our operations team. But these tools are like helping you analyze and find the root cause of your problems.
 
Sascha_Wolf:
Yeah, that makes a whole lot of sense. And is that then also something where you try to make this the one source of truth where you would say, okay, if I have a question about the active and running system, right, then I just go for example, to light step or to honey combo. I mean, if you still have a data block, uh, because that is my experience kind of where some of these efforts tend to fall apart. Like where you... it's okay wait, for this you have to go there, and for this you have to go there, and for this you have to go there. So, yeah, I'd like to hear your perspective on that.
 
Dave_Lucia:
I agree with you that the, like, at my last job, I worked at a sports betting company called simple bet and we kind of were running into this issue where we had data dogs for some data dog for some things, Grafana for others, LightStep
 
Sascha_Wolf:
Hmm.
 
Dave_Lucia:
for, for everything else. And really all I wanted was everything to go into LightStep, but the problem with LightStep and any of these products is that they're all evolving. So when we first started using it, you can only get trace information. Uh, but what's really nice with the evolution is that they added a feature that allows you to bring metric data into the product. And the reason why you might care about metric data versus trace data is that they tell different stories, right? Like trace information is going to tell you about specific users and you could do certain rollups. Like light step has a really nice feature called streams where you could chart, you know, all the different traces and see like P 95s and stuff and kind of like. metrics produced from traces. Um, but you might have other like application specific metrics and they're going to tell you a story and aggregate, which is, um, you know, number of requests coming in altogether or the memory of my system and what you can do with, with lightstep is you can bring in those metrics and then when you're having these failures, you could look at this and be like, Oh, actually memory spiked at this point. And you could click into that memory spike and do that same. Here's the regression. Here's the baseline. And then it's going to pull up relevant traces that are like, okay, these are the traces that had values changed in those periods. So I think the real value is in bringing all of this information together. And if you could do the same thing with logging information, which is really just a events time series where I now can pull in relevant logs that, you know, we're in this regression period. Um, you kind of have like a complete picture of the. of like this regression where you've got the aggregate, you have like the big picture story, you got like a really specific story about like a trace that's failing. And if you have the logs, you have like every step along the journey for you know, what is going on. And that should give you like a really fast window into solving the problem that you're having. But I don't think these tools are there yet. They're evolving. Honeycomb I haven't used in a few months but They've released new product every day. And I think very soon we're going to have products that can do all of these things, highly integrated together.
 
Sascha_Wolf:
Yeah, I...
 
Allen_Wyma:
would be really cool if we could actually stick the data from... You can stick the data from the beam into these collections, right? Because it'd be really awesome to see all that stuff together. So not only server memory goes high, but you find out it's your beam process. And then see, OK, what's going on? Am I getting flooded with, I don't know, somebody left in string to atom instead of string to existing atom.
 
Dave_Lucia:
Right. That, I mean, that's just the metric, right? You know, you have all this specific beam information. There's a great library. I think Brian Nagley wrote it. It's the Prometheus exporter for Elixir. And there's just drop-in libraries that will just produce metrics for like atom count, process count, GC, all that kind of stuff. And you can just pull that right into your system. I think a problem with a lot of these tools is just there's all these competing standards that have happened over time. And open telemetry is still very much a work in progress. And depending on the language you're using is varying degrees of complete. Um, so it's kind of like, can I do X, Y, or Z with open telemetry? And the answer is, well, maybe like you might have to wait a few months or it might've just been released, but it's not like compatible with this vendor yet. Yada, yada. Um, So everything is kind of like cutting edge right now. And that's a bit of the difficulty.
 
Sascha_Wolf:
Yeah, I believe it. And I just wanted to know that, Alan, why is it we always try to talk at the same time? It's crazy. It's like a recurring theme at this point. There are probably memes out there. But yeah, I could very much relate to what you just said earlier, Dave. I still remember this moment when at the job a few years back, we had our first setup of like spandex and some traces. And then had them, I think we used Zipkin or did we use, I don't know, like something which was running on our own infrastructure just to showcase them. And then we had this first kind of audit scenario and I shot this thing up and then I could see the traces and I could see the logs for each individual trace and that was like magic. So suddenly I felt like, hey, I can really understand a lot more about the system and how it's behaving and. I feel you even undersold it a little bit because we had an event-driven system and there was even like a span for things happened down the pipeline, but not even no longer in the stack. Even like the initial request already stopped, but still you could see what kind of cascading changes were happening throughout the system. And in this case, it was like an event handling badly. So that was just like a moment of magic. And I feel like you can't really grok what kind of visibility. You can get into a running system with all these different tools combined until you've seen it in action.
 
Dave_Lucia:
That's, that's true. That I think the most mind blowing thing, as you stated about tracing is the distributed nature of tracing where
 
Sascha_Wolf:
Yeah.
 
Dave_Lucia:
you could see how requests might move across different services. Um, I actually worked with the kind of the same problem, um, at simple bet where we were using commanded, which is a secure us library.
 
Sascha_Wolf:
No,
 
Dave_Lucia:
And
 
Sascha_Wolf:
it's the same
 
Dave_Lucia:
we just,
 
Sascha_Wolf:
with us. Yep, yep, yep. Ha ha
 
Dave_Lucia:
okay.
 
Sascha_Wolf:
ha.
 
Dave_Lucia:
So we had no understanding of what was happening inside of
 
Sascha_Wolf:
Exactly.
 
Dave_Lucia:
commanded. which is why I wrote OpenTelemetry Commanded to provide instrumentation into Commanded. So from there, we're like, oh, wow, look at, like you said, look how many events are being processed by our event handlers. We didn't intend to do that. And so that actually led to more investment in optimizing different parts of Commanded to be able to handle different use cases. But again, this is coming back to just the larger problem of observability where it's Unless you could see what's happening inside of it, you might not know what's going on
 
Sascha_Wolf:
Yep.
 
Dave_Lucia:
and things might be happening contrary to your expectation.
 
Sascha_Wolf:
Yeah. Yeah. And then maybe on a side note there, I'm really interested to see if it's also something that Tristan spoke about in our episode is that he'd like to make some of the goodness the beam has in observability already built in because the beam as like a piece of software is actually quite observable if you know what you're doing. But there's like some some of those are like a kind of different paradigm, again, with what open telemetry and some of these platforms are providing. So I'm kind of excited to see what the future is going that it was something. hooked him also about like, where do you see this going? Now tell me. So didn't really talk too much about any juicy details. But I'm actually interested to hear from you Dave, because from what I've heard, like as a subtext, kind of from what you've been saying is that you feel like platforms like LightStep and Honeycomb are doing something different, doing more than for example, the, I would argue very well known Datadoc and similar platforms. So Why, for example, you specifically in this case went ahead and chose like a platform like Lightstep compared to something like Datadog. So what's the difference and what's like the way you feel that was kind of the, the, the, the argument which made you choose this over that.
 
Dave_Lucia:
Right. So Datadog is going to provide you with a million different dashboards and like almost infinite flexibility in what you can see, what you can visualize and how you can see it. But the problem, going back to observability being a garden is one that's very manual, but two that doesn't necessarily provide you value when you're having an issue, you know, just because you can. chart every single thing in CloudWatch and put it into DataDog, doesn't mean that you'll ever really care about, this one very specific chart in Kubernetes showing, I don't know, something about your workloads, right? So you kind of get into this place where you get dashboard overload and you don't really know like where to go to find the thing that's gonna help you find the issue.
 
Sascha_Wolf:
Mm-hmm.
 
Dave_Lucia:
In other words, I never found that tools like data dog were providing me with the tools to get to the root cause of the problem faster. The, the analysis tools, it provided a million different ways to visualize data. Um, but once I had that data visualized, um, sifting through it and getting to kind of like the problem that I cared about always felt very difficult. Um, they also don't have like very good tools with like different time. types of data, like how you might want to bring them together. Uh, you know, when I was talking about like, okay, we've got a user request and, you know, one database is failing in a very specific region, like what would be the process you'd go through and data dog to get to that problem? Like maybe be like, Hmm, let me look at databases and see two databases look good. And then you go and dig through some saved dashboards and look at your databases and be like, I think these are all okay. Um, It's basically like you'd go on like a treasure hunt rather than,
 
Sascha_Wolf:
Hahaha!
 
Dave_Lucia:
uh, uh, rather than kind of like having, um, I don't know, like some like spy tools. I don't know what the right metaphor is here, but like you need, you need that magnifying glass or like the, you know, tap into the matrix and like, you know, get really close, uh, to, um, relevant information. That's the thing is. How do I get to relevant information quickly? Data dog provides a lot of information, but not necessarily the tools to get to the relevance part.
 
Sascha_Wolf:
So you could be saying it's signal versus noise again, kind of, right?
 
Dave_Lucia:
Exactly. Yeah. A lot of noise, a lot of noise. And the other challenge here is that with those tools, because you get so invested in creating dashboards, um, you get very like, uh, it's like the pets versus cattle thing where you become very like particular about your dashboards and you don't want to delete them, you don't want to delete this data because we might need it someday. Who knows? And that leads to a place where your, your data dog bill just goes up and up and up and up over. uh, because we're just shoving more data into the system because that's kind of.
 
Sascha_Wolf:
I don't know what you're talking about.
 
Dave_Lucia:
I'm sorry, I'm hating on Datadog. It's a good tool. They've built some really great stuff. I think they'd be greater if they invested in what I see is the future of these tools, which is the direction that Lightsap and Honeycomb are going in. I know there's some other ones that might provide similar functionality. I've heard of Dynatrace. I haven't particularly used it. But what I've liked about... Lightstep and Honeycomb in particular is that, um, they have a lot of developer advocacy that's really about education and bringing people into these tools while they're valuable and how to use them. Um, and I think that's really important. Um, Liz Fong Jones and charity majors in particular do a great job of writing and talking about, um, kind of the practices, uh, you know, what makes a good uh, you know, technology organization in terms of philosophies and how that relates to observability, all those sorts of things.
 
Sascha_Wolf:
Okay. I don't think we didn't stop talking at the same time. I was not talking. Hahaha.
 
Dave_Lucia:
Hahaha!
 
Allen_Wyma:
Trying to
 
Sascha_Wolf:
or
 
Allen_Wyma:
keep quiet. I don't want to interrupt.
 
Sascha_Wolf:
Okay, first of all, thank you Dave for all this, like I saw your willingness to share all this. It's definitely something of a topic. It's very relevant for me right now because as you just said, we are currently using data to work on my current job and we don't really feel we get the value out of it. We would like to see. Part of that is because we at the moment that... billion different things going on, right? So you never really feel you have the time to fiddle with these tools. But maybe part of that is also what you just laid out. I've never considered it from that particular perspective. So if you'd have like one piece of advice, if you want to, also multiple pieces of advice, right? But like one essential learning for you along this journey for our listeners, who probably most of them have not that level of expertise and experience with observability topics as you do. What would that be? So if you, I don't know, could you give like an hint or like an advice for you 10 years ago, kind of, right? What would that be, god damn it.
 
Dave_Lucia:
Well, the first thing I'd say is there's a lot better resources out there today than there used to be. So I'm going to Google this really quickly. Okay. So there is a book written by the people I keep mentioning, charity majors. Let's see. Sorry, give me a second here. Okay, it is Charity Majors, Liz Fong Jones, and George Miranda. They wrote a book called Observability Engineering, Achieving Production Excellence. And I have not read this book yet. I am very excited to dig into it, but I think this is probably a great starting place for anyone who's really curious about, how do I get deeper on this? Where do I start? What are the practices? This book just came out, so it's... Very new, very relevant, written by some of the most vocal people in the space about these practices. So I think that's a really good place to start and a place that I wish I started a few years ago. The other thing that I'd advise is get started by pulling in instrumentation libraries that work at the framework level. So Follow the guide on the OpenTelemetry website for Elixir. Get some trace instrumentation in there by dropping in OpenTelemetry Phoenix, OpenTelemetry Ecto, and whatever other libraries you're using that have a trace instrumentation library available. Start there. That's actually the best place to start. You're starting at the edges of your application, at the inbound and at the outbound of your system. And then from there, you could start to connect LightStep. You could probably be on their, uh, their free tier honeycomb as well. They both have generous free tiers play around with it, get an understanding for like, what is this tool giving me that new relic is not? And, you know, maybe you'll find that it, okay, it's not providing me value, but I, I kind of be surprised if you didn't see the power of it. Um, there's also some really good writings by Ben Silo man. He's kind of what sold me on lightstep a few years ago when he started tweeting about. how he went from production issue and then all the steps in between of doing that correlation analysis with traces and choosing the regression and the baseline, what we talked about before. He went through a visual example in this tweet thread and that was like, wow, you could do that? That is insane. The power of looking across services and seeing how they interacted with each other, that is a superpower. So. Look through those readings, read through their documentation. It's excellent, both LightStep and Honeycomb. And I think you'll be in a much better position. Read my blog posts for getting started. Um, and the other thing is like, there's a lot that needs to be done on documentation and open telemetry. And I'd encourage people who do have some expertise, myself included to contribute to the documentation, to write about it. There could be much easier ways to get open telemetry into your application. Like I'd love a mix hotel and knit. There's an issue for this on open telemetry that I would love to contribute to and haven't found the time yet, but like someone should pick that up and, um, it should follow in the guiding light of surface that has, I think the best, um, mixed surface and knit where it actually uses, um, uh, it parses the syntax and then adds into it as opposed to like the string interpolation that Phoenix does. Um, we could have really good onboarding experiences and the tools are all there. We just need to build it. So, um, call it an action for people listening. It'd be a great time to contribute because it's all so early still.
 
Sascha_Wolf:
Nice, thank you. Alan, do you have anything you would like to ask or otherwise I would say we can close this up.
 
Allen_Wyma:
No, I just put the book onto my ever growing, never shrinking backlog of things to read in my and read again.
 
Sascha_Wolf:
Oh boy, I can sympathize with that. Okay, Dave, if people want to get in touch with you and also, I mean, like, want to work, are you hiring right now with you, right? So how could they do that?
 
Dave_Lucia:
Yeah. So, um, you could reach me on Twitter or on the looks or Slack. I'm Davy dog one eight seven. Kind of stick to that handle until I die. Um, you can find me on Twitter and looks or Slack on that handle. Um, I am hiring, uh, always, um, but maybe not so relevant to this, uh, podcast, uh, looking for, for front end developers. Um, so if you're into react, um, tailwind and all that fun stuff, give me a holler and we should talk. Um, but otherwise, uh, yeah, really excited about the space and always want to chat with people who are excited to.
 
Sascha_Wolf:
Nice. Okay, then I would say we transition slowly to pigs and... Alan, come on. Restbook, come on. We haven't had one in so long.
 
Allen_Wyma:
Yeah, I haven't. But I did have it on my. I have a couple on my radar that I'm going to be looking at pretty soon. I just shared a book called Distributed Services with Go. So I had some stuff come up about somebody asked me to build something in a distributed way and also using Go. So I decided to crack open this book I bought a long time ago. And I think the book is actually really good. I'm about halfway through more than halfway through near the end of it. To be honest, there's a lot of information there. I think they could work for anything for distribution. So I think it's a really good book. Check it out if you want to know more about distribution. Something I learned is how to actually generate certificates. I didn't know that Klauffer had this thing called CFSSL. Have you seen that before?
 
Sascha_Wolf:
Nope.
 
Allen_Wyma:
you could generate mutual TLS certificates pretty easily. There's just a ton of stuff in there. So if you're interested in doing distributed services that are service discovery and mutual TLS certificates and all kinds of crazy stuff, even though it's written in Go, I think it could work for nearly any programming language because it's pretty agnostic for most of these kind of things.
 
Sascha_Wolf:
Right. And...
 
Allen_Wyma:
I think I've seen everybody looking around like putting the book into their list right now.
 
Sascha_Wolf:
I was actually looking at my notes at my picks. So...
 
Allen_Wyma:
Oh, okay. I was like, Oh man, I'm making an impact. I'm an
 
Sascha_Wolf:
I
 
Allen_Wyma:
influencer.
 
Sascha_Wolf:
just... The daddy already are for me, Alan. You know that. So I have only one pick this week and it's a shameless self-pick. It's also the... topic of last week's episode, but I now went ahead and published my new library called XUnion which has the tag line, tag unions for Elixir, just that. And yeah, readme is finished, documentation is finished, it's 0.10 on hacks. To be honest, I consider it basically 1.0, but there's still some little things I would like to see and I would also get some feedback from people before I pretend that this is production ready. But just to reiterate again, it's a little helper library to generate code for tech unions like union types or discriminated unions or something that's called variants. And it has this little DSL of a macro and it generates a bunch of structs from that. And then you can pattern match on these structs. And it also gives you a little guard to say, hey, is the given value one of these structs, so you kind of have a little bit of this. union type goodness without having to write off spoiler plate because I've written the spoiler platen some more interesting. So yeah, go check it out. It's now public on hex. And I really would love to get some feedback. Okay.
 
Allen_Wyma:
I think you already got some feedback from somebody right you you promoted in the telegram group and then immediately somebody said oh it's broken.
 
Sascha_Wolf:
No, it's not broken. They had a typo in their code. It's
 
Allen_Wyma:
I
 
Sascha_Wolf:
not
 
Allen_Wyma:
saw
 
Sascha_Wolf:
my
 
Allen_Wyma:
that
 
Sascha_Wolf:
fault.
 
Allen_Wyma:
yeah.
 
Sascha_Wolf:
Hahaha!
 
Allen_Wyma:
I just thought that was very funny I was like.
 
Sascha_Wolf:
Wasn't
 
Allen_Wyma:
When I
 
Sascha_Wolf:
me,
 
Allen_Wyma:
saw
 
Sascha_Wolf:
wasn't
 
Allen_Wyma:
that.
 
Sascha_Wolf:
me. Wasn't my code at this time. Okay. Dave, do you have any pics for us?
 
Dave_Lucia:
Well, I had one, but now that you did a, a self, uh, pick, I'm gonna, I'm gonna add some self picks too. So
 
Sascha_Wolf:
Yay.
 
Dave_Lucia:
my first pick is, uh, the unscripted product, unscripted podcast by Richard Feldman. Um, actually Jose of Aline was tweeting about this a few weeks ago that he really enjoyed the podcast and I've actually been following Richard for a long time. He's well known in the Elm community. Uh, he wrote, I think one of the Elm books. He worked with Evan, the creator of Elm for a while and works at no red ink. Uh, he's also working on a language called Rocklang. That's kind of like Elm and Haskell derivative for the backend. Um, and we're quite interested in that. So yeah, his podcast is great. He has a lot of really good guests. And I think, uh, I've just been getting a lot out of it. I've been listening to all the episodes. So go check it out. Really great podcast. My two self picks, shameless self picks. I also have two new libraries that I'm working on that are open source. One is the topic of my talk at Codebeam America next month. It's a library for making it easier to work with TimeScaleDB in Ecto. So go check it out. bitfo.com slash timescale. There's a lot of open issues and be a really great place. If you're interested in getting involved in open source, to pick one of the issues, learning curve's not that high. The other is another database related project called, right now it's called EctoDateRange, but we're gonna rename it to EctoRanges. And it is focused on providing Ecto custom types for all the different range Postgres types. Um, there's actually a few other libraries that do exactly this, but, um, I was not happy with kind of the flexibility of those libraries and wanted to provide some other conveniences. So, uh, this is my take on, on solving that problem. So please get involved, use them and yeah. My picks.
 
Sascha_Wolf:
Nice, thank you. Okay, then it was a pleasure having you on the show Dave, I really enjoyed this.
 
Dave_Lucia:
Yeah, thanks for having me. Really fun to chat with you all.
 
Sascha_Wolf:
Okay, then for everybody else, thank you for listening to all our rambling, and tune in next time when we have another episode of Elixir Mix. Bye bye!
Album Art
Understanding Observability in Elixir with Dave Lucia - EMx 195
0:00
55:18
Playback Speed: