Observability in the Beam: An In-Depth Exploration of Tools and Solutions - EMx 228
Adi, Allen, and Sascha join this week's panelist episode. They dive deep into the world of observability, tracing, and monitoring. They talk about the advantages of using open telemetry directly and how it can be translated into different formats. They also explore the benefits of using tools for understanding and improving code performance during development. Additionally, they take a look at different levels of observability, from Phoenix Live View and Live Dashboard to telemetry and tracing operations in large pipelines.
Show Notes
Adi, Allen, and Sascha join this week's panelist episode. They dive deep into the world of observability, tracing, and monitoring. They talk about the advantages of using open telemetry directly and how it can be translated into different formats. They also explore the benefits of using tools for understanding and improving code performance during development. Additionally, they take a look at different levels of observability, from Phoenix Live View and Live Dashboard to telemetry and tracing operations in large pipelines.
Sponsors
Social
- Adi - Trace_pattern function in Erlang
- Allen - GigCityElixir 2023 - Amos King
- Sascha - EverWorld on Steam
Transcript
Sascha Wolf (00:01.059)
Hey everyone, welcome to another episode of Elixir Mix. This week on the panel we have Adi Iyengar.
Adi (00:07.8)
Hello.
Sascha Wolf (00:09.035)
and Alan Weimar.
Allen (00:11.651)
Hello.
Sascha Wolf (00:12.979)
and your host, me, Sascha Wolff. And we have no special guest this week. It's a cozy panelist episode. And today we are going to talk about observability in the beam at large. I'm basically gonna give you the gist on how this came to be because I've been currently for a while now, I've been at work, also been involved in getting our observability set up and better and up to speed. And the thing there is, I...
always wished for some kind of resource where I could go to and kind of ask, okay, I'm building software on the beam. What kind of observability tooling does make sense for me? And what kind of observability tooling does answer which types of questions? You know, because if you just very briefly, maybe do a cursory search, you find something like Erlang tracing, like performance tracing, you find something like telemetry, you find stuff like open telemetry, which is an open standard.
But of course you also find software as a service solutions like AppSignal or Datadog and so on and so forth. There's a whole slew of different things out there, all kind of to varying degrees promising to be the do-all and all of observability. But when you kind of come from it from an engineering perspective of, hey, this is not my area of expertise, I just want to get shit done and...
please help me understand what makes sense for my use case. There isn't really a single resource you can go to, at least I don't know about any. And this episode today wants to maybe not solve that problem, but kind of give an overview on like what kind of different solutions we've had experience with, and also what kind of different solutions we've looked at and maybe dismissed and why. And also maybe some tips and tricks we learned along the way. So yeah.
Adi, what is your experience with observability at large?
Adi (02:11.922)
Yeah, I mean, it's been different across different types of project, different scope of project, right. I guess I'll just get right into it. I mean, Alan mentioned earlier, Phoenix Live View now ships with LiveDash, or Phoenix itself ships with LiveDash now, which is built on LiveView. I think that's a very good place to start, get more insights into processes, what they're doing, the more processes you register.
the more insight you'll get. Obviously you won't have the handle event that all the tracing information of every call but at least you can find state information over there. I think that's probably what I consider level one. Level two would be more log driven, collecting more log information and aggregating that along with
Adi (03:11.998)
easy tool to integrate with that kind of provides that interface. And I guess like level three would be like, you know, trying to dig into telemetry or open telemetry and like finding ways to, you know, trace things in the order that they happen so you can like figure out bottlenecks and like a huge pipeline of operations, right? Like, and again, that's something you need like when you have a huge project where you know.
Latency matters a lot bottlenecks matter a lot where you spend your time improving the performance matters a lot So yeah, I think it's again need based. I've worked in all three types of projects and you know it is fun No matter what level you're at, it's fun
Sascha Wolf (03:58.815)
Yeah, I agree. There's like one little tidbit where I would actually disagree in that I tracing specifically is, in this case, I mean, distributed tracing. Um, it's like, that's one of the three pillars of observability. Uh, usually observability is split into three pillars like tracing, logs and metrics. And tracing is actually something I found very, very useful, not even in big, not necessarily in big projects, but in projects that are event driven.
because that is usually a type of software where it can be hard to understand causalities, right? Like what happens because of which thing and like what does that trigger then and so on and so forth. And in that case, having something for distributed tracing where you can kind of see this causality. So like I can see a key, this is the first request that came in that triggered that event, that ended up in these three handlers, which did some stuff, you know?
And that is super, super helpful to understand how things actually work under the hood. And not even necessarily when the project is big or like highly performant, just when you have maybe, I guess, when the complexity of the project is high, not necessarily when the load is high. Um, yeah.
Adi (05:11.414)
I was actually, I mean, I would agree. I think eventually a project will reach a point very quickly in fact, where you would want that kind of information. Actually, it's funny. It reminded me of something I did in 2015. I was a junior engineer at that time and I wanted tracing. I don't even know, but I think that didn't know like open sensors was there. It was a huge tool, right? I don't even know about all of these things. So what I did in Phoenix was I added a plug that just always at the end of the
controller lifecycle, the connection that's returned, it would always raise an error. So I could call the underscore stack trace, print it, and then not trace error and return the connection. So that stack trace, like the entire trace of where the error came from, gave you then like some information of, you know, what path the con
object for lack of a better word, the construct that took in that path. But yeah, I mean, I can remember what you're saying, Sasha, I can relate to it. Like even as a junior developer, I felt the need to do something like that. It gets very important. The more tracing you have, the more transparency you'll get and how that works.
Sascha Wolf (06:33.975)
Yeah, and also modern tracing solutions. And that is, I mean, open sensors is something that no longer exists. It got, I think it's emerged into what is now open telemetry. Um, and modern tracing solutions, which I mean, you do get, for example, from the software as a service providers like Datadog and AppSignal, but in theory, you can also host this stuff yourself. There is, um, I think it was built by Uber, Jagger, which is a thing that can also.
Adi (06:42.111)
Yeah.
Sascha Wolf (07:02.243)
process open telemetry traces and I kind of visualized those. But they also come along with like support for logs and other attributes. So like on your traces, you can actually then potentially see which the logs that have been generated over a duration of any particular span.
Adi (07:23.682)
I, you talking about Yeager, right?
Sascha Wolf (07:26.007)
Jaeger, yeah. Yeah. Maybe I pronounced that wrong, really. I think it was Uber, but don't quote me on that. But yeah, maybe like for our listeners, if you're not familiar with like what distributed tracing and all of the things we just mentioned mean. So like the idea for distributed tracing is that you can kind of correlate what happens in different parts of the system. And usually you do that through spans. And span, one span is always like a fixed
Adi (07:29.378)
I did not know that Uber built that, that's really cool.
Sascha Wolf (07:56.087)
like a thing that happens over a certain duration. So there's a fixed stop time, it has a fixed end time and over that duration, there might be, for example, additional events that might happen, log messages that get attached, so on and so forth. And the big deal about why it's distributed is, is that you can relate spans with each other. So for example, you can say that one span is the child of another span. Coming back to the example of earlier, right? Like you admit maybe an event when a request comes in and you can, as meta information,
give the span ID along for this event, but that kind of was the span that I presented the request. Then you can use that span ID in any of the event handlers to say, okay, I'm now starting a new span for handling the stuff. And I'm pointing to the span ID as the parent of a new span that I started. And then the systems I mentioned earlier, the things that can kind of collect and correlate those things, they can show you, okay, this is now the request that came in. And those are all the other spans, all the other things that happened afterwards. And that is,
I feel this is one of those things which is really hard to understand how useful it is until you see it the first time in action. Because it is really useful and it is a little bit like magic in the sense of that I have not yet seen any other observability, any other monitoring technique that gives you this level of insight into a running system.
Adi (09:17.906)
Yeah, totally. I think also, I think, just to help people view it, like they usually, like this, you know, I think a trace is a collection of spans, right? Like you show a trace as like a Gantt chart, right? Like this function is part of this span, it might have child spans too. And then the next function is called, it lasted this long. And it allows you to really
Sascha Wolf (09:30.486)
Mm-hmm.
Adi (09:46.102)
Again, it like trace where the problems are very, very useful in like performance and bottleneck investigation. It's the I think, yeah, like Sasha said, like the Best tool available right now. I think the whole span thing came from and Sasha correct me if I'm wrong. It's got open tracing right like open tracing which also. Yeah, yeah.
Sascha Wolf (10:05.924)
Yes, it was open tracing and open sensors and I think they both kind of merged into what is now open telemetry.
Adi (10:10.162)
to open to them. Yeah, to them.
Sascha Wolf (10:14.731)
Yeah. But again, I mean, like this is like open telemetry, while it has very good support for the beam, it is an open standard. There is a whole bunch of other tools in the beam ecosystem. So there is telemetry, which is honestly, the naming is hella confusing because we have telemetry and we have open telemetry. Although open telemetry does have good support in the beam. It's not strictly a beam thing, right? But telemetry is a beam thing. Then you have, of course, like the usual logging, you have the performance tooling.
like to trace performance bottlenecks on the beam, like the fprof and eprof and cprof. Honestly, even to this day, I'm not entirely sure what the 100% difference between those things is. I know they exist and I've used them a few times. But if you would ask me right now to tell you what is the difference between eprof and fprof and cprof, I'd be stunned. So, and then again, like you have...
Or probably other thing, I mean, observer, right? Like Alan mentioned earlier, observer and like the monitoring tools you have there. And all of those exist and all of those are useful. But I'm even now at this point where I've been in Elixir for over six years, I still have a hard time. Um, articulating which of those tools is useful in which particular situation.
Adi (11:13.654)
Yeah.
Adi (11:37.558)
Totally agreed. Yeah, I know fprof slows down the runtime. How I made myself remember that is that it fs up. You're such a fprof, but.
Sascha Wolf (11:50.131)
Hahaha!
Sascha Wolf (11:57.399)
Alan, what has your experience been with all of these different toolings? Because I mean, before we started recording, you said like the whole tracing thing is not something you've looked that much into. So what is your poison of choice?
Allen (12:08.486)
Yeah, I mean, I was definitely wanting to look into like the distributed tracing because I had a bunch of microservices for For a previous project. That's something I wanted to look at. But said I didn't get a chance the project had to stop. But for like strip tracing. I just want to say that I don't know how many people actually use it. I mean, it is somewhat of an advanced thing.
I mean, I first got introduced into tracing because of the book from, was it, from Francesco. I forgot the name of the book. It's like building scalable applications with Erlang or something like that. That was super useful to kind of get into it. But of course that's written in Erlang. But I did use, and I'm going to pick this later on, but just to kind of talk about if you guys have looked at Gig City Elixir 2023, Amos King's talk about shepherding.
He goes on how to do this, the tracing and elixir. And I have a true life story actually, where we actually use this to solve an issue where we couldn't connect to, or sorry. So on testing, people couldn't log in, but on production we could. And we had no idea why. And we tried adding in logs and we couldn't figure out why. Because basically we're using an OAuth login using Microsoft as a provider.
But we added in tracing and we used his guide, Amos' guide on his video to do it. And it was super easy, like basically anybody could follow just kind of gave the video to somebody and they did it. We found out, hey, it's because the token that we use for OAuth 2 has actually been expired, has to be rotated. But we could not find any way to figure out how to log that, but tracing in just a couple of seconds just made it work. So, I mean, we're not doing like crazy scale stuff, but.
stuff like that could be super helpful because sometimes it's really difficult to check these kinds of things in your local environment, in production. I mean, you know, so I think those are really more powerful I think sometimes rather than like distributed tracing and having these kinds of like crazy stuff set up. But the other thing I think that's also kind of useful too is for another project, what we also are doing is we actually added extra metadata to our logs. So not only do we see
Allen (14:30.134)
you know, like what they're doing, but we have like, we know who's being affected. So it's easy for us to kind of come back to them and say, you know, what were you doing or like what, you know, what were you working on? Like, cause we actually added in some stuff to the logs so that way we can have more idea about why certain things are failing. I don't know if you guys have done anything like that in your own stuff. Are you actually adding extra metadata to your logs to try to figure out what's going on when you see something that happens?
Adi (14:55.606)
Yeah, definitely done that.
Sascha Wolf (14:59.515)
Yes, um... No, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no
Allen (14:59.522)
Stashe looks bored out of his mind. Is it maybe too easy for you? But I mean, these little things are very simple, but they're very powerful and they give you a lot of information automatically, right?
Sascha Wolf (15:08.983)
No, we've also been like, uh, spring springing additional metadata into our logs and few places where we kind of want to automate some of that because, um, if some of our listeners remember like we, of the system we're building has like a more of a separation between different, um, subdomains, different bounded context. So like we were talking about, for example, maybe having the bounded context, the thing happens in always as a metadata and a log message, but not having to pass it every time manually, right? So like those are the kinds of things that we've also been exploring.
Metadata for logs is something that is also very useful when you want to correlate problems, and when you want to search for specific entries. Usually in my experience, most log aggregators these days allow you to query also by metadata fields. Yeah, the thing is where I do regularly look at logs, but these days I
often my first go-to is not the locks, but it's actually traces and spans. Because I mean, as you mentioned earlier, it doesn't even, it can sometimes give you an insight into parts of a system that are also grockable through locks, but they require more brain power. And one of the biggest things for the MiVF, for example, was in the one of the older system I worked at, where we also had a tracing setup, and we actually had every single call to our actor repos traced. We had a little...
wrapper around Actorepos, which in every single call kind of started a span. And we had like one endpoint that was giving us a little bit of performance issues. And like through the logs, it was really hard to grok, like which of those it was. And that was kind of the motivation for adding this tracing around every Actorepo call. And then we did that. And then it became immediately obvious which of the calls became, was the bottleneck because you really have this visual thing, right? Like the...
As I mentioned earlier, it's this Gantt chart. And the longer a span takes, the wider it gets. So you could really see in that UI, OK, this query is fast, this query is fast. Oh, this is slow. You could visually see it without even having to grok how long of a duration every query took. You could just see it intuitively. I.
Sascha Wolf (17:35.215)
I would like to actually circle back to some of the tooling available on the Beam. For example, Cprof, Eprof, Fprof. Because some of those also allow you to do that. Maybe not the gun charts level of visualization, but for example, I think there's a flame graph you can generate. Was it from Fprof? It might have been from Fprof. And then I think Cprof, yeah, that is something I think I've only used once. So I'm curious how much...
Adi (17:55.37)
Yeah, that's Fprof.
Sascha Wolf (18:01.403)
What is your experience, Adi and Ellen, with those tools? Have you ever used these built-in performance measuring tools the beam kind of ships with? And if yes, for what? And what would you say are useful scenarios, especially for our listeners, where those tools are a tool they can reach for?
Adi (18:22.518)
Um, um, I guess, I guess like the problem that, uh, as I've used these tools a few times, right? Like sometimes when I'm doing local benchmarking for certain things, I do use these tools. Um, F prof just takes like, I think at times it just like increases the runtime of that. I would honestly like two or three X, which is, is not very convenient. Um, uh, I think, I think e-prof is like a more lightweight version of that. I think C prof. C I think stands to account.
I might be mixing them up or something, but it's somewhat like that. But yeah, I've used them a few times and sometimes I'll run them locally. I use them. But I think the most efficient way that I found is actually, I mean, what was the tracing tool in Erlang? Is it recontrace? Am I making this up? There's a tracing tool in Erlang that can do...
capture very simple function calls. I might be making this up. There is a name with the tool. I might be making the name up, but I think something.
Sascha Wolf (19:27.531)
There is a Recon Trace tool, I'm not familiar with it though.
Adi (19:30.066)
Okay, okay, okay. Yeah, I think that's what I if I'm doing something locally. That's what I rely on because obviously The, the, the transmitters and the, the process that capture the trace and connect them and transmitted to your provider data dog or phone or wherever they are not as available locally. Right, so
Recon Trace is very easy to set up and like very easy to get information and it has a nice syntax and which it prints to a file to which kind of looks like a game chart which is so helpful. Benchy is I think amazing. It just gives you, it just gives us awesome data like the whole, you know, standard deviation, median, average, all that like very, very useful data.
So yeah, those are what I use when I run locally. But my instinct always is to, once I get very high level, with low amount of local data, I want to run it in an environment, like a production-like environment, and then get more data. And I eventually use Datadog, AppSignal, all these services for that. What I really like about AppSignal, and I'm a huge fan, because the thing
I think Datadog's interface is very complicated. It is a huge learning curve. And right, it is a huge learning curve. And by the time you get to a point where you have created the dashboards based on how you want to view data, and others are forced to learn in that way, it's just so much investment. AppSignal gives you logs. You can add structured logging as well. There's an awesome blog post by Sophie Dependendado that I will call out at the end of the podcast.
Sascha Wolf (20:49.935)
I agree.
Adi (21:15.686)
And it adds logs. And using the timestamp, you can connect the Gantt chart of traces back and forth. You can go from traces to example logs and back and forth. It just was life-changing for me when I discovered this in 2018. And that's my go-to. Unfortunately, at my work, we don't use AppSignal. I had to create something like this in Datadog, which doesn't work nearly as well. But every other thing I do, I use
place. I have built MVPs for a lot of startups. I always use AppSignal. And it's always everyone's go to place to see the impacts of making, you know, say performance improvement or, you know, changing the path, critical path of the application. So yeah, that, but yeah, your question again, ePROF and all these tools, I use them for local, I've used them for local stuff. But I feel like I've graduated to like using
um, Benji to get more local insights. Sorry, it was a long answer.
Sascha Wolf (22:19.507)
No, I think it was a much more valuable answer. Ellen, does your...
Allen (22:23.402)
I use Benchy a lot before, actually. I never tried fprop, but maybe I should take a look at that. But Benchy is super simple. I use Benchy in a particular project where I was parsing thousands of XML files. And I was using, was it XML something? I forgot the name now. It's kind of required when you use AWS. Anyways, it basically is blowing out my memory.
uh whenever i was trying to like figure out like what's going on so i actually ended up rewriting that some of the code like half the code in rust is so much more performant and i ran banshee to kind of prepare uh like profile the two to see the difference in speed it was just insane uh i wasn't getting proper metric for some things i think that's still a problem right now probably because you're going outside the beam i'm guessing but uh definitely um it was helpful to kind of
Adi (23:13.983)
Yeah.
Allen (23:20.718)
Again, I mean, it was pretty straightforward. But I would like to kind of try out these other ones, like the fprof and everything. I didn't even know about them because Panshi is already so great.
Sascha Wolf (23:33.835)
Yeah, Benji is super useful about getting an insight into what kind of perfect, how fast certain things are performing and to compare those, right? Like Fprof and Cprof and Eprof, even again, I would have to look into the specifics there, but they are useful for understanding the performance of like one, yeah, more like tracing form, a particular piece of code. And I think what you can kind of probably summarize what you just said, Adi, right? Like those tools, Benji, Cprof, Eprof, Fprof, and that those are very useful for like when developing.
Adi (23:48.414)
more like tracing.
Sascha Wolf (24:02.947)
Like when you are actually on your machine building stuff, like you want to figure out, OK, why is this thing slow, for example, or which of those different approaches is faster. So before it hits production, I would say. I'm not saying that they are useless in a production context, but I personally have never. You can, in theory, go into a remote console, then use them on a running system. I've.
Adi (24:14.794)
Yeah.
Sascha Wolf (24:29.931)
Literally never had the need to do that. Probably some of the older Erlang people would now scoff at me. But I think for our listeners, as somebody that's getting into the ecosystem, then it's a good rule of thumb to think about these tools as the things you use when building things, when trying to understand the performance of a particular piece of code. Not necessarily when you want to understand
code that is running right now in production. That is something, as you just mentioned, where platforms like AppSignal and Tracing, and I'm not sure where ReconTrace would fall into that, are more interesting. Yeah, okay.
Adi (25:12.574)
Yeah, it would.
Yeah, it gives you the, it gives you, so with like mix, mix also has traces, right? Like you can get the compile time information with like writing like a mixed tracer and capture all those like remote function calls and stuff, but recontrace is doing that for runtime.
Sascha Wolf (25:34.175)
Okay, interesting.
Sascha Wolf (25:38.735)
And then you have, I mean, you have telemetry. Adi, Alan, does any one of you feel up for defining what telemetry actually does?
Allen (25:51.062)
Telemetry, I think, is kind of more like getting some metrics about certain pieces of your code. A lot of times it's about memory usage or time. I think a lot of times we use it like time first when something happens. So a really good use case that maybe not everybody knows about, but when you're doing development locally with Phoenix and something hits your endpoint and tells you how long it took to give a response.
So that's actually done using telemetry, where it's basically there's like, I think what I'm probably sure what comes up underneath is that there's a start time and then end time when it gets returned within your endpoint. And that one has the, you know, that's basically a telemetry right there. Is that how you would say it Sasha? I don't know. Probably could be a more elegant way to say it. But that's for sure one example.
Adi (26:42.579)
Yeah, good session.
Sascha Wolf (26:44.768)
That is also something how I used to think about telemetry, because I never really looked into it in depth. And I've had to do that over the past few days, and I've come to understand that telemetry is actually, it is used for doing that. But it's surprisingly simple, because telemetry itself doesn't really do anything much, to be honest. Yet not even that, that is the thing.
Adi (27:08.278)
I think it's an event store.
Sascha Wolf (27:11.607)
It's not even an event store. It's like what telemetry like fundamentally does is it gives you this module. You can call to say, okay, this thing happened, right? It has like also a one, like an, it has like an execute function, which is like more of, okay, this thing happened right now and the event happened with some attributes and some metadata. And it also has a span function, which under the hood just calls execute two times, right? Like start and potentially,
not just end, but also an error case, but when an exception rates raised, stuff like that. So it wraps that somewhat nicely. But that's it. Those are the things it does. And then you can, on the telemetry level, you can register handlers, which can do whatever with that information. And that is it. That is what telemetry does. Not more, not less. The thing is...
that there's a whole bunch of additional libraries you can use for example, telemetry metrics. And that is something which you can then use to, for example, have an handler which aggregates metrics. And that is something which usually ships with Phoenix. You can have, it kind of comes predefined with a bunch of telemetry metrics like CPU usage and memory and good gosh, I don't know, a whole bunch of other things. But those under the hood use the telemetry events that get published.
through this interface. And the interesting tidbit there is that telemetry itself doesn't even do any kind of processes. Like there are no processes involved. Every time you call execute, that actually goes through the list of handlers that are subscribed to this particular event and just calls them inline. And under the hood, that particular handler then might put that into a process, right? Like it might put it.
cast it into a process group or whatever, or put it into some kind of pipeline. Um, but it all stays inside of the same process, which I, I always looked at telemetry and also at the events published by, by Phoenix through it and other libraries, and I kind of thought of what was like this, this complex, as you just said, I did like an event, uh, store or something like that. And it's surprisingly simple. And I think some of the confusion, at least for me, came from the fact that
Sascha Wolf (29:27.643)
You have open telemetry and you have telemetry and you kind of would presume that they go in a similar direction. But turns out they really don't. Like the only thing they have in common is they have a concept of a span. That's about it. And what I've now come to understand is like, why do we even have this thing? Right. Is that before there was telemetry, you still had libraries like Phoenix or AppSyn or
a whole bunch of other libraries, which you as somebody that puts these things into production as an engineer, where you still would be interested, okay, like, how is this performing? Like, I kind of want to monitor this. I kind of want to peek inside and get an idea about how long some things are taking, what kind of things are happening. But there was no unified API for making that happen. So if a library bothered with that at all, some just didn't, right? Then it...
did its own homebrewed thing. And that is where telemetry comes in. Telemetry is kind of the unified model for tackling this. To say, okay, you as a library, you can use the telemetry to emit events and to have spends. And then your consumers, the people that are actually installing you and using you, they can subscribe to these events and do whatever with them.
And for example, what I'm building now at work after I really cropped this and I had to read for the source code of this thing to like really understand, okay, this is, there's no, not more to this than the things I just laid out. I'm using the abstraction layer I'm building is actually using telemetry. Like it is using telemetry and now building an open telemetry handler that takes these telemetry events and translates them into
open telemetry concepts, so like into spans and events and locks and so on and so forth. And in theory, you could even, if you wanted to, you could say, you could replace your logging mechanisms, which telemetry execution calls and then have a handler would actually put them into locks. Like that is the thing you could do. I'm not sure why you wouldn't do that, but you could, or the other way around, it could have like a log of log formatter, or that formatter, I forgot, and that's not formatter, not formatter, right? You could have a log formatter that reemits these things, it's telemetry events.
Adi (31:38.604)
Yeah.
Sascha Wolf (31:43.843)
because there is not much magic happening under the hood. And yeah, that is actually what telemetry is. So that I hope for whoever is listening right now, it kind of dispels this magic around the library because while the documentation talks, that this is like this unified library meant for monitoring purposes, I really didn't grok it until I used it and looked under the hood.
And I'm not even sure why, maybe because it looks like more from the outside.
Adi (32:21.042)
Yeah, I think I probably didn't say it right. That's kind of what I meant by even starting a way to capture and handle certain events, but make sure to keep their time like in into account as the span comes in. But yeah, you're right. It's a way to dynamically dispatch events while making sure that time is taken into account.
Sascha Wolf (32:44.823)
Yeah, but not even using any of the OTP goodies we are kind of accustomed to see. Yes, yes. And that is, I think that is for me, at least the thing that tripped me up.
Adi (32:50.27)
Right, right, there's no process, nothing, yeah.
Adi (32:55.906)
There's no state to, yeah, yeah. And I think that's awesome, right? That allows you to, I think it really is so agnostic of how your state of your span between different execute calls, right? Different handler calls. And I think it's great. If your spans, for example, and in my company, it's a problem, if your spans can go really long and...
You have multiple calls being made to your application, and you store the state of his span, you could start using memory very quickly. And with telemetry, not relying on that, not storing a state somewhere, and just assuming the state is captured, agnostic of the application, outside of the scope of the application, it's a very good architecture.
Sascha Wolf (33:50.359)
Yeah, the really nice, like the really, really powerful tidbit. And that is what I've only been crocking now is actually how, because on the surface it might seem like, okay, this neat, but why not choose open telemetry directly? Right. And, and I, I would have said to that like, yeah, you're right. Like a few weeks ago, but now I actually understood, okay, the thing is here.
I can now have this handler that translates it into open to limited concepts, but I can also have a handler that kind of aggregates it into a metric and make them accessible through Phoenix live dashboard. I can have a handler that enriches my logs for it. I can have a handler that I don't even know right now what I could be doing right now because it is all so, so with all any bells and whistles that it's the level of extensibility there is sky is the limit basically.
If you can think of it and the information is available through the events, you can probably build it. And that is kind of the beauty of it. We've been one of the things, again, coming back to the architecture we're going with, with these different separate bounded contexts out of the application, we've already been thinking, can we have dashboards for each of these bounded contexts, be it in Datadoc or be they locally served?
buying into telemetry here and using that as our mechanism to track events, like monitoring events, and spans and traces, gives us that ability. We can still export them to open telemetry and have them in Datadoc. But we can also aggregate them and represent them in a different way that makes sense for that particular subdomain without going through multiple hoops, through going through a Datadoc agent that goes to Datadoc. And then maybe we aggregate it there, so on and so forth.
Right.
Sascha Wolf (35:43.639)
So maybe coming back to the very first question, what is telemetry? Telemetry is a library that allows you to decouple your metric and your monitoring logic from the actual business code by sprinkling in occasionally calls to telemetry in that code and say, okay, this thing happened now, this thing happened now.
then you can have it in a different place, subscribe to those things and make monitoring happen basically. That is what telemetry does. Not more, not less.
Sascha Wolf (36:27.403)
any other tools you found useful.
Adi (36:30.558)
I just wanted to highlight what you said, Sasha. It's nothing more, nothing less. Given an example, you can put telemetry execute in your router pipeline and use that as a rate limiting mechanism. It could have nothing to do with logs monitoring. Just literally every time someone makes a request, IP address to endpoint, store that somewhere in an agent, and literally make it independent.
Sascha Wolf (36:56.undefined)
Yeah, yeah.
Adi (37:00.178)
of your actual code. And yeah, anyway, it just literally can be used for anything.
Sascha Wolf (37:07.071)
As long as the information is present, you need, right? That is the whole tidbit. Like the information needs to be passed along, but that's it.
Adi (37:09.654)
Great.
Allen (37:25.29)
Yeah, there was something else that came to my mind, but now I lost it. I think the other thing, too, is general metrics. Maybe we kind of cover this about CPU usage and this kind of stuff. And also, even using Observer to peek into processes and seeing if their mailbox is filling up to have this kind of idea about, do I need to add more workers? This kind of stuff. Or even like.
database query time, but I guess you can just use telemetry with that. I mean, definitely use telemetry to kind of log query times and figure out like, do I need to add something during the bump up my database specs? Bumping up database specs obviously helped a lot. So did also adding proper indexing. That helped a lot too.
Adi (38:12.79)
Yeah, that's a good point on the database. I think with the database, you'd also need insights from the database too, right? Just getting the client side insights, like how long a query takes, obviously will help to some extent. But then there could be a million reasons why database queries are slow. It could be very well CPU weights, certain tables being replicated, for example, if you're using replicas and stuff. So yeah, I think that's a good point.
Yeah, the database level insights also would be very useful.
Sascha Wolf (38:53.091)
I think, I mean, Ecto, for example, is also instrumented using telemetry. Like, so you have events there. Um, the one thing I, because I mean, like we are now have this open standard and now we're leaving Beamland a little bit, but we have this open standard of open telemetry and like, I mean, the Beam actually is, if you look at the different implementations, um, for open telemetry, the Beam is really, really the kind of spearheading this, um, and. What I've always wondered is like, is there.
Adi (38:57.099)
Yeah.
Sascha Wolf (39:21.379)
plans or is there already integration on like this kind of open standard of observability into technologies like Postgres? Because I mean, why not? Right? Like Postgres, if you wanted to, could also emit metrics and locks and traces in an open telemetry compatible format. Same for Redis or any other technology out there. And I feel...
I'm not, I mean, like maybe any of you knows, I genuinely don't. I do know that I thought this already like five years ago, that would be super neat, right? Like if I could kind of extend my traces beyond the reaches of my own system, beyond the things I control, and enrich the traces through the things that are happening inside of those systems.
Adi (40:12.274)
Yeah, I think connecting the trace would be obviously very hard. But I know that Postgres has their, I know that different proprietary tools, like Datadog, that's when you use a lot, Sasha, right? So they have a very good open telemetry collector for Postgres. It's just hard to connect your application traces with Postgres. It just would be an excellent, especially if you start using PGBounds or some other connection pooling in between on top of Ektos pools, right? It just would be.
Sascha Wolf (40:40.9)
Yeah.
Adi (40:42.046)
almost impossible to connect if that's where you're going.
Sascha Wolf (40:44.811)
Yeah, I am going there. But as far as I know that OpenTelemetry does also specify standards for doing that, but like for how communicating trace information between distributed and dist- separate services across the network. So again, if like something like the maintainers of OpenPostgres really wanted to, they could build that in.
Adi (41:08.91)
I think the, I just, yeah, I'm sure you can. I just don't, I think it's beyond, I think at least the way open telemetry works in a distributed fashion is like having some kind of a unified context, right? I think that, especially if you have like active sessions, that would just be very hard to track, right? Like that's all point of ecto connection pools to keep.
connections alive and make different queries related to different traces, different pipelines. I am sure there's a solution. With the current way open telemetry works, I just can't think of one. But maybe I'm missing something.
Sascha Wolf (41:59.687)
Yeah, I'm also not an expert on a particular implementation details, but I've always looked at it from the consumer side of like, this would be really neat. You know, like if I could kind of see, I mean, like I said earlier, we spotted this performance issue in like one of our Acto queries by actually instrumenting every Acto repo call. But just imagine if we could have kind of gotten that for free because something like Postgres.
Adi (42:09.407)
Oh yeah.
Sascha Wolf (42:26.091)
actually exports open telemetry compatible data. And you can just import that and correlate that with the calls you do to the database. It will be so powerful. And I've been surprised to see, honestly, if any of the listeners out there know about this, and please let me know. But I've been surprised to see that now that we do have this open standards, that not more of the
Adi (42:36.556)
Yeah.
Sascha Wolf (42:54.635)
The software that runs the world, so to speak, Postgres is kind of part of the software that runs the world. Those are not investing more time into having an implementation for VStandard and allowing people to monitor their software this way.
Adi (43:17.03)
Yeah, I mean, I do think Datadog's Postgres, forget the word they have, whatever adapter they have, I think that is pretty good. Yeah, I think how it works is I think it ties a query to a connection or a session based on a time frame. So as long as at your application level, you can tell that this span used this connection. You can like with some.
sort of, you know, there'll be some margin of error, obviously, right, but like within that margin of error, you can tell that these are related, but I know that there have been efforts made by data dog and app signal kind of has a not as good as data dog database tool, but I think
Sascha Wolf (44:23.643)
Okay, so maybe let us summarize. But like we have tooling like E-prof, F-prof, Benchy and so on and so forth for like local performance optimizations and measuring. We have tools for understanding a running system at large to that locks contribute of course, but also metrics and traces. And we have telemetry, which is honestly kind of the glue in between and making it more agnostic.
and not everybody, not every movement, a necessity for every library to homebrew it. What else is there? I mean, we also briefly talked about observer and live dashboard. Like what about all of the other metrics you, and information you can export from a running beam instance. Right, like the schedulers, for example, you can get information on those. And I think live dashboard does ship.
with some OTP specific information out of a box in terms of showing that to you. I don't have it in front of my mental eye right now. What is the story there? What kind of information do you get there and how do you get it? What is the interface looking like? Does any of you know, Adi, Alan?
Allen (45:45.198)
I think you can see some of the running processes. And there's also some, I forgot, it's been a while, but maybe I have to look it up. But I think there's a way that you can take a look at the requests coming in.
Adi (46:01.546)
Yeah, I think one of the things that's useful for us also like memory allocation and elixir obviously overlying the process copying the entire memory can get very problematic at points. I think that's where I remember using observer for. I think one of the top tabs, it had memory allocators and there's a processes tab as well. So, yeah, that's useful. I think it also tells you which processes.
I forget what is it is it is it is it is it called the is it the get state function that returns what is the current function being executed in the process I forget what function you call but you can see where each processes what is a function the current process executing in one of the top tabs I think it's called like trace or something but yeah I haven't used the observer since like I think it's been like four or five years.
but it was a very neat tool.
Allen (47:01.47)
Yeah, you can also plot some of the metrics, right? So if you have some specific custom telemetry, you can plot them out and see what they were over time. So it's called the request logger, where you can see information such as, say,
For certain requests, what are all the logs that came out of that request? And there's also something that's really interesting, too, is the OS data, or the home page. You can see what version of Phoenix you're running, version of Erlang you're running, how much memory you're using, and even some specific information about the computer itself. I think I almost ran into a problem where we're actually running out of disk space. So you can also see that kind of stuff, too, for the OS data. And also the memory, general memory, how much you have. This is quite a lot of interesting stuff. It's kind of like you can see things
Observer would normally give you. But also some other stuff that I think Observer doesn't give you, like how much disk space you have left on the host and things like that.
Sascha Wolf (48:01.135)
Does any of you know how the dashboard actually collects all this information under the hood? I mean, I also know that it, for example, shows the numbers of registered atoms. And I know that those are things you can readily get from the runtime. But for example, the specific details of what is the mailbox of a particular process, something that you get through observer, for example. Do you know what kind of interface, what kind of integration, how that works?
Adi (48:27.106)
I think it is this tracing. Yeah.
Sascha Wolf (48:28.827)
tracing, okay, but like what specifically? I mean, it's all of this, like, I mean, that is again, because like, this is like OTP land, right? This is core library, standard library things. So how are those things instrumented? Do you know that?
Adi (48:44.766)
I think from what I remember, and Alan, correct me if I'm wrong. It looks like you've used Observer more recently than I have. I think you can get Observer data for process state and traces and stuff only after you start the Observer. That makes me think that it uses tracing. Traces have not been started unless you start Observer, right, because that's a relatively inexpensive operation. The process information, like I said, that you can get on runtime, that you can get anytime, like you know.
number of schedulers and all that stuff. So that again, that's why I think it's users tracing. It is also not real time. I remember I think it probably has a refresh time to decrease load to. I remember it wasn't very, if you're running a lot of operations, it wasn't very Yeah, real time.
Allen (49:39.038)
Yeah, for sure, I've started up Observer after starting up a lot of work in the background. And looking at the load chart, you kind of see it starting after some time. So for sure, it must be after it started.
Adi (49:39.254)
That's my guess.
Sascha Wolf (49:58.615)
I'm mainly asking because, I mean, we briefly talked about telemetry. I mean, not honestly briefly, we talked a bunch about telemetry, but telemetry is outside of the standard library, right? Like it's a library you install. So I'm actually curious to understand how these standard library things are instrumented and what are they using under the hood and how easy it is also like for a day-to-day engineer to hook into that. It's probably not super hard, but I mean, I've been working with Elixir for over six years now and I don't know.
Sascha Wolf (50:33.643)
And apparently the two of you also don't.
Allen (50:36.722)
Sorry, the question was about how the telemetry stuff works. I didn't quite.
Sascha Wolf (50:40.223)
No, no, not how the telemetry is, but how, how the standard library is instrumented to get this information. Like what, I mean,
Adi (50:47.456)
I will, I'm willing to bet 100 bucks that it is tracing.
Sascha Wolf (50:53.075)
But what exactly are they? What is the module?
Adi (50:55.646)
I think there's a library with like, yeah, I think there is, I forget, I think it's like TBB or something like that. There's a tracing library. Let me go with that right now.
Adi (51:06.53)
TTV, not TV, T Trace Tool Builder. I think that's what it uses.
Sascha Wolf (51:14.327)
Okay, there's also Erlang Tracer, I just saw that. There's also a module that exists. But interesting. I never had the need to dig into any of those details on this level. But again, like I mean, like when I come back to the very first question I asked at the start of the episode, right, it would be so cool if I could get kind of an overview, like if I would have a guide to go as okay.
these exist, if you wanted to, you can reach for them. Because even to this day, I discover modules and details about the beam where I think, oh, this is useful to know. And sometimes even, oh, I only wish I would have known about this, I don't know, a year ago. And one of these secret tip modules is like, how's it called again?
There's an Erlang module which you can use as a kind of global key value storage beyond ETS. I always forget how it's called. But basically it allows you to write to it and the writing is a slow operation. It's very global, global. Yeah. No, it's not global. It's not global. Persistent term. Persistent term is the module. And it is some, yeah. And honestly, like I...
Adi (52:23.403)
Is it just global?
Is the name global? Okay.
Adi (52:34.518)
Alright, yeah.
Sascha Wolf (52:37.407)
I didn't have to reach for it very often. The thing about persistent term is reading from it is one, like super fast. Writing to it becomes slower, the more keys you basically write to it. So it's really optimized for reading, but in some use cases, it is really useful. And, um, for example, I've used it in one context where we were, had to read a configuration file from disk and then had to refresh it regularly, right? So, and what the, that was one process that read on it, read from the file, put it into persistent term.
and every minute read it again, check if it changed, and put it in there. But if you don't know about these things, that is the thing. If you don't know about them, then how could you use them? And I feel that, especially in the terms of observability and monitoring and so on and so forth, the beam has some really, really powerful modules and capabilities, but I probably don't know half of them. And it would be so useful.
Adi (53:10.614)
Yeah.
Sascha Wolf (53:34.715)
to have like one place to go to and get an overview of like, okay, for example, OTP is instrumented like this. And maybe some of our listeners are more, coming more from the Erlang side and they know, think, hey, there is actually some guides on that, but here link, I can link to that. But obviously I didn't find them. So yeah, I just wish there was a place I could go to and they kind of asked me like, what kind of...
now does make sense for my monitoring, for my observability use case.
Sascha Wolf (54:14.751)
Okay, is there anything you would like to add? We would like to give to our listeners as a piece of nuggets of wisdom before we go to peaks.
Allen (54:24.674)
There's just way too much stuff in the beam. It's hard to keep track. I don't know how the hell Addy learns all this stuff. And even you too.
Sascha Wolf (54:31.192)
I'm sorry.
Adi (54:33.266)
It just, it just, I, I went, I didn't have much of a life.
Sascha Wolf (54:40.088)
Hahaha!
Allen (54:41.59)
I mean, you do have life out. You don't have much life outside of your PS5, right? So that's my understanding. Oh yeah, I'm just impressed with like all the stuff that the beam has, it's insane. And like, I still have to defend like why Elixir Erlang is like a...
Adi (54:47.694)
Good point, good point.
Sascha Wolf (54:53.083)
Okay.
Allen (55:05.054)
a good choice for running for making systems out of because it's just not. It's funny that people use it a lot, but nobody ever really talks about it compared to other things. So like a lot of times they're like, oh, people use it for Pinterest using it. Apple has is using it, you know, like Pepsi is using it. It's just insane. And people still knock on it because not. I don't know, it's like we it's like we're spending our time building things with it that we don't have time to like talk about how awesome it is. This is kind of weird.
Sascha Wolf (55:36.287)
I agree, I agree. I mean, it kind of makes sense when you consider that the beam has been around for like three decades now, right?
Adi (55:46.731)
Yeah.
Sascha Wolf (55:49.139)
Okay, then let us go to picks. Adi, what is your no-life pick?
Adi (55:56.19)
I guess since Sasha was saying he wished.
Yeah, he knew more about some of these hidden gems in Erlang. I will mention one of the ones that came to mind. You know, Erlang has a trace function, right? You can trace all the process calls. Great for mocking and all that, right? This Erlang.trace if you're using Elixir. There's an Erlang trace underscore pattern function. That is great for tracing function calls, which is I have not seen a blog post about that. So...
I know people use our lander trace a lot for process message passing tracing, but check out trace underscore pattern. That is so useful. Very useful to know for testing, you know, to make sure certain functions are called and all that delegates testing delegates properly. Yeah, it's super neat.
Adi (56:56.822)
Sasha on mute.
Sascha Wolf (56:57.791)
Interesting, I've never heard about it. So yeah, good pick. Ellen, what are your picks?
Allen (57:05.994)
Yeah, as I mentioned earlier, I wanted to pick that video shepherding from Amos King from Gig City Electra 2023. I think it's a great video. He talks about different ways that you can kind of figure out what's going on, what's going wrong with your program. And he actually does it like real time, like he made a program. There's a bug in it. He talks about how he.
started from the outside in and started to figure out, OK, well, I think this is what's wrong. And it started instrumenting to see what is the problem and figured out and fixed it. Anyways, I think it's a really great video for this topic, at least.
Sascha Wolf (57:41.959)
Nice. I love that you both have picks that are relating to the topic because I don't. Sorry. My pick for this week is actually like a purely entertainment based pick because I recently got a Steam Deck and I really like it. It's a really nice piece of hardware.
Allen (58:02.062)
You copied off of me. I've been talking about Steam Deck for a long time now. I think you said, you told me you didn't want to get a Steam Deck. If I remember correctly, you're like, no, I'm a PC guy. I'm not getting a Steam Deck.
Sascha Wolf (58:04.759)
Yeah, yeah.
Sascha Wolf (58:10.993)
Yeah, I think I said, like, I've really been thinking about getting a steaming, but it's so expensive.
Adi (58:15.65)
That was me, that was me, Alan. I think that was the anti-esteem deck.
Sascha Wolf (58:20.134)
Oh.
But yeah, it's a really nice piece of technology there. The thing for which I got it for was like, I have a whole bunch of games in my library, which I really want to play, but smaller like indie-ish games. And I honestly, I don't sit down in front of my desktop PC to play those. I just don't. And like when I do that, I play something usually with friends online, right?
There are some of these games which are more, I don't know, for me they are more like cozy games and even if they are difficult to play, and I kind of want to huddle on the couch or like in my armchair and like with a nice cup of tea and play them there. And that is what the Steam Deck has been allowing me to do. For example, and then that can maybe lead into my next pick. A game I've been playing on my Steam Deck and I really, really enjoyed is Everhood. And Everhood is a pixel.
role playing game in like a pixel style. The thing is, it's a rhythm role playing game because you have battles against enemies, but it's kind of like music and rhythm and you have to dodge projectiles that are coming in the rhythm of the music. It's really difficult and weird and trippy, but super great.
So if you are into maybe a bit more difficult games, and also if you are into games that are breaking out of well-throttten and well-known formulas, then Everhood you should definitely check out.
Adi (01:00:02.186)
Nice.
Sascha Wolf (01:00:05.631)
Okay, folks, then I hope you found this useful. I hope it kind of give you an all around view on Office of ability and tracking and metrics on the beam. Even if I don't think we could satisfy the initial question, but I also didn't expect us to do, but I hope you learned a thing or two. And maybe you come back to this episode at some point and think, Hey, they talked about this one thing. Let me check it out again. And honestly, if that happens, then job done.
Was a pleasure talking to you, Adi-Elin.
Adi (01:00:39.022)
Same here.
Sascha Wolf (01:00:40.88)
And I hope you all enjoyed listening to us and tune in next time with another episode of Heliximix.
Observability in the Beam: An In-Depth Exploration of Tools and Solutions - EMx 228
0:00
Playback Speed: