Deep Dive into Metrics and Monitoring with Prometheus and Grafana - JSJ 645
Dive into a fascinating discussion blending the worlds of literature, gaming, and tech. In this episode, Chuck and Dan explore the intriguing connections between The Hobbit and The Lord of the Rings, including an extraordinary tale about Israeli pilots translating The Hobbit during wartime. They share insights into Guy Gavriel Kaye’s standalone novel Tigana, inspired by Renaissance Italy, and discuss the complexities and strategies of board games like Monopoly and Letters from Whitechapel.
Show Notes
Dive into a fascinating discussion blending the worlds of literature, gaming, and tech. In this episode, Chuck and Dan explore the intriguing connections between The Hobbit and The Lord of the Rings, including an extraordinary tale about Israeli pilots translating The Hobbit during wartime. They share insights into Guy Gavriel Kaye’s standalone novel Tigana, inspired by Renaissance Italy, and discuss the complexities and strategies of board games like Monopoly and Letters from Whitechapel.
But that’s not all. The episode takes a technical turn as the speakers delve into the dynamic world of application monitoring with Prometheus. They unpack the mechanics of event loop lag, heap usage, and GC storms, and share how Prometheus's query language (PromQL) and integration with Grafana can proactively manage and solve performance issues. Hear about real-time alerting, sophisticated querying, and the practical applications of these tools in companies like Next Insurance and Sisense.
This episode is packed with information - from managing performance metrics and alerting systems to insightful discussions on favorite standalone fantasy novels and the productivity hacks that keep our hosts on top of their game. So, sit back and join us for an engaging and informative session on Top End Devs!
Sponsors
Socials
Picks
- Charles - Letters from Whitechapel | Board Game
- Charles - TrainingPeaks | Empower Your Training
Transcript
Charles Max Wood [00:00:05]:
Hey. Welcome back to another episode of JavaScript Jabber. This week on our panel, we have Dan Shapir.
Dan Shappir [00:00:12]:
Hello from very hot Tel Aviv. Well, isn't that a surprise summer in Israel?
Charles Max Wood [00:00:19]:
Right. It's it's also pretty hot here. It's been getting up to a 100 degrees, which is, like, 40, 41 degrees Celsius.
Dan Shappir [00:00:26]:
That's actually hotter than it is here.
Charles Max Wood [00:00:30]:
It's So
Dan Shappir [00:00:30]:
we've got it better.
Charles Max Wood [00:00:32]:
It's it's been that way here for over a month. It it's been
Dan Shappir [00:00:37]:
frying your eyeballs?
Charles Max Wood [00:00:39]:
Just about. Anyway, I'm Charles Max Wood from Top End Devs, and, we're gonna be talking about monitoring and alerting using Prometheus and Grafana. Now I've played with Grafana. This this was proposed by Dan, and so I figured he's probably gonna do most of the talking. But, yeah, just to give a little context, I I don't even know what Prometheus is. So, yeah.
Dan Shappir [00:01:12]:
Yeah. So to backtrack a little bit before we get in to specifically what Prometheus and Grafana are, I wanna talk a little bit about monitoring alerting in general. You know, I've been working on stuff like performance monitoring and and performance optimizations for a good number of years now, as you all know, if you've been listening to this podcast. And probably one of the most important lessons that I've learned is that you need to have monitoring in place and alerting in place before you start any work on improving things. And in this context, I love a quote from, Peter Drucker. Peter Drucker more or less created the field of, business management. And the quote is, if you can't measure it, you can't improve it. And I I'm really a big fan of that.
Dan Shappir [00:02:16]:
If if you're unable to measure something, there's really no way for you to know if you're making progress, if you're improving things, or if you're actually degrading things or making no impact at all. And one of the things that I've done in this context is that whenever I, like, say I join a new company or start a new project, I often find myself under pressure to start delivering improvements, you know, from the get go as as quickly as possible. And I always push back on that in order to make sure that we have proper data collection and proper monitoring in place before we start making any improvements. By the way, in this context, it's probably not surprising that I joined Sisense, which is a BI analytics company because I'm really a big believer in that. And, you know, as an extra benefit, if you're doing this kind of work, one of the big benefits of having some sort of measurement and and monitoring solution in place is that after you've made improvements, you'll have graphs to show the impact of your hard work. And from my experience, that's really beneficial when you're let's say, you're looking for a raise or or advancement or something like that. But but really, in order to be successful in any project that requires taking a system and trying to, let's say, improve it, then having some sort of a monitoring capability in place is crucial. Now the question obviously then becomes, what do I actually use for that? I mean, you know, it's fairly straightforward to collect a lot of data these days, but what do you do with it? Where do you put it? How do you process it? How do you visualize it, etcetera? And and in this context, I want to talk specifically about 2 things, which are Prometheus and Grafana.
Dan Shappir [00:04:26]:
And first, I'll start with a riddle for you, Chuck. Mhmm. Do you know, like, who Prometheus was? Like, you know
Charles Max Wood [00:04:35]:
He he was a titan that brought us fire. Right? Wow. Greek mythology. And, he was he was punished by having, what, an eagle eat his entrails out for eternity or something like that?
Dan Shappir [00:04:49]:
He was tied to a mountain on the well, I think with the Tartarus Mountains, and then, Zeus's eagle will come would come once a day and eat his entrails. And because he's immortal, he cannot die, so he suffers forever. By the way, later times, Greeks or Romans kind of thought that this was, like, too bad of a fate, so they had Hercules release him or something along these lines. But Right. But the the now I can't prove it, but I think that the reason that this project was called Prometheus is because it's about bringing knowledge because fire in this context is like, synonymous with knowledge knowledge from the gods to the humans. So it's about bringing knowledge to us developers about how our systems are operating. So that's the mythology about, of what Prometheus is. Now let's talk about what Prometheus as service is.
Dan Shappir [00:05:50]:
So Prometheus is free software as an open source that you can install on premises or you can use as a service, I think, as well that is used for event monitoring alerting. It was originally created something like 12 years ago, at SoundCloud when they came to the conclusion that none of the third party solutions for monitoring were sufficient for their needs. And after they built that system and used it internally, about 4 years later, they donated it to the Cloud Native Computing Foundation, which is the same foundation that, also, hosts the the Kubernetes project. So Mhmm. You know, this is, another successful project from that foundation. I think the, the people who are mostly working on it these days are the people from the company that does Grafana. But, I but, again, it's open source, and you can see the source code on on GitHub, and there are a lot of contributors. I actually personally contributed to one of the satellite projects around, Prometheus, which is the Prometheus client for Node, which makes it possible to connect Node JS to Prometheus in order to monitor, nodes.
Dan Shappir [00:07:20]:
So I contributed specifically specifically to that part of the project.
Charles Max Wood [00:07:26]:
I'm gonna stop you just for a minute. I've been posting the links in the comments, but they don't go to x, and that's where most of our live listeners are. So Prometheus is prometheus.io, and Grafana is at grafana.com. So anyway
Dan Shappir [00:07:45]:
Yeah. That's true. Okay. So, so where were we? So we were talking about, a little bit about the history of Prometheus, both the mythological figure and the project. Now let's talk a little bit about what it actually is. So it's a service used for event monitoring alerting. It records real time metrics in something called a time series database, which is kind of a special type of database, and we'll talk about it in more detail and how it differs from the databases that most of us are familiar with. It allows for something called high dimensionality, which I also will try to explain.
Dan Shappir [00:08:34]:
It supports flexible queries and real time alerting. And as I said, it's free software. It's licensed under the Apache 2 license. So that's what Prometheus is. So let's say you want to use Prometheus in your organization. What you would do is that you would install the Prometheus service and then hook it up to your various services that you want to monitor. Now it's a monitoring solution for back end infrastructure. So things like Node.
Dan Shappir [00:09:12]:
Js or for, you know, the JVM or for something that's, say, written in Go because that's not surprising because Prometheus itself is actually written in Go. So maybe it's a shame that we don't have AJ on the show this time.
Charles Max Wood [00:09:28]:
Yeah. Maybe. But but your point is is that, you know, any any language or system could have a driver that pushes the
Dan Shappir [00:09:36]:
There there's basically a connector for anything. There are also connectors for a lot of, general services. So if you want to monitor let's say we were talking about Kubernetes. You can monitor Kubernetes. Kubernetes has a built in connector for Prometheus. So you can look at how pods are functioning or the Kubernetes cluster itself. There are connectors for various, AWS services and and so on and so forth. So you can collect a lot of data from, you know, third party services and infrastructure, and you can also attach it, like, create applicative level monitoring.
Dan Shappir [00:10:20]:
So you can monitor the behavior of your own applications that are running on platforms such as the JVM, such as Node. Js, or in Go, etcetera. Right. More more or less any programming language that you can think of. Now the way that you configure the system so, again, not very surprising given perhaps that it's certain in Go, the configuration are YAML files. And, again, this is kind of correlates with Kubernetes. So YAML, for those of our listeners who somehow don't know, is a configuration format. You can think of it in to an extent kinda, sorta similar to what we usually do with JSON files, but it's a different format.
Dan Shappir [00:11:09]:
It has certain advantages over JSON. For example, it support comments. Right. It's used by a lot of for a lot of administrative stuff. So any DevOps person is likely very familiar with YAML.
Charles Max Wood [00:11:26]:
The default configs for Rails are all done in YAML as well. So Mhmm.
Dan Shappir [00:11:32]:
Yeah. So you basically create configuration files for Prometheus in YAML. And the way that Prometheus works is kind of the reverse of what you might expect. So you might think that you somehow configure various systems to push data into Prometheus, but that's not how it works. The way that it actually works is that Prometheus pulls data into itself. So in the YAML file, you tell Prometheus the address of the various addresses of the various services that you wanted to monitor and the rate at which it should effectively ping those services. And it effectively does an HTTP get to an endpoints exposed by these services and downloads data from them. So it actually pulls the data from them into itself.
Dan Shappir [00:12:33]:
Now the advantage of this approach is that, first of all, they don't need to be aware of where the Prometheus server is. So it's the the all the configuration is centralized. They just need to open the port, you know, listen on it, and that's more or less it. Also means that they can work with multiple Prometheus service servers at the same time because they all just pull the data. Because pulling the data doesn't clear the data. It's not as if, you know, they give the data and then forget it. It it they they retain the data. Those services are expected to retain their data in memory, So you can just hit any one of them at any time and pull from them their current situate, state.
Dan Shappir [00:13:17]:
I hope that's clear.
Charles Max Wood [00:13:19]:
Yep. Makes sense to me. I kinda like it too just from the standpoint of, so I've used other systems, paid systems. You know, we've been sponsored in the past by Sentry and Raygun and stuff like that that that that grab a lot of this information. Though I think we're talking kind of a level below that, right, where we're not talking specifically about the information that's being sent. We're just talking about how the information gets into the system at this point. I like the fact that it's like, okay. I'm gonna periodically check, and then I don't have 10,000,000 hits on the service on the other end.
Charles Max Wood [00:13:57]:
Right? Because I'm not pushing it to it. It's pulling it. And so it only does the work that it has to do. Right?
Dan Shappir [00:14:04]:
Yeah. Yeah.
Charles Max Wood [00:14:05]:
I don't I don't have all this extra network crap going on.
Dan Shappir [00:14:09]:
Yeah. So it basically does an HTTP get, let's say, once every minute to any one of those, services. The response is essentially text in in their own, format. It's quite readable, actually. So you could literally go to one of those endpoints and just hit it with your browser and see what the response would be. By the way, obviously, you probably want to make sure that those endpoints are not externally exposed. Mhmm. So, you know, that everything stays behind the firewall.
Dan Shappir [00:14:41]:
Right. The the one kind of caveat to that is that sometimes you have, like, short lived services. Think about something like a Lambda. In that case, what they have is something called a push gateway, which is is like a stand alone service that those short lived jobs can push data in because they they are really short lived. So you can't assume that they'll hang around until they're they're pulled again. So they can push their data into that push gateway, which holds on to that data, and then Prometheus pulls that data from the that push gateway. So it's kind of an intermediary service that that you know, for those special cases. But in most cases, and the cases that I've used it in, you know, it was with long lived servers or services, and then, you know, that's just the way it worked.
Charles Max Wood [00:15:37]:
Right. That that makes a lot of sense too just in the yeah. Like, I'm thinking, like, serverless functions and things like that. Or, you know, if you I we've also run background jobs a lot on in the apps that I have right where it pulls it off the queue and then runs it. And so, yeah, in either of those cases, yeah, you don't want or necessarily need something hanging out so that it can say, oh, you're gonna query me within the next minute or so. It just says, poof, I'm done, and then hands it off. Right?
Dan Shappir [00:16:06]:
Yeah. So there's for example, in the case of Node. Js, there's a Prometheus client for Node. I I think it's literally the project is literally, as I recall, called prom client. So you just NPM install it. And then, you know, it's in there. You you just give it the the port to listen on, and then it does all you know, it just it let's say it uses express or something like that, and then it basically collects the information and exposes that to that port for you, and you don't really need to do anything to start monitoring basic system level stuff. Now pro the Prometheus server gets the data in and then puts saves it into its own persistent database.
Charles Max Wood [00:17:01]:
Right.
Dan Shappir [00:17:02]:
And that database is what I call the time series database. And what I mean by that is that Prometheus doesn't store data, like don't think of it something like a database. You know, we tend to think of something like a relational database or maybe a NoSQL database. That's not really that sort of a thing. It's something called a time series database. So, basically, it has metrics, and it basically saves the value of a metric at every point in time. So so it it's really, like, keeps on collecting metric data. So there's no such thing really as schemas or something like that.
Dan Shappir [00:17:47]:
It just has metrics that it collects data into, and it's data collected over time. So, like, think of I don't know. Let's say you're let's say you're a farmer, and you've got a field and you're measuring the temperature in the field. So you've got the temperature measurement, let's say, every minute that you got from from the whatever, you know, the the thermometer device that you use to monitor the temperature, and it just gets recorded into that persistent storage. And and you can go backward and forward in time and look at any point in time what the temperature was.
Charles Max Wood [00:18:30]:
Right. So I'm trying to imagine what this looks like for an app. So is it measuring, like, how much CPU it's using and how much memory it's using along with some of the other oh, we'll get into that. Okay.
Dan Shappir [00:18:42]:
Yeah. We'll get into that. But, basically, as I was saying, you can I'll
Charles Max Wood [00:18:45]:
I'll keep my enthusiasm down for a minute then.
Dan Shappir [00:18:48]:
It it you know what? Let's talk about that a little bit. So it's really monitoring 2 types of information. One, you can think about it as system level stuff. So in the case context of node, a node server, that might be CPU usage or memory slash heap usage or event loop lag. Or if it's an express service, the number of requests per period of time or the duration of of, the responses. Stuff like that. Okay. So those are system level things, and they're collected for you automatically.
Dan Shappir [00:19:31]:
So as soon as you, NPM install the prompt client for node, and it is loaded into your project, your node server, then all that stuff is automatically collected for you. And when the Prometheus server, you know, hits that, port, that information, those metrics are available from the get go. On top of that, you can add applicative level metrics that you push into Prometheus using an API. So for example, if you've got, like, your own, let's say, queues, internal queues that you want to monitor the usage of or your own business logic processes that you want to measure measure the duration of, you can measure those as well. Okay? So you've got both the system level stuff and the applicative stuff. And, by the way, one system level stuff that is really important in the context of node and may be less obvious or familiar to some of our listeners is something called the event loop lag. Are you familiar with that?
Charles Max Wood [00:20:52]:
I am not.
Dan Shappir [00:20:55]:
So as you know, the way that JavaScript works is that it's all based on an event loop. Right. Be it either in the browser or in node. JavaScript, the way that it works is you've got, whenever, something happens, like if it's in browser, then it's something arrives over the network or the user does some sort of an interaction, a mouse click, a keyboard press. Whenever something happens, an event is triggered, and that puts the event information in a queue. And JavaScript, which behaves as as you know, runs kind of in a single threaded type approach, pulls data out the the most the top data out of the queue, processes it, and then moves to the next item in the queue, next item in the queue, and so forth until, you know, if the queue becomes empty, then it effectively idles until another, stuff is put into the queue. So far so good?
Charles Max Wood [00:22:04]:
Yep.
Dan Shappir [00:22:05]:
Now the problem is that usually rather than idling, what happens really is that information comes into gets into the queue at a very rapid pace at a high clip. So it might new new events are placed into the queue before the the before the JavaScript engine is ready to process them because it's still busy processing other stuff. Like Right. You think about, let's say, a node server running express, the events coming in are are the HTTP requests. If HTTP requests arrive at a too high rate, then then then the node service might not be able to process them quickly enough and will get overloaded. Right. So what the event loop lag actually measures is the amount of time from when a message is placed into the queue until it's taken out of the queue in order to be processed. So if that period of time is is small, then you know that your service is really responsive.
Dan Shappir [00:23:21]:
If it gets to be too high, then it means, you know, your your system is overloaded and and it's not responsive enough. It you know, if think about think about a service, an express service that takes, I don't know, 2 seconds to get to pull something out of the queue, it means that the browser, the client side is waiting for 2 seconds before it's, the it's, event is even processed. So, obviously, that's a bad thing. And it's especially bad if it keeps on growing because then eventually your service server would just become unresponsive. Right. So that's information that, that Prometheus that the the system, the prompt client is actually able to extract out of node and exposes that into Prometheus. So that's one of the system monitoring things that's really useful to look at when you're, monitoring a node service. Another thing that's really useful to look at is, for example, heap usage.
Dan Shappir [00:24:29]:
Because if you've got, let's say, some sort of, memory leak, then Right. You'll see that after, garbage collection at GC, rather than going all the way down, your memory just you utilization keeps going up and up and up and up. And, again, that if if it keeps on going, what will happen is that the the the node service will try to do GCs more and more and more and more in order to free memory, but fail to do so and you get what's known as a GC storm. Or effectively, the the the service, all it does is just try to free up memory that that it can't, and it becomes totally stuck. So that's another thing that you can look at in the context of of, of Prometheus.
Charles Max Wood [00:25:17]:
Well, it seems like on both of those measures, on the the the lag in, what is it? Node event loop lag.
Dan Shappir [00:25:26]:
Exactly.
Charles Max Wood [00:25:27]:
It seems like because I'm sitting here, and I'm thinking, okay. So it's gonna tell me if it doesn't have the resources to handle whatever's coming at it. But for me, I I find it useful because, I mean, let let's just take podcasting for an example. Right? Like, I don't go and religiously obsess over the numbers. Right? I I don't go look at the metrics. But for for an app, if you're checking the metrics on a regular basis, it seems like you could start to see this lag, event loop event loop lag steadily increasing and go, okay. We are getting to the point where we need to start looking at right? Instead of all of a sudden being, woah. Woah.
Charles Max Wood [00:26:12]:
Woah. We're, you know, we're way over the edge. Right? And so it it allows you to be proactive, right, instead of reacting to people are complaining it's slow.
Dan Shappir [00:26:22]:
Yeah. And, actually, what you really want is to have good alerting, and we'll get to that as well because
Charles Max Wood [00:26:29]:
I was gonna yeah.
Dan Shappir [00:26:30]:
Yeah. Because realistically, you're not going to check the graphs for all your services every morning or every afternoon. What you really want is you want a system that alerts you in case something is wrong, and you usually want an alert to say not that, you know, system is broken. You You want an alert to tell you, you know, system is running hot. You should you should do something before it breaks.
Charles Max Wood [00:26:59]:
Right.
Dan Shappir [00:27:00]:
And, and yeah. And and Prometheus is great for that as well because you can specify alerts, and so that's the other part. So, I was talking about, how all the data is collected into the Prometheus service and then saved in into persistent storage. But then you can also do queries on top of that data. Prometheus has its own query language. It's called PromQL, and it looks nothing like, SQL, you know, even though it's a query language. It's a totally different query language. No Well, SQL
Charles Max Wood [00:27:38]:
is the query language, and it doesn't look like SQL even.
Dan Shappir [00:27:41]:
Exactly. And you can do 2 sort of things. You can have Grafana as a visualization environment. So Grafana is the service that you run-in the browser, and it can show you all sorts of graphs from various data sources. And one of the data sources is Prometheus, and you can write from QL queries that extract data and then graph this data in in whatever dashboard you're using. So that's one possible usage. Another possible usage is something called the alert manager, which is another component of Prometheus that comes along with it. It's a it's a standalone service that does regular that every at regular intervals, it runs, PromQL queries, gets the data from them, and sees if they're if alerting criteria are are met.
Dan Shappir [00:28:40]:
And if they are, it can then push alerts into, emails or Slack or PagerDuty or whatever so that you can generate alerts out of the out of the Prometheus data. So for example, going back to that farmer and field and and thermometer example, you could say that if the temperature goes above, I don't know, 30 degrees on the ground, send an alert. Something along these lines.
Charles Max Wood [00:29:11]:
Yeah. That makes sense. Yeah. So When I just wanna throw throw in here real quick because, I I think sometimes we kinda treat, the the time series data as kind of monolithic in in ways like treating it, like, for the day or the week. Right? I was actually looking on the Discord server that we use for the hosts and adventures in DevOps. They were gonna do an episode where they were talking about holiday rushes. Right? And so one day to the next, it may vary, or you may get a lot of your traffic in the morning or the evening. And so, you know, by having these alerts, you can start to pick up some of the patterns.
Charles Max Wood [00:29:52]:
And the other thing is is you can turn around and you can say, okay. I not only do I know that something's happening now, but I can go look at the current state of things or get a snapshot from the Prometheus data and then start to solve whatever is the issue is. Right? Whether it's I need more resources or, oh, I didn't realize, but I built something into the application that makes it memory heavy. And so my heap size is going out of control, and I'm running out of memory or whatever.
Dan Shappir [00:30:21]:
For sure. But but definitely also the fact that data can vary over time even if regularly can make some of this stuff pretty challenging but still doable. PromQL is a very sophisticated query language, and I'll give examples of some of the challenges that I've run into when creating alerts. So for example, at a lot of companies that I've worked, the companies like Wix or or Next Insurance, there was a lot more traffic, let's say, over the weekdays than over the weekends, which is not surprising. But the downside of that was that data over the weekend would fluctuate a lot a lot more because the sampling size size was significantly lower. There when you think about it, let's say, you have, let's say talking about Next Insurance. So you might have 10,000 sessions in a week in a working in a weekday in a working day Mhmm. But only a 100 over the weekend.
Dan Shappir [00:31:30]:
And when you've got only a 100 sessions, then 10 bad sessions can really impact your performance numbers. So and then we would see that, we if we weren't careful, we would start getting alerts over the weekends because the data was less stable because there were just fewer sessions. So we basically kind of had to come up with queries that were also dependent on the number of sessions, not just the duration of the sessions. So if there were too few sessions, we would ignore the the other criteria of the duration of the sessions. So that might be something. You know? It it makes the queries more challenging, but you wanna take these sort of things into account. So quickly going back to when is Prometheus a good match and when isn't it a good match? When what type of data would you want to put into Prometheus and what kind of data you would probably wanna put somewhere else? So Prometheus is a good match when you're recording pure numeric time series data. Okay.
Dan Shappir [00:32:47]:
It's appropriate for machine centric monitoring, your monitoring systems, when it's highly dynamic service oriented architectures because, you know, this whole pool type in mechanism makes it very easy, to adjust to additional instances coming up or going down in a very dynamic sort of a manner. And, again, I'll talk about it assuming we have the time, something about it's really important when the data is multidimensional and both in the collection and the querying, and I'll explain what that is a bit later on. Now when is it not a good match? First of all, it's not a good match when you need a 100% accuracy. So, for example, if you're looking at stuff like billing, when it the numbers have to be perfect, then Prometheus is not a good solution. Prometheus, in most cases, kind of averages out data. Because, again, when you're looking at stuff like, CPU usage, it doesn't really matter if you're running at 93 or 93.2 or even 94%. But if, you know, obviously, if you're looking at stuff like, you know, your taxes, you probably need to be accurate. It's not
Charles Max Wood [00:34:11]:
a pro taxes.
Dan Shappir [00:34:12]:
Sorry?
Charles Max Wood [00:34:13]:
They said don't talk to me about taxes right now.
Dan Shappir [00:34:18]:
Yeah. Yeah. Anyway, it's, not it's also not appropriate when you're recording non numeric data. So, for example, if you're recording stuff like email addresses or phone numbers even, even though they seem like they're numeric, they're not really, or or, you know, street addresses, stuff like that, that's not appropriate. And it's also not appropriate when the data has to be totally persistent. When, you know, when it's, you can't afford to lose any data. As I said, Prometheus pulls the the data pulls the data out of, the various services at regular intervals. Let's say every 1 minute.
Dan Shappir [00:35:04]:
If that service crashes during that one minute, you've lost the data since you previously pulled it. And that's obviously not something that you can live with in a lot of cases where you need, persistent data. And, you know, if it's if you're dealing with bank accounts and stuff like that, you can't afford to lose, transactions or stuff like that. So
Charles Max Wood [00:35:30]:
Yeah. But it it seems like most people are pulling are using it by putting this client onto their application, and so it's it's only recording those specific kinds of data. You're talking about a custom use of Prometheus where you might push other data into it as well.
Dan Shappir [00:35:49]:
Exactly. Exactly. Now it can be a challenge because, again, let's say you're pulling you're pulling the data or pulling it in every minute. And if your server has a problem that in a certain scenario, memory consumption, like rockets out of control and then the server crashes, then you might not be able to catch that if it, you know, if it happens within the spans the span of a few seconds. So, you can either then try to increase the the the the rate at which you pull and hope that you're lucky or look at for some other solution.
Charles Max Wood [00:36:25]:
Right.
Dan Shappir [00:36:27]:
As I said, it has integrations for Node, JVM, Go, Python, Ruby. And in terms of, systems, stuff like Kubernetes, GitLab, AWS, Jira, MongoDB, Redis. For visualization, usually, you'd use Gafana even though it also has its own built in simple visualization capabilities.
Charles Max Wood [00:36:48]:
Okay.
Dan Shappir [00:36:48]:
And it's compatible with OpenTelemetry. So if you're using OpenTelemetry for collecting telemetry information, you can also use OpenTelemetry on top of Prometheus and have OpenTelemetry put a date put its data, the data that it collects into the relevant data, the the, the time series data into Prometheus. So, there are several types of metrics, and they're appropriate for different scenarios. So very quickly going over the different metric types, you've got something called a counter. Counter, think about, you know, the if you're let's say, think about the club, let's say, that you wanna go into, and there's somebody at the door with this kind of a counter device, clicking it every time somebody goes in, like counting the number of people inside because, you know, the fire, regulations only allow up to x number of people to be inside at the same time. And they so they need to count the number of people going in and the number number of people coming out to make sure that, you know, they don't exceed those limits. So counter is really something like that. It's a metric that that can only increase, and you basically add 1 or add an or your n, which is like adding 1 n times.
Dan Shappir [00:38:18]:
So you basically just add add into it, and it keeps on go getting higher. Can you think about things that you would measure using something like like a counter?
Charles Max Wood [00:38:31]:
Yeah. Like the number of requests that come in or
Dan Shappir [00:38:34]:
So, funnily, that that's that's that's literally the first example that I have. So the request count is exactly such a such a thing, tasks completed, error count, all these kind of things that only go up. They only increase until, at least until you restart the service.
Charles Max Wood [00:38:55]:
Right.
Dan Shappir [00:38:56]:
And the cool thing about a counter is that because of this behavior of only increasing, you can compute the rate of increase, which makes it possible to have predictions. Because if you can, calculate the rate, then you can also predict where you'll be in a certain amount of time. Right.
Charles Max Wood [00:39:24]:
Makes sense.
Dan Shappir [00:39:25]:
Yeah. So that's the simplest type of metric. Again, it's used automatically for things like a request counter or a task complete counters or an error counter, stuff like that. But you can also create your own applicative counter if you want to count your own stuff. Right. The next type of metric is called a gauge, and it records a value that can go up or down or literally be set at any value that you want. So you can literally send say the the value of this gauge right now is x. Mhmm.
Dan Shappir [00:40:03]:
Again, can you think of when you might use the gauge?
Charles Max Wood [00:40:07]:
This would be like queue size or,
Dan Shappir [00:40:11]:
Chuck, are you looking at my slides?
Charles Max Wood [00:40:13]:
No.
Dan Shappir [00:40:15]:
Because, again, that's the the first item that I have on my slide for for using it is the queue is queue size. Exactly. Queue size, memory usage, CPU usage, number of requests in currently in progress, all these sort of things that you just set it to a particular value at each at each point in time. Right. Now the the key so it can go up and down or be set to a particular value. The thing about it though is that because it's so arbitrary, you can't use it to assess rate of change. Because if you can just jump between numbers, there's no really meaningful, rate that you can think about.
Charles Max Wood [00:41:02]:
Yeah. You you can average it out. But
Dan Shappir [00:41:06]:
Yeah. Yeah. The next one is slightly more complicated, and I and when I, I actually have presentations that I do, that I so far have done internally. I'm kind of looking for a conference who wants to talk about this. But if you want to be able to measure things like, histogram like, sorry, like averages or even more importantly percentages, like the median or the 90th percentage or stuff like that Mhmm. Then you use a metric called a histogram. Now that might seem surprising why it's called a histogram if it's used to measure percentiles, for example, and I'll explain it I'll try to explain it in a in a minute. But can you think about when you want something like percentiles?
Charles Max Wood [00:42:00]:
I would think like memory usage or
Dan Shappir [00:42:03]:
No. Memory usage, we talked about counter. When when
Charles Max Wood [00:42:06]:
you What I what I meant was, you know, percentage of, like, memory used or percentage of resources used.
Dan Shappir [00:42:15]:
So what they have is
Charles Max Wood [00:42:16]:
something not understanding what it is.
Dan Shappir [00:42:18]:
So so I'll give an I'll give examples, and I think then then it will click for you. So think about something like a request duration. Like
Charles Max Wood [00:42:28]:
Oh, I gotcha.
Dan Shappir [00:42:29]:
So you want to say, like, my median request, I gotcha. Yeah. Rendering duration is x or the 90th percentile is y. Another example might be the response size. So you might say, my average response size or my median response size is so such and such, it goes up to something else when I'm looking at the 99th percent arc.
Charles Max Wood [00:43:02]:
Mhmm.
Dan Shappir [00:43:04]:
So you kind of want to be able to get the measurements, but then use them in order to calculate, as I said, percentiles.
Charles Max Wood [00:43:14]:
Right.
Dan Shappir [00:43:15]:
Now the it's called the histogram because the way that it's actually implemented is that internally, you create buckets. You can say let's say, like let's say you're talking about request duration. You'd say I have a bucket from 0 to 10 milliseconds, from 10 milliseconds to 50 milliseconds, from 50 milliseconds to 100 milliseconds, and anything above a 100 milliseconds. Okay? And each measurement goes into one of these buckets. And then you look at how many fell into each one of those buckets, and then you can use it to calculate percentiles. Right.
Charles Max Wood [00:43:53]:
No. That makes sense. So
Dan Shappir [00:43:56]:
So it's like a discount for each point in time.
Charles Max Wood [00:44:00]:
Right. The thing that I'm imagining is anything that you would put, like, on a bell curve or something like that. Right? And so then then your yeah. Your 90th percentile is, you know, out on the end. Right? But so these are kind of rare, but they're also kind the extreme that you have to deal with. Right?
Dan Shappir [00:44:19]:
This is what the medium or whatever is in the middle. Histogram it's a histogram that gets recorded and updated at every interval. So it's like a histogram over time.
Charles Max Wood [00:44:30]:
Oh, interesting. So you could you could literally see, like, the hump move or whatever.
Dan Shappir [00:44:35]:
Exactly. You say, this is my histogram now. This is my histogram a minute later. This is my histogram another minute later, and so on
Charles Max Wood [00:44:42]:
and so forth. Okay.
Dan Shappir [00:44:47]:
So, hopefully, that's kind of clear. Now there's another metric called summary. I don't wanna get into it. It's it's it's more rarely and uncommonly used. So we'll skip it for now. You know, we're starting to run long in any event. Because what I really wanted to talk about was the concept of of labels and dimensionality. So let's go back again to that field example of the farmer wanting to measure, let's say, humidity and temperature in the field, in in their in their field.
Dan Shappir [00:45:18]:
But it's a big field, so they're not using just a single measurement device. They've got devices spread out in different locations in the field. So one way in which you might think about it is that we'll create a separate metric for each one of those measurement devices.
Charles Max Wood [00:45:34]:
Mhmm.
Dan Shappir [00:45:35]:
But the way that Prometheus looks at it is that we've got a single temperature metric, but for each measurement, we associate labels with it. And that label could is is textual value. It might be the name or the ID of the specific measurement device. So let's say I've got 4 devices in the field. They're called a, b, c, and d. I've got the single temperature metric, but each measurement is associated with either a or b or c or d. Mhmm. Is that clear?
Charles Max Wood [00:46:13]:
Yep.
Dan Shappir [00:46:15]:
So now what's the benefit of this approach is that it's very dynamic. If if I if I add an if I, you know, if I buy some more land and I now need also an e and f met, measurements, I don't need to create new metrics. It's the same metric
Charles Max Wood [00:46:36]:
Right.
Dan Shappir [00:46:37]:
But it's associate but there are the new values coming in are associated with new labeled values. So I've got a label called, you know, the device name, but it can come in with any any arbitrary textual value. And then when I query the data, I can say, show me just the met the values for the metric for measure for when the label is equal to a or Mhmm. To a or b or a and b. Because when you do queries in the pro prompt QL query language, you can either set a label like in the query to have a specific value or use a regular expression.
Charles Max Wood [00:47:26]:
Okay. So I I'm imagining that you could do this if you're spinning up, say, another Docker container or server.
Dan Shappir [00:47:35]:
Exactly. So you might have the pod name. You you would have a label called pod name, and the value is the name of the pod. Right.
Charles Max Wood [00:47:47]:
Yep.
Dan Shappir [00:47:48]:
And you can have any number of labels associated with each metric, and those labels can have any number of values.
Charles Max Wood [00:48:01]:
Okay.
Dan Shappir [00:48:02]:
Now that I think, highlights the potential issue of dimensionality because it's it makes the system very flexible.
Charles Max Wood [00:48:15]:
Mhmm.
Dan Shappir [00:48:15]:
But you'll see the problem in a minute. So let's say I have a metric, and I have 3 labels associated with it. So it's actually they defined a three-dimensional space for that metric because each measurement has, a coordinate that is specified by the values of those three labels.
Charles Max Wood [00:48:41]:
Okay.
Dan Shappir [00:48:43]:
Is that clear? I'm kind of waving my hands. Think about it. So so think about again, going back to the field example. Let's say that instead of having, each measurement in the field have its own name, it has, a coordinate in the field. So it has an x axis an x value and, a y value. So it's a 2 dimensional space.
Charles Max Wood [00:49:14]:
Okay.
Dan Shappir [00:49:17]:
Now it now in in reality, like, again, when you're measuring things, you might have many more labels than that. So it becomes an n dimensional space.
Charles Max Wood [00:49:28]:
Right.
Dan Shappir [00:49:31]:
And if for any so and the the the, for each access, the number of of points on that access is the number of different values that label might have.
Charles Max Wood [00:49:45]:
Okay.
Dan Shappir [00:49:48]:
So let's say I have 3 labels and each one has 10 potential values. How many how many point how many different numbers in that space can I have?
Charles Max Wood [00:50:03]:
Thousand.
Dan Shappir [00:50:05]:
Exactly. So I need to remember a 1,000 different numbers for that single metric.
Charles Max Wood [00:50:15]:
Right. And make sense of it, hopefully.
Dan Shappir [00:50:17]:
Yeah. But the problem is that if I'm not careful with the number of labels that I use, and especially the the number of different values that each label might have, I can literally blow up my memory. Right. And, again, going back to how Prometheus works, that Prometheus pulls let's say you have a node server and Prometheus pulls the data from the node server once every minute. So know that node server keeps all that data in its memory.
Charles Max Wood [00:50:54]:
Oh, okay.
Dan Shappir [00:50:56]:
If you're not careful, you'll blow up notes memory. If you've got if you've got, so I'll give an a real example. We were using Prometheus to monitor data associated with the, with the, page performance at, at Next Insurance. And so one of the labels was the URL. Mhmm. We had something like 2,000 different pages. Okay. So we had the label call so one of the dimensions was, was the URL, and it had 2,000 possible different values.
Dan Shappir [00:51:44]:
Right. Actually, if we weren't careful, it could have had a lot worse if we didn't filter out the, query parameters and stuff like that. Exactly. And another, label that we had was device type because we wanted to distinguish between desktop and mobile. And another label that we had was the browser, the the browser type because we wanted to distinguish between performance, let's say, on Chrome and on Safari. Right. So think about it. It's 2,000 different URLs times 10 different browsers times 3 or 4 different types of devices or 5 types of devices.
Dan Shappir [00:52:34]:
And all of a sudden and and for each one of those three dimension coordinates, you need to remember a metric. So it's a number. So it's millions of numbers that you need to Right. That you're holding on in memory, and we literally kind of blew up the memory space.
Charles Max Wood [00:52:55]:
Right.
Dan Shappir [00:52:57]:
So yeah. So you need to be careful when when you're, like, cavalier about the number of labels and dimensions that you're using. So you it's called cardinality, and, basically, you need to be aware of high cardinality. Right. But the other benefit is that that system is extremely flexible. Like Right. If there's a new if there's a new URL, you don't need to do any modification in the system. It automatically adjusts to that because it just associates a labeled value with that particular and another labeled value with that particular metric.
Dan Shappir [00:53:38]:
Right. Now there's all all sorts of things about naming conventions. I won't go into that. About how you name your metrics and how you name your labels, I won't go into that. I will say that, there's a really cool, API for working with all the different, metric types. So we talked about counter, engage, and histogram. So you can there's a, if you, import the prom, prom client, NPM package, then you get those APIs that you can push your numbers with associated labels into the system, and they just get recorded. Right.
Dan Shappir [00:54:35]:
So, again, you might have, let's say, some sort of business logic process that, that, you know, hits does all sorts of things. Go to goes to one database, goes to another database, does all sorts of things, and you want to measure its duration overall. Or maybe you've got your own, like, custom queue for something, and you want to, measure the the size of that queue. Or you've got, something like that, or a pool that you want. Let's say, you're you're you're utilizing your own custom pool of resources, and you want to make sure that you've made the pool not too big, not too small, then, again, you could be looking at the size of that pool and how many, what's the what's its utilization at each point in time. So you can through those APIs, measure your own business logic values. And Mhmm. The example that I gave is we were actually using it to measure Core Web Vitals.
Dan Shappir [00:55:44]:
If you remember, we've had a lot of,
Charles Max Wood [00:55:47]:
we've had yeah. A lot of conversations about it.
Dan Shappir [00:55:50]:
Exactly. Now those are metrics that are actually measured on the browser side. So you might ask, how did they make their way into Prometheus? Well, we would collect this data on the client side, post it to the node server, and the node server would use that API to put it into Prometheus.
Charles Max Wood [00:56:13]:
Right.
Dan Shappir [00:56:15]:
So this way, you can collect data not just from the node side, but also from the browser side. Again, it could be a system level stuff like Core Web Vitals, or it could be applicative stuff. You know, whatever your, web application happens to be doing that you would like to measure and monitor.
Charles Max Wood [00:56:39]:
That's cool.
Dan Shappir [00:56:40]:
Yeah. So, now the final piece that I wanted to mention is how to get data out of Prometheus. So as I mentioned, there's a query language called promQL. That's p r o m q l. And like I said, it's definitely not SQL. It can do stuff like factoring data, aggregating data, run all sorts of predictive functions, tech quantiles, which is a fancy name for percentiles, and averages and and whatnot. And it can be used to answer questions like, what's the 95th percentile latency in each data center over the past month? Right. You know, so very sophisticated queries.
Dan Shappir [00:57:33]:
Or Mhmm. How full will the disks be in 4 days? So here's an example of a predictive query. Mhmm. Or which servers are the top five users of CPU? All these sort of queries. And you can either use these queries in order to graph things or in order to create alerts.
Charles Max Wood [00:58:00]:
Okay. And that's what Grafana does?
Dan Shappir [00:58:05]:
Is it
Charles Max Wood [00:58:05]:
it uses PromQL to do this stuff?
Dan Shappir [00:58:08]:
Exactly. So in Grafana, what you do is you create a graph, and in the graph, you specify Prometheus as the data source. Okay. And you write the PromQL query, and then it basically graphs that query over time.
Charles Max Wood [00:58:25]:
Mhmm.
Dan Shappir [00:58:28]:
And it has all sorts of fancy graphs. So you can do, like, line charts and bar charts and heat maps and all sorts of really fancy fancy graphs, and it's really powerful. I if you're interested in this sort of stuff, I highly recommend just going into YouTube, let's say, and searching for, for, PromQL. There's also, on the, Prometheus website. There's, like, there there's documentation and tutorials for PromQL as well. So we can post links to that, later on in in the in the notes. Obviously, if this was a talk, then I would actually be showing examples of a PromQL, but, you know, I can't really do that. But You're not gonna
Charles Max Wood [00:59:24]:
read it out loud?
Dan Shappir [00:59:27]:
Exactly. So, so you would put PromQL in 1 of 2 places, either in something so actually, one of 3 places. So the Prometheus itself has its own simple web interface where you can put in PromQL queries and get data either in tabular as basically just numbers, tabular representation of numbers, or in simple graphs. If you want much more sophisticated graphs, then you can use Grafana. And Grafana actually has a pretty sophisticated, editor for promQL, so it does automatic completion for you. It knows all the metrics in the system and the labels and the different label values, so it can actually do auto complete for you when you're typing in the queries. Nice. Yeah.
Dan Shappir [01:00:22]:
It's it's it's really nice. It's as I think I said, it's kind of they're kind of sister projects. So it's it's the same people working on both of them, so they made sure that the integration is is really nice. And you can also use the PromQL in the alert manager where you would type the queries in the YAML files, and Mhmm. Then you, they get run automatically at at at periods. And if they come back with data that matches the query, then alert gets raised. So you run a query. If it comes back no data, no alert.
Dan Shappir [01:01:03]:
If it comes back day with data, then there's an alert. So the query the the an alert query might be, do I the number of CPUs that are at utilization higher than 80%. Right. And if the number is 0, then there's no alert. If the number is greater than 0, then there would be an alert.
Charles Max Wood [01:01:25]:
Makes sense.
Dan Shappir [01:01:27]:
Some something like that. And it's actually even more sophisticated than that because you can say, you want to avoid, like, peaks, momentary peaks. So you might say it it needs to be higher than 80% over a period of 2 minutes. Right. Stuff like that.
Charles Max Wood [01:01:45]:
Makes sense.
Dan Shappir [01:01:46]:
Yeah. So it's it's a very sophisticated and powerful system.
Charles Max Wood [01:01:52]:
So is this something it it sounds like something you used at Next Insurance. Is is this something that you're using today? Or
Dan Shappir [01:02:00]:
Yeah. We're also using it at Sisense. So for example, when we, deploy our service, our our customers have the ability to run our services, on premises, and then they can collect monitoring data for our own services inside an instance of Prometheus that do that we install with it.
Charles Max Wood [01:02:25]:
That's cool.
Dan Shappir [01:02:30]:
But like I said, it's a really powerful system, and, and it's, really flexible. And, again, if you want to do monitoring or alerting, and probably you wanna do both, then it's a a very good solution for that. Nice. Yeah. And that more or less covers it. Like I said, I have a talk presentation on it in which I I, you know, I am able to show actual examples and and and and visuals. So it's, you know, more more more informative If you're interested in that, if you're running a conference and you're interested in that, I'm shopping this stock out.
Charles Max Wood [01:03:17]:
Nice. Well, I mean, this is something that I've kind of been contemplating figuring out how to put into my own systems. Right? I mean, I I sent data into I have a self hosted version of Sentry is the thing I've been using lately. And it it collects some of this kind of data, but it you know, I I've gotten more and more into self hosting my own kind of thing and and running through some of this. Right? And it sounds like this this is right kind of right up that alley too where it's okay. When I deploy, right, it just you know, it it makes sure that I have a Grafana and a Prometheus instance set up that it just reports to.
Dan Shappir [01:04:02]:
Exactly. Now, obviously, there are overlap between these sort of systems, but I think their focus is kind of, like, not this not exactly the same. So Right. Let's say, if you if you're do doing, like, error logging, then, you know, Sentry is the the tool for you. Yeah. I think I said that, Prometheus is not the appropriate tool, for textual information. So if you're looking at stuff, you're, like, keeping stuff like stack traces and and stuff like that, then you you wanna use something like Sentry. You know, you could theoretically count of the the occurrences of of a particular type of error, but, yeah, you probably wanna use Sentry for something like that.
Dan Shappir [01:04:45]:
Also, Sentry has the stuff that's, specifically intended for performance monitoring certain scenario. But if you wanna do more general purpose monitoring and system level monitoring, and and and monitoring applicative stuff, then then Prometheus could be the solution for you.
Charles Max Wood [01:05:07]:
Right. I also like that, you know, as it aggregates that data effectively, what you can learn from it is only limited by, I guess, what it's measuring, but also what you can come up with to query out of it.
Dan Shappir [01:05:21]:
Oh, yeah. For sure. I created some amazing query using PromQL. Like, so I'll give that example. So at, Next Insurance, we have something like, 50 services in a service oriented architecture. Mhmm. So which we're communicating with each other over various API endpoints. And we wanted alerts to be fired when certain when any endpoint in the system became too slow.
Dan Shappir [01:05:56]:
But then the question was, what does too slow mean? Because Right. You know, it can't be absolute numbers because if something usually takes 50 milliseconds and then all of a sudden grows to a 100 milliseconds, that means it's too slow. But if something always runs at a 100 milliseconds and then becomes a 110, well, that's a smaller change and you might want to ignore it. So you can't look at absolute numbers. So we I created really sophisticated queries that basically looked at behavior over time and try to see if something became significantly slower compared to its own specific behavior from from, you know, previous durations. And, also, again, wanting to only look at those endpoints which were sufficiently used because if there was some endpoint that was hardly ever used, then we probably don't care about it. Right. So, yes, you can create really sophisticated queries like that.
Dan Shappir [01:07:08]:
And it was exactly because with all these endpoints, you don't want to have to tell somebody, a developer, where, well, you know, you're in charge of a service that has a 100 different API endpoints. You need to look every day at a 100 different graphs to figure out if there's a problem. You wanted a so you want you want an alert to be sent to their to the Slack channel, let's say, of that particular team if any one of those endpoints all of a sudden became slower in a way that really impacts, potentially impacts the system as a whole.
Charles Max Wood [01:07:49]:
Right.
Dan Shappir [01:07:50]:
And and, yeah, that's what that's the one of the projects that I did at NEX Insurance. And we use That's cool. And we use Prometheus and Grafana for that.
Charles Max Wood [01:08:01]:
Very cool.
Dan Shappir [01:08:03]:
So when when alert like that was sent to the Slack channel, the on call, member of the team could actually click a link, see the graph for that the the performance of that endpoint in in Gafana over time and say, okay. We actually have a problem or no. This is just, you know, a fluke or something that we can ignore.
Charles Max Wood [01:08:27]:
Yeah. Yep. It spiked for such and such a thing and no big deal.
Dan Shappir [01:08:33]:
Yeah. We were doing something. We ran some sort of a backup service and, you know, that's why it impacted everything. And
Charles Max Wood [01:08:40]:
Right. Or you migrated some data on the back end and it slowed it down for 2 minutes and then
Dan Shappir [01:08:46]:
Exactly.
Charles Max Wood [01:08:47]:
Yeah. Awesome. Alright. Well, we put some links into the the comments on Twitch and YouTube and Facebook. If you wanna go find those, we'll try and get them into the show notes as well. But let's go ahead and do some pics.
Dan Shappir [01:09:05]:
For sure. Although I don't have that many. Okay. I actually do have something. So for some inexplicable reason, I decided to start tweeting out, my favorite stand alone fantasy novels or books.
Charles Max Wood [01:09:25]:
You
Dan Shappir [01:09:25]:
know, fantasy tends to be written as lengthy series of books or at least trilogies. But, occasionally, I just want to read one book that stands on its own merit, on its own. Mhmm. Because, you know, it starts and it ends, and you can move on to something to the next thing. So I started to tweet out the collect my list of the top stand alone fantasy books. And you know what? I'll I'll use each I'll pick each one at a time. So let's see which one was my first. Can you think of 1? I there's an there's the obvious choice, by the way, which is, I think, The Hobbit.
Charles Max Wood [01:10:09]:
I that's I was thinking that, but then I was like, I don't know if he considers that part of a series because it is sort of a prequel to fellowship of the ring and and that series.
Dan Shappir [01:10:19]:
Yeah. Let's But Yeah. I I think
Charles Max Wood [01:10:22]:
it's self contained story. Yeah.
Dan Shappir [01:10:24]:
Yeah. It's first of all, it's totally self contained. And and the second important thing is that it was written well before Lord of the Rings was written. Yes. So, Tolkien actually wrote the Lord of the Rings because the the the publisher was so happy with the success of the, The Hobbit that they wanted more stories in that world. And it was his idea to transform it from a child's story to a story for grown ups. But so Right. The Hobbit was originally written for his, written for his children, although a lot of adults like that story as well.
Dan Shappir [01:11:01]:
Oh, yeah. You know what? I'll tell you another interesting story about, about the the hobbit. So there's, so there are a couple of translations, of the hobbit into Hebrew, but the one of the translations is especially, interesting. It's called the pilot's translation because, because during, so there was a war between Israel and Egypt in the early, in the early seventies. Like, in between, the 6 day war and the Yom Kippur war, there was kind of it's called the war of attrition. And various pilots were down and and captured by Egyptians and held as POWs for several years. Kind of like, the the American POWs in Vietnam.
Charles Max Wood [01:11:56]:
Mhmm.
Dan Shappir [01:11:56]:
And they were looking for ways to pass the time. And one of them got their hands on the English version of the Hobbit, the original Hobbit, and and they decided to translate it into Hebrew as a way to pass the time. So and when they were released, they they took their the the translation with them, and and it literally got published, and it's called the pilot's translate translation of the Hobbit.
Charles Max Wood [01:12:23]:
That's so fascinating.
Dan Shappir [01:12:24]:
From from English into Hebrew. So, anyway, so the Hobbit would be 1. So the but but because that's the obvious one, I'll give another one. And that that book is called Tigana. It's by Guy Gravel Kaye. Interestingly, to kinda close the loop, he's the person that worked with, with Tolkien's son on on bringing on, you know, publishing the similar alien. So they had a lot of notes left over. Yeah.
Dan Shappir [01:12:58]:
They had a lot of notes left over from Tolkien. So they actually took all those notes and and kind of rounded out and filled up the story and then released it as a similar alien after Tolkien had died. And he worked with it on it with Tolkien's son, but he also wrote several books on his own. One of them is a standalone, novel called Tigana. You might actually find it especially interesting because it's very much inspired by, by the Renaissance Italy. So the setting there, it's a fictional world obviously with magic and, you know, and stuff like that. But, it's it the the the situation, the scenario is very reminiscent of Italy in in in the renaissance with the little countries warring, principalities and and influenced by by large external powers and and so on and so forth and and the the culture and whatnot. And and it's it's a great book.
Dan Shappir [01:13:58]:
It's amazing the the amount of of story and settings and scenery and characters that he was able to to cram into this one book. It's a pretty thick book, but still one book, and it's very highly recommended. As I said, it's called Tigana. That's t I g a n a by Guy Gavriel Kaye. And, that would be my pick for today.
Charles Max Wood [01:14:28]:
Awesome. I'll have to check that one out. We were we were watching the Lord of the Rings movies, and my daughter was saying that she'd never read them. And so I I I had it on Audible, and so she's been listening to it. So
Dan Shappir [01:14:45]:
but if she's already watched the movie, can she can she get into it?
Charles Max Wood [01:14:52]:
I think so. She she's pretty into other, like Harry Potter and, Percy Jackson, and she she got into the books too after she'd seen the movie. So, yeah. I'm gonna jump in with some, picks. Now I have to say my board game group hasn't gotten together in a little bit. It's just it's just kinda that season of life. But, what what I'm trying to think of a game I should pick. What was the game that we played last Sunday with the with the kids? I don't remember.
Charles Max Wood [01:15:36]:
Anyway, I mean, there are all kinds of games out there that you can play. I I I'll just go with one of my favorites. It's funny because I've never actually completely played through this game. But, no, I take it back. I have played through it once. It's called
Dan Shappir [01:15:59]:
monopoly. Does anybody ever play monopoly all the way through?
Charles Max Wood [01:16:11]:
My my kids do.
Dan Shappir [01:16:14]:
All the way through?
Charles Max Wood [01:16:16]:
Yeah. I haven't played Monopoly in a long time. There there are reasons that I don't love the game. I play it for nostalgia, but
Dan Shappir [01:16:27]:
not Probably because it's a it's not such a fun game.
Charles Max Wood [01:16:31]:
Yeah. The the game that I'm thinking yeah. Well, there there are a number of issues that make it anyway, I
Dan Shappir [01:16:40]:
think that I seem to recall that some somebody once told me that the original motivation for the creation of the game Monopoly was to show them to prove the futility of capitalism.
Charles Max Wood [01:16:53]:
Yeah. Well, it's not it's not pure capitalism. So yeah. Anyway so, the game that I'm gonna pick is called letters from Whitechapel. And, I I understand it's like Scotland Yard, but I've never played Scotland Yard. So one player plays as Jack the Ripper, and then, the other players play as the police deputies. So they're trying to catch Jack.
Dan Shappir [01:17:23]:
Is it kind of like a clue or something?
Charles Max Wood [01:17:25]:
No. In in the sense that you're not you're not trying to figure out who did it or anything like that. You're it's literally so what the way it works is, you have the women that Jack the Ripper kills, and, so when he murders somebody I just dropped the link, but I didn't label it. When he murders somebody in the game, he has, so many moves that he can make to get back to his hideout, and you play it in 5 rounds. And so the police deputies are trying to block him off. And so they'll go investigate different spots on the board, and there are hundreds of spots on the board. And so they'll investigate a spot, and they'll either find a clue, which means that Jack was there. And Jack writes down when he moves through a space, he writes it down on his thing.
Charles Max Wood [01:18:19]:
And one of my favorite things to do when I'm Jack the Ripper is actually to loop back around. And so they'll find clues they'll find clues all around this this spot where I went through twice. And so then they don't know which direction I went coming out of there. But, anyway, if you manage to get back to your hideout after all 5 rounds, then you win as Jack the Ripper. And then if if the deputies investigate a spot and Jack's there, and they they have to specifically say that they're they're staging an arrest at that spot, that if they do that properly where Jack is at, then then they win.
Dan Shappir [01:19:03]:
And so to tell you that the the sub the subject matter seems a bit dark for a game, I have to tell you.
Charles Max Wood [01:19:10]:
It it is, but the the overall gameplay is is fun. And so what happens is you start to narrow down where the hideout is. Right? And so you you can you can then start to, you you know where he's going. And so wherever the murder happens at, right, then you can start to, you know, kind of fan out along where he might travel through and get get a sense of where he's at and then be able to arrest him. And so that's that's kind of the gameplay. Let me look it up on BoardGameGeek so I can get the weight. The the reason that I haven't finished this guy, I tried to play it with my family, and it's it's kind of a longer game. I mean, it can run, like, 2 hours or longer.
Charles Max Wood [01:20:02]:
And so they they just kinda especially my kids just lose interest. And it's not the kind my wife likes to play the kind of, light airy games, and this one's a heavier game. And so, you know, all the strategizing and stuff, she just doesn't love. It has a board game weight of 2.64, so it is on the heavier side of a game that a casual player would play. But, anyway, it's it's a fun one. So, yeah, I'll pick letters from Whitechapel as my board game pick. And then, I've been doing a whole bunch of stuff with just being more, I took way too long on that pick. So I'll be quick brief on these other ones.
Charles Max Wood [01:20:46]:
I've been trying to be a little bit more efficient with my time, and I've also been working on getting back into shape. So, you know, I picked a marathon and started training for it. So a a few picks that I've got here that I'm just gonna put out there. So, my training program, I do in training peaks, and that's just trainingpeaks.com. I'm gonna put it in the comments as well. And, effectively, it just gives you a calendar. You can buy workout plans. They the ones I've bought cost anywhere from $5 all the way up to, like, $50.
Charles Max Wood [01:21:23]:
I think the marathon and triathlon ones were were more expensive, but it it literally just puts the workouts into your thing. And then I have a Garmin Forerunner watch, and so it just syncs it down to my watch. So all I have to do is tell it, go into my training calendar, pull the workout, run it through the watch. Right? So I did that this morning. Went for a run. It was awesome. As far as being more efficient, I've also picked up and been using Linear, which seems to be pretty popular in in the development circles. It's linear dot app.
Charles Max Wood [01:22:02]:
I'll put that in the comments as well. And, it I mean, it's basically a project management board like any of the other ones that you're used to. And then what I've been doing is I've been anything that I intend to do anytime soon out of linear or if it's other stuff that I need to be doing, like, for example, when Dan and I were talking before the show, he was like, hey. We we ought to get people like this on the show, or, hey. I was gonna reach out to someone so. You know? And I was like, oh, I know them. And so, as we were having that conversation, I put them into another system called Motion. And I I kind of I'll look and see if they have an affiliate program, but I'm just gonna put the link in, it's use motion.com is is the app.
Charles Max Wood [01:22:54]:
And the way that this works is, so you put your tasks in. Let me put it in use motion.com. Sorry. I'm typing and talking at the same time. So you put your tasks in, and then it pulls, like, it pulls from my Google Calendar. And so it knows when I have something scheduled like this episode. And then what it does is it says, okay. I'm gonna put the other tasks that you've put into, motion into your calendar.
Charles Max Wood [01:23:25]:
And so and and you can set it up so that it'll actually add them to the Google Calendar, which is what I did, and then I told it not to mark it as busy. So they show up as free time. So it's got tasks set up for Tuesday, Wednesday, Thursday, and Friday this week as well, but there's they show up as as free time. And so what that means is is that if somebody books a time on my calendar, it'll just shift everything around it. And so it essentially tells me what to work on. And then finally, the last thing that I've been using is FocusBlox, and we I'm actually gonna do another premium episode with Manny, who's the guy that created it. But, effectively, what it does is you get on, focus you get on a Zoom call and that you do a breathing exercise before you start and you commit to this is what I'm gonna do this hour. And sometimes I hit it and sometimes I don't.
Charles Max Wood [01:24:13]:
And it's things out of my control or it turns out to be harder than I thought it was gonna be or whatever. But it keeps me on task because what they make you do is they make you put your phone away from your desk, and you so you focus on it. And then at the end of the hour, you do another breathing exercise. You report on whether or not you got your thing done, and then you do it over again. And I usually get, 2, 3, 4 focus blocks in in an afternoon. I try and schedule all the regular stuff in the mornings. But, yeah, I'll I'll do that in the afternoons and then yeah. So a lot of my, like, prospecting so if I'm trying to find sponsors or stuff, you know, that's on there first thing in the morning.
Charles Max Wood [01:24:56]:
Getting through my inbox, is a first thing in the morning thing. So after I go for a run and things like that, that's what we do. So, anyway, those are my picks. I think it's focusblocks.com. And I know I have an affiliate link for that, but, so we'll put that in the show notes. And it doesn't cost you anything extra, but I get a kickback if you use it. But, ultimately, I mean, if if these things save you a bunch of time and make you more effective, great. And then if I get a kickback because I found an affiliate link for it, even better.
Charles Max Wood [01:25:29]:
But, ultimately, this is what I'm using. So, yeah. That's what I've got for picks. So we'll wrap it up. Until next time, folks.
Hey. Welcome back to another episode of JavaScript Jabber. This week on our panel, we have Dan Shapir.
Dan Shappir [00:00:12]:
Hello from very hot Tel Aviv. Well, isn't that a surprise summer in Israel?
Charles Max Wood [00:00:19]:
Right. It's it's also pretty hot here. It's been getting up to a 100 degrees, which is, like, 40, 41 degrees Celsius.
Dan Shappir [00:00:26]:
That's actually hotter than it is here.
Charles Max Wood [00:00:30]:
It's So
Dan Shappir [00:00:30]:
we've got it better.
Charles Max Wood [00:00:32]:
It's it's been that way here for over a month. It it's been
Dan Shappir [00:00:37]:
frying your eyeballs?
Charles Max Wood [00:00:39]:
Just about. Anyway, I'm Charles Max Wood from Top End Devs, and, we're gonna be talking about monitoring and alerting using Prometheus and Grafana. Now I've played with Grafana. This this was proposed by Dan, and so I figured he's probably gonna do most of the talking. But, yeah, just to give a little context, I I don't even know what Prometheus is. So, yeah.
Dan Shappir [00:01:12]:
Yeah. So to backtrack a little bit before we get in to specifically what Prometheus and Grafana are, I wanna talk a little bit about monitoring alerting in general. You know, I've been working on stuff like performance monitoring and and performance optimizations for a good number of years now, as you all know, if you've been listening to this podcast. And probably one of the most important lessons that I've learned is that you need to have monitoring in place and alerting in place before you start any work on improving things. And in this context, I love a quote from, Peter Drucker. Peter Drucker more or less created the field of, business management. And the quote is, if you can't measure it, you can't improve it. And I I'm really a big fan of that.
Dan Shappir [00:02:16]:
If if you're unable to measure something, there's really no way for you to know if you're making progress, if you're improving things, or if you're actually degrading things or making no impact at all. And one of the things that I've done in this context is that whenever I, like, say I join a new company or start a new project, I often find myself under pressure to start delivering improvements, you know, from the get go as as quickly as possible. And I always push back on that in order to make sure that we have proper data collection and proper monitoring in place before we start making any improvements. By the way, in this context, it's probably not surprising that I joined Sisense, which is a BI analytics company because I'm really a big believer in that. And, you know, as an extra benefit, if you're doing this kind of work, one of the big benefits of having some sort of measurement and and monitoring solution in place is that after you've made improvements, you'll have graphs to show the impact of your hard work. And from my experience, that's really beneficial when you're let's say, you're looking for a raise or or advancement or something like that. But but really, in order to be successful in any project that requires taking a system and trying to, let's say, improve it, then having some sort of a monitoring capability in place is crucial. Now the question obviously then becomes, what do I actually use for that? I mean, you know, it's fairly straightforward to collect a lot of data these days, but what do you do with it? Where do you put it? How do you process it? How do you visualize it, etcetera? And and in this context, I want to talk specifically about 2 things, which are Prometheus and Grafana.
Dan Shappir [00:04:26]:
And first, I'll start with a riddle for you, Chuck. Mhmm. Do you know, like, who Prometheus was? Like, you know
Charles Max Wood [00:04:35]:
He he was a titan that brought us fire. Right? Wow. Greek mythology. And, he was he was punished by having, what, an eagle eat his entrails out for eternity or something like that?
Dan Shappir [00:04:49]:
He was tied to a mountain on the well, I think with the Tartarus Mountains, and then, Zeus's eagle will come would come once a day and eat his entrails. And because he's immortal, he cannot die, so he suffers forever. By the way, later times, Greeks or Romans kind of thought that this was, like, too bad of a fate, so they had Hercules release him or something along these lines. But Right. But the the now I can't prove it, but I think that the reason that this project was called Prometheus is because it's about bringing knowledge because fire in this context is like, synonymous with knowledge knowledge from the gods to the humans. So it's about bringing knowledge to us developers about how our systems are operating. So that's the mythology about, of what Prometheus is. Now let's talk about what Prometheus as service is.
Dan Shappir [00:05:50]:
So Prometheus is free software as an open source that you can install on premises or you can use as a service, I think, as well that is used for event monitoring alerting. It was originally created something like 12 years ago, at SoundCloud when they came to the conclusion that none of the third party solutions for monitoring were sufficient for their needs. And after they built that system and used it internally, about 4 years later, they donated it to the Cloud Native Computing Foundation, which is the same foundation that, also, hosts the the Kubernetes project. So Mhmm. You know, this is, another successful project from that foundation. I think the, the people who are mostly working on it these days are the people from the company that does Grafana. But, I but, again, it's open source, and you can see the source code on on GitHub, and there are a lot of contributors. I actually personally contributed to one of the satellite projects around, Prometheus, which is the Prometheus client for Node, which makes it possible to connect Node JS to Prometheus in order to monitor, nodes.
Dan Shappir [00:07:20]:
So I contributed specifically specifically to that part of the project.
Charles Max Wood [00:07:26]:
I'm gonna stop you just for a minute. I've been posting the links in the comments, but they don't go to x, and that's where most of our live listeners are. So Prometheus is prometheus.io, and Grafana is at grafana.com. So anyway
Dan Shappir [00:07:45]:
Yeah. That's true. Okay. So, so where were we? So we were talking about, a little bit about the history of Prometheus, both the mythological figure and the project. Now let's talk a little bit about what it actually is. So it's a service used for event monitoring alerting. It records real time metrics in something called a time series database, which is kind of a special type of database, and we'll talk about it in more detail and how it differs from the databases that most of us are familiar with. It allows for something called high dimensionality, which I also will try to explain.
Dan Shappir [00:08:34]:
It supports flexible queries and real time alerting. And as I said, it's free software. It's licensed under the Apache 2 license. So that's what Prometheus is. So let's say you want to use Prometheus in your organization. What you would do is that you would install the Prometheus service and then hook it up to your various services that you want to monitor. Now it's a monitoring solution for back end infrastructure. So things like Node.
Dan Shappir [00:09:12]:
Js or for, you know, the JVM or for something that's, say, written in Go because that's not surprising because Prometheus itself is actually written in Go. So maybe it's a shame that we don't have AJ on the show this time.
Charles Max Wood [00:09:28]:
Yeah. Maybe. But but your point is is that, you know, any any language or system could have a driver that pushes the
Dan Shappir [00:09:36]:
There there's basically a connector for anything. There are also connectors for a lot of, general services. So if you want to monitor let's say we were talking about Kubernetes. You can monitor Kubernetes. Kubernetes has a built in connector for Prometheus. So you can look at how pods are functioning or the Kubernetes cluster itself. There are connectors for various, AWS services and and so on and so forth. So you can collect a lot of data from, you know, third party services and infrastructure, and you can also attach it, like, create applicative level monitoring.
Dan Shappir [00:10:20]:
So you can monitor the behavior of your own applications that are running on platforms such as the JVM, such as Node. Js, or in Go, etcetera. Right. More more or less any programming language that you can think of. Now the way that you configure the system so, again, not very surprising given perhaps that it's certain in Go, the configuration are YAML files. And, again, this is kind of correlates with Kubernetes. So YAML, for those of our listeners who somehow don't know, is a configuration format. You can think of it in to an extent kinda, sorta similar to what we usually do with JSON files, but it's a different format.
Dan Shappir [00:11:09]:
It has certain advantages over JSON. For example, it support comments. Right. It's used by a lot of for a lot of administrative stuff. So any DevOps person is likely very familiar with YAML.
Charles Max Wood [00:11:26]:
The default configs for Rails are all done in YAML as well. So Mhmm.
Dan Shappir [00:11:32]:
Yeah. So you basically create configuration files for Prometheus in YAML. And the way that Prometheus works is kind of the reverse of what you might expect. So you might think that you somehow configure various systems to push data into Prometheus, but that's not how it works. The way that it actually works is that Prometheus pulls data into itself. So in the YAML file, you tell Prometheus the address of the various addresses of the various services that you wanted to monitor and the rate at which it should effectively ping those services. And it effectively does an HTTP get to an endpoints exposed by these services and downloads data from them. So it actually pulls the data from them into itself.
Dan Shappir [00:12:33]:
Now the advantage of this approach is that, first of all, they don't need to be aware of where the Prometheus server is. So it's the the all the configuration is centralized. They just need to open the port, you know, listen on it, and that's more or less it. Also means that they can work with multiple Prometheus service servers at the same time because they all just pull the data. Because pulling the data doesn't clear the data. It's not as if, you know, they give the data and then forget it. It it they they retain the data. Those services are expected to retain their data in memory, So you can just hit any one of them at any time and pull from them their current situate, state.
Dan Shappir [00:13:17]:
I hope that's clear.
Charles Max Wood [00:13:19]:
Yep. Makes sense to me. I kinda like it too just from the standpoint of, so I've used other systems, paid systems. You know, we've been sponsored in the past by Sentry and Raygun and stuff like that that that that grab a lot of this information. Though I think we're talking kind of a level below that, right, where we're not talking specifically about the information that's being sent. We're just talking about how the information gets into the system at this point. I like the fact that it's like, okay. I'm gonna periodically check, and then I don't have 10,000,000 hits on the service on the other end.
Charles Max Wood [00:13:57]:
Right? Because I'm not pushing it to it. It's pulling it. And so it only does the work that it has to do. Right?
Dan Shappir [00:14:04]:
Yeah. Yeah.
Charles Max Wood [00:14:05]:
I don't I don't have all this extra network crap going on.
Dan Shappir [00:14:09]:
Yeah. So it basically does an HTTP get, let's say, once every minute to any one of those, services. The response is essentially text in in their own, format. It's quite readable, actually. So you could literally go to one of those endpoints and just hit it with your browser and see what the response would be. By the way, obviously, you probably want to make sure that those endpoints are not externally exposed. Mhmm. So, you know, that everything stays behind the firewall.
Dan Shappir [00:14:41]:
Right. The the one kind of caveat to that is that sometimes you have, like, short lived services. Think about something like a Lambda. In that case, what they have is something called a push gateway, which is is like a stand alone service that those short lived jobs can push data in because they they are really short lived. So you can't assume that they'll hang around until they're they're pulled again. So they can push their data into that push gateway, which holds on to that data, and then Prometheus pulls that data from the that push gateway. So it's kind of an intermediary service that that you know, for those special cases. But in most cases, and the cases that I've used it in, you know, it was with long lived servers or services, and then, you know, that's just the way it worked.
Charles Max Wood [00:15:37]:
Right. That that makes a lot of sense too just in the yeah. Like, I'm thinking, like, serverless functions and things like that. Or, you know, if you I we've also run background jobs a lot on in the apps that I have right where it pulls it off the queue and then runs it. And so, yeah, in either of those cases, yeah, you don't want or necessarily need something hanging out so that it can say, oh, you're gonna query me within the next minute or so. It just says, poof, I'm done, and then hands it off. Right?
Dan Shappir [00:16:06]:
Yeah. So there's for example, in the case of Node. Js, there's a Prometheus client for Node. I I think it's literally the project is literally, as I recall, called prom client. So you just NPM install it. And then, you know, it's in there. You you just give it the the port to listen on, and then it does all you know, it just it let's say it uses express or something like that, and then it basically collects the information and exposes that to that port for you, and you don't really need to do anything to start monitoring basic system level stuff. Now pro the Prometheus server gets the data in and then puts saves it into its own persistent database.
Charles Max Wood [00:17:01]:
Right.
Dan Shappir [00:17:02]:
And that database is what I call the time series database. And what I mean by that is that Prometheus doesn't store data, like don't think of it something like a database. You know, we tend to think of something like a relational database or maybe a NoSQL database. That's not really that sort of a thing. It's something called a time series database. So, basically, it has metrics, and it basically saves the value of a metric at every point in time. So so it it's really, like, keeps on collecting metric data. So there's no such thing really as schemas or something like that.
Dan Shappir [00:17:47]:
It just has metrics that it collects data into, and it's data collected over time. So, like, think of I don't know. Let's say you're let's say you're a farmer, and you've got a field and you're measuring the temperature in the field. So you've got the temperature measurement, let's say, every minute that you got from from the whatever, you know, the the thermometer device that you use to monitor the temperature, and it just gets recorded into that persistent storage. And and you can go backward and forward in time and look at any point in time what the temperature was.
Charles Max Wood [00:18:30]:
Right. So I'm trying to imagine what this looks like for an app. So is it measuring, like, how much CPU it's using and how much memory it's using along with some of the other oh, we'll get into that. Okay.
Dan Shappir [00:18:42]:
Yeah. We'll get into that. But, basically, as I was saying, you can I'll
Charles Max Wood [00:18:45]:
I'll keep my enthusiasm down for a minute then.
Dan Shappir [00:18:48]:
It it you know what? Let's talk about that a little bit. So it's really monitoring 2 types of information. One, you can think about it as system level stuff. So in the case context of node, a node server, that might be CPU usage or memory slash heap usage or event loop lag. Or if it's an express service, the number of requests per period of time or the duration of of, the responses. Stuff like that. Okay. So those are system level things, and they're collected for you automatically.
Dan Shappir [00:19:31]:
So as soon as you, NPM install the prompt client for node, and it is loaded into your project, your node server, then all that stuff is automatically collected for you. And when the Prometheus server, you know, hits that, port, that information, those metrics are available from the get go. On top of that, you can add applicative level metrics that you push into Prometheus using an API. So for example, if you've got, like, your own, let's say, queues, internal queues that you want to monitor the usage of or your own business logic processes that you want to measure measure the duration of, you can measure those as well. Okay? So you've got both the system level stuff and the applicative stuff. And, by the way, one system level stuff that is really important in the context of node and may be less obvious or familiar to some of our listeners is something called the event loop lag. Are you familiar with that?
Charles Max Wood [00:20:52]:
I am not.
Dan Shappir [00:20:55]:
So as you know, the way that JavaScript works is that it's all based on an event loop. Right. Be it either in the browser or in node. JavaScript, the way that it works is you've got, whenever, something happens, like if it's in browser, then it's something arrives over the network or the user does some sort of an interaction, a mouse click, a keyboard press. Whenever something happens, an event is triggered, and that puts the event information in a queue. And JavaScript, which behaves as as you know, runs kind of in a single threaded type approach, pulls data out the the most the top data out of the queue, processes it, and then moves to the next item in the queue, next item in the queue, and so forth until, you know, if the queue becomes empty, then it effectively idles until another, stuff is put into the queue. So far so good?
Charles Max Wood [00:22:04]:
Yep.
Dan Shappir [00:22:05]:
Now the problem is that usually rather than idling, what happens really is that information comes into gets into the queue at a very rapid pace at a high clip. So it might new new events are placed into the queue before the the before the JavaScript engine is ready to process them because it's still busy processing other stuff. Like Right. You think about, let's say, a node server running express, the events coming in are are the HTTP requests. If HTTP requests arrive at a too high rate, then then then the node service might not be able to process them quickly enough and will get overloaded. Right. So what the event loop lag actually measures is the amount of time from when a message is placed into the queue until it's taken out of the queue in order to be processed. So if that period of time is is small, then you know that your service is really responsive.
Dan Shappir [00:23:21]:
If it gets to be too high, then it means, you know, your your system is overloaded and and it's not responsive enough. It you know, if think about think about a service, an express service that takes, I don't know, 2 seconds to get to pull something out of the queue, it means that the browser, the client side is waiting for 2 seconds before it's, the it's, event is even processed. So, obviously, that's a bad thing. And it's especially bad if it keeps on growing because then eventually your service server would just become unresponsive. Right. So that's information that, that Prometheus that the the system, the prompt client is actually able to extract out of node and exposes that into Prometheus. So that's one of the system monitoring things that's really useful to look at when you're, monitoring a node service. Another thing that's really useful to look at is, for example, heap usage.
Dan Shappir [00:24:29]:
Because if you've got, let's say, some sort of, memory leak, then Right. You'll see that after, garbage collection at GC, rather than going all the way down, your memory just you utilization keeps going up and up and up and up. And, again, that if if it keeps on going, what will happen is that the the the node service will try to do GCs more and more and more and more in order to free memory, but fail to do so and you get what's known as a GC storm. Or effectively, the the the service, all it does is just try to free up memory that that it can't, and it becomes totally stuck. So that's another thing that you can look at in the context of of, of Prometheus.
Charles Max Wood [00:25:17]:
Well, it seems like on both of those measures, on the the the lag in, what is it? Node event loop lag.
Dan Shappir [00:25:26]:
Exactly.
Charles Max Wood [00:25:27]:
It seems like because I'm sitting here, and I'm thinking, okay. So it's gonna tell me if it doesn't have the resources to handle whatever's coming at it. But for me, I I find it useful because, I mean, let let's just take podcasting for an example. Right? Like, I don't go and religiously obsess over the numbers. Right? I I don't go look at the metrics. But for for an app, if you're checking the metrics on a regular basis, it seems like you could start to see this lag, event loop event loop lag steadily increasing and go, okay. We are getting to the point where we need to start looking at right? Instead of all of a sudden being, woah. Woah.
Charles Max Wood [00:26:12]:
Woah. We're, you know, we're way over the edge. Right? And so it it allows you to be proactive, right, instead of reacting to people are complaining it's slow.
Dan Shappir [00:26:22]:
Yeah. And, actually, what you really want is to have good alerting, and we'll get to that as well because
Charles Max Wood [00:26:29]:
I was gonna yeah.
Dan Shappir [00:26:30]:
Yeah. Because realistically, you're not going to check the graphs for all your services every morning or every afternoon. What you really want is you want a system that alerts you in case something is wrong, and you usually want an alert to say not that, you know, system is broken. You You want an alert to tell you, you know, system is running hot. You should you should do something before it breaks.
Charles Max Wood [00:26:59]:
Right.
Dan Shappir [00:27:00]:
And, and yeah. And and Prometheus is great for that as well because you can specify alerts, and so that's the other part. So, I was talking about, how all the data is collected into the Prometheus service and then saved in into persistent storage. But then you can also do queries on top of that data. Prometheus has its own query language. It's called PromQL, and it looks nothing like, SQL, you know, even though it's a query language. It's a totally different query language. No Well, SQL
Charles Max Wood [00:27:38]:
is the query language, and it doesn't look like SQL even.
Dan Shappir [00:27:41]:
Exactly. And you can do 2 sort of things. You can have Grafana as a visualization environment. So Grafana is the service that you run-in the browser, and it can show you all sorts of graphs from various data sources. And one of the data sources is Prometheus, and you can write from QL queries that extract data and then graph this data in in whatever dashboard you're using. So that's one possible usage. Another possible usage is something called the alert manager, which is another component of Prometheus that comes along with it. It's a it's a standalone service that does regular that every at regular intervals, it runs, PromQL queries, gets the data from them, and sees if they're if alerting criteria are are met.
Dan Shappir [00:28:40]:
And if they are, it can then push alerts into, emails or Slack or PagerDuty or whatever so that you can generate alerts out of the out of the Prometheus data. So for example, going back to that farmer and field and and thermometer example, you could say that if the temperature goes above, I don't know, 30 degrees on the ground, send an alert. Something along these lines.
Charles Max Wood [00:29:11]:
Yeah. That makes sense. Yeah. So When I just wanna throw throw in here real quick because, I I think sometimes we kinda treat, the the time series data as kind of monolithic in in ways like treating it, like, for the day or the week. Right? I was actually looking on the Discord server that we use for the hosts and adventures in DevOps. They were gonna do an episode where they were talking about holiday rushes. Right? And so one day to the next, it may vary, or you may get a lot of your traffic in the morning or the evening. And so, you know, by having these alerts, you can start to pick up some of the patterns.
Charles Max Wood [00:29:52]:
And the other thing is is you can turn around and you can say, okay. I not only do I know that something's happening now, but I can go look at the current state of things or get a snapshot from the Prometheus data and then start to solve whatever is the issue is. Right? Whether it's I need more resources or, oh, I didn't realize, but I built something into the application that makes it memory heavy. And so my heap size is going out of control, and I'm running out of memory or whatever.
Dan Shappir [00:30:21]:
For sure. But but definitely also the fact that data can vary over time even if regularly can make some of this stuff pretty challenging but still doable. PromQL is a very sophisticated query language, and I'll give examples of some of the challenges that I've run into when creating alerts. So for example, at a lot of companies that I've worked, the companies like Wix or or Next Insurance, there was a lot more traffic, let's say, over the weekdays than over the weekends, which is not surprising. But the downside of that was that data over the weekend would fluctuate a lot a lot more because the sampling size size was significantly lower. There when you think about it, let's say, you have, let's say talking about Next Insurance. So you might have 10,000 sessions in a week in a working in a weekday in a working day Mhmm. But only a 100 over the weekend.
Dan Shappir [00:31:30]:
And when you've got only a 100 sessions, then 10 bad sessions can really impact your performance numbers. So and then we would see that, we if we weren't careful, we would start getting alerts over the weekends because the data was less stable because there were just fewer sessions. So we basically kind of had to come up with queries that were also dependent on the number of sessions, not just the duration of the sessions. So if there were too few sessions, we would ignore the the other criteria of the duration of the sessions. So that might be something. You know? It it makes the queries more challenging, but you wanna take these sort of things into account. So quickly going back to when is Prometheus a good match and when isn't it a good match? When what type of data would you want to put into Prometheus and what kind of data you would probably wanna put somewhere else? So Prometheus is a good match when you're recording pure numeric time series data. Okay.
Dan Shappir [00:32:47]:
It's appropriate for machine centric monitoring, your monitoring systems, when it's highly dynamic service oriented architectures because, you know, this whole pool type in mechanism makes it very easy, to adjust to additional instances coming up or going down in a very dynamic sort of a manner. And, again, I'll talk about it assuming we have the time, something about it's really important when the data is multidimensional and both in the collection and the querying, and I'll explain what that is a bit later on. Now when is it not a good match? First of all, it's not a good match when you need a 100% accuracy. So, for example, if you're looking at stuff like billing, when it the numbers have to be perfect, then Prometheus is not a good solution. Prometheus, in most cases, kind of averages out data. Because, again, when you're looking at stuff like, CPU usage, it doesn't really matter if you're running at 93 or 93.2 or even 94%. But if, you know, obviously, if you're looking at stuff like, you know, your taxes, you probably need to be accurate. It's not
Charles Max Wood [00:34:11]:
a pro taxes.
Dan Shappir [00:34:12]:
Sorry?
Charles Max Wood [00:34:13]:
They said don't talk to me about taxes right now.
Dan Shappir [00:34:18]:
Yeah. Yeah. Anyway, it's, not it's also not appropriate when you're recording non numeric data. So, for example, if you're recording stuff like email addresses or phone numbers even, even though they seem like they're numeric, they're not really, or or, you know, street addresses, stuff like that, that's not appropriate. And it's also not appropriate when the data has to be totally persistent. When, you know, when it's, you can't afford to lose any data. As I said, Prometheus pulls the the data pulls the data out of, the various services at regular intervals. Let's say every 1 minute.
Dan Shappir [00:35:04]:
If that service crashes during that one minute, you've lost the data since you previously pulled it. And that's obviously not something that you can live with in a lot of cases where you need, persistent data. And, you know, if it's if you're dealing with bank accounts and stuff like that, you can't afford to lose, transactions or stuff like that. So
Charles Max Wood [00:35:30]:
Yeah. But it it seems like most people are pulling are using it by putting this client onto their application, and so it's it's only recording those specific kinds of data. You're talking about a custom use of Prometheus where you might push other data into it as well.
Dan Shappir [00:35:49]:
Exactly. Exactly. Now it can be a challenge because, again, let's say you're pulling you're pulling the data or pulling it in every minute. And if your server has a problem that in a certain scenario, memory consumption, like rockets out of control and then the server crashes, then you might not be able to catch that if it, you know, if it happens within the spans the span of a few seconds. So, you can either then try to increase the the the the rate at which you pull and hope that you're lucky or look at for some other solution.
Charles Max Wood [00:36:25]:
Right.
Dan Shappir [00:36:27]:
As I said, it has integrations for Node, JVM, Go, Python, Ruby. And in terms of, systems, stuff like Kubernetes, GitLab, AWS, Jira, MongoDB, Redis. For visualization, usually, you'd use Gafana even though it also has its own built in simple visualization capabilities.
Charles Max Wood [00:36:48]:
Okay.
Dan Shappir [00:36:48]:
And it's compatible with OpenTelemetry. So if you're using OpenTelemetry for collecting telemetry information, you can also use OpenTelemetry on top of Prometheus and have OpenTelemetry put a date put its data, the data that it collects into the relevant data, the the, the time series data into Prometheus. So, there are several types of metrics, and they're appropriate for different scenarios. So very quickly going over the different metric types, you've got something called a counter. Counter, think about, you know, the if you're let's say, think about the club, let's say, that you wanna go into, and there's somebody at the door with this kind of a counter device, clicking it every time somebody goes in, like counting the number of people inside because, you know, the fire, regulations only allow up to x number of people to be inside at the same time. And they so they need to count the number of people going in and the number number of people coming out to make sure that, you know, they don't exceed those limits. So counter is really something like that. It's a metric that that can only increase, and you basically add 1 or add an or your n, which is like adding 1 n times.
Dan Shappir [00:38:18]:
So you basically just add add into it, and it keeps on go getting higher. Can you think about things that you would measure using something like like a counter?
Charles Max Wood [00:38:31]:
Yeah. Like the number of requests that come in or
Dan Shappir [00:38:34]:
So, funnily, that that's that's that's literally the first example that I have. So the request count is exactly such a such a thing, tasks completed, error count, all these kind of things that only go up. They only increase until, at least until you restart the service.
Charles Max Wood [00:38:55]:
Right.
Dan Shappir [00:38:56]:
And the cool thing about a counter is that because of this behavior of only increasing, you can compute the rate of increase, which makes it possible to have predictions. Because if you can, calculate the rate, then you can also predict where you'll be in a certain amount of time. Right.
Charles Max Wood [00:39:24]:
Makes sense.
Dan Shappir [00:39:25]:
Yeah. So that's the simplest type of metric. Again, it's used automatically for things like a request counter or a task complete counters or an error counter, stuff like that. But you can also create your own applicative counter if you want to count your own stuff. Right. The next type of metric is called a gauge, and it records a value that can go up or down or literally be set at any value that you want. So you can literally send say the the value of this gauge right now is x. Mhmm.
Dan Shappir [00:40:03]:
Again, can you think of when you might use the gauge?
Charles Max Wood [00:40:07]:
This would be like queue size or,
Dan Shappir [00:40:11]:
Chuck, are you looking at my slides?
Charles Max Wood [00:40:13]:
No.
Dan Shappir [00:40:15]:
Because, again, that's the the first item that I have on my slide for for using it is the queue is queue size. Exactly. Queue size, memory usage, CPU usage, number of requests in currently in progress, all these sort of things that you just set it to a particular value at each at each point in time. Right. Now the the key so it can go up and down or be set to a particular value. The thing about it though is that because it's so arbitrary, you can't use it to assess rate of change. Because if you can just jump between numbers, there's no really meaningful, rate that you can think about.
Charles Max Wood [00:41:02]:
Yeah. You you can average it out. But
Dan Shappir [00:41:06]:
Yeah. Yeah. The next one is slightly more complicated, and I and when I, I actually have presentations that I do, that I so far have done internally. I'm kind of looking for a conference who wants to talk about this. But if you want to be able to measure things like, histogram like, sorry, like averages or even more importantly percentages, like the median or the 90th percentage or stuff like that Mhmm. Then you use a metric called a histogram. Now that might seem surprising why it's called a histogram if it's used to measure percentiles, for example, and I'll explain it I'll try to explain it in a in a minute. But can you think about when you want something like percentiles?
Charles Max Wood [00:42:00]:
I would think like memory usage or
Dan Shappir [00:42:03]:
No. Memory usage, we talked about counter. When when
Charles Max Wood [00:42:06]:
you What I what I meant was, you know, percentage of, like, memory used or percentage of resources used.
Dan Shappir [00:42:15]:
So what they have is
Charles Max Wood [00:42:16]:
something not understanding what it is.
Dan Shappir [00:42:18]:
So so I'll give an I'll give examples, and I think then then it will click for you. So think about something like a request duration. Like
Charles Max Wood [00:42:28]:
Oh, I gotcha.
Dan Shappir [00:42:29]:
So you want to say, like, my median request, I gotcha. Yeah. Rendering duration is x or the 90th percentile is y. Another example might be the response size. So you might say, my average response size or my median response size is so such and such, it goes up to something else when I'm looking at the 99th percent arc.
Charles Max Wood [00:43:02]:
Mhmm.
Dan Shappir [00:43:04]:
So you kind of want to be able to get the measurements, but then use them in order to calculate, as I said, percentiles.
Charles Max Wood [00:43:14]:
Right.
Dan Shappir [00:43:15]:
Now the it's called the histogram because the way that it's actually implemented is that internally, you create buckets. You can say let's say, like let's say you're talking about request duration. You'd say I have a bucket from 0 to 10 milliseconds, from 10 milliseconds to 50 milliseconds, from 50 milliseconds to 100 milliseconds, and anything above a 100 milliseconds. Okay? And each measurement goes into one of these buckets. And then you look at how many fell into each one of those buckets, and then you can use it to calculate percentiles. Right.
Charles Max Wood [00:43:53]:
No. That makes sense. So
Dan Shappir [00:43:56]:
So it's like a discount for each point in time.
Charles Max Wood [00:44:00]:
Right. The thing that I'm imagining is anything that you would put, like, on a bell curve or something like that. Right? And so then then your yeah. Your 90th percentile is, you know, out on the end. Right? But so these are kind of rare, but they're also kind the extreme that you have to deal with. Right?
Dan Shappir [00:44:19]:
This is what the medium or whatever is in the middle. Histogram it's a histogram that gets recorded and updated at every interval. So it's like a histogram over time.
Charles Max Wood [00:44:30]:
Oh, interesting. So you could you could literally see, like, the hump move or whatever.
Dan Shappir [00:44:35]:
Exactly. You say, this is my histogram now. This is my histogram a minute later. This is my histogram another minute later, and so on
Charles Max Wood [00:44:42]:
and so forth. Okay.
Dan Shappir [00:44:47]:
So, hopefully, that's kind of clear. Now there's another metric called summary. I don't wanna get into it. It's it's it's more rarely and uncommonly used. So we'll skip it for now. You know, we're starting to run long in any event. Because what I really wanted to talk about was the concept of of labels and dimensionality. So let's go back again to that field example of the farmer wanting to measure, let's say, humidity and temperature in the field, in in their in their field.
Dan Shappir [00:45:18]:
But it's a big field, so they're not using just a single measurement device. They've got devices spread out in different locations in the field. So one way in which you might think about it is that we'll create a separate metric for each one of those measurement devices.
Charles Max Wood [00:45:34]:
Mhmm.
Dan Shappir [00:45:35]:
But the way that Prometheus looks at it is that we've got a single temperature metric, but for each measurement, we associate labels with it. And that label could is is textual value. It might be the name or the ID of the specific measurement device. So let's say I've got 4 devices in the field. They're called a, b, c, and d. I've got the single temperature metric, but each measurement is associated with either a or b or c or d. Mhmm. Is that clear?
Charles Max Wood [00:46:13]:
Yep.
Dan Shappir [00:46:15]:
So now what's the benefit of this approach is that it's very dynamic. If if I if I add an if I, you know, if I buy some more land and I now need also an e and f met, measurements, I don't need to create new metrics. It's the same metric
Charles Max Wood [00:46:36]:
Right.
Dan Shappir [00:46:37]:
But it's associate but there are the new values coming in are associated with new labeled values. So I've got a label called, you know, the device name, but it can come in with any any arbitrary textual value. And then when I query the data, I can say, show me just the met the values for the metric for measure for when the label is equal to a or Mhmm. To a or b or a and b. Because when you do queries in the pro prompt QL query language, you can either set a label like in the query to have a specific value or use a regular expression.
Charles Max Wood [00:47:26]:
Okay. So I I'm imagining that you could do this if you're spinning up, say, another Docker container or server.
Dan Shappir [00:47:35]:
Exactly. So you might have the pod name. You you would have a label called pod name, and the value is the name of the pod. Right.
Charles Max Wood [00:47:47]:
Yep.
Dan Shappir [00:47:48]:
And you can have any number of labels associated with each metric, and those labels can have any number of values.
Charles Max Wood [00:48:01]:
Okay.
Dan Shappir [00:48:02]:
Now that I think, highlights the potential issue of dimensionality because it's it makes the system very flexible.
Charles Max Wood [00:48:15]:
Mhmm.
Dan Shappir [00:48:15]:
But you'll see the problem in a minute. So let's say I have a metric, and I have 3 labels associated with it. So it's actually they defined a three-dimensional space for that metric because each measurement has, a coordinate that is specified by the values of those three labels.
Charles Max Wood [00:48:41]:
Okay.
Dan Shappir [00:48:43]:
Is that clear? I'm kind of waving my hands. Think about it. So so think about again, going back to the field example. Let's say that instead of having, each measurement in the field have its own name, it has, a coordinate in the field. So it has an x axis an x value and, a y value. So it's a 2 dimensional space.
Charles Max Wood [00:49:14]:
Okay.
Dan Shappir [00:49:17]:
Now it now in in reality, like, again, when you're measuring things, you might have many more labels than that. So it becomes an n dimensional space.
Charles Max Wood [00:49:28]:
Right.
Dan Shappir [00:49:31]:
And if for any so and the the the, for each access, the number of of points on that access is the number of different values that label might have.
Charles Max Wood [00:49:45]:
Okay.
Dan Shappir [00:49:48]:
So let's say I have 3 labels and each one has 10 potential values. How many how many point how many different numbers in that space can I have?
Charles Max Wood [00:50:03]:
Thousand.
Dan Shappir [00:50:05]:
Exactly. So I need to remember a 1,000 different numbers for that single metric.
Charles Max Wood [00:50:15]:
Right. And make sense of it, hopefully.
Dan Shappir [00:50:17]:
Yeah. But the problem is that if I'm not careful with the number of labels that I use, and especially the the number of different values that each label might have, I can literally blow up my memory. Right. And, again, going back to how Prometheus works, that Prometheus pulls let's say you have a node server and Prometheus pulls the data from the node server once every minute. So know that node server keeps all that data in its memory.
Charles Max Wood [00:50:54]:
Oh, okay.
Dan Shappir [00:50:56]:
If you're not careful, you'll blow up notes memory. If you've got if you've got, so I'll give an a real example. We were using Prometheus to monitor data associated with the, with the, page performance at, at Next Insurance. And so one of the labels was the URL. Mhmm. We had something like 2,000 different pages. Okay. So we had the label call so one of the dimensions was, was the URL, and it had 2,000 possible different values.
Dan Shappir [00:51:44]:
Right. Actually, if we weren't careful, it could have had a lot worse if we didn't filter out the, query parameters and stuff like that. Exactly. And another, label that we had was device type because we wanted to distinguish between desktop and mobile. And another label that we had was the browser, the the browser type because we wanted to distinguish between performance, let's say, on Chrome and on Safari. Right. So think about it. It's 2,000 different URLs times 10 different browsers times 3 or 4 different types of devices or 5 types of devices.
Dan Shappir [00:52:34]:
And all of a sudden and and for each one of those three dimension coordinates, you need to remember a metric. So it's a number. So it's millions of numbers that you need to Right. That you're holding on in memory, and we literally kind of blew up the memory space.
Charles Max Wood [00:52:55]:
Right.
Dan Shappir [00:52:57]:
So yeah. So you need to be careful when when you're, like, cavalier about the number of labels and dimensions that you're using. So you it's called cardinality, and, basically, you need to be aware of high cardinality. Right. But the other benefit is that that system is extremely flexible. Like Right. If there's a new if there's a new URL, you don't need to do any modification in the system. It automatically adjusts to that because it just associates a labeled value with that particular and another labeled value with that particular metric.
Dan Shappir [00:53:38]:
Right. Now there's all all sorts of things about naming conventions. I won't go into that. About how you name your metrics and how you name your labels, I won't go into that. I will say that, there's a really cool, API for working with all the different, metric types. So we talked about counter, engage, and histogram. So you can there's a, if you, import the prom, prom client, NPM package, then you get those APIs that you can push your numbers with associated labels into the system, and they just get recorded. Right.
Dan Shappir [00:54:35]:
So, again, you might have, let's say, some sort of business logic process that, that, you know, hits does all sorts of things. Go to goes to one database, goes to another database, does all sorts of things, and you want to measure its duration overall. Or maybe you've got your own, like, custom queue for something, and you want to, measure the the size of that queue. Or you've got, something like that, or a pool that you want. Let's say, you're you're you're utilizing your own custom pool of resources, and you want to make sure that you've made the pool not too big, not too small, then, again, you could be looking at the size of that pool and how many, what's the what's its utilization at each point in time. So you can through those APIs, measure your own business logic values. And Mhmm. The example that I gave is we were actually using it to measure Core Web Vitals.
Dan Shappir [00:55:44]:
If you remember, we've had a lot of,
Charles Max Wood [00:55:47]:
we've had yeah. A lot of conversations about it.
Dan Shappir [00:55:50]:
Exactly. Now those are metrics that are actually measured on the browser side. So you might ask, how did they make their way into Prometheus? Well, we would collect this data on the client side, post it to the node server, and the node server would use that API to put it into Prometheus.
Charles Max Wood [00:56:13]:
Right.
Dan Shappir [00:56:15]:
So this way, you can collect data not just from the node side, but also from the browser side. Again, it could be a system level stuff like Core Web Vitals, or it could be applicative stuff. You know, whatever your, web application happens to be doing that you would like to measure and monitor.
Charles Max Wood [00:56:39]:
That's cool.
Dan Shappir [00:56:40]:
Yeah. So, now the final piece that I wanted to mention is how to get data out of Prometheus. So as I mentioned, there's a query language called promQL. That's p r o m q l. And like I said, it's definitely not SQL. It can do stuff like factoring data, aggregating data, run all sorts of predictive functions, tech quantiles, which is a fancy name for percentiles, and averages and and whatnot. And it can be used to answer questions like, what's the 95th percentile latency in each data center over the past month? Right. You know, so very sophisticated queries.
Dan Shappir [00:57:33]:
Or Mhmm. How full will the disks be in 4 days? So here's an example of a predictive query. Mhmm. Or which servers are the top five users of CPU? All these sort of queries. And you can either use these queries in order to graph things or in order to create alerts.
Charles Max Wood [00:58:00]:
Okay. And that's what Grafana does?
Dan Shappir [00:58:05]:
Is it
Charles Max Wood [00:58:05]:
it uses PromQL to do this stuff?
Dan Shappir [00:58:08]:
Exactly. So in Grafana, what you do is you create a graph, and in the graph, you specify Prometheus as the data source. Okay. And you write the PromQL query, and then it basically graphs that query over time.
Charles Max Wood [00:58:25]:
Mhmm.
Dan Shappir [00:58:28]:
And it has all sorts of fancy graphs. So you can do, like, line charts and bar charts and heat maps and all sorts of really fancy fancy graphs, and it's really powerful. I if you're interested in this sort of stuff, I highly recommend just going into YouTube, let's say, and searching for, for, PromQL. There's also, on the, Prometheus website. There's, like, there there's documentation and tutorials for PromQL as well. So we can post links to that, later on in in the in the notes. Obviously, if this was a talk, then I would actually be showing examples of a PromQL, but, you know, I can't really do that. But You're not gonna
Charles Max Wood [00:59:24]:
read it out loud?
Dan Shappir [00:59:27]:
Exactly. So, so you would put PromQL in 1 of 2 places, either in something so actually, one of 3 places. So the Prometheus itself has its own simple web interface where you can put in PromQL queries and get data either in tabular as basically just numbers, tabular representation of numbers, or in simple graphs. If you want much more sophisticated graphs, then you can use Grafana. And Grafana actually has a pretty sophisticated, editor for promQL, so it does automatic completion for you. It knows all the metrics in the system and the labels and the different label values, so it can actually do auto complete for you when you're typing in the queries. Nice. Yeah.
Dan Shappir [01:00:22]:
It's it's it's really nice. It's as I think I said, it's kind of they're kind of sister projects. So it's it's the same people working on both of them, so they made sure that the integration is is really nice. And you can also use the PromQL in the alert manager where you would type the queries in the YAML files, and Mhmm. Then you, they get run automatically at at at periods. And if they come back with data that matches the query, then alert gets raised. So you run a query. If it comes back no data, no alert.
Dan Shappir [01:01:03]:
If it comes back day with data, then there's an alert. So the query the the an alert query might be, do I the number of CPUs that are at utilization higher than 80%. Right. And if the number is 0, then there's no alert. If the number is greater than 0, then there would be an alert.
Charles Max Wood [01:01:25]:
Makes sense.
Dan Shappir [01:01:27]:
Some something like that. And it's actually even more sophisticated than that because you can say, you want to avoid, like, peaks, momentary peaks. So you might say it it needs to be higher than 80% over a period of 2 minutes. Right. Stuff like that.
Charles Max Wood [01:01:45]:
Makes sense.
Dan Shappir [01:01:46]:
Yeah. So it's it's a very sophisticated and powerful system.
Charles Max Wood [01:01:52]:
So is this something it it sounds like something you used at Next Insurance. Is is this something that you're using today? Or
Dan Shappir [01:02:00]:
Yeah. We're also using it at Sisense. So for example, when we, deploy our service, our our customers have the ability to run our services, on premises, and then they can collect monitoring data for our own services inside an instance of Prometheus that do that we install with it.
Charles Max Wood [01:02:25]:
That's cool.
Dan Shappir [01:02:30]:
But like I said, it's a really powerful system, and, and it's, really flexible. And, again, if you want to do monitoring or alerting, and probably you wanna do both, then it's a a very good solution for that. Nice. Yeah. And that more or less covers it. Like I said, I have a talk presentation on it in which I I, you know, I am able to show actual examples and and and and visuals. So it's, you know, more more more informative If you're interested in that, if you're running a conference and you're interested in that, I'm shopping this stock out.
Charles Max Wood [01:03:17]:
Nice. Well, I mean, this is something that I've kind of been contemplating figuring out how to put into my own systems. Right? I mean, I I sent data into I have a self hosted version of Sentry is the thing I've been using lately. And it it collects some of this kind of data, but it you know, I I've gotten more and more into self hosting my own kind of thing and and running through some of this. Right? And it sounds like this this is right kind of right up that alley too where it's okay. When I deploy, right, it just you know, it it makes sure that I have a Grafana and a Prometheus instance set up that it just reports to.
Dan Shappir [01:04:02]:
Exactly. Now, obviously, there are overlap between these sort of systems, but I think their focus is kind of, like, not this not exactly the same. So Right. Let's say, if you if you're do doing, like, error logging, then, you know, Sentry is the the tool for you. Yeah. I think I said that, Prometheus is not the appropriate tool, for textual information. So if you're looking at stuff, you're, like, keeping stuff like stack traces and and stuff like that, then you you wanna use something like Sentry. You know, you could theoretically count of the the occurrences of of a particular type of error, but, yeah, you probably wanna use Sentry for something like that.
Dan Shappir [01:04:45]:
Also, Sentry has the stuff that's, specifically intended for performance monitoring certain scenario. But if you wanna do more general purpose monitoring and system level monitoring, and and and monitoring applicative stuff, then then Prometheus could be the solution for you.
Charles Max Wood [01:05:07]:
Right. I also like that, you know, as it aggregates that data effectively, what you can learn from it is only limited by, I guess, what it's measuring, but also what you can come up with to query out of it.
Dan Shappir [01:05:21]:
Oh, yeah. For sure. I created some amazing query using PromQL. Like, so I'll give that example. So at, Next Insurance, we have something like, 50 services in a service oriented architecture. Mhmm. So which we're communicating with each other over various API endpoints. And we wanted alerts to be fired when certain when any endpoint in the system became too slow.
Dan Shappir [01:05:56]:
But then the question was, what does too slow mean? Because Right. You know, it can't be absolute numbers because if something usually takes 50 milliseconds and then all of a sudden grows to a 100 milliseconds, that means it's too slow. But if something always runs at a 100 milliseconds and then becomes a 110, well, that's a smaller change and you might want to ignore it. So you can't look at absolute numbers. So we I created really sophisticated queries that basically looked at behavior over time and try to see if something became significantly slower compared to its own specific behavior from from, you know, previous durations. And, also, again, wanting to only look at those endpoints which were sufficiently used because if there was some endpoint that was hardly ever used, then we probably don't care about it. Right. So, yes, you can create really sophisticated queries like that.
Dan Shappir [01:07:08]:
And it was exactly because with all these endpoints, you don't want to have to tell somebody, a developer, where, well, you know, you're in charge of a service that has a 100 different API endpoints. You need to look every day at a 100 different graphs to figure out if there's a problem. You wanted a so you want you want an alert to be sent to their to the Slack channel, let's say, of that particular team if any one of those endpoints all of a sudden became slower in a way that really impacts, potentially impacts the system as a whole.
Charles Max Wood [01:07:49]:
Right.
Dan Shappir [01:07:50]:
And and, yeah, that's what that's the one of the projects that I did at NEX Insurance. And we use That's cool. And we use Prometheus and Grafana for that.
Charles Max Wood [01:08:01]:
Very cool.
Dan Shappir [01:08:03]:
So when when alert like that was sent to the Slack channel, the on call, member of the team could actually click a link, see the graph for that the the performance of that endpoint in in Gafana over time and say, okay. We actually have a problem or no. This is just, you know, a fluke or something that we can ignore.
Charles Max Wood [01:08:27]:
Yeah. Yep. It spiked for such and such a thing and no big deal.
Dan Shappir [01:08:33]:
Yeah. We were doing something. We ran some sort of a backup service and, you know, that's why it impacted everything. And
Charles Max Wood [01:08:40]:
Right. Or you migrated some data on the back end and it slowed it down for 2 minutes and then
Dan Shappir [01:08:46]:
Exactly.
Charles Max Wood [01:08:47]:
Yeah. Awesome. Alright. Well, we put some links into the the comments on Twitch and YouTube and Facebook. If you wanna go find those, we'll try and get them into the show notes as well. But let's go ahead and do some pics.
Dan Shappir [01:09:05]:
For sure. Although I don't have that many. Okay. I actually do have something. So for some inexplicable reason, I decided to start tweeting out, my favorite stand alone fantasy novels or books.
Charles Max Wood [01:09:25]:
You
Dan Shappir [01:09:25]:
know, fantasy tends to be written as lengthy series of books or at least trilogies. But, occasionally, I just want to read one book that stands on its own merit, on its own. Mhmm. Because, you know, it starts and it ends, and you can move on to something to the next thing. So I started to tweet out the collect my list of the top stand alone fantasy books. And you know what? I'll I'll use each I'll pick each one at a time. So let's see which one was my first. Can you think of 1? I there's an there's the obvious choice, by the way, which is, I think, The Hobbit.
Charles Max Wood [01:10:09]:
I that's I was thinking that, but then I was like, I don't know if he considers that part of a series because it is sort of a prequel to fellowship of the ring and and that series.
Dan Shappir [01:10:19]:
Yeah. Let's But Yeah. I I think
Charles Max Wood [01:10:22]:
it's self contained story. Yeah.
Dan Shappir [01:10:24]:
Yeah. It's first of all, it's totally self contained. And and the second important thing is that it was written well before Lord of the Rings was written. Yes. So, Tolkien actually wrote the Lord of the Rings because the the the publisher was so happy with the success of the, The Hobbit that they wanted more stories in that world. And it was his idea to transform it from a child's story to a story for grown ups. But so Right. The Hobbit was originally written for his, written for his children, although a lot of adults like that story as well.
Dan Shappir [01:11:01]:
Oh, yeah. You know what? I'll tell you another interesting story about, about the the hobbit. So there's, so there are a couple of translations, of the hobbit into Hebrew, but the one of the translations is especially, interesting. It's called the pilot's translation because, because during, so there was a war between Israel and Egypt in the early, in the early seventies. Like, in between, the 6 day war and the Yom Kippur war, there was kind of it's called the war of attrition. And various pilots were down and and captured by Egyptians and held as POWs for several years. Kind of like, the the American POWs in Vietnam.
Charles Max Wood [01:11:56]:
Mhmm.
Dan Shappir [01:11:56]:
And they were looking for ways to pass the time. And one of them got their hands on the English version of the Hobbit, the original Hobbit, and and they decided to translate it into Hebrew as a way to pass the time. So and when they were released, they they took their the the translation with them, and and it literally got published, and it's called the pilot's translate translation of the Hobbit.
Charles Max Wood [01:12:23]:
That's so fascinating.
Dan Shappir [01:12:24]:
From from English into Hebrew. So, anyway, so the Hobbit would be 1. So the but but because that's the obvious one, I'll give another one. And that that book is called Tigana. It's by Guy Gravel Kaye. Interestingly, to kinda close the loop, he's the person that worked with, with Tolkien's son on on bringing on, you know, publishing the similar alien. So they had a lot of notes left over. Yeah.
Dan Shappir [01:12:58]:
They had a lot of notes left over from Tolkien. So they actually took all those notes and and kind of rounded out and filled up the story and then released it as a similar alien after Tolkien had died. And he worked with it on it with Tolkien's son, but he also wrote several books on his own. One of them is a standalone, novel called Tigana. You might actually find it especially interesting because it's very much inspired by, by the Renaissance Italy. So the setting there, it's a fictional world obviously with magic and, you know, and stuff like that. But, it's it the the the situation, the scenario is very reminiscent of Italy in in in the renaissance with the little countries warring, principalities and and influenced by by large external powers and and so on and so forth and and the the culture and whatnot. And and it's it's a great book.
Dan Shappir [01:13:58]:
It's amazing the the amount of of story and settings and scenery and characters that he was able to to cram into this one book. It's a pretty thick book, but still one book, and it's very highly recommended. As I said, it's called Tigana. That's t I g a n a by Guy Gavriel Kaye. And, that would be my pick for today.
Charles Max Wood [01:14:28]:
Awesome. I'll have to check that one out. We were we were watching the Lord of the Rings movies, and my daughter was saying that she'd never read them. And so I I I had it on Audible, and so she's been listening to it. So
Dan Shappir [01:14:45]:
but if she's already watched the movie, can she can she get into it?
Charles Max Wood [01:14:52]:
I think so. She she's pretty into other, like Harry Potter and, Percy Jackson, and she she got into the books too after she'd seen the movie. So, yeah. I'm gonna jump in with some, picks. Now I have to say my board game group hasn't gotten together in a little bit. It's just it's just kinda that season of life. But, what what I'm trying to think of a game I should pick. What was the game that we played last Sunday with the with the kids? I don't remember.
Charles Max Wood [01:15:36]:
Anyway, I mean, there are all kinds of games out there that you can play. I I I'll just go with one of my favorites. It's funny because I've never actually completely played through this game. But, no, I take it back. I have played through it once. It's called
Dan Shappir [01:15:59]:
monopoly. Does anybody ever play monopoly all the way through?
Charles Max Wood [01:16:11]:
My my kids do.
Dan Shappir [01:16:14]:
All the way through?
Charles Max Wood [01:16:16]:
Yeah. I haven't played Monopoly in a long time. There there are reasons that I don't love the game. I play it for nostalgia, but
Dan Shappir [01:16:27]:
not Probably because it's a it's not such a fun game.
Charles Max Wood [01:16:31]:
Yeah. The the game that I'm thinking yeah. Well, there there are a number of issues that make it anyway, I
Dan Shappir [01:16:40]:
think that I seem to recall that some somebody once told me that the original motivation for the creation of the game Monopoly was to show them to prove the futility of capitalism.
Charles Max Wood [01:16:53]:
Yeah. Well, it's not it's not pure capitalism. So yeah. Anyway so, the game that I'm gonna pick is called letters from Whitechapel. And, I I understand it's like Scotland Yard, but I've never played Scotland Yard. So one player plays as Jack the Ripper, and then, the other players play as the police deputies. So they're trying to catch Jack.
Dan Shappir [01:17:23]:
Is it kind of like a clue or something?
Charles Max Wood [01:17:25]:
No. In in the sense that you're not you're not trying to figure out who did it or anything like that. You're it's literally so what the way it works is, you have the women that Jack the Ripper kills, and, so when he murders somebody I just dropped the link, but I didn't label it. When he murders somebody in the game, he has, so many moves that he can make to get back to his hideout, and you play it in 5 rounds. And so the police deputies are trying to block him off. And so they'll go investigate different spots on the board, and there are hundreds of spots on the board. And so they'll investigate a spot, and they'll either find a clue, which means that Jack was there. And Jack writes down when he moves through a space, he writes it down on his thing.
Charles Max Wood [01:18:19]:
And one of my favorite things to do when I'm Jack the Ripper is actually to loop back around. And so they'll find clues they'll find clues all around this this spot where I went through twice. And so then they don't know which direction I went coming out of there. But, anyway, if you manage to get back to your hideout after all 5 rounds, then you win as Jack the Ripper. And then if if the deputies investigate a spot and Jack's there, and they they have to specifically say that they're they're staging an arrest at that spot, that if they do that properly where Jack is at, then then they win.
Dan Shappir [01:19:03]:
And so to tell you that the the sub the subject matter seems a bit dark for a game, I have to tell you.
Charles Max Wood [01:19:10]:
It it is, but the the overall gameplay is is fun. And so what happens is you start to narrow down where the hideout is. Right? And so you you can you can then start to, you you know where he's going. And so wherever the murder happens at, right, then you can start to, you know, kind of fan out along where he might travel through and get get a sense of where he's at and then be able to arrest him. And so that's that's kind of the gameplay. Let me look it up on BoardGameGeek so I can get the weight. The the reason that I haven't finished this guy, I tried to play it with my family, and it's it's kind of a longer game. I mean, it can run, like, 2 hours or longer.
Charles Max Wood [01:20:02]:
And so they they just kinda especially my kids just lose interest. And it's not the kind my wife likes to play the kind of, light airy games, and this one's a heavier game. And so, you know, all the strategizing and stuff, she just doesn't love. It has a board game weight of 2.64, so it is on the heavier side of a game that a casual player would play. But, anyway, it's it's a fun one. So, yeah, I'll pick letters from Whitechapel as my board game pick. And then, I've been doing a whole bunch of stuff with just being more, I took way too long on that pick. So I'll be quick brief on these other ones.
Charles Max Wood [01:20:46]:
I've been trying to be a little bit more efficient with my time, and I've also been working on getting back into shape. So, you know, I picked a marathon and started training for it. So a a few picks that I've got here that I'm just gonna put out there. So, my training program, I do in training peaks, and that's just trainingpeaks.com. I'm gonna put it in the comments as well. And, effectively, it just gives you a calendar. You can buy workout plans. They the ones I've bought cost anywhere from $5 all the way up to, like, $50.
Charles Max Wood [01:21:23]:
I think the marathon and triathlon ones were were more expensive, but it it literally just puts the workouts into your thing. And then I have a Garmin Forerunner watch, and so it just syncs it down to my watch. So all I have to do is tell it, go into my training calendar, pull the workout, run it through the watch. Right? So I did that this morning. Went for a run. It was awesome. As far as being more efficient, I've also picked up and been using Linear, which seems to be pretty popular in in the development circles. It's linear dot app.
Charles Max Wood [01:22:02]:
I'll put that in the comments as well. And, it I mean, it's basically a project management board like any of the other ones that you're used to. And then what I've been doing is I've been anything that I intend to do anytime soon out of linear or if it's other stuff that I need to be doing, like, for example, when Dan and I were talking before the show, he was like, hey. We we ought to get people like this on the show, or, hey. I was gonna reach out to someone so. You know? And I was like, oh, I know them. And so, as we were having that conversation, I put them into another system called Motion. And I I kind of I'll look and see if they have an affiliate program, but I'm just gonna put the link in, it's use motion.com is is the app.
Charles Max Wood [01:22:54]:
And the way that this works is, so you put your tasks in. Let me put it in use motion.com. Sorry. I'm typing and talking at the same time. So you put your tasks in, and then it pulls, like, it pulls from my Google Calendar. And so it knows when I have something scheduled like this episode. And then what it does is it says, okay. I'm gonna put the other tasks that you've put into, motion into your calendar.
Charles Max Wood [01:23:25]:
And so and and you can set it up so that it'll actually add them to the Google Calendar, which is what I did, and then I told it not to mark it as busy. So they show up as free time. So it's got tasks set up for Tuesday, Wednesday, Thursday, and Friday this week as well, but there's they show up as as free time. And so what that means is is that if somebody books a time on my calendar, it'll just shift everything around it. And so it essentially tells me what to work on. And then finally, the last thing that I've been using is FocusBlox, and we I'm actually gonna do another premium episode with Manny, who's the guy that created it. But, effectively, what it does is you get on, focus you get on a Zoom call and that you do a breathing exercise before you start and you commit to this is what I'm gonna do this hour. And sometimes I hit it and sometimes I don't.
Charles Max Wood [01:24:13]:
And it's things out of my control or it turns out to be harder than I thought it was gonna be or whatever. But it keeps me on task because what they make you do is they make you put your phone away from your desk, and you so you focus on it. And then at the end of the hour, you do another breathing exercise. You report on whether or not you got your thing done, and then you do it over again. And I usually get, 2, 3, 4 focus blocks in in an afternoon. I try and schedule all the regular stuff in the mornings. But, yeah, I'll I'll do that in the afternoons and then yeah. So a lot of my, like, prospecting so if I'm trying to find sponsors or stuff, you know, that's on there first thing in the morning.
Charles Max Wood [01:24:56]:
Getting through my inbox, is a first thing in the morning thing. So after I go for a run and things like that, that's what we do. So, anyway, those are my picks. I think it's focusblocks.com. And I know I have an affiliate link for that, but, so we'll put that in the show notes. And it doesn't cost you anything extra, but I get a kickback if you use it. But, ultimately, I mean, if if these things save you a bunch of time and make you more effective, great. And then if I get a kickback because I found an affiliate link for it, even better.
Charles Max Wood [01:25:29]:
But, ultimately, this is what I'm using. So, yeah. That's what I've got for picks. So we'll wrap it up. Until next time, folks.
Deep Dive into Metrics and Monitoring with Prometheus and Grafana - JSJ 645
0:00
Playback Speed: