Observability and Serverless Intelligence Platforms with Aviad Mor - DevOps 132

Aviad Mor is the CTO and co-founder of Lumigo, a serverless intelligence platform that helps developers understand and troubleshoot serverless applications. Today on the show, Jonathan Hall interviews Aviad to discuss observability within DevOps and the future of serverless.

Hosted by:

Jonathan Hall

Special Guests:

Aviad Mor

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

In this episode…

Observability and Lumigo
Linux Kernel
MRTG tool
Managing distributed systems
What is “serverless”?
Tracing optimization
Onboarding with Lumigo
Black boxes and Kubernetes
Observability tools

Picks

Aviad- Islands in the Stream
Jonathan- A World Without Email: Reimagining Work in an Age of Communication Overload

Transcript

Jonathan_Hall:

Hello everybody, welcome to another exciting episode of Adventures in DevOps. Today I'm your solo host, Jonathan Hall. Will Button is apparently in Dubai trying to get rich or something, I don't know. I hope so. I hope he shares some of his wealth when he gets back. But I'm really excited today to have with us, Aviyah, did I say that right?

Aviad_Mor:

You said it exactly right.

Jonathan_Hall:

Oh, that's the first time I think I've ever gotten somebody's name right. The first time. Welcome.

Aviad_Mor:

Yeah, I'm very happy to be here. I'm a VIA at CTO co-founder at Lumigo and I'm actually excited to be in this podcast.

Jonathan_Hall:

That's great. I'm excited to have you here too. So you submitted a topic proposal about observability, which is great because it's a topic I have a lot of questions about and that I think is interesting at the same time. I try to, quote, do observability, but I know there's so much I'm not doing right. So I'm going to try to pick your brain today and learn everything I can. And hopefully our audience will learn something too. But before we dive into the topic, do you want to tell us a little bit about maybe your background, how you came to know something about observability, maybe where you work, if that's relevant.

Aviad_Mor:

Yes, sure, I'll be happy. So I've been in different R&D roles for the last 20 years, started out as a developer in the Linux kernel on very physical servers and as time went by, you know, started working on cloud applications where you get more and more distributed applications, different components. And a few years ago, I started working with serverless, which was amazing, but very quickly I noticed that there's a few pieces missing. One of them is the ability to understand exactly what's going on in my system. And a few months later, together with my co-founder, we founded Lumigo. which aims to do exactly that, bring visibility, observability to cloud native environments, where you can do a lot of fantastic things, but sometimes it can be very hard to gain that observability. So it really started out as a pain that I had while developing, while I was having to make sure that production is running as it should and getting very, very nervous when production had any issues, and I knew I had to debug it quickly. And the last couple of years, what we've been doing is working in the cloud, specifically in AWS cloud, and building this observability platform, very focused on cloud native environments. And... we've been meeting a lot of people who have these exact issues and seeing how we can solve those issues for them. And at the same time, our own development team is working on AWS Cloud. So we're basically eating our own dog food on a daily basis, making sure that we're able to see whatever we need, because our R&D team is a very tough customer. They have

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

very high requirements. So all day long, basically, that's what we're around.

Jonathan_Hall:

Nice. You said something that sparked my interest immediately. You said you used to hack on the Linux kernel. What, what, what did you do there?

Aviad_Mor:

So I was in the first part of my career, I was working for a cyber security firm. So

Jonathan_Hall:

Okay.

Aviad_Mor:

especially in the back end, so it was receiving a lot of data, going over a lot of IP packets, TCP packets, and making sure that everything is safe and okay, nothing suspicious is getting through there.

Jonathan_Hall:

Nice.

Aviad_Mor:

Yes.

Jonathan_Hall:

So, onto the topic of observability. I think my journey into observability started, I'm old, I used to run a dial-up ISP as a kind of hobby back in high school. And so I installed MRTG, I don't know if you're familiar with the tool. I imagine a few of our listeners are, but it's a really old tool that is probably completely obsolete by now. It stood for Multi-Router Traffic Graph, and it would basically... give you graphs of inbound and outbound traffic on different network interfaces. That's all it did. That's the only observability it did. I mean, you could shoehorn CPU utilization or disk space into that, as long as you didn't mind it was called bytes in and bytes out.

Aviad_Mor:

Right. Yep.

Jonathan_Hall:

And then a few years later, I graduated to Nagios and honestly, it's been years since I've really been deeply involved in setting up a tool at that level. Um, but let's, I'd be curious to hear a little bit, like what are the differences in observability from the old school bare bones metal, like you have a server rack that you're monitoring to, and then there's, you know, Docker containers or Kubernetes and now they're serverless. You know, what, what, what are the similarities and differences across these sorts of things

Aviad_Mor:

Right.

Jonathan_Hall:

and why is a tool like yours useful, uh, to address those specific needs?

Aviad_Mor:

Right, so that's a great question because observability, monitoring, understanding what's going on with your system has always been important, right? You always needed it 20 years ago, 30 years ago, and today. If you want to make sure that everything is okay, you need to make sure that you understand what's going on. I think there's a few differences. Probably the main one is the systems have become more and more distributed. So if once you had a monolith, you were able to tap in into one major process or even a few of them, hopefully all of them on the same server, and very quickly you were able to understand what's going on. And if, for example, you'd have a crash in your system, you'll be able to see the full stack, basically from whatever triggered that process all the way to the point where it actually crashed. And today it's much more complex. Sometimes it's easy, right? So if somebody, for example, adds code which causes a division by zero, it's very easy to recognize it, make the change in the code and we're good. But now when you have a lot of different components in your system, which is a good thing, right? So you're able to build this puzzle with whatever the cloud vendor provides you. Right, so just as an example, if we're looking at AWS, you get an API gateway for your APIs, maybe Lambda, and some containers, dockers running for your compute, DynamoDB for your database, and so on. But now you have all these different parts, and just knowing what's going on with each one of them is simply not enough, because more... More than once, you'll get to a point where, for example, if we were talking about a crash, you'll see a crash, you'll check your code, and nothing changed in the last day, week, month. And that's very typical of a distributed system where something changed somewhere else, maybe by another team, not directly connected to your Lambda, for example, or to your container. But because it changed the way, let's say, the data looks like, that reached all the way to your container, and now you have an issue. So you need to very quickly be able to understand what's going on to get to the root cause, and in order not to spend too much time and energy, be able to hop between all the components that took part in that chain of events. So that's one of the main differences. It's no longer enough to just get the information about one piece of your system. It used to be that one piece might be the whole system. But now there's like hundreds of pieces in your system, very easily, you can even get to thousands. So you need to be able to tie

Jonathan_Hall:

Yeah,

Aviad_Mor:

them all

Jonathan_Hall:

really

Aviad_Mor:

together.

Jonathan_Hall:

good.

Aviad_Mor:

So that's one of the differences.

Jonathan_Hall:

It sounds like that same difference probably applies pretty equally to both microservice architecture and serverless. Or I suppose maybe there's a greater scale because you maybe have more pieces if you do serverless. But otherwise, is it similar concepts or does the serverless aspect change things too?

Aviad_Mor:

Right, so I think that first of all,

Jonathan_Hall:

Good point.

Aviad_Mor:

when people say serverless, a lot of times different people mean different things.

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

So I'll say what I mean when I say serverless. So serverless is the whole ecosystem where you're able to run things which you really don't have to take care of anything, almost anything related to DevOps. So if we're talking about compute, it will be the Lambda. But if we're talking about GraphQL, it will be AppSync and File Storage S3 and so on. So there's a lot of different services which are serverless. And containers, especially when we're looking at Kubernetes and ECS, EKS, which again allow you to run at scale, they are similar. And meaning that very quickly you can get to a high number of components. and you need to be able to tie them all together. And I'll also add that, I think a few years ago, we were looking at serverless and containers, and there was like a question, who will win the war? Who will win the fight? Will people pick serverless or will people pick containers? And one of the interesting trends that I'm seeing is that people are just picking both. It's really not, it's not a war, it's an integration. It's... It's really great. So you pick for each part of your application, whatever fits best. So we see a lot of container-based applications, which have a lot of serverless parts in them. So you'll have, for example, a flask getting the information in, but then you'll have some queue, like SQS or Kinesis, taking part in the moving of the data from place to place. And the. Honestly, I think that's one of the things that makes observability in the modern stack even more complex because it's not about just connecting all the dots.

Jonathan_Hall:

Right.

Aviad_Mor:

It's also about you have a lot of different types of dots. So, it's not like you can run one specific code, write it once, and add it to all your different pieces. You need to be able to connect the relevant data in order to trace the information for each of those. let's call them dots to each of those components separately. And a lot of those components are completely managed. So it's amazing, right? So the cloud vendor takes care of everything for you. But then if you have, for example, in the middle of your system, an event bridge, right? Which is an event bus, which moves around the different events. So that's... a great piece of software which AWS takes care of for you. But if you're not able to trace through it and you can't add any tracer to an event bridge, you're going to have a problem. You're going to see different parts of your system.

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

You won't be able to understand how it's connected. And in order to get that observability, to get that tracing at the level that you want, you need to be able to connect the dots. including the dots which are fully managed, that you have no control over

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

because now they're an integral

Jonathan_Hall:

That reminds

Aviad_Mor:

part

Jonathan_Hall:

Aviad_Mor:

Jonathan_Hall:

Aviad_Mor:

your system.

Jonathan_Hall:

a question I'm going to ask a little bit later, based on an experience I had a few years ago. But first, I'd like to sort of lay some groundwork about, like, what should we be thinking about when we talk about observability? Because I think, like you said, serverless means different things to different people. I've noticed the same thing about the word observability. Some people include logging. Some people don't include logging. What maybe pillars would you say exist within the framework or under the umbrella of observability? What are the main things that we should be looking at and considering?

Aviad_Mor:

Right. So I think that the most important thing about observability is the bottom line. The bottom line is being able to understand. And in our world, to understand means usually to be able to debug, to get to the root cause, to understand your system, even about things that you weren't able to know in advance that are interesting to you. Being able to query it, being able to... see the different pieces of data after the fact. So after the system stopped working, not needing to go back, add different prints in the code or different traces and only then praying to hopefully being able to reproduce that problem. Then the question is, how do you get that observability? And that's exactly, Jonathan, where we get to those three pillars. So... The famous three pillars are logs, metrics, and traces. So basically, metrics allow you to know that everything is OK or maybe not OK just by seeing that the numbers, you have a lot of issues, you have a lot of high latency or something like that. They're great for seeing a lot of information in a very condensed way. Then you have logs, which are great to show you. a lot of information for specific things. And usually these are things that the developer was able to think of in advance. And OK, say if we get to this point in the code, I want to make sure that I know A, B, and C. And then you have tracing. Tracing is all about the ability to connect everything together in a very modeled way. So without having to think of things in advance, the tracers in the right places and being able to collect the data so you'll be able to correlate everything and see the full story end to end. Now the way I see it, a lot of times people talk just about the three pillars, but I think that the traces are the base in a distributed environment for observability because you can very easily be overloaded with a lot of metrics and a lot of logs, and they'll contain a lot of data, but it will be very hard to really understand what you need out of it. And especially, you know, when something goes wrong, because it's never a question if something will go wrong. It's always, when will it go wrong? And probably at 3 a.m. when you're not ready for it. So then, when you have the right tracing in place, you're able to see the full story. You're able to correlate all those different components and then not go over tons of logs and never ending metrics. You're able to just see the logs and metrics which are relevant for what's interesting for you right now. For example, somebody triggered an API, a customer triggered an API, started a chain of events let's say 20, 30, 40 components taking part of that, being able to see only those 20, 30, 40 components and not all the things those components did, but the specific triggered events. So it's being able to see the data being passed, the information going in and out of each of those components for that specific API trigger. And then when you add the layers of logs and metrics, but only for the correlated events, then you get observability, then you get the information you need and being able to go back and forth only in what's interesting

Jonathan_Hall:

Aviad_Mor:

for you.

Jonathan_Hall:

let's explain a little bit what tracing looks like to somebody who's maybe never seen it or conceptualized it. Is it as simple as a unique identifier on a request that is sort of sticky across services? Or is it that simple? Or is it more than that? I imagine it's maybe a little bit of both.

Aviad_Mor:

Right, so

Jonathan_Hall:

Yeah.

Aviad_Mor:

I think that it's a little bit of a lot of things, including both of those. And again, I'll actually start at the bottom line. The important thing about tracing is being able to see all the connected parts, all the components that took part in a specific call, or even if we take it to a higher level, being able to see the full architecture how your different databases, your different queues are related to each other, so you know what's happening in your system. And then you get to the how, and this is exactly what you were talking about. And sometimes it's passing along IDs. Sometimes it's inferring by pieces of information that you're able to get. For example, when I was talking about that managed services, there you're not able to add any ID just because you don't have any control about what AWS is running. So, for example, what you're able to do is to look at the information being passed and just by understanding the metadata for a specific service, if you know how to look at the metadata being entered on one part of the queue and then looking at the metadata being outputted on the second part of the queue, you're able to make that correlation. So it's a lot about collecting the right information. Some of that information is IDs that you add, but a lot of it is looking at the right part. And then sending it back to a correlation engine. And the correlation engine needs to be able to understand what's connected to what. Because again, it's a very distributed environment. It gets pieces of information, which are traces. from different parts and the correlation engine is able to make all the relevant connections.

Jonathan_Hall:

Okay.

Aviad_Mor:

You know, I want to add one more thing, which is I think tracing is not just about being able to make those connections. It's maybe the most important part, but it's also being able to collect the, in a modular way, collect relevant information. So it's not only seeing, for example, is putting information into a DynamoDB, and that DynamoDB is now triggering another Lambda. It's also being able to collect the information about the input and the output for that Lambda. Okay, so the invent, the environment variables, and maybe the HTTP calls. What were the relevant headers, body, and response? So you don't have to add logs and have the developer She doesn't have to add it. I want to collect this information in each and every part, having it automatically collected. So as opposed to logs, tracing is collecting the right things beforehand. And you can always add additional information

Jonathan_Hall:

Aviad_Mor:

Jonathan_Hall:

this

Aviad_Mor:

you

Jonathan_Hall:

brings

Aviad_Mor:

Jonathan_Hall:

me back

Aviad_Mor:

along.

Jonathan_Hall:

to the question that I thought of a few minutes ago. A few years ago, I was working at a company that had a microservices architecture and they were heavily based on, I think it was Google Pub Sub. And I think they had Nats running on top of that. And we kept running into this problem that, so basically rather than the services directly communicating over GRPC or HTTP, they would pub something into the Pub Sub and then a consumer would, would read it. And so we, you know, and of course I added a little bit of latency, but more important, it added this sort of like black box between our services that we didn't necessarily know, like which instance even was taking a request. And it made debugging really hard. We didn't, we didn't have your tool in place, obviously. So I can very easily see the problems because I've seen them

Aviad_Mor:

Yeah.

Jonathan_Hall:

in person. How was that solved in a general way? I mean, you've talked about that a little bit. Is it a matter of passing that trace data along with every PubSub request or yeah, just maybe fill in the details a little bit there on a situation like that.

Aviad_Mor:

Yeah, sure. So I think that you're really touching here

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

the main pain point, those black boxes, right? Because you can have all of this great compute and containers and serverless, and there you have a lot of control. But then when you're using, for example, a PubSub, you're feeling blind, right? And it's not only like what's going in there. the picture because you know something is going inside and something is coming out the other side but you're not really able to connect and correlate what went in and what went out. And that's a lot of really our secret sauce and I'm saying it with a smile because a lot of our tools are open source so it's not really hard to check that out but it's actually very different like one way we solve it for every service out there. We have it just, I'll name a few, just as an example, where we have SNS and SQS and Kinesis, which are different queues, and then we have EventBridge, and then we also have DynamoDB Streams. All of these are like black boxes, which we're able to trace over. And for each of them, we had to solve it in a different way. So some of them, By looking at the metadata being sent and returned by the API requests, by saving that metadata, we're able to correlate it without adding any IDs. And then by looking, for example, at containers where we also are able to look at containers going over different parts of the system. So container calling another container over HTTP there are sometimes you do need to add some ID in order to be able to make that connection. So it's actually not one answer. For each and every service that we're tracing, we sit, and this is one of our expertise, is to find the best way to make that correlation between the inbound and the outbound, and wherever possible, we don't change the data at all. which is always the best way because we don't want to garble up your information. And wherever needed

Jonathan_Hall:

Okay.

Aviad_Mor:

an ID

Jonathan_Hall:

So does

Aviad_Mor:

will

Jonathan_Hall:

that

Aviad_Mor:

Jonathan_Hall:

so,

Aviad_Mor:

added.

Jonathan_Hall:

so that means you need something I'm guessing on the sending and receiving and a sidecar container or an SDK or something. What do you offer that? Or maybe you have different solutions for different scenarios.

Aviad_Mor:

Right. So yes, we have different solutions for different scenarios just because

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

we want to find the best one for each scenario. For example, in AWS Lambda, what we do is add a Lambda layer. So there's minimal effect on the Lambda itself. In a container, we'd add an open telemetry library. our own distro of an open telemetry library, which collects the relevant information. But maybe the most important part, which is something we invest a lot of time in, is making all of this fully automated. So we have different ways of adding our tracers, but for the developer in the end, it always means the same thing, which is no code changes. We invested a lot of time in doing that and we keep investing time. The way we see it, the way I really believe in, is developers, DevOps, SREs, they don't want to spend time working for their observability solution. So a good solution allows them to just add the observability either by adding one line in the relevant library. like one time, or for lambdas, for example, it's actually just clicking on the UI, and behind the scenes we're able to add that lambda layer automatically with an extension. And that is very important. It's not only important for day one, where you don't have to deploy and make a lot of changes and have a high learning curve, it's also important for day two, three, and four, and so on, because two things. A, you're going to have a lot of different services. So if you're going to have to make changes in each and every service, it's going to be hard work all the time, because the modern environment is ever changing. It's very easy to add services, which is great. But you don't want to have that overhead for every service I'm going to add. I'm going to work on the observability. And the second part is you want to be able to have everyone the team doing it. So it doesn't have to be only the senior architect if he or she has the time to do it. You want to have any developer, even if he joined one week ago and he's already deploying to production or even to his own development environment, but he should have that ability to observe because that's going to save her time, that's going to save her energy, and that's going to allow them to develop much, much quicker. So the automation is not just nice to have, it's not just because, hey, I don't want to work, it's because,

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

you know, you shouldn't work. Let us work for you and you can just work on your system and make sure that your business

Jonathan_Hall:

So on

Aviad_Mor:

Logic

Jonathan_Hall:

that point,

Aviad_Mor:

is the best in breed.

Jonathan_Hall:

what's the onboarding like if somebody decides to start using your service? How long does it take to start getting meaningful observability? I'm sure it depends on how complex their system is and so on and so forth. But like, you know, how long does it take to get your first component wired up and running?

Aviad_Mor:

Right, so minutes. So literally minutes. And again, this is something, the type of things that

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

we work very hard, so it will be very, very simple. And again, we believe that, and we see it, that in the everyday people who use our product, the developers, the DevOps who use our product, are not there because I sat with their CTO and I explained. to them that Lumigo is the best thing, and now they're going to all the teams and making them use Lumigo. No, it's never that, it's always some developer, they have an issue, they have a long latency and their API and they need to solve it very quickly. They're looking for a solution and they just start using Lumigo. And after that, my aim for them is to fall in love with Lumigo or at least be very excited with Lumigo it's saving them a lot of time and then they can show their team, Hey, look at Lumigo, maybe try this out and, you know, just start with a completely free tier because this is the way into the hearts of the developers. So what we did is make the onboarding very, very simple. It's actually a three step wizard where you start seeing traces automatically. We are also a very visual tool. So again, you can very quickly start seeing your environment. It doesn't have to be going over a lot of information. You can always dive into your raw data. You can always dig as deep as you want. But you immediately get a very visual way of understanding of, OK, this is your system. This is a. what every API looks like. This is what every API does on the back end. And this is where you have issues. Automatically, if you want, you start getting alerts. So again, you don't have to think in advance, what alerts do I need? We try to do them based on a lot of experience that we have.

Jonathan_Hall:

Nice.

Aviad_Mor:

And in minutes, you can start debugging on the system.

Jonathan_Hall:

What are some of your competitors? Maybe people are having a difficult time really placing you in the market, or maybe they're using a competitor. Who are some of your competitors, and what are the differences?

Aviad_Mor:

Mm-hmm. Right, so I think when we're talking about observability, when we're talking about monitoring, usually the biggest one out there is Datadog. Very well known and it's a great product. What we're doing is focusing on a specific type of environment, which is the cloud native environment. where it's highly distributed, as we said, containers, serverless, if you have, for example, on-prem servers, I'm going to say in advance, don't pick Lumigo, right? We decided to focus on this very rapidly growing type of applications where it's not just similar to what we used to have, just we need to make a few changes, but rather needed to be built from the ground up, something which allows traceability to be very easy, very automated, and most important, being able to correlate all the different parts, including all of those black boxes that we mentioned before, right? And there's a lot of black boxes because as we move forward in time, there are more and more managed services. It makes developing easier. It makes DevOps much easier. but it makes the ability to get visibility much, much harder. And because of that, it makes debugging much harder. So we're focused on that, automating it, making it very easy, and also bringing together the monitoring part and the debugging part. So DevOps and developers, they go together. And Not the first person to say that, that's for sure. But a lot of times we still see tools which are separated. One will tell you something is wrong, and another one will help you to find what is wrong. One will tell you there's a high latency or high CPU, and another one will tell you why there is high CPU or high latency. And because we have developers and DevOps in mind, And because we're using the tool ourselves, it's very important for us to bring that together, to have a seamless workflow. Not just throw a lot of numbers at you, but allowing you to get from the point of, OK, it took me a few minutes to deploy, but now, OK, I got an alert. I understand something is going on to be able to take you exactly to the place where you can get to the root cause. In a simple way. in a way that won't be spending all your energy in moving from screen to screen, or maybe having to think of what dashboards do I need to get up and to get in place just so I'll know what's going on, to having that automatically

Jonathan_Hall:

Nice.

Aviad_Mor:

being done for you.

Jonathan_Hall:

Speaking of black boxes, one of, and there's, so the, speaking of two things, speaking of black boxes and speaking of don't use Lumigo on bare metal, what about Kubernetes? Especially if you're running your own Kubernetes. I can imagine Lumigo would be great for your services on Kubernetes, but does it help with Kubernetes itself? Either on bare metal or in the cloud, if you're using Amazon or Google or something.

Aviad_Mor:

Right. So, yes, it helps with Kubernetes. We actually already have Kubernetes users. You know, some of our best customers are using Kubernetes or Kubernetes with AWS version of EKS, where they're running into the issues of being able to see the full traces end to end. And we're seeing it especially in places where they have different types of services, a lot of the containers running on Kubernetes, and they need to be able to see how those containers are talking to each other, how they're affecting each other. And in addition, they also have what we called before serverless services. they don't make any difference and some of their applications it's completely EKS and they use Lumigo there and in others they're using it as a way to see how the data is being moved between let's say a kinesis and an express application and so on and so on. So all those place where it's highly distributed, we see people using Lumigo. And I'll just say, there's another thing sometimes, you know, I think it doesn't get as much press, but we see a lot of people using that as well, which is ECS. Which is, you know, that's AWS, it's a little bit simpler to run containers there. But a lot of people we are seeing using it. And that way, they're able to start using containers without, let's say Kubernetes can be a little bit of a hassle. So if they don't want to start with that hassle, they use ECS.

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

I personally don't mind. They can use whatever they want, right? But every time we say Kubernetes,

Jonathan_Hall:

Of course,

Aviad_Mor:

I always

Jonathan_Hall:

of course, yeah.

Aviad_Mor:

don't forget the people using ECS as well.

Jonathan_Hall:

Great.

Aviad_Mor:

Yeah.

Jonathan_Hall:

So you talked about how you focus a lot on making this automated so that you don't need to make code changes or very minimal code changes to start taking advantage of the features. Are there additional features or capabilities you can get if you're willing to make code changes? I mean, if you have specific needs or something, or is that a silly question?

Aviad_Mor:

No, no, that's a great question. I can tell you, when I think about the product, so we always have to take into consideration the everyday developer who maybe just is on pager duty right now and he needs to go in and make a quick fix and for him we want to have the system as simple as possible. But then we also have to think always of the power users. Right? So, okay, it's great that we do everything automatically, but they want to add additional information. They want, for example, to be able to see the user IDs of the crashed container. They want to see the ticket ID, which is laying somewhere in their code in the Lambda, and they want to make sure that it's being pulled out. And also, we're able to recognize a lot of different issues, you know. of course, any crashes, but also timeouts and out of memory and so on and so on. A lot of types being able to see them out of the box, but then you have those application level errors, those programmatic errors. The code is running fine, but the coder knows that if we got to this point in the code, if we're counting

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

the number of users and we got to a negative number, something is wrong. So nothing is going to crash, but that coder still wants to see that as an issue. So you can always add programmatic errors, application level errors. You can always add tags, which is very important, allowing you to get a much richer experience of the data that you later consume on the Lumigo platform. And that's especially important because Not only will you be able to see in a case of an issue who that user was, you'll also be able to see when you query the system. It's not always about the issues. It's not always about something not working. A lot of times it's something working in a fuzzy way. It's something working but not the way you intended to. So for that, we have a very strong search engine search all the data collected by Lumigo, which is usually data you won't be able to see in any other place because it's traced information, and then you can check things after they happened. And even if in advance you didn't know that there will be some correlation between a specific service and high latency, or a specific user. and a low number of API counts or anything else, you're actually able to query all of that on your own. It doesn't have to be things that were thought of in advance. And that's in my, the way I see it, it's very

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

related to what we were talking about, observability. You know, there's a lot of data. Some of them you want to be flagged very quickly, right? An issue, you want to be flagged very quickly so you'll know. when you have to fix it, but sometimes it's not a clear cut issue. So you want to be able to slice and dice all the information

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

after the fact.

Jonathan_Hall:

Do you provide tools for front-end monitoring? A lot of applications include either a web front-end or a mobile app of some sort. Maybe it's just a dashboard or an admin console. But is that outside the scope of Lumigo? If my Android app crashes when I'm trying to order something.

Aviad_Mor:

So yeah, so that's a great question, basically because one of the questions that when I go to sleep every

Jonathan_Hall:

Okay.

Aviad_Mor:

night is, what should we trace next? So we don't trace Android yet, and we get a lot of different requests and we see a lot of different use cases. So that's always the hard question because we're always adding more and more traces to... different use cases. What we do that is very close to your Android, for example, is tracing right from the closest part of your back end, which is closest to that front end part. So either CloudFront or AppSync if you're using GraphQL, or API Gateway for the regular REST APIs. And those actually contain a lot of information about what's coming in from your front end. So a lot of our users actually take that information and they're able to correlate it with what's happening in the browser or the Android or iPhone in order to to see what was going on in the front end causing maybe issues down the road back, you know, very

Jonathan_Hall:

Mm-hmm. Very

Aviad_Mor:

deep inside

Jonathan_Hall:

nice. And that's a great

Aviad_Mor:

the backend.

Jonathan_Hall:

segue to my next question, which is, what features do you have coming in the future?

Aviad_Mor:

Yeah, so yeah, I love that question. So a lot of surprises, a lot of great stuff. But I think one of the main things is the advanced traceability for the different containers and the different applications running on Kubernetes, like you mentioned before. We see a lot of different applications. A lot of times we're using open telemetry. We're open to vanilla open telemetry. So whatever is using the open source, open telemetry is great. But to that, we're adding our own pieces of information and our own tests to make sure that you're getting the best of breed. traceability and that allows us to run forward with that and being able to add that monitoring to Kubernetes. I'll add two more things for what's coming next. We talked about automation, so automation is always very important. So we're looking about how to do even more advanced automation. to have it as easy as possible for all the different environments. And another thing is AppSync. So even today, AppSync is something that you can see and use in Lumigo. But having it as an integral part of all the transactions, we're seeing a rising use in AppSync by different users, and we're addressing that and allowing them to debug it

Jonathan_Hall:

Very nice.

Aviad_Mor:

just like any other service.

Jonathan_Hall:

So you've mentioned a couple of times open source that a lot of your stuff is based on open source. You've mentioned open telemetry. I can hear a lot of people, I can imagine a lot of people's mental gears turning, thinking, oh, this sounds great. I'm gonna go out and install open telemetry and do this myself. What's to stop them? Whoa.

Aviad_Mor:

So first of all, not me, I'm not going to stop them. So, and

Jonathan_Hall:

hahahaha

Aviad_Mor:

I'm all for open source and I think it really depends on the person and how much work he wants to invest in each part of the things he does. And if they want to invest in the core part of their system or in the ecosystem, for example, observability. So also I'll take as an example, elastic search. So if we have a great DevOps leader and she wants to invest time and install elastic search, she can do that. Or she can go out and use some elastic search hosted solution and she can do either one, but it's a classic buy versus build. I usually think that

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

wherever it's not part of your core system, use somebody else's solution. So you don't have to spend any time. Usually your development time, your DevOps time is the most crucial thing and has the highest effect on how quickly you can develop your own application. But I also think

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

that when we're looking at open telemetry, It's not as well baked as Elasticsearch, right? It has a lot of different pieces, but it's not simply, okay, I'm just going to install it. It has actually different components. You have to take care of the different components, the tracer, the collector, and then a way to correlate everything, and of course to visualize it. And it's not something that is right now, you can just take and install. I think it is great in the way that it has very clear protocol and a lot of open source libraries, where it enables you to bake it in to your own library if you're doing your own open source. So you can have it baked in if you're having an open source. So your open source is now going out with OpenTelemetry inside of it already. And also if you added a tracer, you can send it to different places. But even those tracers for different libraries, each library requires a different tracer and they have different levels of maturity. So again, if you want to take it on yourself, you can do it. And I'm never the person to discourage you to do that, but... And when you use a solution like Lumigo, what you're getting is not only a great way to consume it and visualize it and see all the issues and have the slack integration and the easy way to have everybody on board. You also, for example, the tracers, we have a very harsh testing environment. So any tracer we say we support, it means it went through our own integration tests. And we're adding a lot to that open source. By the way, we're pushing that open source that we're adding upstream so everybody can enjoy it. But

Jonathan_Hall:

Mm. Mm-hmm.

Aviad_Mor:

our own distro is always the newest and has the best data that you can find.

Jonathan_Hall:

Very nice. Is there anything else that you think I should have asked that I didn't?

Aviad_Mor:

No, so first of all I'll say that you had great questions and I think they touched a lot of important points and when I say important points sometimes that means painful points. But that's exactly why we're here, right? That's the challenging part to ease the pain and to give a very easy solution. Now I don't think there's any more questions. Maybe one more thing, which is something that I'm seeing more and more, which is the question is who's using these observability tools, which has a very strong correlation to who's using these cutting edge technologies, these modern cloud environments, because everybody is talking about them, but not everybody is using serverless and Kubernetes and... We're getting there, but

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

not everybody is using it, and that's fine. And I wanted to say one of my observations about it is we're seeing a lot of startups using it. For startups, it's almost easy because they have zero legacy and they're always keen to use the latest technology, and that's great. But it's something that we very clearly see in enterprises. So it's never the whole enterprise saying, okay, we're going to throw away all our legacy code and move to cloud native technologies. And they really shouldn't do that. That's okay, they're not doing it. If it works, don't change it.

Jonathan_Hall:

Yeah.

Aviad_Mor:

But we're definitely seeing as time moves by more and more enterprises where they say, okay, we have a new technologies or we have this team. Usually it's one of the stronger teams in that company. and they starting to use these cloud native technologies. And that after hopefully it's successful, it goes from team to team to team. So a lot of times we see the use of our own observability tools, both in startups, of course in medium companies, but in a lot of enterprises, which you might think take time to use new technologies, but we're actually seeing a lot of classic big enterprising. having these teams and then groups and then bigger groups using the new technologies and of course using the whole ecosystem around that.

Jonathan_Hall:

That's great. Cool, well, I've learned a lot. I was hoping I would learn, since this is, as I mentioned at the beginning, this is a topic that I find interesting, but I had a lot of questions about. So thank you for educating me and our listeners. How can people find

Aviad_Mor:

Right,

Jonathan_Hall:

you? You've mentioned the name, but let's

Aviad_Mor:

Jonathan_Hall:

just spell

Aviad_Mor:

that's

Jonathan_Hall:

it, make sure

Aviad_Mor:

Lumigo,

Jonathan_Hall:

it's crystal clear

Aviad_Mor:

it's

Jonathan_Hall:

how to find

Aviad_Mor:

L-U-M-I-G-O,

Jonathan_Hall:

the company.

Aviad_Mor:

and all you have to do is Google it. I'm a firm believer that I can talk all day long, but just look at it and try it out for yourself either by connecting, as I said, it takes a few minutes and we have a very big free tier because we really want people to use it before they decide. And also just you can go to the send box and play around with it, see it for yourself and see what it can do for you. And while you're there, you'll be able to see a lot of blogs we've written, not only about observability, but also about the cloud native technologies. So everybody can enjoy anything that we've learned ourselves.

Jonathan_Hall:

Very cool. And if people

Aviad_Mor:

Right,

Jonathan_Hall:

are interested

Aviad_Mor:

Jonathan_Hall:

in reaching out to

Aviad_Mor:

I'm

Jonathan_Hall:

you directly,

Aviad_Mor:

always

Jonathan_Hall:

are

Aviad_Mor:

happy

Jonathan_Hall:

you on social

Aviad_Mor:

Jonathan_Hall:

media or

Aviad_Mor:

talk

Jonathan_Hall:

how

Aviad_Mor:

Jonathan_Hall:

about

Aviad_Mor:

people,

Jonathan_Hall:

the contact information?

Aviad_Mor:

hear new ideas, answer any questions. You can check me out on Twitter, look for my handle, aviadmor, A-V-I-A-D-M-O-R, or you can find all my information again on the Lumigo website, and feel very free to reach out. And if you have any comment, any question, or just want to talk, I'm very happy to do that.

Jonathan_Hall:

Wonderful. Well, let's move on to picks. What we like to do here is let every host and guest pick one or two things or three or

Aviad_Mor:

Right,

Jonathan_Hall:

four or

Aviad_Mor:

so,

Jonathan_Hall:

whatever. Go crazy if you like. Do you have any

Aviad_Mor:

yes,

Jonathan_Hall:

picks for us this week?

Aviad_Mor:

this is actually something that has zero connection to technology and to my day-to-day work. It's just a book I read and I really

Jonathan_Hall:

Awesome.

Aviad_Mor:

loved. It's called Islands in the Stream by Hemingway and it's a very, very sad story. There's a terrific song, Islands in the Stream. which basically has nothing to do with the story itself. It's a great song, but it's not about the story itself. But I think it's beautiful how a story can be at the same time, very, very sad, but also beautiful and can stay with you a long time after you read it. So that's a book I read and I highly recommend it.

Jonathan_Hall:

Nice. Great. We'll check it out. My pick this week is also a book, although it's not a novel. I'm actually in the middle of reading it right now. That's, I think the third book I've read by the author, Cal Newport is the author, and the book is called, A World Without Email, Reimagining Work in an Age of Communication Overload. And as I said, I'm still, I haven't finished it, but the premise of the book is that email is disruptive. I think we've all heard this before. You know, your little inbox goes, I don't know, 17 minutes or whatever the average is. And it's very distracting. And then if you consider additionally things like Slack or other instant messaging tools, we kind of live in a workplace full of interruptions. So his premise is that this is probably not healthy. It's not the way our brains are designed to operate. So he tries to offer new ways of thinking about this. And I'm hoping he gives me some suggestions. He's promised he will. So suggestions on how I can improve my effectiveness

Aviad_Mor:

Right.

Jonathan_Hall:

by overcoming some of these disruptive

Aviad_Mor:

Right, so

Jonathan_Hall:

patterns. So

Aviad_Mor:

I really relate

Jonathan_Hall:

that's

Aviad_Mor:

Jonathan_Hall:

Aviad_Mor:

the

Jonathan_Hall:

pick,

Aviad_Mor:

subject.

Jonathan_Hall:

A World Without Email

Aviad_Mor:

By the way,

Jonathan_Hall:

Aviad_Mor:

Jonathan_Hall:

Cal

Aviad_Mor:

also

Jonathan_Hall:

Newport.

Aviad_Mor:

read a great book about a very close subject by Nir Eyal about basically how you're able to, how maybe you'll be able to disconnect and be disrupted by all those pings, which is so, so hard. So. Yeah, it's a very important subject. And yeah,

Jonathan_Hall:

Mm-hmm.

Aviad_Mor:

yeah.

Jonathan_Hall:

Yeah.

Aviad_Mor:

John, yeah, Jonathan, I wanted to say I had a terrific time. I had a lot of fun. Very

Jonathan_Hall:

Great.

Aviad_Mor:

interesting.

Jonathan_Hall:

Glad to hear it. I'm glad you came on. It was a fun conversation.

Aviad_Mor:

okay goodbye

Jonathan_Hall:

I guess we'll

Aviad_Mor:

cheers

Jonathan_Hall:

say goodbye until next

daria:

you

Jonathan_Hall:

time. Cheers. Okay.

Observability and Serverless Intelligence Platforms with Aviad Mor - DevOps 132

0:00

52:47

Playback Speed:

Show Notes

In this episode…

Sponsors

Links

Picks

Transcript