The Future of Intelligent Monitoring and Alerting with Ava Naeini - DevOps 129

Today on the show, Ava Naeini shares about her patent pending intelligent engine tool that uses machine learning and statistical processing with various heuristics to determine the healthiness and performance of distributed systems. The panel discusses what this new platform can do and how it can help enhance developers with distributed systems.

Hosted by:

Jonathan Hall •

Will Button

Special Guests:

Ava Naeini

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

In this episode…

The main challenges with testing distributed systems
Developing confidence within your systems
Performance testing models
Monitoring and performance testing gaps
Inclusive technical specs
Pulse Operations (company)
Current similar tooling options
Production data analytics
Accuracy and data exposure

Picks

Transcript

Will_Button:

What's going on everybody? Welcome to another episode of Adventures in DevOps. Joining me in the studio today, as usual, is Jonathan Hall.

Jonathan_Hall:

Hey guys! And girls! And anyone else!

Will_Button:

Eh.

Jonathan_Hall:

Dogs, puppies, kittens, whatever, whoever's listening, hi!

Will_Button:

All are welcome. All are welcome. I'm today's host, Will Button, and joining us also today is Ava Naini. How are you?

Ava_Naeini:

In the name of God, the Compassionate, the Merciful. Hello everybody and nice to meet you all remotely.

Jonathan_Hall:

Amazing.

Will_Button:

It's nice to meet you. We're excited about this. So tell us a little bit about yourself.

Ava_Naeini:

I'm a founder. I worked in technology for about 12 years. My career started as a quality engineer at Salesforce. And then I wanted to build things, so I became a developer. Over time, I realized that I want to do more, and I want to basically exercise my human side, my people side. I didn't want to always be at the desk and coding. So the world wanted me for more. And I left Salesforce, and I started pursuing my own path. And down the path I started working as a consulting engineer, again, to pursue my path of working with people more and customer success division seemed like a more suitable fit for me. And when I was working at Confluent I was able to realize market gaps and product gaps when it comes to monitoring and performance testing distributed systems, which is pretty challenging when it comes to looking at many, many metrics or configuration these systems have. Pulse operation and Pulse, the first product is a machine learning engine that predicts and detects anomalies and it's going to be paired with the UI and tooling that's required to basically monitor and have insight, a better insight into performance of Confluent Platform.

Will_Button:

Wow. Let me see if I can rephrase that just to make sure I understand. You've built a monitoring and observability platform that uses machine learning for testing distributed systems.

Ava_Naeini:

It's a little bit of a strong word. It's a concept. I built a POC for the UI to be able to explain the concept. So it's like a clickable POC. The patent, I have a provisional patent, and I'm submitting utility end of this month. Because the ML part is really something that a lot of people are trying to figure out, as well as the fault tolerance of the tool. Because when you are having a monitoring tool paired with this infrastructure tool. that is at the core level. Confluent usually sits at the core of distributed system. Core is the riskiest, it's the most important piece of software because it's the backend that supports everything else. So you can imagine when you're building a monitoring suite for that. You want that to be another layer of fault tolerance and accuracy. And despite the fact that there are many tools in the market that provide monitoring for Confluent Platform, during my consultancy I realized what I really need is not being provided to me with the quality of UI UX, especially navigation because often times DevOps end up looking at 20 metrics in one single view. A human brain is just not equipped to process that many visual cues at the same time. So there needs to be prioritization. There needs to be beauty in the UI. This is really something that's my feminine side. I think it's very important to have flowy and beautiful UI UX where people are spending time to learn, really as I said, the insight to learn how to distribute the system performing. So it drives a human more into being more motivated when it's something that's easy to use. And there, I felt like the focus that it's needed to be there when it comes to detection, anomaly detection, prioritizing metrics, configuration management, all of it at the same time does not exist plus the fault tolerance and ease of use. So there's a lot more that I can dive into, but I like to give you a chance to ask questions.

Jonathan_Hall:

Yeah, great. I want to start with, in your intro, you said that there are a lot of challenges with monitoring distributed systems. Now, I've worked with distributed systems, so I can guess what some of them are. But I'd love to hear what you think the main challenges are that your concept addresses.

Ava_Naeini:

short. The main challenge is the number of confliction metrics. Usually in the past when you wanted to set up a system there were like five to ten things that we needed to know how to set it up. But now it's becoming twenty to fifty to be able to say this is production ready. I have confidence in this. So often times when you are starting a project for example on Confluent or other tools even I'm not saying it's only specific to Confluent it could be a major issue in the field. You have a certain amount of time to basically configure and get it to deployment, get it into different environments. And in that stage, there's no way you would have all the knowledge about the metrics and configurations. So you want some of those decisions to be predetermined for you, and you want to have a centralized place for knowledge. Now I'm not saying knowledge is not there. Knowledge is scattered. And gathering that together in a tool so that it's condensed and centralized and it's already, a lot of decisions have already been pre-made for you, needs to happen. And that's something that I felt is missing. Like I've seen architects, software engineers, DevOps, they kind of take the random paths. And when

Jonathan_Hall:

Hehehehe

Ava_Naeini:

you take a random path, it's really hard to build confidence. And when you don't have confidence, you don't want to have production issues. And that's, unfortunately, many, many times happens. And it creates a lot of, to be transparent, it creates a lot of anger, resentment. And the worst part is it's preventing knowledge from development, lack of having centralized library for understanding it. So before you try to launch or deploy, of knowledge because fear is the first thing that's created in people's mind. As they don't know and there's no way for them to find out unless they have like one year before and usually people don't have that much extra studying time you know so I feel like knowledge is missing at times centralized powerful knowledge that's categorized is missing so there has to be a certain frame of mind that needs to be established.

Jonathan_Hall:

Nice. I can keep asking questions, but I want to give you a chance, Will.

Will_Button:

No, go for it. Go for it. I'm soaking all this in, thinking

Jonathan_Hall:

All

Will_Button:

what

Jonathan_Hall:

right.

Will_Button:

a, just like, it really resonates with me, like how much time I've spent on incidents and production outages, just looking for information. And I always have a lot of confidence. I mean, it's not based on anything. It's completely unfounded, but I do have confidence. Ha ha ha.

Jonathan_Hall:

Ha ha ha

Ava_Naeini:

Yes, but it takes a very long time to get to that. You know, the speed of this, it would take at least, it takes a year or two to develop that confidence. So in the initial phases of

Will_Button:

Yeah.

Ava_Naeini:

the project, let's say if they want to roll out within six months, and I've onboarded teams before, they kind of, they do their studies, but again, because the central tool does not exist and the decisions are not made for them, they often need to make too many decisions. And I'll... When we have like, we don't want to reinvent the wheel. You know, as engineers, we are all about optimization. So let's say if you have 10 different groups that are trying to do the same things and they go through the same cycle for like six months, how can you basically shorten that cycle and provide that centralized knowledge so that six months is one month now, you know? So that's optimizing human learning when it comes to making so many technical decisions. For the confidence to be built faster, for incidents to be prevented. And system status monitoring is another thing that Paul says equipped with which you mentioned as part of my idea because I talked about prioritization. So we have 20 charts. We oftentimes have more than 20 charts in full. Like for a Confluent Platform at the core, I believe there's maybe up to 4D metrics. There's way more than that. But let's say out of that 4D, like 10 are important. So when you have an incident, how can your UI elastically shows you what is causing it or bring the correlated metrics that are abnormal to the front view so you don't have to look for it. It actually does that for you basically ahead of time. Basically following the human intuition, following your human intuition.

Jonathan_Hall:

So you said that this is at the proof of concept stage. Talk to us a little bit about where the project is and what your next steps are.

Ava_Naeini:

Data is at the POC stage. The idea actually started in 2021, because the learning model needs data to be able to

Will_Button:

Thanks for

Ava_Naeini:

test

Will_Button:

watching!

Ava_Naeini:

it and experiment it. And we haven't been able to get access to metrics data for Confluent platform with incident reports. And I didn't wanna go down the path of simulation. My advisors did not suggest that, and I personally think simulation might not work, but ultimately it might be a path that we have to take too. proof to basically show the engine is accurate enough because I did machine learning research in my graduate studies for about four years and I did work with industry for about three years and I was very passionate in my graduate studies about this field and then I realized some of the research is missing in the industry and there is like this gap between industry and academia. That's another reason I'm working on Pulse. So when it comes to Getting it to the next stage, I want to have data to be able to show the accuracy that is required for industry. That's also another reason that we are really not seeing much of a real machine learning tools, I would say, or expressions, really, expressions in the field of computer science. We hear about it, but we haven't really been able to explore it. POS is not just as an idea, is

Jonathan_Hall:

Yeah.

Ava_Naeini:

not just for distributed system. You can think of a highly accurate machine learning engine. It can predict anything. It can be used towards biology, health care, monitoring any signal. So if you have a highly accurate engine, you want to monitor somebody's heart rate and have alerts, you can attach it to and it will do that for you. So it has to be highly pluggable. So that's the ultimate goal.

Jonathan_Hall:

Now when it's ready, what technology stack will it plug into or will it be agnostic in that regard?

Ava_Naeini:

It will be agnostic. It would get the idea is that there is metric servers. And these metric servers, basically, they export the metrics for you. So it's really technology agnostic. Confluent platform deals with a lot of libraries. So we really do not deal with the implementation of the platform. We just deal with data. So you can think of us as a data tool in some ways.

Will_Button:

So you just take, you accept any data feed and then work with that.

Ava_Naeini:

Yes, yes. But I wanted to train it with Confluent Platform because it has, there's correlation between metrics. Instead of like one signal, I think it would be something that's more difficult and more challenging, not only predict anomalies, but also to learn correlations. So it's a little bit of like a higher goal, but I feel like that's what needs to happen. And if I wanna get into that a little bit, I would say you can think of Pulse as a model for human cognition. How the, because I just started when I went to the meetings and I was doing performance testing and I was following my mental model to see how am I looking at the signals, how my mind is searching for things and how can that happen in the implementation with the memory, with trends, you know, I don't know if, I'm sure a lot of people try to build posts for predicting the stock values and I started doing tradings and then. And that also helped me because I started realizing how am I using the trade and the previous windows or previous state trends and that's the memory concept. So Markov chain has a memory model as well so my plan is to use that a little bit as well into it so it would use a hybrid combination of different methodologies as well as like rich feature extractions but this would be more into the MF field so.

Jonathan_Hall:

Nice.

Will_Button:

So in your quest for data, is there some way that if someone listening to the podcast has data that you can connect or consume it or they can share it with you?

Ava_Naeini:

Yes, I would love to. There are a lot of open source data, but if anyone has access to confidant data, metrics data, that would be fabulous for us. Or a data set that helps with the correlation, so it's not just one dimensional. The multi-dimensionality

Jonathan_Hall:

Mm-hmm.

Ava_Naeini:

is what makes it a little bit more difficult. And ultimately, I thought maybe we can try building pulse with audio data sets. There are open source data sets, so at the moment, I'm looking for someone that's machine learning engineering as well. because what I'm doing for the most part is more like the data science, architecture, design. I really became passionate about systems as part of writing the patents. I like that as well. So, but there's a lot of the, there are a lot of details in every single component that needs to be well thought about and I feel like for that I definitely do need help. So if you are listening to me and if you're passionate, interested or have experienced similar issues, challenges. and you've learned some

Jonathan_Hall:

Hehehe

Ava_Naeini:

lessons, I would love to hear them as well.

Jonathan_Hall:

Awesome. I think we need to start a new segment about job search and job fulfillment. Last week we had our guests looking for work and here we have people looking to fill roles. We just need to start this as a regular segment on the show.

Will_Button:

Hahaha!

Jonathan_Hall:

What else can we talk about? I feel like it's almost too big to bite off, right? Too much to talk about. And so I wanna talk about, normally when we have somebody on promoting a product, we talk about what does it take to get started, but your product doesn't quite exist yet, so it's a little bit hard to talk about, what's the onboarding like? what size team should join and what maturity do they need to be at? But it sounds like at the minimum you're trying to help brand new teams to ramp up faster, right? So if you're in a startup or you're going to be starting a startup, then this would be a product for you. What kind of, what do you envision to be the technical knowledge required to use the product? In other words, do I need a really technical CTO or a really technical administrator or operations person to set this up? Or can I just have my maybe wear a two-person shop with a full-stack developer? Is it something that they can really take advantage of too?

Ava_Naeini:

Sure, sure. The normal setup for a specific application is you have an admin for the... basically software that you're using that's generating or the tool that you're using that's generating metrics. When it comes to Confluent Platforms, there's usually a team of like four or five engineers. So they would be the ones who are in charge of deployment and Kafka basically operations, they would be using the tool. So, and

Jonathan_Hall:

Okay.

Ava_Naeini:

we do have a, in terms you asked about, we do have a product, as I said, it's a clickable POC. So it is

Jonathan_Hall:

Mm-hmm.

Ava_Naeini:

more than a concept. That's why I have, I'm hesitant

Jonathan_Hall:

Ava_Naeini:

Jonathan_Hall:

nice.

Ava_Naeini:

use the word concept because the research, I'm not the R&D and I'm more in the research but in terms of product maturity and visualizing features I feel like I have a very good frame of mind and

Jonathan_Hall:

Mm-hmm.

Ava_Naeini:

the part that was difficult and is difficult is conveying the idea because sometimes people think we're trying to boil the ocean and that can be seen like that but once you learn the details of the components it starts becoming more simplified and more than just simple. the concept. So these DevOps people, architects, they start Confluent Platform, they've done their deployment and the next stage is, okay, how can we look into this? Because again, you're running this tool but you don't have insights and you want to see things, you know, it's like your eyes. So they install Pulse, they tie it together, there's a tool called Kafka tool. which is basically a tool for working with Kafka. And it sort of follows similar patterns of like installations where you associate the tool with the cluster and then you set up the alerts and then you run the configuration report and then you go and then it starts running for you. So the magic starts working. And there's disaster recovery as well that I didn't talk about. Because

Jonathan_Hall:

Okay.

Ava_Naeini:

you want POS to be fault tolerant, you want it to basically sort of self-heal itself or like monitor itself as well as the application that it's running on. And if it fails, you want to restart another. We have like a disaster recovery mechanism as well built into it so it can use the other instance if anything else.

Jonathan_Hall:

Fascinating. So you said that you have a clickable, what did you call it?

Ava_Naeini:

E o seu.

Jonathan_Hall:

Proof of concept, you call it more than that. Yeah, is that something publicly available or that's only that you're showing

Ava_Naeini:

It's

Jonathan_Hall:

to?

Ava_Naeini:

publicly available, but I feel like it needs to get into a better stage for it to become fully

Jonathan_Hall:

Okay.

Ava_Naeini:

marketable.

Jonathan_Hall:

Alright. Fair enough. Fair enough.

Will_Button:

That's the stage of every early stage startup though, right? I

Ava_Naeini:

good.

Will_Button:

always wanted to do just a few more things.

Ava_Naeini:

Yes, yes, that's what I feel like we have like imposter syndrome, going through the imposter syndrome.

Will_Button:

Yeah

Jonathan_Hall:

Ava_Naeini:

Multi-filling

Jonathan_Hall:

ha ha ha.

Ava_Naeini:

is not perfect enough and that's why I, this is my effort to break through that a little bit, to be able to get more momentum, you know. Because I felt, as I said, I felt like initially software engineers, moving-in engineers, we're sort of like loners. We go into the problem and we start like doing things, so. But with time, it's more that I feel like really for this kind of effort that I initiated, I do need more help and support. It's physically beyond my abilities at times. And the skills as well. There's UI side. So I work with my designer. They were able to help me get it visually a little bit sorted out, but it needs more engineering work.

Jonathan_Hall:

Are you working on this mostly alone or do you have a team?

Ava_Naeini:

I do have a team in Ukraine that help me with the design part. I do have an advisor and a designer that's pretty advanced. So I've worked a little bit with my advisors from the past, but we don't have like full-on establishment the way I like it to be, you know. I don't have the pace that I need. So the pace is just a little

Jonathan_Hall:

Mm-hmm.

Ava_Naeini:

bit what I would be happy with. And

Jonathan_Hall:

Okay.

Ava_Naeini:

COVID unfortunately became an issue because I feel like people started closing off themselves when it comes to investment opportunities.

Jonathan_Hall:

Yeah.

Ava_Naeini:

The idea is also a little bit overwhelming. Like I've learned towards the mentorship, I've said programs that I had when I hired different type of roles. Sometimes people can get a little bit overwhelmed. So it took me a while to understand what kind of personalities and skills I need. So to not be a very demanding boss.

Jonathan_Hall:

Hehehehe, yeah.

Will_Button:

Hahaha

Ava_Naeini:

I feel like I'm still going through that growing time a little bit.

Jonathan_Hall:

All right, good.

Will_Button:

Yeah, there's definitely a huge mental shift that you have to go through to switch from being like an employee to launching and running your own business.

Ava_Naeini:

And Confluent Platform is something that I know we talked a little bit high level, but for this tool I would prefer to work people who have gone through the similar journeys as I have gone through with the platform. So intuitively we're able to communicate telepathically on the issues without having to discuss it. So the reason behind the creation of this tool is. obvious to the person that's working on. Because many times people are like, oh, there are monitoring tools. There are anomaly detection tools. But because they haven't worked on this platform hands on, it's, it takes a while for them to absorb the reality and it would really take more than a while. It could take somebody, it took me about two years, I would say, the onboarding term for Confluent Platform, to say like I really understand it. It can take many more years, but you know. I would say it took about a year or two for me to feel comfortable.

Will_Button:

Right.

Jonathan_Hall:

What else can we talk about?

Will_Button:

So once you, what's the next milestone for you in getting this ready to make it publicly available?

Ava_Naeini:

The machine learning engineering is a set, there's a developing the models, developing the

Will_Button:

Yeah.

Ava_Naeini:

models so there are a lot of research in the anomaly detection space and that I'm trying to integrate and build as part of it and plus I'm trying to add my own flavors to it. When it comes to machine learning there are two pieces, I hope your audience does not get bored with this, but any machine learning component has two big pieces. One is feature extraction, the other one is the methods and models. So how you do pruning in these two phases, how you parameterize your model, that's the secret sauce. And for that, you want data and you want actual code in place to be able to choose heuristics. So development

Jonathan_Hall:

Mm-hmm.

Ava_Naeini:

efforts, I would say, engineering. And I can always

Will_Button:

credit.

Ava_Naeini:

use outliers from a more experienced mind. That somebody that worked in the intro space, I think, auditing possibly some of the things that we're doing. I feel confident, but I feel like more minds, the more minds, the more mayors. It'll be better for the product to have more ideas.

Jonathan_Hall:

Yeah, cool. Well, hopefully somebody listening hears you talking and they're like, oh, I'd love to do that. And they'll get in touch. That would be great. Ha ha

Ava_Naeini:

I hope so.

Jonathan_Hall:

ha.

Will_Button:

I think it's a super cool concept, just the whole, like using machining learning to surface the metrics and the data that's relevant to the task that you're working on. That's just such a cool idea. I don't think I've ever spoken with anyone who's kind of approaching it from this point of view. But after you say it, it's like, oh, well, yeah,

Jonathan_Hall:

Hehehehe

Will_Button:

that just kind of makes sense.

Ava_Naeini:

Yes, if you would have seen like a page with like 25 charts and you were like, oops, your brain would have frozen. Maybe I think you would have gone to the same place. That's and yeah, that's I think the reason. My mind, I forgot, sorry. My mind just left me.

Jonathan_Hall:

I was going to ask, are there other products to do something similar that you are aware of that we can help compare to?

Ava_Naeini:

DataDog has anomaly detection. I haven't used it in practice. And when I worked, I haven't heard people using it in practice. And as I said, DataDog is a monitoring tool. So there are multiple problems that need to be solved. A monitoring tool versus a monitoring tool for a specific application are completely two different things when that application has a lot of knobs. So, you know, when you have different cars or different service centers that specialize in different cars. So we are that automated service center. I think that's, I never use this analogy, it just came to me. We're like there's B and O, you said, sometimes there's G, so they understand different things. That understanding is that person's minds. So that's why we cannot say they understand the mechanics. That's not going to be enough, because the prioritization and the metrics are not present in the data. And alerting is something that I believe I, I'm not sure if I talked about it briefly, we also do alerting. and possible self-healing, but I haven't explored that path deeply much. When I filed the provisional, I had the patent that basically when issues happened, engineers would be like set up the tool, but the tool is able to auto-correct before it gets to SREs and cases escalations are filed. And that is said, we would still have notifications for the changes that are made, but we are more preventative and proactive rather than reactive. That is also something

Will_Button:

Thank you.

Ava_Naeini:

that I think helps a lot because it's, you feel like you're confident about the tool and you also have this tool that supports you. The existing tools do not provide that kind of intelligence in their, in the approach that they have or in their, the ease of use, you know. The human

Jonathan_Hall:

Okay.

Ava_Naeini:

element I feel like is missing a little bit as well in the design of these tools because sometimes as engineers we think very analytically and. our mind would just come up with ideas that may not be very practical. So I felt like I have to bring that. I felt like that's missing. That's something that needs to be created. If it has to be me or somebody else, it was really missing for me. And performance testing, I want to touch on this one because I'm very

Jonathan_Hall:

Yeah, please.

Ava_Naeini:

passionate about this one. You know, when we talk about the confidence, confidence comes from having information in our line of work. And to get that information, let's say you are at the. pre-product stage or at the QA stage, you want to take your tool and now say this is production ready, now we can go GA. How can you do that? You have to do some simulations with the load. And I felt the tools that people use are more, or their approach in testing is preliminary. It's not advanced enough. Like in my mind I wanted to have more complicated load testing. I wanted to have... elaborate ways to perform and to develop that confidence to say that's production. And I realized people sometimes skip and as a result of it, they would not be able to advance or as I said, the confidence is not at the level it needs to be. They kind of take a chance and sometimes they succeed, sometimes they fail. So I didn't like that. I couldn't say, okay, I ran this kind of experiments with similar production loads and this is what I'm seeing. You know? that has to be there I feel. for the leadership, especially for the leadership. And I always felt there

Jonathan_Hall:

Yeah.

Ava_Naeini:

is a gap between leadership and engineers. And I like to say this one because I really care for DevOps. I felt they're overworked many times. I felt they're abused many times. I felt they're asleep and their success is delayed because they are overworked literally. Like you cannot put more hours into something and necessarily become more productive. And then on the other side, leaders, maybe they were pushy and they were ahead. They had strict deadlines that they promised to their bosses. And I felt like these forces are not fair, and justice is missing at times. So how can we be more open about what we need as engineers? So we don't have that communication gap, or secrecy, or pretentiousness at times that I did not like. You know, fake promises that I felt sometimes would be made. So I feel like that's not going to be a path to success. Honesty and truth needs to be there when it comes to it. shipping software that's going to production enterprise scale or government or finance or dealing with serious things. And many times when we would do

Will_Button:

Thank you.

Ava_Naeini:

consultancy we would be bombarded with a lot of issues we didn't even create had something like Pulse existed.

Will_Button:

I think you touched on one of the things that's really, I'm definitely guilty of it, of not being able to reproduce accurate production data. I'll look at production and you've got all these random humans out there using your system and I've taken the approach of, well, there's no way we can guess what behaviors and actions they're gonna take and what quantities, so we're just gonna go to production and deal with it.

Ava_Naeini:

Yes, yes. And I feel like if there needs to be pivoting along my path, because as I said, it's a learning experience, I will take that. So I wouldn't say, at times I even like questioned ideas. You said like, are people going to like this? But the truth is like that, missing, a lot of these missing factors that I experienced led to me feeling that something new needs to be born. As I said, I have introduction videos that shows I look very tired. And I said, the tired look on my face is a proof that something new needs to be born. And it wasn't just

Jonathan_Hall:

Hehehehehehe

Ava_Naeini:

that. Yes, I stayed silent for too long and as I said, I overworked myself. It's a habit, I overworked myself because I realized I have to be able to solve this. If I spend more time on this, if I read more about this. But then I realized there is a bigger effort that hasn't taken place. Like a platform has been built with hundreds, I said 50, but if you consider it as a whole. Confluent platform has a core and it has components. Core has a lot of metrics, components also have metrics and configurations. So you say, I have 100 things that I need to set up and 100 things that I need to monitor. So how am I going to, it's impossible, you know? So making impossible possible in new ways using higher intelligence.

Jonathan_Hall:

I can see this being so useful. Many of the places I have worked or many of the companies I work with, I specialize working with small companies and small teams, helping them do better software delivery. And usually monitoring is one of the last things that they think about. They're busy fixing the bugs that customers are complaining about. They don't have time to go looking for their own bugs.

Ava_Naeini:

Yes, yes, yes.

Jonathan_Hall:

But with a tool like this, I think it could really help, especially if it's... I mean, yeah, there's Datadog and there's all the other monitoring tools. And everybody knows that they need to do that. But

Ava_Naeini:

Yes.

Jonathan_Hall:

it's not the low-hanging fruit. But if you could just throw one tool at it that would sort of self-organize and start giving you useful information, that would be amazing.

Ava_Naeini:

And I want to mention this actually, thank you for bringing that up there. Accuracy is something that I mentioned many times in writings when I discuss pulse because let's say even data, I don't have any knowledge of how accurate the anomaly detection tool is. And I'm not saying it's

Jonathan_Hall:

Mm-hmm.

Ava_Naeini:

good or bad, but that needs to be transparently exposed. So if we had this engine that's like detecting signals anomalies very accurately, we could use it for many, many, many, many more applications, not just monitoring the service system. As you said, any application that is generating data. and you won't want to monitor that data. But how come we don't have that tool yet? And I don't think that tool needs to be very fancy to set up. This is, I like to mention this because I feel the reason that people are hesitant to do monitoring is that they're learning tool A, and in order to make sure tool A is good, they need to use tool B. And tool A and tool B are both similar at the level of complexity when it comes to learning them. So when people need tool A, so they forget about tool B. They deprioritize that. But there needs to be tool Bs that are very easy. or as I said the machine learning is there, we develop custom tools based on this for this specific A, B, C, D tools that exist so people can easily use it. They don't have to learn a lot because again there's a limit to human learning and time and physical abilities of a person at the end of the day. And you can't just scale by hiring more people because the more people you hire, communication between them is also another task and I'm not sure if you guys experienced it. Another thing that I was able to observe is that we have different units. When you talk about distributed system where you get a little bit more complex where you have teams that develop services, team that manage and monitor Kafka, teams that want to enroll their services into Kafka, they all have to be able to coordinate if they are using, for example, same or similar clusters. And this human communication also sometimes was not there. So the hope is that it pulse instead of. not having those communications and making mistakes or not ever like scaling with the tool. You can say this is our tool, this is our capacity, you can see where we are, this is what we can offer to you if you want to use Kafka, this is the leftover capacity that we have based on what we know. So these conversations are extremely important and I was able to be part of this as Salesforce. As Salesforce we would develop custom tools and before every feature we would do performance testing. Like we had different phases. We had planning, then we had development, QA, and performance testing, I think, I believe it was after the QAing of the feature, functional features. And we would spend at least a month at the end of the every release cycle to do performance testing. And I love that because that's the confidence. We have like articulate performance plan, articulate scenarios in terms of what we are doing. And when it came to Confluent Platform, I realized customers do not do that. And I was like, how can they say this tool is, how can they feel good about it? not knowing what's happening.

Will_Button:

Yeah.

Ava_Naeini:

And then when it falls they get angry, but you didn't do the right thing. So, you

Will_Button:

Yeah.

Ava_Naeini:

know, that's my story.

Will_Button:

I took it close to home.

Jonathan_Hall:

Hehehe

Ava_Naeini:

Yes, yes, I am best friends! We are all in the same place! It's like we are pretending that pink elephant is in the room.

Will_Button:

Right?

Jonathan_Hall:

Ha ha.

Ava_Naeini:

But it is possible, you know, with enough patients, I think, with enough research. And just to say a little bit about my experience with research. Unfortunately, research many times is driven by grant. And I don't know if your audience is on clubhouse. I had this researcher discussing how sometimes people that do scientific research, post requires scientific research. They need to feel creatively free so they can flow freely and explore their ideas, you know. That is also something that's important and sometimes I feel like maybe it's not favored in the enterprise space or in the industry as much. But as we are going through technology evolution, as we are going through scalability and exponential complexity of various platforms, we have to start incorporating solid research units with proper time to delegate to the causes that are research driven. to be able to develop these tools. So I feel like 10 years ago, 12 years ago, when I joined as a developer, I worked even with discovery team at Salesforce, so it was more like a machine learning team at the time, so it was sort of like similar to the research organization at the time, but my efforts were mostly development, and I didn't feel like we have much of a, we have data science research, but it wasn't, I felt like it didn't meet my standards of research. It became either too abstract or it was non-existent at times. So it was irrelevant or too abstract. How can we find a middle ground a little bit more? And that is like a global change, I feel, when it comes to technology companies who have a budget for research inside their companies.

Will_Button:

I think one of the other unique aspects of this is just coming at it from a different angle. We've talked about everyone has to have monitoring. Say, OK, I plugged into a monitoring service, and here's my dashboard with some XY charts and some squiggly lines on it. And I'm like, great, I did monitoring. And then you kind of just forget about it. And I think the approach that you're taking here is cool because it's like. looking at it from not just do you have monitoring, but what monitoring is actually providing meaningful information to you. And it's something if you've been doing this for a long time, like we have, you fall into that rut of, oh, I have to do monitoring. Let me set up monitoring the same way I have for the last 30 years. And so it's, you know, you need that outside fresh perspective to have someone to say, wait. Why are we doing it this way? We don't have to anymore.

Ava_Naeini:

Yes. That's exactly what's happening outside their loop. And shift, I would say, we're expecting a shift. I feel like maybe that's what's happening. And shifts sometimes take time. They're sometimes backed up by a lot of frustration as a push force. So I feel like we're going through that. And I feel like organically, whether it would be Meeva or somebody else, organically, we want to, especially for Confluent platform, this tool has been developed for genetic. research, I feel like we're dealing with more amounts of data and we have a lot of data that nobody is physically able to look at it. How can we have better data tools? centralizing more cost driven tools. I started familiarizing myself with data bricks, snowflake, different data tools just to see how they move the data. Confluent calls itself data in motion. And I would say it's a little bit different when it comes to Confluent platform because Confluent is not a cloud tool. It also has on-prem instances, so a lot of technologies like finance and government and health care that are data sensitive, they use their own private cloud and Even if you develop a very advanced monitoring tool on the cloud, they would not be able to use it. So we have challenges when it comes to data security there as well. So hosting also that it can perform the learning of the customer machines, all of these things have to be taught about.

Will_Button:

Right on. What do you think, John? Any more questions?

Jonathan_Hall:

I just want to know how I can follow this project as it's developing and stay up to speed.

Ava_Naeini:

sure you can find our page on LinkedIn. You can find me on LinkedIn, Alva Noini, and there's a page as well for the product called www.pulseops.ai. So we have blogs there, feel free to interact with our blog. You might have to create an account. Feel free to message us and I will basically loop you in into the efforts that have been going on so far.

Jonathan_Hall:

Awesome.

Will_Button:

Right on. Well, should we move on to some picks?

Jonathan_Hall:

Sure, let's do it.

Will_Button:

Alright, do you got any ready for us?

Jonathan_Hall:

I have a pick.

Will_Button:

All right, let's have it.

Jonathan_Hall:

My pick is a tool called LazyGit. Because

Will_Button:

I'm

Jonathan_Hall:

I'm

Will_Button:

sold

Jonathan_Hall:

lazy

Will_Button:

already.

Jonathan_Hall:

and I use git. It's a little command line. I shouldn't call it command line. It's a like an incurses based console based get interface. I think it's written in Go, which makes it even better because everything that's written in Go is cool, right? No,

Will_Button:

Right?

Jonathan_Hall:

that's totally not true by the way.

Will_Button:

Hahaha.

Jonathan_Hall:

Seems from really bad Go projects, but it is written in Go, but basically it's on GitHub. GitHub.com slash Jesse Duffield. J-E-S-S-E-D-U-F-F-I-E-L-D slash lazy git. Just Google for lazy git, that's the easy way to find it. I think it's the first hit most of the time. But you just type lazy git on the command line and it pops up this little multi-window thing that you can quickly tab through. I honestly don't know most of what it does. The one thing I really love about it is it makes it super easy to do interactive rebase. So you just, I type lazy git and tab twice to the list of commits. And then I can just move my commits up and down on the list. And it in real time tries to do a rebase and tells me if there's a conflict or not. So I know gone are the days of typing rebase minus I, and then in my editor moving things around and then running through and discovering that there's a bunch of conflicts and I did something wrong and I have to start over again, that never happens anymore. So that's the one feature

Ava_Naeini:

dangerous.

Jonathan_Hall:

I use. all the time. I'm sure there's a million other great features. That's the one I use, and it makes me happy because I'm lazy and I use Git.

Ava_Naeini:

It makes me anxious for changing the color

Will_Button:

Hehehehe

Ava_Naeini:

of it.

Will_Button:

For sure. Alright, Anad, do you have any picks for us?

Ava_Naeini:

I like Lucy charts, I think it's a cloud tool. So it was a tool that I wasn't using much. I had some fear of design, I would say, in the past. Architect the complexity. And then when I started using it a little bit more, I became comfortable with it. And it allowed me to really express my ideas visually. And the beauty of expressing your idea visually as you're working, especially at the architectural level, is that as you're designing, you're getting the ideas. Because you're getting everything out of your mind in front of your face. and it's really easy to use, it's very very nice, I haven't been able to really work with such a tool that's like so nice I would say because I'm passionate about the beauty side of software as well, I always tell my designer please prioritize beauty and colors, so another tool that's a power, I guess in Lucidchart I feel like that's what it is, I like their design and canvass is another tool that I also really like. I wrote my children's book, my first children's book, it's called the New Age Language Definitions. If you're interested, it hasn't been published, I'm very close, so that tool also canvass as well is very nice with the templates and colors and I feel like it's a, these tools allow you to get to the self-expression side of yourself, whether it's analytically or like emotionally, very quickly and they're very easy to use, so I like, it helped me a lot with expressing myself.

Will_Button:

Right on. Yeah, one of the cool things I like about Lucidchart and other tools like it is you throw that information out there and for me, I quickly come to the reason why something's not going to work, but then if you've laid it out graphically, you can just drag and move things around and it allows you to get your thoughts out there and then organize them quickly that you can't really do with other tools or paper, which is usually my go-to.

Ava_Naeini:

Yes, yes, and I was used to writing my ideas down, like technical specific.

Will_Button:

Yeah.

Ava_Naeini:

And I like doing that as well, like very again, back to the Salesforce time. I developed all of these skills very articulately because I was there for seven years. So and I inherently like adopt, take them to different places. The smaller companies, unfortunately, they don't do as good of a job. And they are written communication, which I really miss sometimes about Salesforce. And when it came to Lucid, I felt, as you said, that I still feel like when things are a little bit more complicated, there's a gap. And with the visual presentation, I feel like collaboration can be taken into a next collaboration at the architecture level becomes easier. And I felt like I actually haven't been able to experience it much in my field. But I feel like architects need to start using more. Or people like myself that were scared before this, they need to start using it more, you know. So it's something that can help any engineer to get themselves to the next level. If you just start playing with your random ideas in Gussig chart, you can come up with the next best thing. So

Will_Button:

Yeah, right?

Ava_Naeini:

it is, it is, yes.

Will_Button:

Cool, so my pick for the week, as some of the people listening to the podcast might know, I do videos over on my YouTube channel, DevOps for Developers, and in the past, I have just kind of had an idea, hit record on the camera, and then rolled with it, which leads to me wandering off topic every once in a while. And so I just bought the Padcaster, and it's... So now I'm scripting out my videos and this thing is so cool. It mounts on the front lens of my camera and then there's an app on my phone where I script out the video and then put that in this little thing on front of my camera and it scrolls all the text up in front of the camera lens. So I'm looking at the camera lens, so you think I'm looking at you whenever I'm recording the video, but I'm actually reading the text on the screen. And that's been pretty cool for helping me keep on topic whenever I'm trying to convey a point, much like I'm doing very poorly right now. So let me cut to the chase. Padcaster for teleprompting. That's my pick.

Jonathan_Hall:

Nice. I've been thinking about teleprompting it for my own videos and I'll check this one out.

Will_Button:

It's been pretty cool. It's been super cool.

Jonathan_Hall:

Thanks.

Will_Button:

All right, well, I believe that's a wrap. Thank you so much for joining us. This has been a cool conversation. I'm going to go check out your website. And I'm excited to see progress and see where this takes you.

Jonathan_Hall:

Me too.

Ava_Naeini:

Thank you so much for having me.

Jonathan_Hall:

Thanks for coming. Hope to

Ava_Naeini:

have

Will_Button:

All

Jonathan_Hall:

see

Will_Button:

right,

Jonathan_Hall:

you

Ava_Naeini:

it.

Jonathan_Hall:

soon.

Will_Button:

we'll see you guys next time.

Jonathan_Hall:

All right, bye.

The Future of Intelligent Monitoring and Alerting with Ava Naeini - DevOps 129

0:00

44:51

Playback Speed:

Show Notes

In this episode…

Links

Picks

Transcript