A History Of ML And How Low Code Tooling Accelerates Solution Development - ML 099
In this episode, Ben talks with Rosaria Silipo, a Software Engineer and Developer Relations advocate at Knime. They discuss the benefits of low-code ML, delve into the history of ML development work as it has changed over the past few decades, and discuss a few stories about the importance of pursuing simplicity in implementations.
Hosted by:
Ben Wilson
Special Guests:
Rosaria Silipo
Show Notes
In this episode, Ben talks with Rosaria Silipo, a Software Engineer and Developer Relations advocate at Knime. They discuss the benefits of low-code ML, delve into the history of ML development work as it has changed over the past few decades, and discuss a few stories about the importance of pursuing simplicity in implementations.
On YouTube
Sponsors
- Chuck's Resume Template
- Developer Book Club starting with Clean Architecture by Robert C. Martin
- Become a Top 1% Dev with a Top End Devs Membership
Links
Transcript
Ben_Wilson:
Hey everybody, welcome back to Adventures in Machine Learning. I'm your host today, Ben Wilson. And today we're going to be talking with Rosario, who is an employee at NIME and has a very fascinating history. We were just talking before we started the podcast recording about some, some history of ML and data science work and algorithms. And we're going to dive a little bit into that today. We're also going to talk about tooling. It's a common theme in our recent podcasts about what are the development challenges for doing different levels of solution engineering for problems. We figured it would be really exciting to talk to somebody who's working with and working on a tool that enables that in a slightly different paradigm than some of the stuff we were discussing in previous episodes. Before we get started, Rosaria, could you introduce yourself please?
Rosaria_Silipo:
Thank you, welcome everybody. So I'm Rosaria Silipo. I work as a VP of Data Science Evangelism at NINE. Yeah, so this is I've been working in the data science space for I don't know I can't even remember I started until since the early 90s in the data analysis space.
Ben_Wilson:
So that was back before people were coining the term data scientist.
Rosaria_Silipo:
Yeah, if you want, I can give you a history of the how it was called, because I did the whole loop, right?
Ben_Wilson:
Mm-hmm.
Rosaria_Silipo:
I always tell that to the young hired employees that I did the whole loop. I started in the beginning of the 90s. It was artificial intelligence. Then it changed to data mining. Then it changed to big data, because you remember at the beginning of the 2000s, then there was the data mining back considering on, you know, on a large scale of data. Potem je je bilo datacijen, in srečno smo vrstili v vrstnih inteligencijnih.
Ben_Wilson:
Yeah, some of the fun, funky job titles I've had in the past are stuff like Advanced Analytics Engineer, Data Modeler, and Analyst. Just plain analysts, data scientists, and machine learning engineer. Really strange how every one of those titles, except for maybe Machine Learning Engineer, All those other titles are all the same human, same skill set effectively. We just like to apply labels to things. So back,
Rosaria_Silipo:
Yeah.
Ben_Wilson:
back when you're saying when you got started in, in the 90s for, when they were calling it artificial intelligence back then, um, what was it like for developing a solution?
Rosaria_Silipo:
So I did my thesis at that time, it was on neural networks. So we had to program it. I mean, maybe not alone, there was a whole team, a whole group at the university, but it was a lot of self-made code. So there were no libraries that you could download and reuse or things like that. So it was a lot of C, C++. then writing the code or adding that part of the code that would help make the one step further in the accuracy, in the one percent accuracy of the classification.
Ben_Wilson:
Mm-hmm.
Rosaria_Silipo:
Yeah so that's how we used to work. So this is yeah so when you also got a consulting job it was always programming at the end. As I said C or C++. And also the moving the data, it was just programming. And then you would move the data from one side, write them in a data warehouse, for example, and then the data warehouse would be, the data would be processed in the program, and then the data warehouse would be a database.
Ben_Wilson:
So when you're talking about data movement, and to any listeners out there who had never had to move data prior to modern tooling in the clouds or anything like a software SaaS vendor who makes all of this somewhat simple these days, it's nontrivial if you're interfacing with the outputs of a mainframe. and you have this proprietary data format, you have to write serializers and deserializers, and you have to write basically data streams that move bulk data from one place to another. And what was the development lifecycle like for that? When you're like, okay, we have this new data set that we need to provide insights on, and the data is sitting in the mainframe, how long would it take to get something into a state where you could even start doing feature engineering work.
Rosaria_Silipo:
Yeah, that was a long time. So you would work in a lab and if something was not already available because somebody else had developed it, then it would be a classic software development cycle. So you would have to write the development, the testing, the productionizing, so the whole thing. It would take, depending on how complicated it was, it would take a few weeks. depending on what the task was. But moving data around, it was already like a few weeks work. Absolutely, also because you had to do some aggregation. I mean, I was
Ben_Wilson:
Mm-hmm.
Rosaria_Silipo:
talking about data warehousing. So when you build a data warehouse, you need to define some KPI or some kind of an aggregation that moves, that reduces the dimensionality of the data from whatever it is that you collected raw to some more informative numbers. So to reach these more informative numbers, you had to perform some kind of operations. And so depending on how complicated these operations where it would take a few weeks. I noticed that's another thing. So we're going through history. So I noticed that it used to be the data warehouse. Then during the big data, everybody was lashing at the data warehousing and they were saying that with the big data, you just need the data lake, you throw everything in there. and then you don't think about that anymore. And when you need it, you pick it up and then you extract the data and then, I don't know, you do whatever it's supposed to be done with the data. But now I see that they are going back to the concept of the data warehouse because, so I see the point of just putting data somewhere and maybe you will need them someday, but to have an organized repository of... Data, ki je zelo pričajno, ki je zelo pričajno, da se tega vsega analizi za vsega dnega.
Ben_Wilson:
Yeah, it's interesting that you bring that up. It's something that, that I've noticed as well too, because I'm old enough to have like done stuff prior to Hadoop and all of that sort of thing, where if you're going to talk about creating a feature set, yeah, that's a large investment of time and resources where you're forced to apply engineering principles and, you know, agile principles generally from that time period that I was doing it. where you would say, right, what is our MVP here? What, like, what are the, what are the things that we absolutely have to have in this data set? And then what are the nice to have? So if we have extra time, we'll add them, but you would write pipelines and develop, you know, data transaction code to get that clean data into a data warehouse. And then the reason that you would have to do that is just because of that time commitment. and how much money is on the line for the value of that. And then the advent of, like you said, the data lake, it was like requirements got relaxed. People wanted to move faster because it seemed easier. And nowadays when people are trying to get production machine learning solutions out there in the data science world, they're now saying, oh, we need to think about stuff like testing and data validation checks and monitoring. And it's like, yeah, that's what everybody was doing 20 years ago because you had to, because there's so much money on the line. Yeah, it's, it's,
Rosaria_Silipo:
Yeah,
Ben_Wilson:
it's a
Rosaria_Silipo:
it
Ben_Wilson:
dream.
Rosaria_Silipo:
was a problem of time, of course, because you need it and you need it and you can reuse it. It was also a problem of money because you didn't have that much space on the hard disk. So
Ben_Wilson:
Yeah.
Rosaria_Silipo:
you had to, it had to fit. So you could have, for example, the tapes and then you would store the tapes for the original data, just in case somebody wanted to see them. But on the hard disk, you would only keep what was absolutely necessary. And of course, if you reduce the size. to something that fit, of course that was another necessity. Anyway, I feel old.
Ben_Wilson:
I mean, even if you're talking about throwing even more money at, at server farms back in the day, uh, I've worked at a couple of companies where you'd walk into the server room or if you had access, of course, or you just see it through darkened glass, you kind of look in and all the blinking lights and wow, this is amazing. And then you look and you're like, wait a minute, those are sands and many tape racks are in there or how many, how many hard disks are in each of those sands and you start counting and you're like at $400 a pop, even if they're using mid tier hardware here for these disks and they're changing out 70 of these a day because of just drive failures. You start thinking about how much money that is. It's amazing about just
Rosaria_Silipo:
Yeah,
Ben_Wilson:
how
Rosaria_Silipo:
absolutely.
Ben_Wilson:
that's, that's changed nowadays with the cloud and, and the fact that you can now have You can have a laptop sitting on your desk that has more storage space than an entire, you know, server sand had 25 years ago. It's mind blowing.
Rosaria_Silipo:
Yeah, and then talking again about the time, I remember we were doing some predictions on heart attacks, something like that. So we had the ECG signal, and we were supposed to extract the measures from the ECG signal. So there was a lot of feature preparation at the end to be able to feed a neural network. So most of the work was in the preparation, in the feature. space preparation, so in the data preparation to feed the neural network. So you had to extract that wave, you had to do that, you had to do that. It was long and then you had to make sure that it was correct because if you give them wrong measures then the neural network gives you wrong predictions. So yeah that was definitely...
Ben_Wilson:
So to extract meaningful data from something of that data volume, were you just doing Fourier transformations to get change points?
Rosaria_Silipo:
Yeah, so we were, we were, now it was many years ago, but we were doing the derivatives and the second derivative and then we would find the changing point, for example, where the QRS complex is starting and then we would do all the derivatives to find the changing slope. And then in some other analysis we would measure the actuar distance and then we would perform a Fourier transformation or a wavelet transformation. and then we would get the spectrum to analyze, I don't know, differences, now I don't remember exactly, but to analyze differences in people who would die earlier or would have a heart attack or something, so to see if there was any predictive feature for that. But it was not automated.
Ben_Wilson:
Right. That was the point with that. That question was, I remember many, many years ago, I was working on a problem in a factory where we had, it was millisecond level data, but we wanted to check the performance of basically irregularities in the signal over a certain moving threshold. So it wasn't, it wasn't a stationary series of data. There was some sort of trend associated with the recipe that was being used, and that wasn't captured directly as a parameter anywhere, unfortunately, in the raw data output. So I was tasked with like, hey, Ben, figure out where each of the peaks are. And I was like, but I don't have the slope of the data. I can infer the slope, so I wrote very simple, basic algebraic equations to determine what that slope was. I was like, all right, I got that. Now I need to find a magnitude from that slope and I'll just cut the data and find all the peaks. And as we were saying, before we started recording those statisticians, uh, all those people that were, you know, the S the 65 and older crew that was working at that company, all of the geniuses who knew so much, uh, they saw what I was doing, one of them happened to be walking by my desk and was like, what are you working on son and. I was trying to explain, I was like, I'm trying to get all of these inflection points. And he's like, have you heard of fast Fourier transformation? You could take that whole thing and he's like, just use this algorithm. And I was like, well, I don't know where to get that equation. He's like, give me a half an hour. I'll send it to you. And a half an hour later, he has it all written out in Python because I was doing all of this in Python. Python 2.3 or something way back in the day. And he wasn't a native Python developer or anything. He was just a savant and he had that algorithm and a clever way of implementing it seared into his brain. And I asked him later, I was like, well, how did you do that so fast? He's like, well, I can, I can do it for you in Fortran. I can do it for you. And. In. Cobol and I can also do it for you in C based languages. I just, he's like, Python's easy. I just figured this will work. He's like, I just used this like real simple Scython library and I ran it and. It was so insanely fast and performant and worked exactly for what I was trying to do that it like that gentleman saved me probably eight weeks of work due to his knowledge. But nowadays you can get that package. You can
Rosaria_Silipo:
Now you
Ben_Wilson:
go
Rosaria_Silipo:
have libraries
Ben_Wilson:
and,
Rosaria_Silipo:
for everything.
Ben_Wilson:
yeah, you can download fast forward transformation
Rosaria_Silipo:
अबसोलूले।
Ben_Wilson:
on PyPy. So with the ability nowadays of people who are doing applications modeling work or data science work, do you think something's lost in in the world of that practice today with people not being able to, or not having had to go through that early on in their careers.
Rosaria_Silipo:
That's an interesting question. So I put myself in there too. Now it has become so easy that sometimes you just apply and you are not completely sure of what you are applying. So when I train people, you know, in machine learning or also in signal processing, even though now signal processing is not so popular anymore. but when I train them I often ask them to calculate some things by hand, so to or to write so really to write step by step the algorithm because they have to realize what happens inside otherwise you just drag and drop or you just use the library and then you don't know what's you know what's coming out what's coming in and at least a few times in their life they need to go by hand or yeah.
Ben_Wilson:
Yeah, I mean, I couldn't agree more with,
Rosaria_Silipo:
and
Ben_Wilson:
with some
Rosaria_Silipo:
yep.
Ben_Wilson:
of the advancements in productivity that are out there. You know, your company's tool is a perfect example of something that empowers the informed to save many, many hours and many days of effort by saying, okay, I know the tools that I need. I know the components in order to solve this problem. Cause I understand what these algorithms can and can't do. and what these steps and these transformations and things do. So I can, in a very low code or pseudo no code environment, create a DAG of operations that need to happen and get that solution into a state where I can evaluate it much faster than if I had to roll my own and build everything from scratch. So it's great that these tools exist. Um, but if you don't know what those, those components are, if you don't know what, like in most, in most software suites that are out there that attempt to solve this, that attempt to save a lot of time for, for practitioners, you know, SAS has like enterprise minor, which is sort of a graphical interface to a certain degree. And then you can, you know, of course supplement that with, with SAS scripting code. all the way to tools like DataRobot, which is similar with their graphical user interface. In the hands of the uninformed, do you think it's, it's potentially dangerous for a company? Like let's say we have a company that just says, Hey, I need data scientists. I'm going to hire 10 people straight out of school and I'm going to, we're going to buy this tool and we're going to, we're going to start making millions of dollars with these, these 10 people.
Rosaria_Silipo:
This is another interesting question. So there is a colleague of mine, he says that if you give him NIME, it's like to have a bazooka and he doesn't know how to use it. So he does all the drag and drop. So with all the possible algorithms and then he gets some result and he doesn't know how to interpret that. But I have also another colleague who says that as long as you know the basics, you don't need to know the... you know, for example, that if I have this kind of data set, that the measure of the entropy is, I don't know, that much and I mean you just trust the algorithm and you know that it's calculating the right entropy and then the decision tree at the end gives you some classification. So I, yeah, I don't know, I used to think that it's like a bazooka and then you need to make sure that you know what you are doing, but I think that there are some. Kaj je zdaj, da je bilo nekaj lete lete, kaj je zdaj, je tudi nekaj konsolidatornih in standardnih procesov, ki se prepromirajo, ki se preprimirajo in preprimirajo nekaj masinnih učiteljstv. Tako, te standardne procedurje, vse boj, vse boj, da se preprimirajo nekaj zelo vzpominev, ko se potrebujemo, ko se potrebujemo, ko se potrebujemo, ki se potrebujemo, Možete znamenjati, da, če imaš datacet, trening, test, kaj se v tem testu vseče, je to mizik, ki se lahko vrsti, kaj se v treningu vseče vseče, ki se vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče vseče v To je v teori, kaj se protekuje za vsega, kaj se preveči. Ne znam, mislim, da je pravda v srednjem, da moraš se vsegaj znamo. Tako, da bi sem pričal, da v srednjih vrstih, da imam vsega problem, in da morajo prevečiti, da morajo prevečiti te pričanje, zelo lepo, da se prevečujem, kaj se vsega vrstih v algoritmu. Kajne standarde potencije v procesu treninga, način, v modelu način, boste to občutno vzpomagati. To je zelo zanimivno, ker, način, data science in AI, ki so vršili vse na maturiti, je vse začin in vse začin. In ingegrirana, ni več, ni več ini razvijaj. Ali tako, da si pričali nekaj standardnih procesov, da se, kaj si podajali, nekaj, da ne prej se vrstimo. In to je vseč, kaj se vrstimo, ko je točno, ki se vrstimo v razvijaju in vrstimo več v ingegriranje, without going into the tiny, tiny little details. So maybe now it's changing a bit. It used to be definitely that you would have a bazooka and then you don't know what you were doing.
Ben_Wilson:
Or sometimes it's just a landmine where, you know, bazooka is directional landmine is
Rosaria_Silipo:
I
Ben_Wilson:
you're
Rosaria_Silipo:
am not
Ben_Wilson:
holding
Rosaria_Silipo:
an expert.
Ben_Wilson:
something that's active. You're going to take yourself out and your whole team out. And if you're not careful, your whole company, I've seen implementations that at, at companies talking to practitioners that are not quite, they don't quite understand what the algorithm's doing. you know, either read a blog post or, you know, read a really high level book on ML, like on applied ML. And they're like, well, I hear this package won a lot of Kaggle competitions or this algorithm
Rosaria_Silipo:
Mmm.
Ben_Wilson:
is really famous. So therefore it must be good for all problems and nothing could be further from the truth. And a lot of
Rosaria_Silipo:
Yeah.
Ben_Wilson:
times when I've talked to. to groups like that, they're talking to me because they have an issue with something that they built. It was working great for the first three months. Then all of a sudden the predictions now make, it's not that they're not making money anymore, it's they're actively losing money because they're relying on those decisions from a model that overfits so poorly to their training data set. They didn't even know that they needed to retrain or how to retrain properly or how to evaluate. And their code base, because they're using, you know, either derived examples from blog posts or something they copied out of a book somewhere, or they just took somebody's Kaggle submission and copied all of the code and put it
Rosaria_Silipo:
Yeah,
Ben_Wilson:
into
Rosaria_Silipo:
yeah.
Ben_Wilson:
a notebook. They don't understand. many things about it. They don't understand how the code works. So from that engineering perspective, it's a black box to them. And then from an algorithm perspective, they don't know why the model is overfitting. They don't know that, like you mentioned, that entropy calculation,
Rosaria_Silipo:
Yeah.
Ben_Wilson:
it's shocking to me how many people I've asked, like, hey, how does a random forest decide where to split for regression versus a classifier? And... They go, I always, when I'm teaching people, I always send them like, all right, here's the homework you're going to do. You're going to go read about differential, differential entropy. And you're going to tell me what that means and how, how an algorithm could, could calculate this and how does it actually physically do it from an engineering standpoint. And you see the light come on in, in somebody's eyes when they start finally understanding like, Oh, that's how this algorithm works. Oh, I really shouldn't have this data in there then. because it's a binary and it's just splitting continuously on that binary. Yep.
Rosaria_Silipo:
Yeah, when I do the courses, when I teach, usually I have a little competition inside the class and then I say, well, let's see who gets the highest accuracy. And there is always, always one student always who comes back with 100% accuracy, always, every time. And then I usually tell them if it's too, if it looks too good to be true, then it probably is too good to be true. But you know, maybe not, I don't know, but you should at least double check.
Ben_Wilson:
Mm-hmm.
Rosaria_Silipo:
Yeah, yeah, it's another issue, as you say, that they usually take the most popular algorithm of the time and then they use it. And then you get these humongous models to, I don't know, recognize, I don't know. some easy easy task if you are male or female or something like that so something extremely extremely easy and they could have done it with a much much smaller model with a you know much easier I remember there was another colleague of mine he did try to predict the winner of the European Soccer Cup in that was in July and So he managed, he predicted who would run the final game for the European Soccer Cup. And I remember that I was reading what he did and he actually used the linear regression. And I said, what? So that something so relatively random because soccer is a relatively random game. I
Ben_Wilson:
Hehehe
Rosaria_Silipo:
mean, something so random and you just managed to do the right prediction, training a linear regression. I mean... Yeah, the data, okay, he downloaded all the data from the FIFA website.
Ben_Wilson:
Mm-hmm.
Rosaria_Silipo:
And, you know, just with the linear regression, he was able to predict the right teams in the final game. So not even a deep learning, not even a complicated, I don't know, a GAN or something or some
Ben_Wilson:
Ha ha!
Rosaria_Silipo:
super complicated network. So yeah, sometimes that's also the other problem, right? They don't know all possible algorithms and
Ben_Wilson:
Yes.
Rosaria_Silipo:
they go for the latest and greatest and possible and they end up with a humongous model that they can't handle because
Ben_Wilson:
Mm-hmm.
Rosaria_Silipo:
to put in deployment something so big and making sure that it doesn't overfit is not that easy. And you know, something like that could have been much easier done with a smaller model, a traditional model. So something like that, that's another. Yeah, another side of sometimes not knowing what you are doing with the algorithms.
Ben_Wilson:
Yeah, that brings me to a little competition that we had at a previous company that I was at. You'll probably like this story. The challenge that was given from the head of engineering to everybody on the team, and basically everybody in the department. So software engineers and data scientists, data engineers, everybody got a chance to come up with a way to solve this problem, and it was just for fun. And what it was was photos of all employees. And they basically had a photo studio that took a full body image of every employee on their first day into, into work. And they wanted to determine just, Hey, can you predict male or female? That's all we wanted to do. So we had submissions from. software engineers and data engineers and a couple of data scientists that everybody just went to the latest inception model. They were like, oh, I'm going to use PyTorch, I'm going to use TensorFlow, and I'm going to retrain on all of our employee images that have been labeled. And I'm going to need some GPU instances so that I can get enough epochs on this. And they're tweaking the layer specifications. trying all of these different things that they were seeing on these websites. And then me and this other, who was an ex data engineer, and me and this other guy were like, I bet there's a, there's an easier way that we can do this. So what's the average height of a woman versus the average height of a man? And
Rosaria_Silipo:
Пошли, отдали завися.
Ben_Wilson:
we knew that the camera tripod was fixed in location. So it was always the same aspect ratio and the same distance to where people were standing. So you have a fixed reference of each image about how big a human is relative to the distance. So all we did was just use very simple like JPEG modification tools. We just took, you know, basically rastered over each image until we saw a change in the distribution of pixel color because it was on this like blindingly white background. So we knew where somebody's hair started, and we just counted what was the row at which we got an indication of human skin or hair. And then we just said, let's do a distribution of that and say, okay, where does everybody fall? And we hit something like 97% accuracy on that, everybody else couldn't get above about 70%.
Rosaria_Silipo:
And because probably the complex network that they were using was trying to implement exactly the same thing, but they overfit because they had too many parameters, probably.
Ben_Wilson:
Yeah, we went back into some of the deep learning implementations and we basically started pulling out different layers and seeing what the deep learning CNN was seeing at certain layers and just like basically re-rendering an image
Rosaria_Silipo:
Mm-hmm.
Ben_Wilson:
from that. And it was picking up on stuff like the shape outline of clothing and what it seemed like was sort of posture. It was detecting like, what is the shape of somebody's legs and feet? And like, that's completely irrelevant. So there wasn't enough training data because it wasn't, we didn't have like,
Rosaria_Silipo:
Yeah.
Ben_Wilson:
you know, 500,000 employees or something. And what I was explaining to everybody who did those approaches was like, I can't understand why it got that, you know, you got your guys' work so much better. It's like, well, you're, you're reusing a model that was, that was designed to look at just general images. From. you know, Google images and classify.
Rosaria_Silipo:
Many more, yeah.
Ben_Wilson:
Yeah. So many different classes in there and that image, that network is massive. And it's also expensive to train it properly. So. I've always been a proponent of like simple is best in ML solutions.
Rosaria_Silipo:
Yeah, when they ask me how they should start and how, so I always tell them start simple and then if it doesn't work it's easier to make it more complicated. It's much more complicated to go the other direction, to start complicated and make it more simple.
Ben_Wilson:
100%. I could not
Rosaria_Silipo:
Yeah,
Ben_Wilson:
agree more.
Rosaria_Silipo:
yeah, yeah, always, always that simple. You might be surprised.
Ben_Wilson:
And it's fast too.
Rosaria_Silipo:
It's fast too, exactly if you make a mistake it's not a big issue. And you understand because usually the simpler the model the easier it is to understand what it does. So maybe you also get some insight on whatever the model is trying to do.
Ben_Wilson:
And also another big initiative that people have these days. And I hear it from practitioners that are serious about getting ML into production. But they're, their years of experience aren't there yet. So they're like, Hey, I've been doing, I've been a data scientist for two or three years. So I have a lot of experience, not really, but maybe as compared to your peers at your company, you do, but. They're like, well, management is telling us that we need, we need to do bias and fairness measurement on our model. I'm
Rosaria_Silipo:
Oh, yeah.
Ben_Wilson:
like, okay. Um, and we also need to explain about why it's making the predictions. So I heard this thing, you know, called Shapp. I'm going to get Shapley values and start. looking at how people are evaluating things like that. And then when they test that out, they're like, well, it's really hard to get this, or it takes a long time to get these results from a deep learning model. But then I tried it on a linear regressor and it was finished in less than two seconds.
Rosaria_Silipo:
Yeah
Ben_Wilson:
Like, yeah, do you know what the linear regression is made of? It's just coefficients
Rosaria_Silipo:
I don't see
Ben_Wilson:
on a very
Rosaria_Silipo:
it.
Ben_Wilson:
simple equation. It's pretty easy to chug through a bunch of data with that.
Rosaria_Silipo:
I think also, so another topic is that often, and this has been with the start of the deep learning, this has been a bit forgotten. The more you prepare the data, the easier is the job for the model to learn, right? So there is this data preparation part that is more like a data engineering task, but of course it makes you understand what your data is. it also makes the job for the upcoming model much easier. It removes the noise, it removes the the wrong things, it prepares the information ready to be used for the model. So I think also the data engineering part of our job is a bit underestimated. Sometimes I think we should do more of that instead of just throwing things into a network and then let's see what the network does. Zdaj, je, da, je, da, je, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da by education. So I think that, you know, the part of data of engineering around the model is an important part. A lot of these fantastic models that you see around AI models, actually, they are masterworks of data engineering because they prepare the data. They also export the model in a way that it's useful. So yeah, I think the data engineering is another part that is often underestimated.
Ben_Wilson:
I couldn't agree more. Uh, and I have seen that in a lot of people I've talked to that get into the profession of data science, thinking that all they're going to be doing is effectively like a Kaggle competition. Or like, Hey, I'm just going to try to get the greatest accuracy I can, or I'm going to get the lowest
Rosaria_Silipo:
Don't get
Ben_Wilson:
R squared.
Rosaria_Silipo:
me started.
Ben_Wilson:
And they, they, they think that that the process of creative modeling is what that job entails. And I've always told, you know, I've had people that are still in college, you know, contact me on LinkedIn and like, Hey, could I talk to you for an hour or some time? Sure. Yeah. I'll tell you exactly what it's like working as a data scientist and as an ML engineer and as a software engineer. And then you can choose your path about which one sounds the most interesting to you. And it, it's always interesting how their perception of what data science work is, is so divorced from reality. Where And I tell them, Hey, 85% of your work has nothing to do with models. It's data acquisition, data cleansing, statistical analysis of those data sets. It's determining, Hey, do I have an, a potential issue if I feed it into this algorithm, if I have, you know, covariance in the data set, how do you check for that? How do you independently validate the, the co like the variance amongst, uh, these features that are going in there. How do you validate that you have, you don't have spurious signals in your data. How do you remove those? And then the other 40% within that chunk is how do you pitch your idea to a business?
Rosaria_Silipo:
Ah, that's the whole communication thing, yeah.
Ben_Wilson:
and it's critical.
Rosaria_Silipo:
i t communication skills is important. And then another thing that they need to understand is that creating the model, the fantastic model that they create, is one thing, but then they also need to use it, need to be able to export this model and the whole deployment part it's usually not taught or I don't know somehow it doesn't get into their head after the university. When I am invited to
Ben_Wilson:
Hehehe
Rosaria_Silipo:
something in their brain open because they see what is useful for and you know
Ben_Wilson:
Mm-hmm.
Rosaria_Silipo:
and then you can actually make money out of that because you can use it in a particular way or you can save people because you can use it in a particular way or you can but they see the implications it's not just an accuracy number and that's definitely something that I would like to see more for example when I receive a CV because this yeah sure you have done the model you got nice accuracy it's very resourceful and knowledgeable, but I need also some practical application at the end. In a company you need to make money and then you need to make it work or you need to make whatever the company goal is, but if it's a non-profit maybe it's not money, it's something else, but still at the end it has to do something and you'd be useful for something. So definitely. And also another thing yeah would be nice also that people could learn some of the communication skills, also writing a report.
Ben_Wilson:
Exactly. And then that other part that I also see that people struggle with once they get into doing this, this work in industry is. It's something I've noticed has come for people that have started in the, this profession probably in the last 10 years, not people that it like 10 years ago, when I was looking at people that were starting, they were in working in, in ML and data science. A lot of them are coming from like pure academia, like, Hey, I just finished my PhD in statistics or physics or computer science, and they're used to working in a lab environment where. They were working on a project and they're getting peer review constantly. And they have advisors that they can ask. And they're also asking their fellow classmates, but it seems like the recent influx of people from different backgrounds that don't go through that academic rigor. It's more of, I don't want to say it's ego focused, although I have seen that of like, Hey, I'm the one working on this project. I'm doing this in my notebook. I'm going for that, that amazing best accuracy. And if there's nobody who's a technical lead or a manager who knows anything about data science work or software in general, something can get shipped to production that is completely unmaintainable
Rosaria_Silipo:
Yeah,
Ben_Wilson:
or
Rosaria_Silipo:
yeah.
Ben_Wilson:
is full of bugs. And you just, you leave it up to that one person to fix their own code or their script basically that they've written. And it might've been a useful project. and it may have been making money, it might have been serving results on a website or something. But if it's going down all the time because the code is written terribly or it's just unstable. I mean, do you see that as a really big issue about people working in deployed ML who don't know how to write a unit test?
Rosaria_Silipo:
So
Ben_Wilson:
Ha ha ha
Rosaria_Silipo:
that's another topic. So another thing that I think has changed, I don't know if you remember that article of the sexist profession in this 21st century. It was an article, I think on Harvard Business Review that was saying, when was that? Was that before the pandemic? So must have been 2018, 2019. And they wrote something like, the sexist job in the 21st century is the data scientist.
Ben_Wilson:
Mm.
Rosaria_Silipo:
In je bil, ne vem res, ali je bil oseba o teh, da je data zanem superhera in je zna, kako se vse poče. Zna, kako se poče model, data preparation, deployment, sestavljanje in tukaj. Zato, mislim, da je to zanem razvijal. Ta percepcija je bila vse konetna do vsega v labu, kje je data zanem je bilo veliko razvoj na razvoj.
Ben_Wilson:
Right.
Rosaria_Silipo:
To je pravi, ali ni tako. Nekaj, da se vrnje vrnje modeli na vrstu v meseci in se vrnje v svoje vrstovno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrstno zvrst in tukaj je datarengineer in to je tukaj, ki se preveče data in preveče produkcijalizacijo kaj pa je produkčil. V sestani je to lab. V labu je različna kompetencija in mora biti kaj, ki se pospjevajo očinom razvijati
Ben_Wilson:
Hehehe
Rosaria_Silipo:
in produkcijalizacijo modelov na vrstnih učinjih. To je vse testov, vrstnih učinjih modelov. ali takoče generalno test, stresno test, kako je mnogo ljudi vsega, ki se vsega se vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vsega vse kratirali model, ali so ljudi se tudi prelepjali na kratirati kvalitet od exportu modela in relijabilnost modela. To je zelo razvijalo. To je nekaj, ki ni tudi superhero, ali je prelepjali prelepjali različno različne profesionalne figure,
Ben_Wilson:
No.
Rosaria_Silipo:
včasih kaj sem videla v različnih data science labs.
Ben_Wilson:
So working with universities as you do now, do you think that that job persona is being served?
Rosaria_Silipo:
Ah. Ah.
Ben_Wilson:
the
Rosaria_Silipo:
Ah.
Ben_Wilson:
one that is focused on MLOps effectively. I'm like, hey, I need to stitch together all the work from the data engineer
Rosaria_Silipo:
Yeah.
Ben_Wilson:
and the data scientist to do DevOps deployment and like the person who's building that CI-CD pipeline, who's
Rosaria_Silipo:
Ew.
Ben_Wilson:
setting up the monitoring system, who's making sure that every REST request that comes in is scaled through auto scaling. that there's a load balancer in there, that we have the ability to do, you know, shadow deployments or A-B testing, and every prediction that comes out is being written back to a data warehouse so that we can do evaluation, so we can do, you know, attribution analysis. Do you think that universities are thinking about that and preparing people for that role?
Rosaria_Silipo:
are starting. So I see some, there are some courses now, I've seen a few that are starting this kind of more engineering kind of profile, but they are starting, it's only the beginning. I've seen a few, I mean some courses they start having this kind of specialization in the data space professions completely abandoned and now it's coming back in one of these data engineering courses. So it's starting. I'm not sure they are completely producing as many data engineers as we need. But yeah, I see some signals.
Ben_Wilson:
Yeah, I've noticed that as well, just from the volume of people that have contacted me or that I interact with. Everybody wants to do ML until they know what it is usually. They're like, wow, that doesn't really sound that interesting. Just using, you know, scikit-learn or, you know, using these R packages. Like I really, I want to solve the problems. Like, well, yeah, that's, that's data science, but stop thinking about the algorithm. It's It's important. You need to know how it works, but the exciting part of it is talking to the business and finger and hearing, we have this problem that we think your team might be able to help us with. And then you get that creativity of, of talking to a dozen people at the company and saying, Hey, do you want to meet? And let's figure this out. Like, this is a challenge and let's use our, our creativity and our insight and wisdom and build something amazing together and then saying, okay, we need two software engineers and we need, you know, four data engineers. Yeah. When people see it in the action for the first time and, and when they're new to these roles, they get really excited and they're like, this is what I want to do, but there's not a lot of people thinking about some of those other roles. It seems like people are still focused on that. They still see like, oh, the data scientist is the, the center of, of all of this. It's like. Not really, by volume of work that's done, the data engineer and the software engineer are doing most of it, and the data scientist is tuning stuff and testing hypotheses, and the analyst as well. And I've even seen the ML teams in industry that they don't even have analysts embedded in them. You're like, well, who's doing the statistical analysis of all your data? I didn't see the data scientist. issue a report on your features.
Rosaria_Silipo:
Yeah.
Ben_Wilson:
So who's doing that? Like, oh, we didn't know we needed to do that. Yeah, that's pretty important.
Rosaria_Silipo:
Yeah, yeah, yeah. So, but I think this is the big change in the organization of the lab of the last few years, maybe with the pandemic, I don't know. But this is the biggest change, definitely, that you don't need one person. One person only cannot do everything. You need different professional
Ben_Wilson:
Yeah,
Rosaria_Silipo:
figures.
Ben_Wilson:
definitely not. Not in today's cloud architecture. And particularly if you're working for a big company and a lot of money's on the line, there's, there could be a hundred people touching a product, a project at a large company as it should be. You should have all those people looking at it.
Rosaria_Silipo:
Yeah, as it should be exactly. It has become a kind of software development. I mean it's not software but it's this kind of process.
Ben_Wilson:
So shifting gears to tooling now,
Rosaria_Silipo:
Mm-hmm.
Ben_Wilson:
you know, the company that you work for, when we're looking at how tools have evolved to support data engineering tasks with respect to data science things, like the consumption of data for purposes of business insight or to
Rosaria_Silipo:
Mm-hmm.
Ben_Wilson:
make money or creating the efficiency. Well, as we said earlier, you know, 20 something years ago, 30 years ago, everything was roll your own. Very painful, very time consuming, but probably fun. I mean, for, for doing everything from scratch the first time that you do it. But as, as things, as time has gone on, there's been open source tools that have been developed that simplify that process. And a whole generation of people are entering into. data science and ML with free to use packages that are out there. That even though it's free and there's a lot of examples out there, it doesn't mean it's simple. So companies have sprung up that offer, you know, managed services to do this, to simplify it, not to make it easier, but to simplify it, to make it not harder. If you get what I mean, like it,
Rosaria_Silipo:
Yeah.
Ben_Wilson:
it makes it so that you, it's, it's more challenging to break things when you're using a managed service. Uh, but it also means that you're not dealing with stuff that you probably don't want to be spending time dealing with like dependency
Rosaria_Silipo:
Yeah.
Ben_Wilson:
management and.
Rosaria_Silipo:
definitely a good thing. So NaIM is based on a graphical user interface. So it's based on this drag and drop thing.
Ben_Wilson:
Mm-hmm.
Rosaria_Silipo:
So you have these blocks instead of writing instructions. It's visual programming. So it removes the coding barrier. Of course, it doesn't remove the algorithm barrier. So you still need to know the algorithms. The coding barrier, I mean, It's easier, especially for the data engineer. So you connect to a, there are many connectors for different data sources. You just use this node and it does everything under the hood. You don't know what it does and it connects to whatever source you get the data and that's it. So that's the actually for the data engineering part is much, much easier.
Ben_Wilson:
Hehehe
Rosaria_Silipo:
And then you have all these other blocks and you perform all the things that you're supposed to perform. Each block is dedicated to a given task. And then you you know, you bring your data from A to B and you build your sequence of blocks to perform this operation. I don't know, people should code, people should not code, they always ask me that. I think people should do what they feel most comfortable with. If they feel comfortable coding they can code, if they feel more comfortable using the drag and drop they use the drag and drop. Sometimes they can't program, so they can only use the drag and drop Včasih se lahko programirajo, ali je vseče pristopno, da se obrnijo oseg vsega sečence z node, ali ne odrejte kodirajna instrukcija. Včasih se uspomene drag-and-drop obrnje za prototyping, kako se želite, da se vseče, in tako se to kodirate. Vse lahko lahko lahko vseče. Vseče 9 je open source, ali se mi spomene, da je open. So that means that it's open to be integrated also with Python, with R, with JavaScript, with Java, with, I don't know, all possible coding programming languages. So, I mean, as I said, you can decide to code, you can decide not to code, you can decide to code a bit in some of the Python nodes, for example, and you can do the rest in low code. You decide whatever you feel comfortable with. It's, I mean, of course, we could code everything. We could write everything in assembler, right? That's also an option.
Ben_Wilson:
Let's not.
Rosaria_Silipo:
But I don't know why.
Ben_Wilson:
Yeah.
Rosaria_Silipo:
So yeah, sure. I mean, you can do it once, as I said, you learn it, you know what it is. But then when it's routine work, then you can use the low code approach that works too.
Ben_Wilson:
Yeah, and praise that I have for your company's tool in places that I've seen it employed extremely successfully. And I'm a huge fan of it because of what you mentioned. I mean, first off, I've used it for prototyping. If I'm under the gun and I'm like, hey, I've got three days to figure out what approach I'm going to use here and what feature data I can trash and what needs to be... augmented or cleaned up, you can't be a tool like that because you can get that insight and issue a bunch of builds to see what results are. If you know how to do it manually and you're using a tool like that for seasoned professional, you can get those answers so much faster and be like, all right, I know exactly the direction or I can rule out out of the 12 things that I need to test, I can rule out 10 of them. And I can get that answer in two hours rather than, like, can I go in and open up an IDE and write, you know, 12 different scripts that may or may not be a bunch of copy paste, or can I, you know, write these 11 functions that I'm gonna call from all of these different scripts. That's gonna take me six hours to write all that code.
Rosaria_Silipo:
Exactly.
Ben_Wilson:
So I don't, and it's throwaway code. I'm not using that for the production implementation. I'm doing a design and I'm trying to figure out what direction I'm going in. And I need rapid prototyping. So the tool is awesome for that. But, or in addition, uh, what I wanted to, to call out is for people that understand the math and the statistics behind these algorithms and their approaches, but they're their choice in career progression has not led them down into the path of, you know, air quote big data, uh, or into doing stuff with software, their accountants, their
Rosaria_Silipo:
physicians.
Ben_Wilson:
actuarials. Yeah, exactly. Scientists working with ecology or paleontologists, you know, people that are highly intelligent human beings, highly educated. but they've specialized their knowledge in a place that it's irrelevant for them to learn how to write Python code, uh, in object oriented fashion. They just don't need to do that. So tools like this, you can get something that, that gives you the solution that you need that solves a problem and make it so that it's consumable. And if you need to take that solution and build some massive infrastructure around it. Like, Hey, you know, the, the paleontologist built this amazing model that, that figures out where we're going to dig next season. And we need to make sure that this is updated for 1300 locations around the planet, and we're going to divert research funding to this. So we need this to really perform. You can take that entire pipeline and all of the attributes associated with it and use it as your template. or as your blueprint for building that extreme scale production deployment, which might be like, hey, we can hand this to software engineers and they can just run with this.
Rosaria_Silipo:
Yeah, there are many solutions. Once you decide to go for deployment, there are many solutions.
Ben_Wilson:
Yes.
Rosaria_Silipo:
Of course, you need a better thing than just the prototype you have built, than you have a bunch of best practices to implement. You have also a bunch of other softwares that, for example, I don't know, NIME has two products, so the other one would be the one that helps with the IT infrastructure. I mean, there are a bunch of solutions for that, but definitely, NIME was taught since the beginning to be fast, be fast in creating a data application. Also for people who can't program, they can't or they don't want to. I mean sometimes it's just a time issue.
Ben_Wilson:
can
Rosaria_Silipo:
Yeah,
Ben_Wilson:
confirm.
Rosaria_Silipo:
for example we had a project with a hospital and you know with nurses and doctors they don't have the time to program. There is no way you can convince them to write, to learn, to write a line of program. They are too busy. and they would build some, they wanted to isolate some episodes and they would build their pipeline. I mean it was not too complicated and they would manage to isolate the episodes they wanted to find and they were very happy with that. So that's exactly the goal.
Ben_Wilson:
Yeah, that speaks volumes to the direction that a lot of, that the industry in general is going with these tools that are like the tools that your company offers is that the democratization of approaches to advanced in yesteryear research only algorithms that are now in the mainstream, making that so that you can put the power of that into the hands of the people that really know the problem. Cause you can have somebody who's one of the finest software engineers on the planet who knows everything there is to know about, you know, distributed systems and networking. And you can have all this massive volume of knowledge about computers. But if you tell them to solve something in a hospital, they're not going to have any
Rosaria_Silipo:
No, of
Ben_Wilson:
idea
Rosaria_Silipo:
course.
Ben_Wilson:
where to go. They can be like,
Rosaria_Silipo:
No.
Ben_Wilson:
uh, I need to interview a bunch of doctors and nurses and say, what is this problem? And where's the data? What does that data mean? Yeah.
Rosaria_Silipo:
No, no, of course, I work with these people from time to time and have no clue. I need to rely on what they tell me, you know, we should do. So sometimes I help them, but the knowledge stays with them.
Ben_Wilson:
Yes.
Rosaria_Silipo:
I don't know, I am not an expert of that domain. This happens often, and that's where, so collaboration between the people who are a bit more into the data science, but also the domain expert is necessary. And if you empower them to... work with you or even to work alone you find so they can be fantastic solutions it's really it's really nice sometimes that you know to work with them and see how you know they find these things that they wanted to find they're all very very excited yeah now we work with a number of professionals often and they are not necessarily coders so sometimes they know how to code they just they they just don't have the time yeah Also pharmacists, we've worked with a lot of lawyers, we've worked with lawyers.
Ben_Wilson:
Hmph.
Rosaria_Silipo:
So that's another one that you know because they use the text processing extension and then with the text processing extension they try to find for example previous cases and how they were sold or I don't know something and yeah that's how they do it.
Ben_Wilson:
Interesting.
Rosaria_Silipo:
And you know they build together this pipeline of text processing blocks. and then at the end they get to isolate or to search into the database of public cases. Now don't ask me I'm not a lawyer. So these previous cases and they find out how it was sold and what was the case and if there is a precedent and this kind of thing. It happens very often. Paleontologists now have never worked with a paleontologist. That has never happened to me. But it would be cool. It
Ben_Wilson:
It
Rosaria_Silipo:
would
Ben_Wilson:
would
Rosaria_Silipo:
be very
Ben_Wilson:
be.
Rosaria_Silipo:
cool.
Ben_Wilson:
Yeah. It's
Rosaria_Silipo:
Yeah,
Ben_Wilson:
a, that's
Rosaria_Silipo:
we will.
Ben_Wilson:
a slight lot better than sitting on the search interface for LexisNexis where all of those legal
Rosaria_Silipo:
I'm sorry.
Ben_Wilson:
cases are and trying to write whatever archaic ancient search query format that they use there. I've seen that interface before. I had to use it at a previous company. And it was like, really? This is, this is what we're at. This feels like the 1970s.
Rosaria_Silipo:
Yeah.
Ben_Wilson:
Yeah. Really painful. All right, well, this has been a fantastic discussion and I could continue this for hours, but I'm sure you have stuff to do. So
Rosaria_Silipo:
Yeah.
Ben_Wilson:
before we leave today, could you tell people where to go in order to test out your company's products and also how they can get in contact with you if they have questions?
Rosaria_Silipo:
Well, let's start from the company product. The open source product is the name is NIME Analytics Platform. You go on to NIME.com and then there is our website. You find the whole description of the product. There is a button download. You download it and you start using it. If you need help, still on the web page there is a learning tab and in the learning you find all options courses. beginner spaces, examples, challenges. You find all sorts of tools that you can use to learn. The challenges are the new ones. I'm very proud of those. We had 40 challenges this year. We just had the award for the people who did the best solutions.
Ben_Wilson:
interesting.
Rosaria_Silipo:
Yeah, so you can download some of those. We are going to start with a season two, but that's just a rumor. You didn't hear it from me. Next year. So this is the software if you want to talk to me then I'm on LinkedIn I'm the only Rosaria Silip or one of the few Rosaria Silip on LinkedIn so you will find you will find me very very easily. Yeah then you can send me a message I check my LinkedIn profile often and yeah.
Ben_Wilson:
All right, Rosaria, it's been an absolute pleasure talking with you today. This is a really fun discussion. And I just like to thank you for coming on the show and discussing, you know, all of these topics and, uh, yeah. Um, so until next time, everybody,
Rosaria_Silipo:
Thank
Ben_Wilson:
I've been
Rosaria_Silipo:
you
Ben_Wilson:
your
Rosaria_Silipo:
for
Ben_Wilson:
host.
Rosaria_Silipo:
inviting me.
Ben_Wilson:
You're most welcome. So until next time everybody, uh, I've been Ben Wilson and we'll see you next time.
A History Of ML And How Low Code Tooling Accelerates Solution Development - ML 099
0:00
Playback Speed: