How to Simplify Data Science with DagsHub Founders - ML 092

Have you ever wondered why data science is hard? Well, in this episode we cover some common data science challenges and how the founders of DagsHub are looking to solve them.

Hosted by:

Ben Wilson •

Michael Berk

Special Guests:

Dean Pleban

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Have you ever wondered why data science is hard? Well, in this episode we cover some common data science challenges and how the founders of DagsHub are looking to solve them.

Links

Dean Pleban

Transcript

Michael_Berk:

Hello, everyone. Welcome back to another episode of Adventures in Machine Learning. Today we have two guests, and this is going to be very exciting. They both work at DAGSUB. We have the CTO and the CEO. Dean has worked as a software engineer doing both front-end, back-end, and mobile development. And then Guy has been a quantum optics researcher, a podcast host, and is now a CEO, which is pretty crazy. And for the past three years, they've called DAGSUB as we mentioned earlier. DAGSUB leverages popular open source tools to version datasets and models, track experiments, label data, and then visualize those results. So there's sort of the high level but do you guys mind elaborating a bit?

Guy:

אני חייב להיות מרווחת בציון המציאה, זה נשמח.

Dean_Pleban:

Okay, I'll let you start. Hehehe

Michael_Berk:

Yeah, congrats.

Guy:

אני אראה שאני יכול להשתתען את השיעות שבחדות דין.

Michael_Berk:

Yeah, excuse me, I meant CTO.

Dean_Pleban:

Yeah, no, Guy, go ahead. So I guess the note is the backgrounds are opposite. My background is quantum optics and machine learning and computer science. Guy's the backend and front end developer.

Michael_Berk:

You

Dean_Pleban:

Okay.

Michael_Berk:

know what, we

Dean_Pleban:

Yeah,

Michael_Berk:

have an editor. Let me redo that. I stumbled like four fucking times. Jesus

Dean_Pleban:

no worries,

Michael_Berk:

Christ,

Dean_Pleban:

no worries.

Michael_Berk:

sorry guys.

Guy:

No, that's okay.

Michael_Berk:

Let me just flop these around. Yeah, that's the perks of not making it live. Cool. All right. Three, two, one. Hello, everyone, and welcome back to another episode of adventures in machine learning. Today we have two guests, they both work at DAGSUB. The first guest is Dean. He's been a quantum optics researcher, a podcast host and now CEO. And we also have Guy who's worked as a software engineer doing both front end, back end and mobile development. For the past three years, they've been building a company called DAGsub, and it leverages popular open source tools to version data sets and models, track experiments, label data, and then visualize those results. So there's the quick hitter. Do you guys mind elaborating a bit?

Guy:

Didn't

Dean_Pleban:

Okay,

Guy:

even

Dean_Pleban:

I'll

Guy:

want to

Dean_Pleban:

start.

Guy:

go over it.

Dean_Pleban:

So hi everyone. It's a real pleasure to be here. My name is Dean. And as Michael said, I'm the CEO and co-founder of Dags Hub. My background professionally is computer science and physics. On the physics side, I basically went studying to understand quantum, sorry, yeah, to understand quantum theory and machine learning at a non... uh, pop sciency level. Um, and I, I feel like I decently succeeded in both of those tasks on the, on the quantum side, I did a bunch of research on quantum optics and quantum key distribution, uh, for those familiar. And then on the machine learning side, I tried to insert myself into as many sort of different courses as I could, but my, um, comfort zone has always been, and I think still is computer vision. Uh, part of that is. like in my personal time, I really enjoyed design. And I always really wanted to understand how a tool like Photoshop can take sort of code and algorithms and use those to make something that's really beautiful and visually appealing. And so I've had opportunities to work on natural language processing and a bit of reinforcement learning, but computer vision is still sort of the place I enjoy myself most. And then, yeah, we started, like Guy and I have known each other, have been friends since kindergarten. So we've known each other for 27 years at this point, which is crazy. And the opportunity presented itself to start working on something. And we'll probably tell the story of Dags have been a moment. So that's my intro and I'll let Guy go.

Guy:

Yeah, hi. Nice to meet everyone and I don't think I have a lot to say. I guess I wouldn't presume to be a front-end developer. That's the... that's the... that's I think the one area I can't claim any... anything in. I've done a lot of other stuff but yeah, cybersecurity, backend, DevOps, data engineering, machine learning and... All of it has kind of culminated. משינד לרנינג שני 2015 realized what was going on and Yeah, that I guess the development and DevOps background and the Research work I was starting to do in deep learning has kind of naturally led to understanding the things I understood that led me to Work on DAGZAP realized that it's something that's needed in the world

Michael_Berk:

Got you. So the million dollar question, what is the vision of DAX Hub?

Guy:

That's a question

Dean_Pleban:

So.

Guy:

for the CEO, usually.

Dean_Pleban:

Maybe. So basically, when we started working on this, the idea was simple and straightforward and it has evolved since then in many ways. But originally the thought that we had was basically, Git is great and GitHub is maybe even greater. And both of them have sort of created a foundation that enables software teams everywhere to work collaboratively to get their work to production at scale and everything. But the moment you add data into the mix, things stop working as nicely. And the sort of parallel tooling at the time didn't really exist or everything that existed seemed to us like it didn't make sense or it didn't work the way we imagined the make sense solution would work. Um, and so originally it was even smaller actually than this. It was more like there is versioning for code and there is a sort of platform built on top of that, but there's no proper versioning for data. And there's no platform built on top of that. And that's what led our sort of first research efforts. Um, but since then, uh, we've grown that to thinking about this again, like the, the versioning parts is, is the substrate on which collaboration, um, uh, and teamwork and community are built. and the workflows that then sort of permeate into the industry and enable teams to get their models to production at scale are built. And so the vision for DAGZip is basically to be a central hub where teams can manage their project components, whether it's their code, their data, their models, experiments, pipelines, and then do that with an emphasis on teamwork. So, Data science shouldn't be a single player game. And I think that a lot of times collaboration is treated as a side effect, but not as a core problem that needs to be solved. So we look at that and we think, what would it mean to do this work as a team, to review each other's work, to understand what someone else is working on and build that into the platform? So maybe that's a good answer. Guy, do you wanna add anything?

Guy:

Yeah, I think concretely we want, and you can do this on DAGZAP now, and you couldn't before, we want someone from the other side of the planet that you don't even know to be able to jump into an open source machine learning project, and without asking for permission first, be able to fork it and not just modify the code, but also add new data, label the data differently, run new experiments, and then ask to merge, like create a merge request, pull request, whatever you want to call it. Um, and then also, you know, the maintainer should be able to do this review process in a way that is actually data science oriented and not just code diffing oriented, um, just enable real open source for machine learning.

Michael_Berk:

What's the difference between code diffing oriented versus machine learning oriented?

Guy:

So I guess when you... Okay, so first of all, when you do a machine learning project, in my mind, the data itself is part of the source code. You have the code files themselves, you have the data, if you change one of them you get a different result. You want the same code and different data, you get something totally different. And this is... This means that when you contribute new data, you want to be able to see what the new contributions are, or not even contributions, like modifications, new labels. So you want to be able to see statistics on it, diffing on it that makes some kind of sense. That's just on the data side. I think also on the... אני חושב שבגלל קריטירים של תקצת אמור, אתה רוצה את הקטנטה, אולי, לנסות פרימנטים ולמנסות את הספקות ולהקמפ אותם. ובגלל קגל, שאולי לא... אני חושב שזה לא מנסה לציון. אתה לא יש 1 סגל מטריק שאתה יכול להגיד, אה, ה-AUC זה יותר, למה זאת, זו פרמטריקה היא יותר. לא, אתה רואה את הרבה אחרות מטריקות ופרמטריקות, ואתה... need to look at everything at once to be able to decide yes I want this to be the new tip of the main branch and deploy it. Yeah, it's a lot of product work to make that I think convenient.

Ben_Wilson:

It seems like a truly fundamental aspect of collaboration, particularly in today's world, where, yeah, you might work at the same company, but it's not like the person who's collaborating with you is sitting right next to you and you can swivel your monitor over and be like, hey, can you look at this? Does this make sense? And being able to say, hey, here's a commit that I have, or here's a branch. And can you pull that and check? Do I need to adjust my labels? Do I need to do some different feature engineering here? How could we possibly hit this edge case that it seems to overfit on? How do we generalize this? It seems really important. My big question is that functionality as well as some of the validation that you're talking about, it's a lot of work to build all that. Could you walk through how you've either built it yourself or integrated other tools and what the process is when you're saying, hey, we have this new idea that's important to data scientists. How does your team approach feature creation or feature integration?

Guy:

I would be happy to say we have a super systematized and data-driven solution to this question, but I think we'll look at what makes sense to us and what users ask for. I don't have an easy answer. Dean, do you have any thoughts?

Dean_Pleban:

Well, I think that the, maybe to take one step back, I totally agree with Ben what you said earlier, that this is sort of a fundamental aspect of collaboration. And when we were doing user interviews early on with DAGSA, we were surprised by how many people literally did that monitor swivel thing. That was before COVID. So that was still reasonably possible. Now I think everyone understands why you can't count on having that. um, that ability. Um, I think that the, the way we, we do approach this aside from what Guy said, which is sort of trying to, uh, apply common sense or first principles thinking, um, and then deriving the way things should be built, um, is first we made a decision, which is sort of disconnected from the realities of building our product, which is wherever we could, we, we want to integrate with open source tools. Um, and, and that was a decision that, that was fundamentally important to us because we imagined that the ideal solution would be built this way. Um, and, and so even if at some, at some sort of, uh, um, crossroads, there was an alternative, which was an open source, we heavily prioritized, uh, the open source alternative. Um, and, and I think that the, the second thing is we have, I mean, uh, I I think to the best extent that is possible, given how this market, and I'm sure you understand this as well as we do, this market is still changing relatively rapidly, even compared to software development. So I think everyone is having a really hard time choosing which tools to use in their arsenal. And we have sort of this meta level problem because we're not choosing tools for ourselves. We're choosing tools. for our users, which have much more diverse use cases than a single data science team. And so we actually did sort of develop an unconscious methodology, which we then laid out in PowerPoint slides so that we can give talks about this. And there's actually a talk that both Guy and I gave that's called Solving Emollapses from First Principles, where we don't talk about any specific tools because the goal was not to, you know. do an ad for something, but it's sort of trying to break down how to think about selecting tools under the

Guy:

Yeah.

Dean_Pleban:

assumption that that's what we had to do for ourselves. Um, so the general methodology, if I, if I'm sticking to my, uh, five point, uh, plan, which we present is you need to first understand really well, which problem you're trying to solve. And I think that this also applies to data science teams listening to this podcast, which are thinking like, I have some issue. and I want to use this latest buzzword to solve it. If you take five more minutes, or maybe it's a bit more than five minutes, but if you take a few more minutes to really dive into the problem that you're trying to tackle here. And that means both, maybe even like literally writing it down on paper or whatever in a Google doc, and then the criteria that you actually care about, that will many times lead you to a solution, which is... counterintuitive because it's very easy to go for the shiniest thing. So yeah, so the moment you have your problems set out, you evaluate solutions, hopefully not too many. And I think that that's also part of the discipline that you should have. Like it's very easy to say, I'm going to evaluate 50 solutions for something and only then decide, but hopefully after you define the problem, you can rule out most of them and then you have three solutions to evaluate, which is reasonable. And then you choose the one that fits the bill the most, which I think for us, again, as a platform and not as a data science team, modularity is

Guy:

Yeah.

Dean_Pleban:

super important. We prioritize that if possible. It needs to be, again, it doesn't only need to be open source itself. Ideally, also the formats are pretty generic. So even if it's not modular, it's very easy to plug the outputs or inputs into other systems. Those are like two main considerations that I

Guy:

Yeah.

Dean_Pleban:

see ourselves going back to.

Guy:

Yeah, I think we try to focus on, I guess, jobs to be done, which is a bit easier said than done. And then thinking, okay, we have this job to be done. For example, users want to track metrics and hyperparameters, okay. So we would look at the available open source solutions for this, like we have sometimes rolled our own version, We really don't like to do that. We want to make things interoperable. And yeah, I just, I don't like vendor locking, basically as a user. So I really wanted to. use things that have an existing ecosystem, that like Git itself as a basis for what we do, like MLflow of course. So we think, okay, users want to track their experiments, metrics and hyperparameters, that's a good job to be done, but then we try to go the extra step and say, okay, but what is the bigger picture of this? What do they then do with this? And how can we make it... without breaking the interoperability, work with other parts of the product in a more systematic way. Like

Dean_Pleban:

He said,

Guy:

just

Dean_Pleban:

yeah.

Guy:

the fact that you can link things more easily, say, okay, when is the user looking at type metrics and hyperparameters? Can we show it to them when they actually need it and not just have to tie things together in a... Google Doc that says here is the code, here is the experiment, here is the notebook.

Ben_Wilson:

Yeah, I really appreciate that statement that both of you made about that modularity and making sure that you maintain the Rosetta Stone for these disparate open source tools sets and looking at it through the lens of how is a user going to use this? And before I started working on an engineering team, working on a tool that people use in open source, I always used to wonder, like, oh, why don't they build all these features that everybody on my team is always trying to use or we have to build ourself. And now, looking at it from the other side, you end up seeing opinions from so many different people. And if you just canvas all of your customers and say, what's most important to you? And then you look at a ranking of their requests. And then stack up. different customers, sometimes they're completely at odds with one another. You're like, well, what's most important for this group is to have this esoteric feature that behaves like type A. The next customer, next most important customer wants it to behave that same feature, but to behave like type B. You're like, well, I can't build both, that doesn't work. And taking that step back when you're designing and saying, can we abstract this and make it so that maybe they can... easily implement both of these things and we're not going to restrict them. That makes me think of, Guy, what you said about vendor lock-in. I'm also a very passionate hater of that as well. When you're like, oh, I'd rather buy versus build most of the time when you're talking about complex things because nobody wants to maintain complex systems if they don't have to.

Guy:

Mm-hmm.

Ben_Wilson:

But if there's some specific functionality that is available in a pay-to-play product, it sometimes gets frustrating when you're looking at that. You're like, well, do I really need that? It's really nice to have that, but could I build that, like a simpler version of that for myself? And it seems like that's the thought process that your whole company is using to build things. And it mirrors actually our own. That's how we do it. feature development too. And it's great to hear that we're not alone in how we do that.

Guy:

כן, אני חושב שזה חשוב לתאר שאין לך ענקה ספירה, כמו שאתה אומר, איזה דברים הם מאוד ידידים ואייסולטים, כמו ומחורים, כמו נגיד, אני רוצה להשתתף על אוטו מל סולושן, אני יכול להשתתף על עצמך, זה יהיה איזה חליף, אבל אם זה כמו אינטרפס פרדיקט, שאני יכול להשתתף על דata ובקודם, ואם יש איזושהי פרדיקט גנרי שאני יכול להשתתף בין פרדורים או שם אחרות עדן פרדורים, בטח, למה לא? אבל אם זה כמו שבייסים את כל עבור המשפט על משהו Like, I don't want to call out anyone specific, but like SAP, I guess I saw does this with companies. Like the whole company is basically built around the SAP system. If you take it out, there's no company anymore. That's the part that scares me. Yep. And

Ben_Wilson:

And

Guy:

Ben_Wilson:

there's

Guy:

think also,

Ben_Wilson:

even

Guy:

yeah.

Ben_Wilson:

open source tools that are like that too. You look at their API contract, and you're like, hang on, in order for me to interface all my other things with this, I have to use this specific type of functionality to interface with their package. And it can sort of lock you into that design and make it so that you can't use these other tools.

Guy:

The advantage there, though, why I really like open source is that you always have the escape hatch.

Ben_Wilson:

Yes.

Guy:

And I've used that escape hatch a lot of times in my career. Like usually you don't want to, but you know that when it makes sense, you can do that, you can fork off, you can do the thing you need to do to make it work with your stuff.

Ben_Wilson:

Exactly.

Michael_Berk:

you

Ben_Wilson:

And it's really powerful. And I've talked to a number of people that are now contributors to MLflow that they'll send code to one of us individually, like a maintainer. So like, hey, can you check out my fork? And you go and look at it and like, hey, this isn't proprietary, right? Like, are you cool with rolling this into MLflow? And there's some response sometimes is, We didn't know that we could do that. Like, yeah, you can. This is great. Please contribute this. We'll help you get it in. And then you'd see that person nine months later, and they've had 40 commits to the repository. And instead of working on their fork, they're now just making it so that everybody can benefit from this. So yeah, open source is fundamentally awesome. How often do you guys have time to do that? Because I know you interface with a ton of open source packages. It's probably one of the most comprehensive integration

Guy:

Mm-hmm.

Ben_Wilson:

suites that's out there. Do you get time to do that?

Guy:

Yeah, not as often as I would like, but yes, we do it relatively regularly. We fix bugs in open source projects that we find. We contribute features when it makes sense. Yeah, it happens pretty often. I think we've fixed a few bugs in MLflow in the past. We also wanted to contribute a big feature, but I think due to the product considerations that you described, it didn't work out in the end. um

Ben_Wilson:

Sorry.

Guy:

no it's okay

Dean_Pleban:

Thanks for watching.

Guy:

i understand it's hard um but yes

Dean_Pleban:

I think

Michael_Berk:

What?

Dean_Pleban:

to me the, yeah, sorry, go ahead.

Michael_Berk:

No, you got it.

Dean_Pleban:

Um, I was just going to say that, um, I think one of the, one of the things that was so counterintuitive to us that led to the start of DAGSUB was how could it be that this happens so often with software, but basically

Guy:

Mm-hmm.

Dean_Pleban:

never happens with data science. Like there, I think to this day, the only contributions we've seen, uh, from external parties are. code contributions. I can't think of any famous non-code contributions that we've seen outside of like DAGSUB and smaller projects, which is kind of sad. I don't know, maybe sad is the wrong word here, but yeah, but I wish it would be different and that's part of the reason we're on this journey.

Guy:

אני חושב שזה גם חשב שזה יהיה דבר אחר, ככה.

Dean_Pleban:

Yeah.

Guy:

כשהוא נהג, זה חלק מה שערעה לי לנהג שערעה על דגזאב. כן, בבית אביב לא יישב אם זה חשב בפאנסטים לנהג שיש לך משהו. contribute to an open source project before github just lowering the bar the barrier for entry is super important

Michael_Berk:

That's a really good point. And I feel like there's a lot of new infrastructure just based on the past, whatever, 40, 35 years of software development. There's a lot more tooling that allows you to collaborate online. But software is, let's say, 20 years ahead of ML or whatever number you want to say. what areas of ML, because there's a lot of wisdom in this room on this topic, what areas of ML still are very young relative to, let's say, software.

Guy:

Hmm. Um...

Ben_Wilson:

Let's say from my perspective, EDA and attribution in open source are still relatively nascent as compared to tools that you can buy. If you do a comparison of say, if we're talking about modeling, I'm like, hey, I want to fit a linear regression and I want to tune the hyperparameters, that stuff in the open source community has been there for years. If we're talking about... Python, right? They can say, I'm going to use sklearn, which is based on the R-cran packages of these algorithms. It's been around for a couple decades, and it's so proven, and everybody uses it. So ubiquitous throughout the profession. But in proprietary tools, yeah, that's been around for 40 years for some of these things. You look at SAS software, right? They've had the ability to fit a linear regression since the early 1980s, maybe. and it just works. with statistical evaluation of data and visualizations of a feature set and being able to really display all of the statistics about that in a high level API, there's no comparison between the capabilities of tools like SAS as compared to what's in the open source. I think there's packages out there that are really great. They're just not as widely used or as refined as what you can get with proprietary tools right now.

Guy:

באמת.

Dean_Pleban:

Yeah, there's

Guy:

פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-פ-

Dean_Pleban:

the

Ben_Wilson:

Just general analysis of a data set. And it's not quite as. as refined as it is in some of the proprietary projects that I've seen.

Dean_Pleban:

Interesting. Like the first thing that came to my mind when you said EDA was all of the, or all of the, I think there's basically two that come to mind, uh, like auto EDA, uh, open source solutions, which is pandas profiling and suite viz. Every time I, I, I'm surprised again by the fact that I'm speaking with data scientists that didn't hear of them. And then I feel like I'm doing them a big favor by telling them of their existence because it's nice, but it's definitely not, um, it's as far as I understand, it's not, it's not really suitable for really large scale data and things like that. So it's not, it's not like an, uh, enterprise grade open source tool. It's more like a cool thing that you can use on your small data set to have a better understanding of it. Um, but yeah, I guess. EDA to at least in my mind is statistical analysis is very, um, case by case. So it's very customer or use case specific as opposed to some of the other things. One of the actual like this is an interesting topic. Actually like when we got started with DAG sub there were a few areas that we felt this way about and we thought like if we go if we try to solve for them we should do this later because right now every company needs something so different from one another. that trying to do a solution that will actually, sort of the rules that we said earlier would apply to would be really, really hard. I think that EDA is still one of those things, like abstracting statistical analysis of data in such a way where as a company, as a vendor, you could build something and then it would be really useful across different data types and modalities and stuff like that is really hard. Maybe, maybe it's also specifically harder for us because, um, at least, at least a DAG sub, like we think first and foremost about unstructured data. So they're like

Ben_Wilson:

Mm-hmm.

Dean_Pleban:

statistical analysis of MRI is probably very different from audio is probably very different from 3d models, et cetera, et cetera. And tabular, you're a bit more constrained, but the meaning of the data is very different, so probably

Guy:

Yeah.

Dean_Pleban:

still people want to see different, very different things. So yeah.

Guy:

I don't mind sharing that the first really correct, I guess, user research we did on like a demo, like a fake feature kind of user research that we did was trying to display auto EDA on like tabular data files. and you know we displayed all the statistics and the histograms and this and that and we showed it to a few people and they all said the same thing which is like I really like that you show me how many rows and how many columns it has I really like that I can see the first few rows of the data set everything else is like uh bullshit because

Ben_Wilson:

Yeah!

Guy:

uh like yeah that just you know telling הם נהרים שעוד פעמים.

Dean_Pleban:

They were really nice people

Guy:

אבל...

Dean_Pleban:

and they gave

Guy:

כן, כן, כן.

Dean_Pleban:

us honest feedback, which is hard to come

Guy:

כן, כן.

Dean_Pleban:

by. So,

Guy:

אבל הם...

Dean_Pleban:

but yeah.

Guy:

ההסתכלה היה גדולה, כי הם אמרו, תראה, כמו... אני לא עובדה על כל מה שיש על זה. אני אעשה על... מה זה הדייטה. ו... זה כמו סייבת. זה יכול להיות, כמו... סירי פעם. file like there could be several lines which relate to a single user session or something what does the distribution of this mean it's totally meaningless if i need to group rows or something so i might like it could be nice like sure why not but the chance that i would actually get something useful from it is very low so we kept the row and column count and we show the head But that's it for now. We realized that it's not really useful for a lot of people, the other features right now.

Ben_Wilson:

Yeah, we-

Guy:

Maybe because we don't have good enough auto-ADA.

Ben_Wilson:

Well, we noticed the same thing when I used to interact with customers. There would be this group that would have the exact same response that what you just explained. They're like, yeah, I just want to plot and just show me the table. I'll issue queries against it. I'll figure out where problems might lie. And then there's this other subset that comes from enterprise-grade tools, like on-prem, so generally SAS users. And that's how I learned about that functionality. They're like, let me show you something. And I was actually demoing the PANIS profiler and showing like, hey, this is all the stuff that you can get out of this. And they're like, no, no, no, no. Let me show you what a simulator is. And they write a couple of lines of the proc statements and stuff, and then kick off this modeling run. And then a simulator shows up. And they're like, here's what happens if we constrain this feature. And it shows what the actual predictions would be different. They're like, here's what happens when we drop these four columns. It's just GUI-based. They just start dropping columns. And in real time, the actual prediction is changing. And they're like, well, here's what happens if we set additional logic after the model. And then here's what happens if we remove this covariance between these two features. And it actually does a simulation. I'm like, yeah, we're not building that. That looks like about 40 years of research and patents that went into building this. And they're like, yeah, that's why we pay $8 million a year to them. I'm like, yeah, OK. That's what I was saying about the difference between

Guy:

Hmm.

Ben_Wilson:

what is out there in open source versus a company that employs 600 mathematicians over

Guy:

I see.

Ben_Wilson:

a 40-year

Dean_Pleban:

I'm

Ben_Wilson:

period

Dean_Pleban:

out. Bye.

Ben_Wilson:

working on something like that.

Dean_Pleban:

So maybe

Guy:

I wonder

Dean_Pleban:

this

Guy:

how

Dean_Pleban:

is Panda's

Guy:

that

Michael_Berk:

you

Guy:

translates...

Dean_Pleban:

profiling in 40 years.

Ben_Wilson:

I think it will be.

Guy:

Yeah.

Ben_Wilson:

I think it'll eventually get to that point because so many people, there's so many people doing, I hate to use the word serious, but things that are expensive to get wrong. Applications of ML that are now moving towards open source tooling. You look at some groups that I've had interfaces with and they're like, hey, I'm talking to somebody from JPL right now. Oh, what are you using our platform for? You're tracking. you know, this satellite going and rondo going with an asteroid, pretty sure you don't want to get the modeling wrong on that on your simulations that you're doing. And they're like, yeah, we, we write a lot of our algorithms, but it'd be great if this was in open source. So I think eventually those sorts of things will migrate to open source projects to provide that extra functionality.

Michael_Berk:

Also sort of double

Guy:

Yeah.

Michael_Berk:

clicking on the EDA portion. Listening to you guys talk, there were two components. The first is looking at problematic areas of features and sort of informing feature engineering. And then the second, which is usually a step prior, is figuring out the context of the model, the context of the data, what things will go wrong or potentially could go wrong. I think those are two very, very different steps. So if we borrow the concept from software engineering, prior to a project, you know what you're building for, what platforms you're building for, SLA is that type of thing. Yeah, maybe not, but

Guy:

Yeah.

Michael_Berk:

ideally,

Ben_Wilson:

Sometimes.

Michael_Berk:

sometimes, yeah. And then from there, you can actually go in and build the tooling and leverage existing tooling to develop whatever framework you want. It seemed like there were two components of that EDA piece. Do you guys have thoughts on that? Did you hear that? Or am I making it up?

Guy:

So again, the two components you're saying is like informing the feature engineering and giving the wider context.

Michael_Berk:

Yes.

Guy:

I think we definitely, so first of all, we approach this not as we want to build another BI. tool that lets you slice and dice the data and aggregate and everything because there are a lot of those, we don't want to create another one. What we set out to do is mainly solving the context issue which is you jump into a project that you don't know, you want to get as fast as possible an impression of what the data is and what is attempted, like what are they attempting to do with it in this project. and we solve the context issue by letting the code and the data and the data pipeline live in the same git commit. And so that's that gives all of the context. Also, you know, readmes, wikis, things that we also give as part of every project. and the other thing is okay now i look at the data and i want to figure out more or less what it is as fast as possible and together with the code and the other information you could figure out the context as fast as possible and i think we never intended to replace the following deeper analysis we wanted to solve the problem if of like is this even interesting for me Especially in open source or in larger organizations where a big part of the problem is discovering some feature engineering that someone else has already made and reusing it. That's the problem we were trying to solve. Is this even what I'm looking for? Is this data total garbage or something?

Ben_Wilson:

So the modularity of your project, if we were to create a scenario here, let's say you created a project and you chose certain tools within DAX Hub with like defining your pipeline, saying I'm doing my visualization of my features with product A, that's an open source tool. And then I'm using SKLearn, this type of model for the modeling and then I'm doing. using say, SHAP for explainability at the end, and then I have this validation that I need to do for fairness that's also packed onto that at the end. If I were to pull that entire commit, I get the data, I get the pipeline, do I get the ability to change out components to say, well, I don't want to use that ADA tool. In order to evaluate the feature data, I want to use this other one. Would you be able to rapidly do that?

Dean_Pleban:

Yeah, so

Guy:

didn't

Dean_Pleban:

I think

Guy:

you want to?

Dean_Pleban:

that, yeah, I'll start and then you can add everything I'll miss. Um, I think that the, the way we're thinking about this is if the tools are, if the tools are properly, let's say modular or open source, uh, and the platform is built well, um, what it should enable you to do is pull all of that, understand what's going on. And then, um, I don't. Maybe the hard thing to say is it's probably a combined effort by us building the platform and you building the project that your own project would be properly modular. So sometimes, in some cases, we can actually do a conversion between formats for you and then you can say like, I want a different format for something and then your life is great. But that almost by definition means that this format that we're converting is sort of finite in complexity. So the easiest example is if we had a function, let's say, that would take arbitrary Python code, I guess maybe the best example here is Pandas, right? Like, Pandas is... very well known for being the default way that many data scientists interface with like slicing and dicing their data, but it's also pretty well known for not being the most efficient of all the data processing frameworks. So many times when you get to a certain scale of data, Pandas doesn't sort of, doesn't live up to those strains. And then what people wanna do is they wanna say like, what if I could... click a button and then convert my pandas code to some other framework, which is sort of has the same processing abilities, but is much more efficient or scalable or whatever. And there are some frameworks that do that, but all of them have caveats. And the main caveat is, it doesn't support all of the pandas functionality because some of that functionality is fundamentally not scalable or as scalable. So I think the same limitation would apply here. in the sense that we can't convert your code to a different framework. That's still going to be on you. One of the things that Guy didn't talk about, and I feel like we see some of our users already adopting this, but maybe we've done a bad job of telling people that it's possible, is we did a lot of work, we invested a lot of work into visualizing notebooks in a more interactive way. and then also diffing notebooks and doing a bunch of things above notebooks. And notebooks have, like pandas, they have advantages and disadvantages. One of the advantages is that the, you can use it as an interactive shell for your work, which has outputs that you can then share with people, especially if you have decent visualizations for them. So in the SHAP example, when you'd pull this, you could get a notebook that has the explainability report that you want, and that would be easily modifiable by you if you need to add another I don't know, function or something like that. And then on DAGZIP, you'd still see that in an interactive way. And that provides sort of that initial jump in point. I don't know if that exactly answers your question. Guy, do you want to add anything?

Guy:

um yeah um i guess what everything inside is true we try to uh at least give everyone uh the kind of escape hatch of we don't want we don't want to create like a platform that is very limited on Rails, that is very opinionated. We do try to do things where they make sense, like support specific data types that are not common. outside of the X app, but are kind of standard, like 3D model formats and things like this, or medical imaging. But we don't give you a platform that says like plug in your EDA here, plug in your hyperparameter tuning here. We let you choose those things and we try to give you... powers on top of things like notebooks to solve those workflow issues like how do I show my explainability report, how do I comment on it in an interactive way. And yeah, I think what Dean said about Pandas is true. Like... there's no silver bullet as far as I'm aware. Like if you write, let's say, perfectly Dask compatible Pandas code, Dask is the tool to make Pandas scalable in theory. אבל מה שאתה עושה בכל חקר שם זה כמו... לכל חקר שם, אתה עושה חקר שם פרק ספורטים לתת, דאסק לא יעשה אתך. כמו, כשאתה תעשה את זה לתת,

Ben_Wilson:

Thanks for watching!

Guy:

האנגנירים הדאטים יצטיירו את רגלך, וזה יהיה זה. תצטיירו את הקודם. אין לי ספורט, אני חושב.

Ben_Wilson:

I mean, that pretty much answered my question, which was like, hey, if you have an EDA notebook or an explainability notebook that's post-processing a production code of like building the model and tuning it and stuff, so you could have all of those artifacts as code, but you're also saying, hey, we're packaging the data up as well. So it's not just the data in the notebook, but it's, you could say, hey, I want my raw input data that goes into my EDA. I want a version control that. And I want to version control the output of that. So that if I want to maybe be in the future chain, have three explainability notebooks, that could say, I can choose when to re-execute those, but also version control the output of each of those as data, as an artifact. That's pretty darn cool. And you can share that with people. Yeah, fascinating.

Guy:

and comment on it.

Dean_Pleban:

Yes,

Ben_Wilson:

comment

Dean_Pleban:

also internet.

Ben_Wilson:

on actual data artifacts. Do you do that with

Dean_Pleban:

Yes,

Ben_Wilson:

Dean_Pleban:

Ben_Wilson:

metadata

Dean_Pleban:

data,

Ben_Wilson:

addition

Dean_Pleban:

notebooks.

Ben_Wilson:

to the? interesting. So you can

Dean_Pleban:

Yeah, the,

Ben_Wilson:

comment

Dean_Pleban:

yeah,

Ben_Wilson:

row

Dean_Pleban:

sorry.

Ben_Wilson:

wise on the data set.

Dean_Pleban:

Well, again, like

Guy:

Mm.

Dean_Pleban:

in our thought process, we start by thinking about unstructured data. So in our mind, it would be more like you can comment on bounding boxes within images, on rows and text data, on notebook cells and things like that. And sort of the original idea came from the fact that a lot of teams we spoke with have lengthy discussions about different components. of their projects, but they happen in a different place. And then you need to somehow connect them back. So the typical example is, how do you talk about results? Well, you screenshot your whatever, MLflow charts and then you send it in Slack and you talk about whether or not it's good enough.

Guy:

Mm-hmm.

Dean_Pleban:

And so our ideas, why not have, you know, sort of a discussion thread below each. file

Guy:

Yeah.

Dean_Pleban:

and each component within your project so that you, when you talk about the model architecture, you can do it on the model file itself or on the artifact itself and

Guy:

Yeah.

Dean_Pleban:

things like that. And then we add more context. So again, like bounding boxes, rows, notebook cells, things like that.

Guy:

Yeah,

Dean_Pleban:

Yeah.

Guy:

we don't have commenting on rows yet, but we will. We've just prioritized other data types first. Yeah, and about what Dean said about the Slack channel is very, I think, fundamental. Like, I don't remember where I read it, but a very illuminating product tip I heard is that like... סלק זה כמו, כשאתה עושה קצת, או קצת לתחיל בפייטרון, אני חושב, יש לך את הפסיקה של הקודם שחשב כל הפוסדות הפוסדות, כדי שהפרגם לא נקרש. סלק זה כמו זה, כמו תחיל בקורס, אז אם אתה נשתה את עצמך כמו, שותנות קרקעים ומתארות על דברים בנהיג סלק, בנהיג הרפיטה, זה זאת אומרת שזה כמו שזה יהיה חשבי של איזה פרדקט. that is waiting to happen or you haven't realized exists or something like that. Also, the same for Google Docs that are collaborated on by a bunch of people. You find yourself creating a template for a Google Doc and a bunch of people are commenting and editing it. It's probably a product that should exist.

Michael_Berk:

Are you guys trying to augment the in-person style working so that you can be just as productive remotely? Is that part of the vision?

Dean_Pleban:

Yeah, I think that when COVID started in earnest, we wrote a blog talking about that. Like this is now the de facto way people work and we need to adapt to that and provide a solution. However, I think that what we're saying is a bit broader than that, because if you think about the functionality of GitHub, with regards to open source software that existed long before COVID. And so I think that the buzzword here is like a central source of truth, right? But you want a place where the core of the knowledge surrounding a project exists. And if you need to go to five different places and someone, you know, and you start hearing things like, oh yeah, well, that person that left the company they knew that, but we don't know where that's saved right now. Um, then, then you're probably doing something with it, which is detrimental to productivity, even if you're working, uh, in, in the same room. And so I think like today, everyone, you know, not everyone, but many companies have at least a hybrid policy. So some of the days you work from home and then you definitely can't swivel your monitor to the other person. Um, and so you need a way to, to log this, right? Like even. for us at DAGSUB, like you use Notion, you use Discord, you use all of these tools to help you communicate. But when you're working on a specific project, ideally that context exists within that project, then that's what we're trying to, that's how we're thinking about the augmentation that DAGSUB provides. So that's it. Thank you.

Guy:

גם כשאתה קולבורט עם עצמך, זה כמו שאני אוהב לדבר על זה כי כל מי יודע את ההחלבים אחרי שתהיה כבר קודם, שתראו את הקודם ותגידים, מי הוא האדיאט, שכן קודם? אתה רואה את גיטבלם, אתה רואה שזה עד כאן, ואני חושב, because I know I'm going to forget what I did and where it is and what happened. And yeah, I think data scientists, when we started out, were a lot more in the mindset of like, I'm a researcher and I'm messy and that's okay, I'm a scientist, not an engineer. And it's much less the mindset now. Like, people see the value of recording the work even if it's just for themselves. a lot more. That's my feeling.

Ben_Wilson:

And I can't express strongly enough how much you guys existed. And this tool existed 10 years ago when I was doing team lead stuff for data science work in industry and applying software engineering fundamentals or agile fundamentals to the data science team and pissing everybody off. They're like, what is like, what's with all these PRs? Like, why do we have to submit our code? And why do you want to care? Like, where do you want to see it? Here's my score. on the validation data. I'm like, well, I need to reproduce it because I'm going to simulate production. Can you send me over your data set? I'm like, well, OK, here's the query. Issue the query, model doesn't train. You know, features are wrong or renamed. And then you get back to what I'm saying. No, you did something with this query. Like, where's the actual query you used to train this model? Like, oh, let me dig that up. I got to go through my SQL history. And then you've. you waste days trying to recreate something, or somebody, you know, hurries up and finishes a project before they're about to go on a three week vacation, and it needs to ship while they're gone. You're like, well, I can't even review this, or I have to recreate this project from reverse-engineering what they're doing. So that feature alone would have made me an instant customer of your product, and forcing everyone to use it, because... that probably saves, even on a team of like five data scientists and ML engineers, you're talking about saving person months a year by having that functionality, particularly for the TL, who's got to be checking everybody's stuff.

Dean_Pleban:

Yeah.

Guy:

I think we've had one early conversation with a head of research at a well-known company who said Like we pitched the product and don't ever lose work again because things are not reproducible and stuff. And he was kind of skeptical, like how much work is actually lost because of this thing. Let's call in a couple of researchers from the team and call in two researchers. And we kind of presented the idea also and asked them how much work are you losing because of non-reproducible stuff. And they said, I don't think we lose work because we lose time because of that. No, no. And then they both kind of agreed on that. And then one said, oh, wait, remember, remember like a couple of weeks ago when that guy left without like saying how we made this intermediate engineer data set or something, and she was like, oh yeah, yeah, yeah. We, we worked two weeks to recreate that.

Ben_Wilson:

Mm-hmm.

Guy:

Um, it's yeah. People do it without even thinking about it.

Ben_Wilson:

Yeah, and if you're doing applied ML and not research-based stuff where you're just supporting a business and you have 50 models in production, that scales. And every little change that happens and you're trying to get something to go through staging and validate that it's going to work for a retraining event, if you don't have that snapshot, that's what I'm doing for the next 16 hours, is figuring out what went wrong here.

Guy:

Mm-hmm.

Michael_Berk:

Yeah, 100%. So we're coming up on time, so I'll do a quick wrap and then see if there's any ways that we can reach out to Dean and Guy. So first to summarize, DAGSUB is a central repository where people can manage their data science projects. And the underlying goal is to save your organization man months over a year, maybe even more. What they do is they leverage open source tools to make them modularly fit together. And they also have really good collaboration tools or discussions in version metadata. So it's really easy to not only collaborate between people, but with your future self. We also talked about some other topics. One is EDA is still pretty young. I firmly believe that as well. But it's also just a really hard topic to move into different use cases. And then also, current open source tools are a bit behind proprietary code. In 40 years, we should have a lot better tooling. But right now, if you're not going to pay $8 million, you're probably not going to have the best tools. So, first, did I miss anything? Anything else people want to say?

Guy:

Mm-hmm.

Dean_Pleban:

I like that summary. I'm curious if we could get to a point where a machine learning model can create it in real time, or maybe it already has. But yeah.

Guy:

אהה, כן.

Michael_Berk:

Yeah, no, we actually have a model running in the background that and I just read what it said.

Ben_Wilson:

Cough

Guy:

אני רוצה לנסועים לתקלם את התכנית על חשבות אורפנסורס EDA כפי חלום, כפי חלום פרסונל.

Michael_Berk:

Please.

Guy:

זה מי שכנע.

Michael_Berk:

So if people want to learn more either about you guys or DAG's hub, how can they reach out?

Dean_Pleban:

So for DAGSUB, the website is dagsub.com. Personal recommendation, I think we have a blog that has a lot of objectively interesting content because we build on open source tools. If you're looking to solve problems, we might have a blog about how to solve that from first principles. So you should check that out. But if you want a solution for yourself and your team to manage your projects, dagsub.com. would be the place to find out more about that. Personally, I'm pretty active on LinkedIn. My last name is Pleban, P-L-E-B-A-N. Feel free to either follow or connect with me there. And then on Twitter, my handle is my first name, D-A-N, and then P-L-B-N. So you can follow me there,

Guy:

Mm-hmm.

Dean_Pleban:

happy to chat, answer questions. Generally, I really enjoy speaking with practitioners and enthusiasts. about what they're working on and what their challenges are. So even if it's unrelated to DAGZIP, if you want to share why you, the last thing that you spent 16 hours on and you felt like you shouldn't have, I'd really love to hear that. So, yeah.

Guy:

Yeah, I think of course I also have LinkedIn and Twitter. The easiest way to reach us is, DAGZAP.com is like not just a SaaS product for companies. Like if you want to work on open source projects, you can just sign up. We have prominent links to our Discord, which is very active and you can just pop in and say, hey, I wanted to talk and we'll be happy to. And there's also a lot of discussions about like non-DAGZAP related. topics, ML theory or tooling or whatever it is. And yeah, and if you're working like on interesting open source machine learning or want to work on interesting open source machine learning, first of all, we're sponsoring Hacktoberfest. I guess maybe it would be over by the time the podcast is out, but we also like to support cool open source projects or even blogs, things that are interesting for the community so you can talk to us and if you're hoping something interesting, we might support it.

Michael_Berk:

Amazing. Well, thank you Dean and Guy for joining. It has been Michael Burke and

Ben_Wilson:

and Ben Wilson. Thank you. It's been a pleasure, gentlemen.

Michael_Berk:

Have

Guy:

Thank

Michael_Berk:

a good

Guy:

you.

Michael_Berk:

one everyone.

Dean_Pleban:

Thank you for having us. Bye.

How to Simplify Data Science with DagsHub Founders - ML 092

0:00

48:02

Playback Speed:

Show Notes

Sponsors

Links

Transcript