Data Visualization and Hugging Face - ML 131

In today's episode, we chat with Sylvain Lesage from Hugging Face, a specialist in data visualization and software engineering. Dive in to discover insights about Hugging Face's software engineering environment, invaluable data visualization techniques, and more!

Show Notes

In today's episode, we chat with Sylvain Lesage from Hugging Face, a specialist in data visualization and software engineering. Dive in to discover insights about Hugging Face's software engineering environment, invaluable data visualization techniques, and more!


Sponsors


Socials

Transcript


Michael (00:01.185)
Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do machine learning and data engineering at Databricks. And I'm joined by my newly appointed software engineering co-host.
 
Ben (00:14.366)
Ben Wilson, I integrate with tools like what our guest company builds at Databricks.
 
Michael (00:22.897)
Nice. Yeah. So great segue. Today we have Sylvan Lassage. He got his PhD in machine learning and scientific computing. And since then he's taken a variety of software engineering roles. And honestly, there's a few too many to recount, so I won't. But most recently, he's been focusing on data visualization and software engineering at Hugging Face.
 
So Sylvain, I was going through your website for a little bit too long last night, looking at all the cool things you've put together, and I was wondering why you build so many things, so many tools and to give a little bit of context for the listeners who might not have seen the website yet. Um, he has a little widget that solves the equations for the mechanical design of a spring. He also has stuff for the gram algorithm that looks to find it.
 
the boundary of the smallest convex polygon that encloses all points of a set. So is this just for fun? Is it to learn? Why do you do all this?
 
Sylvain LESAGE (01:25.391)
For the spring, it was my brother who is a mechanical engineer and he needed to design a spring so he asked me if I could help and I did it in some days at Christmas vacation. So it was very, very fun but clearly not what I do every day. And I wanted to work with free GS so it was very interesting.
 
Michael (01:55.933)
Got it, so you did it because you're a kind person and you wanted to help out fellow engineers.
 
Sylvain LESAGE (02:02.163)
I was very curious about if I could do it.
 
Michael (02:08.288)
Got it. Yeah. So from my, my perspective, I absolutely love learning by doing, and is that your experience? Cause you also have a very fancy PhD and have held a variety of professional positions. So how do you approach learning new topics?
 
Sylvain LESAGE (02:25.099)
Yes, by the way, I have a special work experience because I did a lot of different things. I lived in France, in Bolivia. I did a special data engineering. I was also working on networks and web development, data visualization, data engineering. So I like it very much.
 
changing topics and trying and learning things, obviously by doing them. So yes, I'm always very curious and I like to have challenges and trying to prove myself able to do it.
 
Ben (03:16.598)
I hear that a lot from colleagues that also are like this, where it's not so much building something to solve a problem when you're not at work. It is part of that, but it's more of that process of saying, like, I want to feel incompetent again. Like I want to see like how much I don't know.
 
and then figure that out. And everybody that I've known that does that, you start seeing it bleed into their work, where they can figure things out faster than they could have years before, like new problems. Do you feel like it's changed the way that you approach engineering in general, because you do a lot of that stuff?
 
Sylvain LESAGE (04:09.583)
Yes, that's true. I've always been somewhat incompetent in my job. And I even was an executive director for public office in Bolivia, the name domains of the country. And I was not good at that. And it was like administrative work. And but I learned a lot. I also
 
worked and talked for Bolivia at Geneva United Nations.
 
I don't like being a diplomatic, but it was something I learned too. Yes, I like to be like in danger and trying to work on this. So yes, it's something I like very much and I think it's the best way to learn. I like very much to learn new things. But yes.
 
Michael (05:14.633)
Yeah, for-
 
Sylvain LESAGE (05:16.771)
The issue is that I don't really take the time to become very good at one thing. So I don't know, I never really took the time to focus on one thing. So maybe I'm missing.
 
other experiences.
 
Ben (05:36.578)
I don't know, you're asking the wrong guy because I'm the same way. Jack of many trades, but a master of none. But I've noticed that on teams, when you have people like that, that are very curious and want to learn new things and try out different approaches, having them
 
Sylvain LESAGE (05:43.927)
Yes.
 
Ben (06:05.75)
you know, tackling issues like somebody, I don't know if it's like that at your team, but we're kind of structured that way. We have a couple of people on, you know, a given team, they're sort of generalists, software engineers, and they can figure out like how to design these things. And, you know, the big picture, they understand how integrations work with other tools at the company and how a user might use something. But that person is not necessarily.
 
the best software developer, like executing code and getting it shipped. But then you have these other people that are amazing at just churning out perfect production code because they've gone so deep into this one area. Do you find that stuff like that at places that you've been before that mixed teams like that work pretty well?
 
Sylvain LESAGE (06:41.135)
Thanks.
 
Sylvain LESAGE (07:00.579)
Yes, currently we are a team of five, I think. And we have some colleagues that are Python geniuses and that can really can follow on their other level. And I really do very basic Python code, but clearly I'm contributing to different levels to JavaScript.
 
and the front end product, that are visualization, item backend, the network, Kubernetes. And so a lot of different things, I'm not good at none, but I am able to contribute to all the stack. And on the other side, for example, people who work on the data set library, which is...
 
very specific. They are contributing and producing a lot of code every day, but clearly more focused on this topic. I think it works very well. We are able to fix issues very quickly. If it's basic, any member can take the issue.
 
and if it's a bit harder, we have always somebody more specialist able to go deeper. So yes, it works well, but we are also a small team, so we can't escalate with...
 
Sylvain LESAGE (08:51.447)
every level of expertise. Every everybody in the team is asked to go outside of their comfort zones.
 
But we like very much working with small teams. It's very efficient for what we do.
 
Michael (09:14.589)
Yeah, that makes a lot of sense to sort of the smaller the team, the more tactical you can be and often the general skill sets serve small teams well, because you do have to wear many hats. If you don't have a hundred people, you don't have your designated XG boost engineer, your dedicated JavaScript, whatever, whatever the developers are. And so, um, it's really interesting to hear that. How would you describe the culture beyond generalist of your team? Because I've been joining some of Ben's, uh,
 
MLflow OSS standups and sort of learning more about, and I work with external teams every day, multiple times a day. So what are some of the core tenants of culture of your team that you find lead to success?
 
Sylvain LESAGE (09:59.927)
I think our team is working the same as the UGNFACE in general, which we have some basic principles to work asynchronously. So really, in our team, we do one hour meeting every two weeks, but we didn't do it last time.
 
It's like one month without meeting this month. And it works very well. It's useful to speak from time to time to set up a bit of strategy or the tactic for the next weeks. But we don't have this kind of bureaucratic scrum protocol. We really don't use that.
 
So it's basically Slack, GitHub issues and that's it. And the other thing is really are free to work on what we want or we have some priorities, but really we have, there is no micro-management. There is even no management, I would say.
 
So we are autonomous and we see how every member of the team can contribute better. So there are a lot of daily messages on Slack to see how everyone can help the other ones. But every member of the team feels responsible for the product. And so it works.
 
Very well. And also one thing I didn't mention is that everybody is working remotely. So we, and we don't, we are not all in the same time zone. So.
 
Sylvain LESAGE (12:17.067)
this kind of setup works very well for our team. For example, the last two months, I have been working from Bolivia with six hours.
 
Sylvain LESAGE (12:34.275)
difference with the friends time zone and it worked. To say, I don't, there were no differences. So it's very robust.
 
Ben (12:49.026)
I think to a lot of our listeners who maybe work at companies that don't subscribe to that philosophy, they're probably like, yeah, the software is probably not that good or the product is not that good. Allow me to assuage those doubts. Throughout the last...
 
on and off pretty much for like the last 10 months. I've been.
 
From the initial feature implementation to follow up PRs and feature requests that we've added, which we're still continuing to do, I've probably interfaced with APIs that you've had your hand in building many times, dozens per week whenever we're working on a feature. I've never been anything but delighted with, particularly the datasets APIs and how Hugging Face Hub works. It's cleverly designed. It's...
 
implemented properly where you don't have as a user, you don't have to worry about like, oh, how do I fetch from this particular S3 bucket and get authentication keys and do all of this stuff to find this particular reference by some, you know, unique ID. The APIs are very simple. You just get this tag reference. It fetches all of the required data, brings it to your local machine.
 
And now you're operating off of a local cache. So the way that you described how your team works and this 100% remote distributed, not exactly following agile methodology with Scrum, it's working for your team. Do you think that the entry bar for a team like that is potentially age gated?
 
Ben (14:44.742)
or experience gated. Could you take, could you start a new team with, with five members that have less than a year experience and do you think this would, that would work? Or do you need somebody who's senior has been doing this for 20 years to kind of be the mentor for that, that team remotely?
 
Sylvain LESAGE (15:09.559)
By the way, I think it works without requiring a lot of experience. In the team, we have different levels of experience. And by the way, the team leader has a lot less experience as the rest of the team. That includes everybody.
 
Sylvain LESAGE (15:37.603)
I think if you hire people who are interested in the project and have a good technical level and you give them autonomy and responsibility, I think it works well. There is self-coordination and organization that every team will find.
 
But with clever people, you generally get a good working team. So I'm not sure the age or experience is a right requirement for a team to work this way. By the way, I don't think we, I think it's really a very good setup for a team. And I would.
 
I don't want to work in another setting anymore.
 
Ben (16:44.454)
I'm personally loving remote work as well myself. And our team is globally distributed just like your entire company is. And it works amazingly well for us because we trust one another. We communicate asynchronously or through non-traditional means and get things done.
 
One thing that I'm curious about hearing from you in particular, before we started recording, you showed us a new product that you just released two days ago. I kind of fanboyed a little bit on it. It's like, wow, this is so fast and it's so well done and the front end looks super nice and it just is very well implemented. When you get a new project like that, which I'm sure you've done things like this in the
 
your work history, but when you get something that's new or you have a new idea, what's your process that you use for going from that ideation to a prototype?
 
Sylvain LESAGE (17:51.071)
Yes, but we generally try, but we don't take a lot of time to try prototypes. We think more globally than just our team, we are sharing with more team members that didn't face when we have ideas of what new features we will work on, in particular with the CT-Eco.
 
and see how it will have an impact on the product. And when we have it, we are, when we decide on which feature we want to try, we look at which stack we can use behind it. The idea is to use the simplest technology possible. And it works well. For now, I don't.
 
know if it will always work. In particular, we are scaling and adding more and more data sets. But the idea we have taken is to make every processing as more unitary possible, atomic. So we can scale and deploy on more and more workers when we need them. So.
 
Sylvain LESAGE (19:21.747)
And one way we saw to be able to do this is to rely on the simplest technologies to have a file. So we use the data set library to convert every data set to the pocket format, which is a very useful format to do a lot of.
 
queries or statistics on dataset. And then all the following processes are done on the parquet, particularly using the db, which is you can run with one process that uses a file and returns results. You can run any SQL,
 
query on the with the DB. So it's basically the idea we have for everything we are doing on datasets, taking face for the product to show things to people. And it works very well. So maybe we had luck, maybe it was a good idea. I don't know, but for now it's working well. And we hadn't to do
 
much prototyping for that. We just converted to Parquet, tried that the queries were fast, that our server could serve the requests and the responses quickly. Then that's it. And on the front end side, yes, we did some, some...
 
some prototypes, I use a lot the observable platform. I don't know if you know it. It's a platform done by, in particular by Mike Bostock who created the D3.js. And I have used it a lot when I was doing more data visualization as a freelancer. And I'm keeping using it to prototype
 
Ben (21:34.837)
Mm-hmm.
 
Sylvain LESAGE (21:48.155)
the front end data visualization. But really we are not doing a lot of research or prototyping that we only try to develop the feature and when it's working more or less we ship it. And if it's not working well, we fix it. There is like a culture of shipping
 
the fastest possible and fixing when we need to fix it, not to overthink it and not forecast for any possible scenario, but more reacting fast when we see some issues.
 
Michael (22:39.713)
Got it. So sort of the move fast and break things approach. That's really interesting. And I've seen that employed at a variety of Databricks teams as well. But typically when you have very like multi-million dollar ETL pipelines, you do need some guarantees on reliability. So it definitely depends upon the type of technology that you're working on. But I sort of have a more high level question, which is.
 
Data visualization, in my opinion, is sort of a slept on topic and it's very pretty. It's good for LinkedIn posts. And you're like, Oh, look at my scatter plotter, my heat map or whatever it might be. Um, but I was wondering how you, as a legitimate expert in data visualization, how you think about good data visualization, what is good design, what is bad design. Are there any principles that you follow when you're looking to communicate a story with data?
 
Sylvain LESAGE (23:35.543)
There are like two kinds of data visualization. One is explanatory, exploratory, and the other is to explain things. So in the case of Huggins face, we are wanting to give users a way to understand their data and to change things, to filter, to search, and to have insights on the data, but we cannot.
 
We don't want to take an editorial role to say them, this is what you want to see to know in your data. So it's more giving exploratory tools for the people. We are just starting to integrate data visualization to the data cell viewer, but it's the way we are working on.
 
Obviously, a lot of people are doing data visualization for newspapers, for example. And in this case, it's the other way. You are looking and analyzing the data. And when you have some insights, you are trying to do a very cute and innovative data visualization
 
help people understand one key insight of what you could extract from the data. So you are using basically the same tools to do that, but the process is different. And in our case, at Huggins Face, we are giving the tools for people to pay with the data and to extract what they want.
 
Michael (25:33.877)
How, so question on that, how do you know what dials and knobs to add in and how to also simultaneously keep the product simple? Because when we, the, the feature that Ben was talking about that you guys built a couple of days ago, it's basically a histogram plot on the top of each column. And then some unique values are some road displays under that histogram. Very similar to what Kaggle does. Why? I mean, histograms are awesome. I absolutely love them, but how did you.
 
know that histograms were the right decision versus a scatter plot versus a box plot versus the 70 other plots that communicate very similar information. And how did you not actually put in 50 different widgets? Like how do you distill down to the essential product?
 
Sylvain LESAGE (26:17.391)
Yes, first we looked at what websites are doing, Gaggle, Observable, Deep Note, et cetera, et cetera. And there is like a standard way of doing things and that seem correct. It's very important to do data visualization that people will understand and that there are standards that everybody understands. And
 
We don't want you to have to think, oh, will I read this data? When you see the histogram for a numerical column, you know what it means. You can prove the bars and you understand what is written below as the ideas and number of samples for that class. So...
 
We, and as it was the first data visualization we integrated in the dataset page, we wanted it to be the most simple possible, what people will expect. And we will iterate a bit. For example, you can see currently we are not showing anything over the string columns.
 
and why we have a lot of NLP data sets and there are like two kinds of string. Some are classes. So you have a label, for example, good, bad, I don't know. So in this case, we could show the frequency of every class as we are currently doing on some.
 
We have a specific type of column which is class level. And so we show a specific widget for them. And the majority of string columns are prompts, large sentences. And so we have to think what we will show if we will show something. If we show something, it must be very simple to catch. You have to understand what.
 
Sylvain LESAGE (28:44.847)
This means we are thinking of showing an histogram of the length of the strings and see if it works for people. If it doesn't work, we will change it. But we don't want to support any kind of data and to show very complex widgets, we prefer to go one step.
 
time and see if it works. If there is, if we have a lot of comments that it's not understandable, we will remove it.
 
Ben (29:27.55)
Yeah, something really to be said, in my opinion, with the simplistic approach and not overloading users. It reminds me of a ticket I responded to a couple of months ago where somebody was using MLflow. And they had an issue where they were saying, well, my plots aren't rendering the way that I want within the UI. Initially thinking, oh, that's our visualization.
 
We need, I need to verify that our plot, like the generated plots that we do and the plot builder work, I go and test it like, everything's working fine on my end. And then look deeper into the ticket with what they're reporting and ask them. And it was charts that they were generating as artifacts to be stored to the runs. So it's just, you know, PNGs that they're saving off as part of logging. And I've looked into the run that they shared. And.
 
there's 1500 pictures that are attached to each run. Now, like, yeah, it's really slow. Yeah, you're writing 1500 files that are all at 4K resolution effectively. These are huge pictures that you're generating. But I looked at a couple of the pictures that they shared and even being in this field for going on a decade and a half,
 
I'm trying to figure out what the plot was for. It was something that I had never seen before, but it was so busy. There were so many lines. So it was like a shadowed histogram in the background. And then there were plot points of individual data points with like overlaid on top of that were different shapes and color coded. And then it looked like there were probability distribution functions being mapped for like within each histogram.
 
vertical. So just on their side. And I'm like, I asked the first time, what are these plots for? And what do they, what do you get out of these plots? Because it looks like this is six plots in one. And it took me, I set a timer on my phone. When I first looked at it, I'm like, I'm just going to start a timer. I'll stop the timer when I can explain what this chart does in
 
Ben (31:52.234)
in less than four sentences. And it took me 73 seconds to get to that point of just working it through my head, being like, what is going on here? So what are your thoughts when presented with somebody's, like maybe a feature request, when you're talking about a data visualization or an idea that they have? It would be great if I could, for the text string, if I could get the length of each string, the average, the median.
 
the token count as it would be seen by a Transformers model, what do you think of when people give you that sort of feature feedback with requests?
 
Sylvain LESAGE (32:32.963)
Basically, we have a lot of ideas and a lot of experts who are producing this kind of metrics and graphs and charts. For example, to assess the quality of the dataset, to see if the dataset is well balanced. If...
 
Sylvain LESAGE (32:59.659)
if there are biases inside the data set. So there are a lot of very advanced measurements, but the issue is that it's too hard to understand them. And the again, face hub is designed for everybody, any users that want to do machine learning.
 
Michael (33:18.017)
All right, I think this will need an edit. I wonder if he dropped or if it's just a buffering.
 
Sylvain LESAGE (33:28.503)
has to understand what we are showing. So we have this obligation, which is to make...
 
Michael (33:31.265)
Looks like a drop.
 
Album Art
Data Visualization and Hugging Face - ML 131
0:00
1:07:05
Playback Speed: