How To Think About Optimization - ML 102

In this week's episode, we meet with Micheal McCourt, the head of engineering at SigOpt. He is an industry expert on optimization algorithms, so expect to learn about constraint-active search, SigOpt's new open-source optimizer, and how to run an engineering team.

Hosted by:

Ben Wilson •

Michael Berk

Special Guests:

Micheal McCourt

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Links

Transcript

Michael_Berk:

Hello everyone, welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I'm joined by my co-host.

Ben_Wilson:

Wilson.

Michael_Berk:

And before we get started, I sort of want to explain a little bit of the origin story of this episode. I used to work in AB testing at a company called Tubi and that's Fox's streaming service, so sort of think Hulu and Netflix and the org had really good AB testing, surprisingly advanced for the organization maturity. And. So a lot of my job was staying on the cutting edge, reading company blogs or white papers. And so one rainy afternoon, as one does, I was flipping through recent publications and I stumbled upon some research by Mike McCourt, who is our guest today. And I was very grateful to hear that he had 90 minutes free on one Friday and he's willing to speak with us today. So I'm really excited about what we're gonna be covering. And the reason is Mike is head of engineering at SIGOpt, which is a company that was acquired by Intel in about in 2020. He has over 19 publications, some of which we'll hopefully get into today. Four patents. And yeah. So Mike, let's jump right in. And can you explain a bit more about what SIGOpt actually does?

Micheal_McCourt:

Absolutely. Thank you both for inviting me here today. It's my pleasure. Always happy to talk, talk about SIGOP, talk about stuff that's exciting and making sure that we get the cool topics to the forefront of the conversation. SIGOP is a company that was started in 2014, very, very late 2014, in order to provide what at the time we sort of just phrased it as black box efficient optimization through our REST API to our customers who are trying to, I'll simply say optimize all sorts of things. And some of those things are machine learning models, in particular, maybe optimize the validation accuracy of the machine learning model. Could also be used to minimize the training error in a back test for a financial model. And in fact, over the years, Many of SIGOps customers turned out to be financial customers, which was an interesting twist, and I got a chance over the years to learn a lot from that. We also had, and much of the work that I'm most proud of, is the enablement of optimal design for physical products, for actual things that are being manufactured. And even before I started there, this was happening. There was an example, and the name escapes me right now. George Bonacci, I remember, was the first. scientist, I can't remember the name of the company, he was making shaving cream and he was trying to design shaving cream to be both very robust and smell nice but also then not fall apart. Like, because if you mix it the wrong way and the emulsion is wrong, then you, it just turns into water or something like that. And in reality, what was exciting was he was able to use a tool like SIGAPS to come up with a really effective formulation of his shaving cream. So that topic, whether you call it formulations engineering, whether you call it computational chemistry, whether you call it material science, the application of a tool like SIGOPT for optimal design in each of these settings has been very exciting. Some of the work that we've been able to do over the years has been with material scientists and making sure that they know, hey, a tool like SIGOPT, even though we think of it as maybe a hyperparameter optimization tool, we certainly market it like that. It really has... broad usage and I've been so lucky to work at a company that was willing to explore that to not just say, hey, we're going to do hyper parameter optimization only, we're only going to talk to ML people, we don't want to hear about anything else. I've been very fortunate to have some really visionary founders as part of the company who said, no, you know what, if there's an opportunity, let's go after, let's do some good work. We've been able to do that and develop new features, including the article that you mentioned. which branches beyond, I think, what's often discussed in any sort of ML training or HPO, and is most useful in these physical situations, in these maybe expensive, either computational simulations or even physical manufacturing. One of the best experiences of my life was being able to actually touch something that was manufactured as a result of a design that we suggested to the folks. And I'm just very fortunate, very fortunate to be a part.

Michael_Berk:

Yeah, that's super cool. And the paper that we were referencing, it essentially touched on Pareto frontiers. And these are sort of the optimization frontiers of multi-constraint or multi-objective optimizations. So for instance, if you wanna optimize price of X while minimizing the amount of materials you are using to create X, well, those are two objectives that you're looking to optimize and sort of the boundary. is the Pareto frontier. And we were doing a lot of that work with AB testing where let's say we want to optimize revenue while increasing engagement if they're sort of inversely related. So that it was a really, really interesting paper and had a lot of novel sort of concepts that I really enjoyed. So do you typically think about optimizations in terms of a Pareto frontier or just sort of a search? Can you sort of explain how you go about thinking about optimizations in general?

Micheal_McCourt:

Absolutely. And I want to start by clarifying, especially for an ML audience. When I talk about optimization here, I am not talking about gradient descent. Gradient descent is a very important tool for optimization, but that's not what I'm talking about here. Really what I'm talking about is circumstances where in general you don't have gradient, almost certainly you don't have gradient information. You also don't have any specific structure associated with the problem. the domain that you're looking at to optimize is, let's say, between, I don't know, four and 50 dimensional. So not in the order of the billions, that of course we're talking about when we do gradient descent maybe nowadays, but also not one or two dimensional. There's an old phrase I think Warren Powell always used to say, everybody's a hero in one dimension. So we're not talking about sort of the most simple situations here. We're also not talking about the situations where you'd use either linear programming or gradient descent. And also I want to say that the domains we're often talking about optimizing over usually have some sort of certainly categorical element to them. So we're not optimizing over a continuous domain. One of the parameters might be red, green, blue. One of the parameters might be using the hyperbolic tangent or the ReLU activation function. So I think that there's... It's important to mention that because a lot of times when people hear optimization, they think at them. And that's reasonable. But when I'm talking about it here, really what I'm talking about is being efficient, being extremely sample efficient. You're only going to get a small number of mathematically speaking function evaluations or from an ML perspective, we could say trainings. From a financial perspective, we'll talk about back testing. From a physical products perspective, we talk about actually building the thing, actually using the material and building the thing. And that is It's slow and it's expensive. And perhaps, especially from a slow perspective, you may only have a week, two weeks to build something out. If it's going to take you three hours each time you want to test it, you need to come up with an answer pretty quickly. And that's what we focus on. And in particular, while it is obviously logical to talk about this in the context of optimization, because we are trying to make something go up or down, it's definitely trying to happen. I am trying to balance the use of the term optimization against the topic of mathematical optimization. And in particular, in the article that you're referencing, one of the key elements here is this idea that optimization itself is a very poor formulation for things that are happening in the real world. Mathematically speaking, it's a logical formulation. I have something, I want to see the best performance, find the point that gives me the best performance. That's not surprising, it's very logical. The issue is, and the thing the paper really tries to delve into is the idea that what you are studying when you conduct this numerical optimization is not actually what you're interested in optimizing. Let's take a financial model, for instance. When I study historical data and I try and build a model that's going to perform well on historical data, That's all well and good. I found something that's going to perform well in the past, but that's not really what I'm trying to optimize. What I'm really trying to optimize is performance in the future. I'm trying to optimize how something is going to perform next week. Now, how it performed last week is a reasonable proxy for how it's going to perform next week, but it's not the same. They're not the same. And so when you go and run a mathematical optimization strategy and say, I'm trying to find the one point that is the answer, even if it is. even if you miraculously happen to find the one point in a, you know, at least partially continuous domain that actually is the answer. It's the answer for last week, not the answer for next week. So you exhausting a huge amount of energy to find the answer to the wrong problem is not what I think of as a worthwhile investment. And really what the crux of the paper is trying to focus on is this idea that hey, what you are trying to do is not what you can do. What you can do is just pointing in the right direction. So don't bother trying to zero in on the one point or on the Pareto frontier. And admittedly, the Pareto frontier is more than one point, but it's still brittle. Being on or off the Pareto frontier is talking the tiniest fraction of things. A point is on or off based on 0.0001 of one of its objectives, one of its metrics, one of its output. That's not. That's not appropriate, it's not the real world. And I don't think people even use it that way. A lot of times when people get the predo frontier, they don't just want everything that's on the frontier, even things that are close to the frontier they're interested in. Why is that? They're interested in it for a couple of reasons. Number one, noise. The measurements that you're taking when you do your training, even if the training itself is deterministic, the data you observed to conduct the training was done in a noisy condition. So the objective you're studying is noisy. And then number two, people know intuitively. that what they're doing isn't what's going to play in the future. So people want data that isn't even the frontier. People want data that isn't even the answer because it helps them understand what's going on. And that was the crux of the paper. We wanted to embrace that. We wanted to say, hey, let's find not a point but a region. Instead of converting this, let's say, eight-dimensional search place that we're playing in and saying, I want to find a zero-dimensional point I want to find an eight dimensional region inside this eight dimensional space. And then there's a lot of follow ups as to how to make that happen practically, what the implications are for multiple objectives, and whether this sort of actual or how to use the results of this, whether this is actually the answer then or what's the next step. I'm happy to dive into each of those, but that was the motivation of the paper. And where this came from was, again, the material science work, collaboration with the University of Pittsburgh, Dr. Paul Lu. at the Laboratory for Advanced Materials in Pittsburgh is working on glass manufacturing. And when you're trying to build a piece of glass like this, in particular, you can see a reflection here. Oftentimes, you actually don't want reflection in glass. Suppose you're looking at your cellular telephone. If your cellular telephone is reflecting light, that means you can't see what's behind the glass, which is really what you're trying to look at. Also, you don't want glass that leaves fingerprints. that you want to somehow be both hydrophobic, oleophobic, fatphobic, not double oil stick to it. Omniphobic. You don't want anything to stick to the glass. That's important for phones. That's important for solar panels. That's important for all sorts of things. How do you design that? Well, that's a complicated question, because you can build these numerical simulations to estimate how something's actually going to occur. But again, they're wrong. They're just models. They're estimates of what's going to happen. So you're going out and trying to find the answer on the wrong problem. isn't as helpful as you might like it to be. There are other reasons that the constraint active search methodology, which is this methodology that I'm talking about here. There are other reasons why it's useful, in particular, for involving expert opinion. But really, the key crux of things is the idea of converting the problem from saying, I want to find this single point to I want to find some region, some answer, and I want to find an interesting selection of points from that.

Ben_Wilson:

Yeah, it's really interesting that you bring up that aspect of, hey, this isn't just for AI folks or data scientists for applied ML, but also for the physical space. Because I'm one of the, I don't know how many people are out there with my background, but came from the physical space where we were applying some of these theories. Definitely not the sophistication that your company approaches this with. But. that foundational idea of design of experiments where you take experts and say, all right, what are our hypotheses to... This is critically important in R&D when you're thinking about we're going to create a new product that does something that none of our products currently do. Where do we even get started? And you can brute force it. And this also applies to ML, which I found years later when I moved into machine learning. I was like, oh, this is the same as like a random search when you're just guessing at stuff and saying... I hope for the best, or you can approach the problem with, well, what are our boundary constraints of what we can do? Like what are the tools, what are they physically capable of doing? How high can we set the temperature? How low can we set the temperature? How high and low the pressures are, the gas flow rates and stuff. And nobody does that in R&D because you're either going to break your tool or you're just going to make garbage. And everybody knows that the experts know like, no, we're not going to run in that minimum temperature. It's not going to work. So. And that's grid search from an ML perspective, when you're just saying, hey, generate all these different experiments. But when you're talking about that in the physical world, time is money, gases cost money, turning on the heater in a tool costs money. So we would do something similar to what you're talking about, which is informed hypothesis testing through design of experiments saying, what do we think our ranges are? And people would come up with ideas and say, here's my theory on how we could do this. And then we would have a voting round in a room, sitting around a table. And everybody would

Micheal_McCourt:

Awesome.

Ben_Wilson:

say, you get 10 experiments to explore this space. Let's figure out what kind of works. And then we'll iterate on that. And then three days from now, after we reviewed everything, we're going to do 10 more. And that's what we were doing. We were constraining that end dimensional space to an area where we say, okay, this

Micheal_McCourt:

Perfect.

Ben_Wilson:

is where we need to move. past the exploratory phase into, let's figure out how to make a product here. And if your tool does that, I am angry that it didn't exist 10 years ago. Cause

Micheal_McCourt:

that

Ben_Wilson:

that...

Micheal_McCourt:

you have perfectly nailed it. And that exact topic you talked about, you get the experts in the room and they vote on it. That is that expert opinion element that we were trying to support. And when I talk about, so when you talk about optimization, the answer is whatever the highest point is, the point. When we talk about constraint active search, the goal was find a region. The region is in this case, implicitly define users give their statement of what they think is an acceptable level of performance. We try and find a bunch of points inside that region. The goal of finding that selection of points is exactly what you talked about. Now we're going to give these to the experts. We're going to give these five, 10, 15, 20, however many points we were able to find to the experts. They're going to pick the three or four that are actually going to go into manufacturing that are going to take six months to get manufactured. I love to hear, I love to hear that that is taking place right now because we hear it from people. But I think it's, and this is the difference between making something happen practically and then writing an article about it. And I can say, oh, I talk to people for whom this happens. But how often do you see an article say, 10 of us got in the room and thought about what we wanted to test, and then we picked this to test? Like, nobody writes an article about this. When I tried to cite people and say, well, I promise this is happening, some of the reviewers were like, you sure people actually do this? I'm like, I promise

Ben_Wilson:

Oh yeah.

Micheal_McCourt:

people do this. And the reviewers, the ICML reviewers, surprisingly accommodating, surprisingly supportive on that front. And they're like, we love to hear it, good. We just want to make sure that you're actually talking to real people and not just inventing a fake problem out there. So I really need to give some kudos to the reviewers because it's easy, especially at an ML conference. It's easy to be like, oh, you're not doing computer vision. We don't need that. But the reviewers are very, very accommodating, very excited, very open-minded. I appreciate it.

Ben_Wilson:

I mean, I would argue with that statement a little bit, or strongly actually,

Micheal_McCourt:

Please,

Ben_Wilson:

with

Micheal_McCourt:

please.

Ben_Wilson:

not your position, the general consensus position of like, oh, it's AI, it's ML, computation is cheap, we can just iterate through a bunch, like do we need things like this? And I would argue that to your point that you made earlier about we're finding this one point, and that's what traditional cross validation is. It's just saying, give me the lowest RMSE, give me the Bayes information criteria score that works the best. Well, if you tweak your parameters just a little bit and all of a sudden it falls off the face of the earth because of some sort of split of the data that made that one iteration work really well, or maybe you're converging and optimizing at a point where it's not a linear relationship along that gradient. you're finding a minute like a local minima and you've optimized to that. Well, if the data changes just a little bit, the model just is garbage. So using that's why whenever I talk to practitioners and it's funny, I'm working on a project right now that uses something similar to SIGOPT. It's just the open source, like Hyperopt and Optuna. And

Micheal_McCourt:

Yeah.

Ben_Wilson:

if you're not using tools like that, which they do give you, you know, that one result at the end, but you

Micheal_McCourt:

Yep,

Ben_Wilson:

can see

Micheal_McCourt:

absolutely.

Ben_Wilson:

that. the history and you can consume the history

Micheal_McCourt:

Yep.

Ben_Wilson:

and parse it and say, what's the population distribution around these values? How many

Micheal_McCourt:

Very true.

Ben_Wilson:

are good around here? And I want to select that range for my parameters that I'm going to use in production, but so few people do that. So I think your paper is awesome and I think more people should read it and talk about it, about how important this is.

Micheal_McCourt:

I definitely also want to call out, you mentioned open source tools. Hyperop, James Berkshaw, outstanding tool, not necessarily being updated still very actively. But when that paper came out, I remember it hitting, I remember people being like, wow, I love it. And Optuna, I have good friends at Preferred Networks working on that, Shitoro Sano, Masahiro Nomura, and those guys are doing really fantastic work. I think Hideaki Imura will be speaking at our session at the upcoming International Conference on Industrial and Applied Mathematics in October in Tokyo, where we're going to be bringing together some of these experts in sample-efficient optimization, thinking about things, talking about things. So I definitely want to give a shout out to them for doing outstanding work.

Ben_Wilson:

Yeah, it's a great

Micheal_McCourt:

Also

Ben_Wilson:

tool.

Micheal_McCourt:

to the great people at Metta, not fashion, Metta, who are doing the Bowtorch project. which is also one of these tools. And I'll throw in a minor plug for ourselves. We also will be open sourcing SIGOP

Ben_Wilson:

Nice!

Micheal_McCourt:

in a few months here, subject that we need to get through the legal stuff, the licensing and so on. But this is something that we've been pushing for since we joined Intel. One of Intel's things is open source. Everything open source, put it out, get it out there. Let's make sure that we can build a community around it. We're very proud as a byproduct of the acquisition. to be able to contribute to that Intel software ecosystem and look for that ideally by ICML this year, which will be right here in Honolulu. So I'm also looking forward to that.

Ben_Wilson:

Yeah, you and I need to talk after this podcast is over about that. That's fantastic news. And it is

Micheal_McCourt:

Very good.

Ben_Wilson:

great when you work for a company that's really big, that also really believes in open sourcing tools.

Micheal_McCourt:

Agree.

Ben_Wilson:

And

Micheal_McCourt:

Agree.

Ben_Wilson:

we've had a number of, you know, developer groups that have come on in heads of engineering at these companies that people just want to enable everybody to not have to sign contracts and stuff. Definitely supported tools are generally better, particularly for enterprise uses. You want somebody who's maintaining that and you're paying to maintain that. And everybody's got to eat. But there's also this thing to say about giving stuff away for free that allows advanced users to not just use it, but contribute to it. And making everything better. That's what we live and breathe by at Databricks. is that whole collaboration. And there's a lot of collaboration that happens, which hopefully you'll see very shortly when you go into that open source where other open source contributors for other packages start talking to one another on the sideline and saying, hey, we'd like to integrate with you, or hey, let's do some PRs on each other's projects and let's make them both better. So it's

Micheal_McCourt:

Absolutely, absolutely. We're

Ben_Wilson:

exciting.

Micheal_McCourt:

definitely looking forward to that. It's in some ways it's maybe overdue and we're glad to see it happen. We definitely wanna see people take advantage of it who haven't been able to thus far for one reason or another. And in particular, yeah, make sure that we're actually integrating into as much as we can and being a part of the community, exactly like you said.

Ben_Wilson:

Yeah, I mean, I've heard testimonials. I've worked when I was in the field at Databricks working with customers. I've seen your product being used at a dozen different major companies and everybody's just had nothing but the best to say about it. And that's the sort of the constant thing that everybody who comes in contact with is like, man, I hope they open source this someday. This thing's awesome. So yeah, everybody loves it. So it's. It's a great

Micheal_McCourt:

Glad to

Ben_Wilson:

job

Micheal_McCourt:

hear

Ben_Wilson:

Micheal_McCourt:

it.

Ben_Wilson:

you and your team at building a fantastic

Micheal_McCourt:

Thank you.

Ben_Wilson:

product.

Micheal_McCourt:

I will pass that along. I'll pass that along. And also, I'll give thanks to everybody who's part of the team, and especially everybody who's still with us today, working on the open sourcing project. Really outstanding work, folks. Thank you very much.

Michael_Berk:

Now what are sort of the benefits of using SIGOP versus a traditional open source tool like Hyperopt or Optuna?

Micheal_McCourt:

The main benefit that I think of when using the hosted system is we're managing the computation for you. And in particular, some of these computations that we're doing are expensive, which is just logical. We talked earlier about this idea of sample efficiency. When you are taking hours or days to do a neural network training, to build some sort of a financial model or run some Navier-Stokes solution on some nasty cavity problem or something. Every test matters. So we want to invest our time to make sure that you can get the best result as possible within a limited amount of time that you're going to be running your test, your optimization, your search. And... By hiding that from the users, they only feel the pain of the 100 milliseconds, we hope, API call that they're doing. And if you are running one of these other tools, you'll have to incur that burden somewhere. And so that's one key benefit to the system. I also think we manage parallelism quite well. quite effectively, both from a mathematical standpoint, I'm very proud of the work we've done there, behind the scenes, but also from a infrastructure standpoint. I think we manage parallelism quite well. So if instead of running one training on one machine, you wanna run 10 trainings on 10 machines or whatever your budget allows, you can do that. And the sort of reward or dividend per run decreases slightly as a result of parallelism, because at the end of the day, mathematically, we have to make more decisions with less data. If I have zero pieces of data, I need to tell you what to do on the 10th thing. That's gonna be slightly less good than if I had nine pieces of data and I tell you what to do on the 10th thing. And of course, that's how SIGOPT and these other tools are all powered, is this intelligent decision-making process. And... If you talk to people on one branch of things, they might use this phrase sequential decision making. I think nowadays that's sort of very specifically maybe for RL, but even RL, a lot of RL people, I think, can think about what they're doing. And some RL people may even be using this Bayesian optimization methodology to push things forward. I'll comment that another thing that I think SIGOPT does quite well is bounces back and forth between different strategies internally. I mentioned Bayesian optimization. Gazeon optimization, leveraging Gaussian processes, which was my own research topic back when I was a student. This is extremely powerful. And I'm happy to talk about why I think it's extremely powerful, but let's just say it's extremely powerful. But it doesn't mean it's perfect for every situation. If you're running something at 50, 60, 70 parallel with the Gaussian process, then you're going to have to do a lot of work. And so I think that's a good way to do it. And I think that's a good way to do it. I'm personally, we'll say, I'm dubious that Gaussian process tools are the right thing. I'm happy to be told otherwise. I'm happy to have discussion. There's some really outstanding, brilliant researchers out there with whom we have fantastic discussions all the time about this. But even me, as big a GP fanatic as exists, thinks that other tools are useful in the mix. And I think SIGOP does a fantastic job bouncing between different options and making sure that the right tool is being used for the right circumstances. In addition to that, the new methodology, constraint active search, I believe it's the right tool for not just, we talked about single objective problems, really pushing these multi-objective problems. When you're talking standard multi-objective or multi-criteria, multi-metric problems, you're talking about the solution being the Pareto frontier. That is the standard definition of the solution. It doesn't mean it's not a useful piece of information, but like I said, it's Pareto. There's something dissatisfying. about it, not from a mathematical standpoint, but from a practical standpoint. It feels like you're leaving data, it feels like you're leaving answers on the table. And in particular, what I find frustrating about the way a lot of tools that search for the proto frontier are trying to do so, is they really define success as just, I want to spend all my energy clarifying every point that's on the proto frontier and maybe missing a big chunk of the region that's full of useful, viable outcomes for eventual production, or for presenting in front of experts and saying, hey, you have thoughts in your mind about how this is actually going to work later on. What can we do with this information? SIGOP has not just this implemented in place, but in particular, this running right now for scaling up to six objectives. We could go higher, but we've only tested it well for six, in my opinion. And I'm really excited about that because when you talk about Pareto frontier in two metrics, I should say two objectives. Okay, Pareto frontier is this nice little line. Fine, I can live with it. I can look at it. I can feel that I'm okay with that. You get to three objectives and you're talking a surface. I can still visualize it. So I'm happy about the fact that I can visualize it, but how many points do you really need in order to approximate a surface kind of well? Hundreds, thousands, I don't know, a lot. So if you're talking, you need hundreds of points on the frontier, how many total runs do you need to do to get to 100 winners or a thousand wins? Talking hundreds of thousands before you know it. And that's just for three objects, suppose you're trying to go to four, five, six objects, it doesn't scale. That is the curse of dimensionality. When people talk curse of dimensionality, that is what they mean, is that the pain. of trying to study things in increasingly many dimensions just renders standard attempts to do so as infeasible. In contrast, the nice thing about this constraint as a search methodology that we have available in SIGOP today is that you can run this in higher numbers of metrics. And because it converts the problem from a straight optimization problem into a sort of yes, no statement. Yes, this is a viable point. No, this is not a viable point. the complexity of adding additional metrics, the computational cost on our end increases, but the conceptual complexity doesn't increase at all. You're simply adding another metric, you're simply maybe whittling down the region into something smaller and smaller, but it's not actually changing the concept of the complexity of the solution, the way that the complexity of the frontal frontier increases significantly as you increase the number of metrics. So that's another reason why I'm really excited about it, why it's really useful.

Ben_Wilson:

So it's interesting how you brought up the, the parallelization and computational burden of these. I get data breaks. We, we have distributed hyper ops, which it's pseudo distributed. So we're not, we didn't take the hyper ops package and library and say, Hey, we built a mathematical formula that you can distribute to N number of cores and it'll just work. because that's impossible. You can't take that tree of parts and estimators and have it dynamically update in real time and

Micheal_McCourt:

Yep,

Ben_Wilson:

modify things that

Micheal_McCourt:

pretty

Ben_Wilson:

are

Micheal_McCourt:

tough.

Ben_Wilson:

already running. So you're constrained by, yeah, you can run more tests at the same time. You wanna run a thousand at a time, go nuts. You know, it's not gonna improve it at all.

Micheal_McCourt:

Yeah. It might not.

Ben_Wilson:

It'll take a random search space to feed, to get that initial feed.

Micheal_McCourt:

Yeah.

Ben_Wilson:

and that's random

Micheal_McCourt:

Yeah.

Ben_Wilson:

throughout the space, it'll

Micheal_McCourt:

Yep.

Ben_Wilson:

spam that out to all these executors on Spark. But then when it's iterating on the data that's been shuffled around into all of these partitions, it's still using the standard hyperop library. So we do a barrier

Micheal_McCourt:

Gotcha.

Ben_Wilson:

execution mode where we'll do like, hey, 10 iterations, see how good you get, send the data back to the driver

Micheal_McCourt:

shop it around.

Ben_Wilson:

and say, who's best? Or what's the range of best values that we've gotten from all

Micheal_McCourt:

Yep.

Ben_Wilson:

these different tests? now

Micheal_McCourt:

I like

Ben_Wilson:

send

Micheal_McCourt:

it.

Ben_Wilson:

another iteration out and do that. Improvements to the way that that works. That's what really excited me about what you said about what SIGOPTA is working on is thinking through how can we do this a little bit better and not be constrained by that whole iterative distributed process and make it so that somebody could get not just that right answer that everybody's focusing on but hey, what's that gradient? within this dimension that we should be thinking about. So that when we want to do

Micheal_McCourt:

information.

Ben_Wilson:

something like retraining, and I've heard from a lot of customers and I've had to implement it myself in the past, where we have automated retraining, like active retraining systems. Oh, I'm drifting down and we're losing

Micheal_McCourt:

Yeah.

Ben_Wilson:

money from this algorithm. We need to kick off training again. Well, you don't want to start, you don't want to have your training code be the same code that you use during initial development

Micheal_McCourt:

Yeah.

Ben_Wilson:

of saying,

Micheal_McCourt:

Yeah.

Ben_Wilson:

Here's my massive search space that I'm going to be looking

Micheal_McCourt:

Yep.

Ben_Wilson:

through. You want to constrain that down to say over

Micheal_McCourt:

Yes,

Ben_Wilson:

the

Micheal_McCourt:

you do.

Ben_Wilson:

last five runs that we've done retraining, what were the best parameters that we had? And what's the distance between those spaces add 10% on continuous

Micheal_McCourt:

Mm-hmm.

Ben_Wilson:

on those boundaries. And if it's categorical, if it's really like one thing is just dominating everything else, then okay, let's use the hat and fix that. We don't

Micheal_McCourt:

Yeah.

Ben_Wilson:

need to search over that. And. But being able to do that and to have that result from your optimization algorithm to say, this is the space we should be in, is so incredibly powerful to make that process of active retraining or passive retraining so

Micheal_McCourt:

Mm-hmm.

Ben_Wilson:

much cheaper. Cause you don't have to do a thousand

Micheal_McCourt:

Yep.

Ben_Wilson:

iterations. You can say, retraining, just do 20. We already know this space we're supposed to be in. I just want to tweak it and get the best for this one particular run.

Micheal_McCourt:

And I couldn't agree more with what you said. I'm going to take one second here to confirm that I remember the title. But we just actually put out a paper to the TMLR, the Transactions on Machine Learning Research, which is part of the JMLR's new initiative to publish stuff they feel like doesn't fit somewhere else. And in particular, this paper was a perfect example of that. It is a discussion of exactly, as you're talking about how, constraint active search provides a strategy for, we maybe use the word hot swapping, models that are in production. So effectively, you could have your constraint active search leave you with four, five, six different winners. And then you can put. all of them effectively in production in some sort of a multi-armed bandit situation. And that gives you the ability to exactly as you're talking about maybe do a passive retrain or an alternate, an alternate model that's already ready to go and has been, as you've seen, performing better in live data as opposed to just the data that you've observed in the past. And the article is called, there it is, Bridging Offline and Online Experimentation. And that is... That was one of the follow ups to the original ICML paper. And that was a collaboration with Junpei Komiyama at NYU Business School, Stern Business School. And I love the idea. I love being able to tackle these practical problems that we're talking about here. In some ways, this isn't even, that's the key. It's that mathematically, it's like, oh, we'll just retrain it. Do it better. Get the answer. Get the best point. Why would you do anything else? But then from a practical perspective, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, track that I really like where people just talk about things. They have the discussion, they build up the community around practical circumstances. I think that having that and having journals like TMLR, which are used to talk about things that maybe don't fit perfectly in the theoretical context, but are still of relevance to the community. I think the community, it's growing. It's growing, it's adapting, it's changing. It's an interesting question. Where do things belong? How do we talk about them? How do we... build the community, grow the community in a way which is engaging but also focused. I think it's complicated. And I think that it really takes everyone in the community working together, being friendly with each other, that's nice, open-minded. And then also it takes a vision. And that's one thing that I'll be honest, I worry slightly about on the ML Publications is that right now I don't see someone, there is no boss of ML, but I don't see the heads of the conferences or the heads of the proceedings necessarily having a specific vision for how these questions are going to pop up. How are we gonna deal with the practical elements versus other elements? Exactly as you talked about before, how are we going to deal with, they're trying, I give them 100% of their credit for how to deal with this question of, very brittle models or these, okay, you tune the whole thing or you found some set of hyper parameters, the thing works perfectly on that with that split of data that you have, everything is great, put the paper out, anybody tries to reproduce it, it just sort of collapses or something like that, or even these other situations. This is complicated, but it cost you 50 million dollars or whatever in AWS credits to train this thing and then you publish it. I want you to put it out there. I want you, that was a huge amount of money, huge investment. I'm glad you're putting the information out there. But then it's this complicated thing when you're putting out research that nobody could even attempt to reproduce in any viable set of circumstances. To say nothing of the fact that almost immediately, as soon as you put some of these things out there out of date, I've had reviewers say to me, why didn't you cite this archive paper? I have my own personal beef with the archive. Why didn't you cite this archive paper that was put out after you submitted this paper? I'm like, well, I don't know what to say about that. And so we as a community, I think it needs to come from the top down because community is growing so fast. And you've got so many people with so many different backgrounds in there that somehow you need authorities on the topic to say, hey, reviewers, be a little more open-minded, be a little more not dependent on just spamming archive for, hey, what's the latest thing? I'm going to compare it against this. Not complaining. when there aren't 95 different benchmarks that you compare something to. And also, frankly, it's my own personal opinion, not letting papers get published just because they compared against 95 different benchmarks. They didn't put any concept, any hypothesis, any desired outcome into their thing. It's like, well, then they just tested a bunch of things and said what they did. And it's like, I don't know. I'm not sure that that's what we should be publishing. Whatever, that's my controversial take of the day.

Ben_Wilson:

I happen to agree with you on pretty much all those points. When you work with companies that, like as a vendor, which you're familiar with, you work with a company you know, within a half an hour of talking to a team, when they're talking about a project of like, okay, they thought this through, they know what they're looking for, they're gonna use our tool to solve a real use case. It's either gonna make them more money, make them lose less money, do something interesting in the real world. And then you also know after talking to enough people and enough teams within probably five minutes whether or not they know what they're talking about at all and whether this is just a pipe dream. Somebody read a blog post, somebody read a paper and they're like, we need to reproduce exactly what's in this. So sometimes those companies have an obscene amount of just rainy day funds, let's call it that. And they're willing to pay a company. of experts to come in and build something. And I've had that handed to me about a half a dozen times in the last four years, where show up in a room full of people and they're on a massive TV screen that's, you know, the size of my car at the head of the whiteboard, of the table, and you're looking at it and you're like, all right, there's a white paper on, on the screen. Okay, let's read through, like, what do you want to do? You're just reading through it and you're like, you know there's like a Python package that does exactly what you're looking for and it'll probably run 10,000 times faster than what we're talking about here. And it's free, you just have to pip install it. And they just straight up say, we're not interested, we want you to help us implement this paper. And I'm like looking through pages of mathematical formula, I'm like, okay. let's get to work. And one of the ones we did

Micheal_McCourt:

Yeah

Ben_Wilson:

was somebody was like, we want to do, you know, we want to do matrix, basically dot product matrix and version operations on a distributed

Micheal_McCourt:

Okay.

Ben_Wilson:

matrix on Spark. Like we want that algorithm. Like, why? They're like, we just

Micheal_McCourt:

Yeah,

Ben_Wilson:

need it.

Micheal_McCourt:

why?

Ben_Wilson:

Micheal_McCourt:

Hahaha!

Ben_Wilson:

it was a paper. The paper was really well written, super fascinating. Had three citations after a year being published. And

Micheal_McCourt:

Yeah.

Ben_Wilson:

it was just so far beyond the scope of what anybody would really need to do. But they paid us to implement it and we implemented it. But then we used an algorithm that's in Spark ML that's free for them

Micheal_McCourt:

Hmm.

Ben_Wilson:

to use and it performed better. I was like, this is why we didn't, like, this is why this algorithm doesn't exist in Spark because it's so bespoke. And it's cool that it worked for that paper, but it doesn't really work in the real world.

Micheal_McCourt:

I'm, when I hear about that in particular, I grew up in numerical linear algebra. So I'm definitely a big fan of the matrix computations. Charlie Van Loon was my advisor. And I so rarely get to talk about that in particular, if I'm not mistaken, the Google people, the DeepMind people put out, I don't remember who, put out an article talking about how they did a search and found better ways to multiply matrices together of various sizes, which of course, I found extremely exciting, very fun. I don't exactly like you're talking about. I don't know exactly how practical it's going to be, though they claimed they were able to see some nice pickups and some of their TensorFlow stuff, which is really cool. But yeah, exactly to your point, sometimes what gets done in a research setting can be important, and it's important that you get the idea out there so that people know about it, but whether it actually gets traction can be subject to a lot of practical considerations that aren't, and in my opinion, shouldn't be. question whether an article gets published. Certainly not a question whether an article gets published in SIAM matrix analysis or whatever. Like that's not the forum for practical concern. The forum is exactly as you're talking about. Hey, when you go in to answer questions for somebody, is this viable or not? Trust me, I'm an expert. I know you shouldn't do this, but if they're paying, it is what it is.

Michael_Berk:

Yeah, and that's a great segue to one more topic that I wanted to cover in the very limited time that we have. So Mike, you start off as a with a background in mathematics, and then you went and got your PhD in philosophy, worked in some national labs, and then eventually became head of engineering at SIGopt. And so you've,

Micheal_McCourt:

National Lab.

Michael_Berk:

yeah, let's go. He held up his mug for those of you listening

Micheal_McCourt:

Sorry,

Michael_Berk:

in.

Micheal_McCourt:

yes, sorry.

Michael_Berk:

And I was wondering how you think about developing a team culture that promotes innovation. Because now that you're ahead of a team, it falls a lot

Micheal_McCourt:

Mm-hmm.

Michael_Berk:

on your shoulders. So how do you think

Micheal_McCourt:

Oh, yes.

Michael_Berk:

about that?

Micheal_McCourt:

Oh, yes. Excellent, excellent, excellent question. Extremely important topic and something that I think companies really, really, really need to be thinking about right now. Universities also do. And in particular, I think professors who are building up research groups should be asking themselves, how do you build an effective group, an effective culture? I believe it's important. I think it needs to come from both sides. I think the boss needs to be looking for a variety of different folks. And in particular, I keep calling out SIAM, SIAM for those of you who don't know, Society for Industrial and Applied Mathematics, that was where I grew up. And when I grew up there, I felt like I was doing good work. I still think I did some good work there. There's some good work getting published at those conferences. But I would argue SIAM people are very under represented in the tech industry right now. And I'm gonna argue, I have to do so that falls on math departments, not encouraging people to get out of the shell, not encouraging people to go take internship. I remember offering an internship to somebody over the fall semester and his advisor was like, nah, I think it's more important than kids in class. And look, as somebody who used to teach, I used to teach at University of Colorado, I believe class is important, class is very important. But it's a unique opportunity to be able to spend a chunk of time in San Francisco, not just doing good work, but frankly also doing research. We wanted to publish an article together. And it got spiked. And I was like, wow. So OK. So some of this comes from the supply side. People need to be available. They need to be excited about it. They need to see it as an opportunity. And right now I'm speaking specifically to PhD candidates. Do not assume you are going to get the faculty positions. It may happen. I hope you do if that's what you want. I 100% hope you do. But do not assume you will. Check out your options. On the other hand, yeah, I think it is important that team leaders who are recruiting, who have the ability to do personal recruiting, I've been very fortunate both start up and since I've been at Intel, to be able to go out and recruit my own people, to be able to go out to job fairs myself, pull people into the mix. say, hey, let's talk, I want you, let's make this happen. I've been very fortunate to be able to do that. I think at some companies, you get whatever HR gives you. So maybe there's a gap there that needs to be figured out. I'd love to see more companies reaching out beyond just recruiting at ICML and NURBS and ICLR. I think it takes energy, it takes effort though, because you read some of these resumes, these CVs. Can you parse them if somebody says, hey, I did this sweet project on numerical PDE solving boundary value problems in MATLAB. Does that translate to your ability to do stuff in a tech company? Frankly, the answer is probably no. Like very little of that immediately translates. I hate, I really hate somebody who's spent plenty of my graduate studies doing exactly that. Breaks my heart to say that, but honestly, no. When you work on a project all by yourself. without any collaboration with very little project oversight until sort of the end. And then you check with the policy, like, is this OK or not? It's not representative of that when you don't use version control. It's not representative of how teams actually function, how iterations happen when there's no code review. So I am going to put a burden on industry in general to say, hey, We need to make sure to send these things out. I have gone to universities and I've given talks where I explain to folks, I'm like, hey, if you want kids in your department to start being able to go to industry, this is the stuff you need to do. Like, and these aren't complicated things. These are all things that you're doing anyways, just could be doing them a little bit different and your kids would be ready to go. And then when they talk to employers, they could actually say the words that they need to say in order to get hired and actually talk about, yeah, we did GitHub, yeah, we did. pull requests, we did code review. Yeah, I have these example projects or key projects, projects that define my PhD. You spent years doing it. You put none of the code out there. No, no, no read me, no blog content for any of this. It's all just piled up in this 400 page tome that's probably just sitting on your mother's shelf. Just go ahead, my mother has mine on her shelf. But the point is, is it's like, that's not, that's not how, that's not how you're getting out there. That's not how you're gonna be a part of a cool So I think some of it's from the supply side and some of it is the industry side. If the industry wants to get these incredibly talented, outstanding statisticians and mathematicians and people from operations research for the people who are part of the informed community also, all those folks should be pulling good jobs out there. They're all really smart. They're all really capable. I think it comes from both sides and I'd like to see more of that from everybody and I need to do that too. I need to keep being out there and making that happen. And if that happens, I think we're going to have... a better workforce, a workforce that has more solid skills and we're going to have more students being able to join and make big contributions. So there's my plug for math education, but also getting out, getting out of your bubble and making things happen.

Ben_Wilson:

Yeah,

Michael_Berk:

Nice.

Ben_Wilson:

it's amazing what we see in the difference between a first year intern. I mean, somebody that's doing a CS program, we get an intern who's coming from, you know, Berkeley or Stanford. They've already done internships in their undergrad, you know, working for some SF based company and they already know the ropes there. But if you get a PhD who's coming from math or physics that maybe never did CS work in the past, if you see them four semesters later, the next year that they're doing something

Micheal_McCourt:

Mm-hmm.

Ben_Wilson:

and they're like, all right, I'm finishing up my defense and getting all this stuff ready to go through final review. I just wanted to take six months and work at a startup and learn some more. When you see them again after that period, how much they've retained and how much they They don't need any real supervision. You just give them a project

Micheal_McCourt:

Yeah.

Ben_Wilson:

that's generally the actual cool work that everybody wants to do, but we don't have time to do the intern has time to do it. So you give them this amazing project and like, Hey, can you do this in four months? And the success rate that I see on implementations is like over 95%. It's insane how great these, these people are. But they would not be able to do that if they didn't have that foundation of how do I do a peer review? Why do I need

Micheal_McCourt:

Yep.

Ben_Wilson:

to do a peer review? That,

Micheal_McCourt:

Yep.

Ben_Wilson:

and I think some academic institutions in postdoc and in doctoral research, they just, it's like an isolating sort of environment. You're not getting that

Micheal_McCourt:

Yes,

Ben_Wilson:

constant

Micheal_McCourt:

very much

Ben_Wilson:

peer

Micheal_McCourt:

so.

Ben_Wilson:

feedback. And that's what I

Micheal_McCourt:

Very

Ben_Wilson:

thought

Micheal_McCourt:

much so.

Ben_Wilson:

that Stanford and Berkeley did something really great with. you know, amp lab and rise lab at Berkeley where they said, everybody sits in a room together and you're going to work together and you're going to ask

Micheal_McCourt:

Yeah,

Ben_Wilson:

each other questions

Micheal_McCourt:

right.

Ben_Wilson:

and you're going

Micheal_McCourt:

Yeah.

Ben_Wilson:

to debate and it's going to be,

Micheal_McCourt:

Yep. Yep.

Ben_Wilson:

and what's come out of that Apache spark rave,

Micheal_McCourt:

really amazing stuff.

Ben_Wilson:

you know, from any

Micheal_McCourt:

Ray

Ben_Wilson:

scale

Micheal_McCourt:

too,

Ben_Wilson:

now,

Micheal_McCourt:

yeah.

Ben_Wilson:

uh, these amazing projects that have built into like re they've rebuilt entire industries. with what's come out of that collaboration within that lab and with them working with industry. So I cannot

Micheal_McCourt:

Mm-hmm.

Ben_Wilson:

endorse your statement any stronger than you already did. I 100% agree.

Micheal_McCourt:

Very good to hear. I like to hear that. And yeah, I think there's an opportunity. It just, it takes, it's just gonna take a little open mind, a little bit of energy and change on both sides and big stuff can happen.

Michael_Berk:

Amazing. Well, I will wrap and then we can kick it over to Mike to see if there's any ways that people can get in contact. So.

Micheal_McCourt:

By all means, yeah, please.

Michael_Berk:

Cool. So just to sort of recap what SIGOP does, they focus on unstructured optimizations, whether it be financial models, machine learning or shaving cream. And some core benefits of the tooling is computational management. And then they just have really smart algorithms that bounce between, let's say, Bayesian optimization or other methods that are very specialized for a given use case. And hopefully some of this will be open source in the next few months. I know I'm waiting. I have a project where I will implement it if it's out in time. Um,

Micheal_McCourt:

Very good.

Michael_Berk:

and then one thing that really stuck with me is this sort of interesting reframe of optimization to think about acceptable regions instead of optimal points when, uh, searching is expensive. This is a really effective reframe to improve your efficiency essentially. And, uh, this is called constraint active search. And then finally on the innovation piece, bosses obviously need to look for good people, but also the people need to be interested and ready to join. And then finally, the underlying tenant from what I was hearing in this conversation is the teams just need to be productive and often people don't take the time to invest in learning tools. You need to be able to be efficient with GitHub. If you're using Google docs, you need to know about headers in Google docs and be able to type English that they're all very valuable skills. So with that, Mike, if people want to learn more, where can they find you?

Micheal_McCourt:

Absolutely. First things first, our SIGOP emails are still alive for a little bit longer. So feel free to hit me up at mccord.sigop.com. You can also probably better transition to michael.mccord.intel.com. You can find me on LinkedIn and we have a Twitter handle at SIGOP to where you'll find some fun conference pictures. and follow-ups to some of the exciting work you're seeing people who are using SIGUP today. We always try and hype our users' work because that's what we do it for.

Michael_Berk:

Amazing. Well until next time it's been Michael Burke and my co-host

Ben_Wilson:

Wilson.

Michael_Berk:

and have a good day everyone

Ben_Wilson:

Take it easy.

Micheal_McCourt:

Take care.

How To Think About Optimization - ML 102

0:00

50:01

Playback Speed:

Show Notes

Sponsors

Links

Transcript