Data Platform Innovation: Navigating Challenges and Building a Unified Experience - ML 147
Nick Schrock is the Founder of Dagster Labs. He is also the Creator of Dagster and the Co-creator of GraphQL. They delve into the world of data engineering, software development, and ML orchestration. In today's episode, they explore the challenges and intricacies of standardizing data movement, handling data access in various systems, and migrating data across different platforms. They share insights on the importance of building a system that spans multiple data platforms, the decision-making process behind tool development, and the impact of lineage in managing and migrating data. Join them as they uncover the complexities of open-source projects, API evolution, and the future of data engineering.
Show Notes
Sponsors
Transcript
Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do data engineering and machine learning at Databricks. Today, I'm joined by my co host,
Ben Wilson [00:00:20]:
Ben Wilson. I work at Databricks designing stuff for really smart people to implement. Nice.
Michael Berk [00:00:27]:
And today, we have a guest named Nick Schrock. He started his career as a software engineer at Microsoft, then worked at a couple smaller organizations before joining Facebook. And then after a a short 8 year stint, he founded Dagster, a cloud native pipeline orchestrator that includes lineage, observability, and high testability. And he's also the cocreator of GraphQL, an open source data query and manipulation language for APIs. So, Nick, I was playing around with Dagster and really love the simplicity and the power that you get out of the box. So was there a project or a pain point that, originated this idea? Or did you just say one day, I wanna build this?
Nick Schrock [00:01:07]:
It was pretty much one day, I just wanna build this, I would say. The the story is that, you know, I left Facebook in 2017, and I was figuring out what to do next. And I was talking to all sorts of companies inside and outside the valley about what their biggest technical liabilities were kind of in an open minded sort of way. And this notion of data and infrastructure kept on coming up over and over and over again. Companies evolve vintages, all sizes. And when I dug into it, I like to say I found the biggest mismatch between the criticality and complexity of the domain and the tools and processes to support that domain that I've ever seen in my entire career. So the only analogous thing in my experience was web development, say, 2008 to 2010, before the browsers were good, before React, before GraphQL, before TypeScript, we're all that stuff. Like, there's, engineers are building these, like, extraordinary complex web apps on substandard frameworks and technologies, and it was just a a complete nightmare.
Nick Schrock [00:02:22]:
And if you look at the front end full, you know, web app ecosystem and you fast forward in time to today, it is completely and wholly unrecognizable in a good way. You know? And I think, most importantly, it's become a true discipline. And, you know, it is it is really transformational. So when I was looking at this in, 2018, 2017, data engineering in particular was, like, in this state for sure. And I think it's still in the state largely. There's a long way to go. So, you know, I was adjacent to these problems at Facebook. But at Facebook, I worked in the mainline application stack.
Nick Schrock [00:03:01]:
So, you know, GraphQL kinda came out of a bunch of work internally to build our internal object model and a higher level kinda API, to query the the data at Facebook. So I was kinda intimately involved with that, and I would call it data infrastructure adjacent. But as I was looking into the space, you know, I was immediately attracted to the orchestration layer. You know, I like to think about the the the the the projects I like to work on have a few common properties, generally. One, engineers in pain using, like, incredibly slow workflows. They're using distractions that don't make sense. Makes Makes me, like, fundamentally very upset, actually, and angry to see that, which is, you know it's not the most wholesome motivation, but it is what it is. You know, because, like, it's it seems like the the universe is, like, unordered, and we could, like, fix it just by, like, having better ideas.
Nick Schrock [00:04:01]:
Yeah. Another thing I I'm really attracted to is a problem that matters. And one of the things about data pipelining you know, data pipeline is the core data infrastructure, in my opinion. And the the, you know, the the assets that are produced by these data pipelines power dashboards, which, like, inform, like, nearly all decision making in modern organizations, or they power ML systems, which do automated decision making, and increasingly more and more so. And the fact that, like, no one really understood what their data systems were doing, they didn't they were terrified to change them. Like, there was, like, very little confidence and trust in the data. And these are the things that determine whether you get mortgages or not. So it's, like, an incredibly important foundation of society run by millions of developers and engineers.
Nick Schrock [00:04:49]:
So it's, like, really important problem. And, you know, the I also like having, like, a novel technical solution, which I think software defined assets, qualifies for that, which is core abstraction of Daxter. And then last, it's a strategic point of leverage. So if you know about in the in the in the organization so GraphQL have this property where it's kinda this, like, middle tier layer of software that manages all the interactions between all clients and all servers, in a system. So you could you know, it effectively is just, like, intermediate layer that provides enormous point of leverage. Similarly, I believe the orchestration layer, is has similar properties to that. The orchestration layer invokes every computational run time, in turn, touches every single storage system. Every practitioner who is putting a data asset into production has to interact with it at some level because all data comes from somewhere and goes somewhere.
Nick Schrock [00:05:49]:
So properly conceived, I believe the orchestration layer is this, like, like, incredibly powerful point of leverage to have build a generalized control plan for a data platform. And I came back conclusion very quickly in 20 when I kinda looked at this. So that was kinda, like, where it where I know there's a long winded answer, but that's kinda where it all came from. And then I started digging in, and, the rest is history, as I said.
Ben Wilson [00:06:15]:
I've got a follow on question about that. From you your unique perspective as being somebody who has worked on establishing a standard that is used ubiquitously in industry because it's so awesome. How what do you think the tipping point is for both large and small companies that are trying to solve a problem in house in whatever way they can. How much of that has to happen before a bunch of people get together and say, let's establish an open standard that people can conform to, publish it, and then build that middle layer to simplify the world.
Nick Schrock [00:06:59]:
So you're kind of asking, like, at what point do you standardize? Or, I'm trying to tease apart the question a bit. Like
Ben Wilson [00:07:09]:
What have you seen it? Like, is that that impetus for somebody like you and your team to say, GraphQL solves a ton of our problems. Let's open source this so everybody can benefit from
Nick Schrock [00:07:19]:
it. I I could tell the GraphQL story. I think that is a good one. So GraphQL is built very quickly. You know, effectively, I conceived of the idea in February 2020, 2012. And then we built it and shipped it in 3 months, to power the feed of our mobile app, which is kind of the most high stakes surface, at the time in the in our applications because it's gonna drive most of our usage and revenue. So we had to make a lot of decisions very quickly, but it got extremely wide adoption very quickly as well. Because, like, in the in in the Facebook app, feed is, like, the ultimate point of leverage.
Nick Schrock [00:08:10]:
Because every team wants to inject feed stories in order to get distribution. That's how they, like, get do a good job. So once you kind of capture feed, the entire rest of the app follows. So we actually had a interesting internal problem in that we got adoption too quickly internally, and we're kind of falling behind. Because both because that underlying technical reality as well as downward pressure from Zuck to move everything to native. You know, he'd, like, go down to the timeline and be like, why isn't it, you know, why isn't profile native yet? And then they'd, like, freak out and wanna build GraphQL endpoints in the native app to do it. So we you know, in effect, you know, we had seen the success of React, which was inspirational. So the CEO of Dexter Labs, who would obviously a close collaborator of mine, is Pete Hunn.
Nick Schrock [00:09:07]:
He's the cocreator of React. And if you think if you think GraphQL is a universal standard, React is, like, ordering to bigger, I think. So we saw that. And we also wanted to open source a front end library called Relay, which is kinda like the you know, you React and GraphQL have a baby, and it looks like Relay. It's kinda like this, like, GraphQL specific front end framework that also matches up, React. So, you know, I think that was kinda pushed us to it. And I I talked to my pro creator, Lee, about it, And I was like, yeah. We should open source something.
Nick Schrock [00:09:50]:
And, yeah, as I could say, convincing Lee to do something is a process. And so and then he comes back, like, a couple weeks later after I've talked to him about, like, about 5 times. He's like, yeah. We I'm I'm into this open sourcing thing. But you know what we should do? We should standardize it, write a document about it, and not really open source any software. And I'm like, and we should also redesign everything. And I'm like, what are you talking about? Like like, I was talking about, like, open source and the thing we had. And then, like, you know, I do a lecture on second system effect and all the this stuff.
Nick Schrock [00:10:24]:
Right? But it turns out Lee's a genius, and he's right. And he kinda, like, came back. Like, I don't know if it's, like, Ayahuasca or something, but he came back from the mountain with this, like, perfectly formed idea of, like, what the system should look like. And it took a while for me to get there on some of the issues, and I had some feedback. But, you know, we had kinda built up this intellectual capital over years, using the system. And I think we had incredibly good instincts on what the broader developer population needs because of the experience with React. Because React is unique in that all the other Facebook technologies were all, like, internally facing, whereas React has this huge user base. So I think anyone in the React team had much better instincts on what what a broader developer population needed.
Nick Schrock [00:11:12]:
And then, you know, that spec, which Lee largely wrote and I participated in, has really stood the test of time. And now it's evolving slowly in an incremental, like, structured way as part of a a process. So I think that we were able to develop, on the GraphQL side, a bunch of intellectual capital and then kinda marry it with some of the experience that the React team had to kind of build something that, like, was pretty good out of the box, and good enough where people could build their own implementations in their own programming languages without oversight from us, which, by the way, I thought would never happen. I was like, who's gonna like, we're expecting all these people build run times in all these programming languages. And Lee's like, sure. Why not? It's fun. I'm like, what are you talking like, this is crazy. It's not gonna work.
Nick Schrock [00:12:01]:
It completely worked. That was completely wrong. So the, so that's the GraphQL, story. I think the we're you know, on the DAXTER side of things, it's more of a challenge because we're building it out in the open. And I don't think it's at the point of maturity yet where we could write a document to standardize it, and then, like, it would be good for all of time. It's just not you know, we're not quite there yet. And, yeah, I think that's one of the challenges we face to be, you know, perfectly transparent. You know, but I think we're we're much closer than we were, say, 2 years ago.
Michael Berk [00:12:46]:
Question from or sorry. Go ahead, Ben.
Ben Wilson [00:12:49]:
No. I was just gonna say to add on to that, the way you phrased it sounded like you guys don't have that figured out, but I just wanted to to add that nobody has that figured out, as the universal standard because it's pretty much impossible to do until you get the generators of data and the computation platforms to agree on at least some direction of how the industry wouldn't would wanna move
Nick Schrock [00:13:18]:
That's right.
Ben Wilson [00:13:18]:
Instead of just new tools coming out all the time, a lot of them proprietary. People saying, I can make a ton of money off of this building, this closed system that doesn't interface with anything or provides this API endpoint that could be archaic or so cutting edge that nobody really knows what to do with it. And people are like, well, now I need to build a connector to this, and I need to support this instead of that horror you know, like, hey. Let's think about this intermediate layer where we can all agree to a conceptualized idea of how to handle data movement from place to place.
Nick Schrock [00:13:56]:
Yep. That's right. No. No. It's, and it and I think the other question is, like, what are you being opinionated on versus not? Right? Like, our version of a standard is not standardizing on storage format, like Parquet, or standardizing on in memory format, like arrow or a computational runtime or SQL because, you know, despite the you know, this has been kind of the error of the SQL max ease, but, you know, you can't capture all compute in encoded in SQL. And so then what are you, like, what are you standardizing on? And the, and so that's where I think this, like, notion of this kind of middle layer, which is agnostic to compute platform and storage system, but allows you to structure the complexity of your data pipelines in a way that provides a homogeneous layer over underlying heterogeneity is kind of the path forward. And that feels a lot like you know, I guess I'm probably presuming too much knowledge of GraphQL for the audience, but it kinda feels in in that way, it also feels like GraphQL, where GraphQL is like this query language Mhmm. You know, that is effectively a front end for an application server.
Nick Schrock [00:15:21]:
And through that, you can orchestrate all sorts of back end interactions with microservices or or any sort of back end, and it all appears as, like, the standardized format. Even though beneath it, there's, like, perhaps stunning heterogeneity. And it kind of actually just, like, organizes the complexity that's going on. And that's why I think the direction needs to go for, data platforms.
Ben Wilson [00:15:47]:
Yeah. Because otherwise I mean, speaking as somebody who has somewhat limited knowledge of front end development based on my current job, but we do own a front end. We do owe, you know, myself and other people primarily do the back end stuff for that. Yep. But when you're trying to develop a React app that's connecting to your back end services, you need to know where is my data, what interface do I need for that, what is that access layer looking like. And when you start adding more and more features onto your front end, you're now no longer communicating with a single interface.
Nick Schrock [00:16:23]:
That's right.
Ben Wilson [00:16:24]:
You're saying, hey. Each of these features could have independent access per, you know, paradigms to the back end storage layer, which just explodes your code complexity.
Nick Schrock [00:16:34]:
Yep. And
Ben Wilson [00:16:35]:
then when something changes, like, oh, we need to migrate our schema. How many places in your code do you now have to update and test and make sure that things don't blow up? It's like it just trashes developer productivity. And what I used to do before doing what I do now is doing what Michael does now. So a lot of data engineering stuff with with customers and can't confirm. Nobody has this figured out. You talk to any company regardless of size, and you ask them for this ML model that you're about to train, where's your data come from? Like, really, where did it come from? Do you know what back end source system generated this that went into the RDBMS table, which then went into this ETL pipeline, which then went into the data warehouse, which now goes into your query layer. They're just like I don't know. This team just puts the data there.
Nick Schrock [00:17:27]:
Yep. Yeah. And then most organizations, they try to reconstruct that. They try to often do it there's kinda, like, different gradations. Right? There's, like, a manual spreadsheet driven process where they just, like, document everything. And then the second they make that, it's out of date. No one no one can trust it. Then you maybe maybe bring in a cataloging system.
Nick Schrock [00:17:47]:
Right? And there's some automation there, but it's still post hoc observation of everything that's happened. And it ends up being, like, full of junk and, like, doesn't feel like a true system of record. In most places, we see it. Right? Whereas yeah. What we believe is that you have to have lineage and provenance effectively come for free as part of the programming model. Yep. And that's why we firmly believe and that has to go across technologies. Right? Because, like, as you know, you go to a client, and they almost certainly probably have both Databricks as an account, but also one of the cloud data warehouses.
Nick Schrock [00:18:24]:
Like, they coexist everywhere. So you need a system that, at minimum, spans those two things. And then there's, like, there's just all sorts of garbage in every single, like, data in every every every platform. By garbage, I mean, well, constructive technology. The, totally thoughtfully adopted and constructed. No. But, like, I mean, I've been there I've been there too. I'm not just being clipped.
Nick Schrock [00:18:46]:
The you know? So, like, that's why we think it's so exciting to view the orchestrator as not just this, like, narrow thing which runs something every day, but as a system of record for the structured data assets in your organization. And having that system of record simply be an output of the programming model that we use. And, so, yeah, that's, kinda what we're what we end up doing. It kind of ends up feeling more like a, you know, a standard for how you construct data platforms at companies. Because, you know, our line is that every company has a data platform. It's a question whether they explicitly acknowledge it or not. And that that platform is, like, inevitably multi technology, multi tool, multi persona. Because data is data drives and is consumed by all sorts of people and all sorts of use cases across every org.
Michael Berk [00:19:51]:
Yeah. Ben and I were chatting about a topic that's related, which is how are you deciding so the pitch for decks or for basically, in a central orchestration layer is you have lineage and that's comes for free, as you mentioned. Can you provide some examples of shiny things that you guys decided not to build?
Nick Schrock [00:20:14]:
Shiny things that we said not to build.
Ben Wilson [00:20:17]:
Why?
Nick Schrock [00:20:20]:
Well, the you know, we very much view ourselves as a data engineering toolkit, built for data ML engineers who embrace software engineering best practices. So we're not in the business of drag and drop tooling. We're not in the business of capturing things that are outside source control, for example. We're not in the business of kind of integrating tons of manual workflows into our cataloging tool, for example. You know, on the ML side, which is what this podcast is about, you know, in general, I find that, you know, MLOps is generally is generally expressed in the world as it's being its own silo of tooling. And I don't think that makes sense. I think it's a layer. There should be MLOps specific tools that are built on top of a generalized data engineering platform.
Nick Schrock [00:21:21]:
And so that means that you do your pipelining in the base layer, IE in DAX or something else. But then things like model serving and hyper specific ML tools should be the province of other things. You know? Like, I we're not particularly interested in building a, like, complicated, you know, ML ML we were talking about in the, leading up to podcast. MLflow experiment tracking layer. I think that's a whole another company. Yeah. I don't know. How many engineers work on MLflow? There's gotta be a lot a lot of features in there.
Ben Wilson [00:22:03]:
On open source side, ranges from 8 to 20.
Nick Schrock [00:22:08]:
Right.
Ben Wilson [00:22:08]:
On a given quarter. On the Databricks side, I think we're up around a 100, 115.
Nick Schrock [00:22:17]:
Yeah. So that's, like, a series d company.
Ben Wilson [00:22:19]:
Like, if I Yeah. And that's just working on the ML side of stuff at Databricks, which is the smallest department.
Nick Schrock [00:22:26]:
Yeah. But it's a good question, Michael. Like, in terms of I think we have to fight our own, you know, our own vices of having our eyes being bigger than our stomach. You know? But the like, for example, we we think of ourselves as we have, like, a we do have a cataloging capability, but we think a bit of a about a layer of cataloging that provides kind of bare bones targeted experiences related to operational matters. But then there's other data catalogs, which are way more sophisticated and incorporate both, you know, manually inputted metadata and, like, processes around that and, like, very complex web app. And, like, that is not in our, not in our wheelhouse. Like, what we do is we empower the engineers to write code, which ends up producing metadata, which is displayed coherently in in a catalog in plat cataloging platform, which is available both as a web app on our system, but also, you know, we want downstream consumers to be able to consume that, so that they don't have to reconstruct lineage on their side. They can actually focus on going up the stack in the cataloging space, which I think is actually a win win for both both participants.
Michael Berk [00:23:52]:
Yeah. Exactly. I'm I'm currently working on a project that's a Azure Databricks to GCP Databricks migration, and we're trying to get lineage to inform how to move resources from one cloud to another. And working backwards is just hell. Like, we're parsing audit logs. We're we're basically getting as creative as we can. So it makes a lot of sense that if you start with a really robust programmatically driven orchestration layer that gives you lineage and everything is sort of hard coded together, it's it's the right way to do it, and it makes it so much easier.
Nick Schrock [00:24:25]:
That's right. Like, I think I think it's actually Armbrus. Michael Armbrus is, like, you know, one of the vaunted engineers at your company who is he was I remember talking to him a few years ago, and he was like, you know, we're finding a lot of success getting people on streaming, Spark Streaming, even if they don't use it for streaming. And I'm like, what are you talking about? He's like, well, streaming gives a much more, like is much more constrained in terms of lineage that you have to do because, like, you have to declaratively say that, like, b is downstream from a. So even if it's, like, actually run just bat like, you know, typically, just run batch through the streaming process, well, you end up with better outcomes because you can it's a more constrained program environment. So I thought that was a pretty interesting comment. I don't know how much that's played out over the years, but,
Ben Wilson [00:25:12]:
We have an entire product line devoted exactly to that. It's called DLT, Delta Live Tables, and it it is exactly this.
Nick Schrock [00:25:20]:
Right.
Ben Wilson [00:25:20]:
So micro batch transactions, so using streaming as a batch service, just constraining it to say, where did I leave off? Where's my my latest token that I'm gonna be processing? Process that through the system and record metadata about it, say, how many records did I pass through, and what was my throughput rate? I put them up in metadata. But, yeah, like, customers jumped on that. Customers that you would never imagine would be interested in spinning up a Kafka service and connecting to it for real time streaming. They don't need it. But just that lineage. Yep. Yeah. It makes sense.
Ben Wilson [00:26:00]:
I'm really excited about the prospect, and I've been having conversations internally with another, maintainer of MLflow about this whole data centric concept of ML, exactly as you said it. Like, hey. It's it's a service that is data centric. It happens to produce objects or functions that operate on that. You know, the ML people and data scientists call them models. But at the end of the day, it's a function. It's taking data. It's operating on it, and it outputs something.
Ben Wilson [00:26:32]:
We can call it whatever we want, but they're all functions. Even Gen AI, whatever you're doing, it's a function. And the thing that that is really impactful on how the behavior of that function works is that data and about knowing how to reconstruct it reliably, how to migrate it between systems. Yeah. You do all this work in dev. You know, like, hey. A a lot of data scientists take a a parquet file or a CSV file to generate their features, and then they kick it off to an ML engineer, for deployment. And now they have to spend 10 days like, a full sprint or maybe 6 sprints to take that script code and convert it into something that's deployable, to build that, like, where do I get my features from? Like, where did you get this data from? It's a cool data set that you use for training, but I have no idea where this data is.
Ben Wilson [00:27:31]:
But having that lineage from dev to say, here's the 18 source tables that constructed this this data set. And then you can just take that definition and say, okay. Deploying to staging now because all of those data sources are mapped and staging. Let's make sure this runs. Does it retrain? Does it work? Okay. We're good. Ship to prod. The only way to do something like that is using the like a universal standard, I think.
Nick Schrock [00:27:59]:
Yep. Totally.
Michael Berk [00:28:01]:
Yeah. And I think this also reflects a trend that we've been seeing, or at least I personally have been seeing. We were chatting with Praveen Paratosh in episode 116, and he was mentioning this new sort of concept in ML, which is actually not that new, but it's data engineering focused. And they had a a essentially, a hackathon where holding the model constant, you're allowed to do whatever you want to the data. How can you minimize model accuracy? So make the worst possible model.
Nick Schrock [00:28:30]:
You can
Michael Berk [00:28:30]:
you can throw, like, crazy feature engineering at it, but the cross validation steps and the model itself will remain unchanged. And this sort of reverse engineering of the process of of model building, it really shows how the data are like, that's the valuable thing. Like, yes, linear regression versus a decision tree that matters. But if you have bad data, it's really tough to make a good prediction consistently.
Nick Schrock [00:28:53]:
Yeah. All the values in the I I think, still, in all these system, all the values in the data pipelining, and the quality, understandability of data. And I think it's actually gonna play out in Gen AI too. Like, you know, the foundational models are getting commoditized. So, you know, what the value added thing in specific companies or apps is gonna be is their ability to understand their own proprietary data and, you know, introduce that into the Gen AI process, whether that be reg via RAG or some other some other methodology. I think I didn't fully read the article. I saw it go by. The the Databricks' Matej Zaria, your CTO, pushed out something about compound applications, which I think is getting at this that Yes.
Nick Schrock [00:29:42]:
That these GenAI apps will end up being you know, it's good. The the LLM component should be more thought of a computing primitive that you can use to assemble into a coherent application rather than just, like, complete magical, you know, LLM consumes something and something good happens. Like, that's not the way it's gonna work. Like, it's gonna take thoughtfulness and blame it this way. I think the the elimination of, software engineers has been wildly overblown. But, of course, I'm gonna say that because I'm a software engineer, but I happen to believe it's true.
Ben Wilson [00:30:25]:
The only thing that I've seen, with this Gen AI renaissance, we're just hiring more people. You need more people to wrangle these things. And the the thing that the thing that's come up most recently for me that I've started thinking about is the application of first principles to all of the new hype. And the ones that are successful because I'm doing evaluations now, like, as we've mentioned before, about tracing for these Jet AI apps
Nick Schrock [00:30:57]:
from Buffalo.
Ben Wilson [00:31:00]:
There's a lot of ways to collect logs, like, a lot. Every big tech company out there over the last 3, 4 decades has had to do some sort of instrumentation on what data is being passed around as, like, message traffic. And knowing, hey, what is my rest endpoint actually generating, and how do I capture failure rates, and how do I instrument that? And then over, you know, decades of working with that stuff, people said, let's just do a standard here and come up with a standard. Everybody interfaces with this, and you see how quickly something like that gets adopted. All the big tech companies are like, yes. Our services integrate with this. This is the way if you're a user of our platform, you get this for free. Right.
Ben Wilson [00:31:47]:
And it'll also connect to these 30 services that all subscribe to the standard as well. So from an end user applied perspective where you're you're trying to use tooling, having everything on that standard, you don't even have to worry about it. But then when we're talking about the new hype coming up, you get a new a new generation of developers that are all working on this that not all of them have a lot of history and experience and knowledge about how things how standards were developed over time. Yep. Hey. I've got a problem. I need to log information about this. And you look at these engines that are interfacing with Gen AI to build applications like the, you know, rag type stuff.
Ben Wilson [00:32:36]:
And you can see who's been there, done that before that has been started this open source project. You're like, oh, yes. This is an ex meta engineer, or this is somebody from Google. And you look at their their history, and a lot of times, I get to chat with them on on one on ones and stuff. And, like, oh, I I see why you subscribed to the standard, and you did a really good job with this. This is awesome. They're like, why would I do anything else? But then you get somebody who's new, who's never been through that pain. And it's just like, one, my users say I need logging, so I'm gonna write my own instrumentation.
Nick Schrock [00:33:15]:
Right.
Ben Wilson [00:33:15]:
When you look at the structure of it, you're like, what are you doing here? Like, there's no parent child relationship between these events. Like, yeah, you're collecting all of the the records of stuff. You don't have any way of disambiguating when these events occurred, though. It's just a collection of data of, like, hey, here's the inputs and outputs from the system and from the user. What happened when? And how do I visualize this so I can debug it? Because there's no relationship here. There's no DAG that you've created. In the system, there's a DAG because it's temporal events that happen and child processes kick off. But it's just interesting to me, and I I wanna know your take on that As somebody who's worked in standards and has thought through all these problems and has invented this stuff.
Ben Wilson [00:34:10]:
When you see new generations of people building tooling, and you know they should be doing it a certain way because you went through that and suffered through what happens when that doesn't happen. What are your thoughts about stuff like that?
Nick Schrock [00:34:27]:
Well, I guess I'm less beholden to standards you might think. The, you know, kind of one of the legends of, like, you might like, now, meta engineers have a good reputation. And, like, they come in, and they're the ones who know what to do. But, in 2009, that was not our reputation. You know? Sometimes script kiddies writing PHP. Like, who are these clowns? And then our open source stuff you know, when React was initially introduced to the JavaScript community it was, like, rejected on Impact because, like, we were like, it had an XML s syntax, which was, like, embedded in JavaScript and then precompiled, you know, or train you know, people call it transpilation, but but same thing. And it was just like people were like, you do not like like, that's not what we do. We've you know, separation of concern, blah blah blah blah blah.
Nick Schrock [00:35:29]:
And it was a big fight, to kind of persuade the market that, no. Actually, this is the right way of doing things. And a similar thing happened with GraphQL and Relay and all this. You know, we were fighting model view controller orthodoxy. So we explicitly were like, oh, that standard is bad. We should not do that. Right? Because the whole thing of, like, separating like, you must separate your views from your data, and we're kinda like, well, if you're displaying something, there's a implicit coupling between fetching that data. Like, you can't pretend like it's not there.
Nick Schrock [00:36:08]:
So, yes, we agree with separating concerns, but, actually, the entire industry is conceiving of separating the wrong concerns. Because the problem that they were fighting was that people would, like, you know, linear loop fetch from a database and then produce HTML strings. And the problem there is not that the view and the data logic is commingled, that there's no you're coupling the string generation with synchronous calls to a database, which is, like, very bad.
Ben Wilson [00:36:34]:
And
Nick Schrock [00:36:34]:
And so what you what you want, actually, are the requirements for viewing data colocated in a centralized coordinated mechanism that can make the the fetching efficient. So, you know, anyway, the another example here is, like, for some reason beyond understanding, like, micros yeah. I'm I'm, like, a huge microservices hater, And that has become, like, standard wisdom best practice. And, like, and I think it's it applied to startups or small organizations. It should be the tool last resort, not a first resort. But if you have someone that comes in from Amazon or Google, right, they'll come and be like, well, this is the way we did it, so this is the way you should do it. And I would say that the new grad out of college who then says, like, aren't we just building a web server? Is the one who's correct. Right.
Nick Schrock [00:37:27]:
So, I guess I always check myself, when I have the instinct to do copious beard stroking and wag my finger at the the, kind of, beginner's mindset. Because I think often people are grasping at something that it makes intuitive sense, and there should be a good reason why they can't do it. So, yeah, I'm at least in Android organizations that I'm a leader in, I'm definitely willing to tolerate probably a little more chaos than most. I think sometimes I go too far that way, and should kind of, like, impose a little more. So I guess yeah. That that is I mean, just it always comes back to first principles thinking. Like, you're like because I think a lot of people reach for the familiar because, like, oh, we did it here. And then but then you have to have the discipline to apply that idea to a new context, which is like, okay.
Nick Schrock [00:38:29]:
What is the problem that's actually solved? How relevant is it to this new context? Are all the ideas applicable? Are only some of them? Or is it complete hogwash? You know? So, it's like experience grounded grounded in experience, but always context aware judgment.
Michael Berk [00:38:47]:
Can you elaborate on what a first principle is to you? I know that's a very open ended and potentially seemingly obvious question, but I think there's a lot of nuance in how you approach defining a first principle and how you go about getting them.
Nick Schrock [00:39:03]:
That is a very good question, and I'm frustrated with myself that I don't have a good answer off top of my head for it. I guess I have to ride derive the first principle. First principle's on the fly here. The But yeah. Like, the first principle here's one, for example. If you're building a framework, a good first principle is the open closed principle, which says that it should be open for pluggability or modification but it should be closed to modification, but open to extension or pluggability. So you should be, like have a principled thinking about, like, this is what's fixed, this is what's pluggable, And let's figure out, given the constraints of the system that we want to impose, how we maximize flexibility for our developers. So I think, like, that's an example of a first principle.
Nick Schrock [00:40:08]:
The but, yeah, I I would need to, like, I think, ruminate and go on a hike and really kind of take about, more broad. That that just that one's top of mind because, like, came up in internally a couple weeks ago. Other I don't know. How does it is was that does that strike you as first principles thinking? I guess, the other the other thing is, like, what the what the process is of first principle thinking rather than an instance of it. And to me, that's, like, more about, like, trying always to strip away jargon and and really talk about, at the core, what what you're doing and the trade offs that you're doing it with. And I I really respect a lot of I think I think leaders who are good at first principles thinking generally end up communicating in jargon free ways. And so that's kind of another indication. Like, whether you like them or not, like oh, like, I know.
Nick Schrock [00:41:10]:
Like, talking about Elon is, like, a, like, a potentially triggering event. But, you know, Elon or Jobs or some of these, like, venerated figures in tech, I I think what's interesting about them, if you listen to them talk, there's, like, 0 jargon. Mhmm. Like, they speak in very plain spoken words about highly technical subjects that my mother could understand, but largely, which is remarkable. Like, kinda like the generic like, like, kind of like the the normal leaders kind of end up speaking in this highly jargon way of speaking. I probably do it all the time without being self aware. But if you can like, I think you're engaging in first principles thinking. If you can boil down the, like, the trade offs you're making into fairly jargon free language.
Nick Schrock [00:42:04]:
It's, like, another process here. So these aren't very good these are more like ruminations rather than a fully constructive thought. But
Michael Berk [00:42:13]:
Nice. Yeah. So the the origin of that question is a lot of my work involves saying, do you really wanna do it that way and why? And suggesting a solution that is sort of circumnavigates the problem and makes everything just simpler and better. That's that's a key part of the the resident solutions architect role. And I know Ben has been doing this for years. So, Ben, I was curious how you think about sort of cutting through the noise and finding the correct answer.
Ben Wilson [00:42:44]:
There is no correct answer. There's just answers that are better than others. But, really, my guiding first principle is always do your homework. If you're running code, particularly exactly as Nick just said, framework development, That used to be a huge pet peeve of mine when I was, you know, like, in the field or or talking to companies. When people's first reaction is we need an ETL framework on Databricks, which is an ETL framework. It's like Spark is a framework for processing distributed data. You wanna build an interface on top of that that you're basically building a a that custom DMO in Python that interacts with PySpark, which interacts with Scala, which interacts with the JVM. It drove me nuts, because nobody is sitting there saying, what do we actually need to do with our data? What are the things that we need to do to make sure that the translations that we're doing or transformations that we're doing to our data between our source system where we're loading from and where we're writing to, are those correct? Are we validating that our data is not corrupted or missing or flawed in some way? Because those are the things that burned me when I work for companies in the past.
Ben Wilson [00:44:17]:
I own some MLETL pipeline that I wrote usually poorly. And just randomly someday, this column just starts producing all the same number. And you're like, you find out about it 3 days later or something when your production monitoring that you set up for how well this model is performing. Like, why did everything just start going crazy? What's going on? You go and dig in, waste an entire day of work until you find out, oh, this column, for some reason, this dataset is broken. Go talk to the front end devs who generate this this field that goes into the, you know, into the database that gets synced over into the data warehouse. They're like, oh, yeah. Yeah. Sorry.
Ben Wilson [00:45:03]:
We had a regression. Should have told you about that. But, you know, that whole process of building stuff that would break and not having not using tools that could have let me know about that sooner or allowed me to know that the transformations or manipulations that I'm doing on data wouldn't be causing production problems. That I learned a lot from that, but that all goes down and boils down to that whole concept of, well, I'm I'm gonna I wanna build a framework to interface with this thing when what you really should be thinking about is, does this already exist? And I just use something that somebody has built, and it works really well and adapt it to this use case. And then after exhausting, you know, 8 or 10 answers to that question of doing a research and saying, no. This doesn't exist or this doesn't meet my needs and only then building something. So that that homework aspect of, like, what do I actually need to solve here is so critical. And it's part of, like, everything that we do on our team in engineering now requires designs like that where you show that you've done your homework.
Michael Berk [00:46:23]:
Yeah. That makes sense. Another high level question for both of you. So we've talked a lot about the past and how ETL has sucked, how lineage has been painful, and how having the central orchestration layer is valuable. What are the next 3 to 5 to 10 years of data engineering going to look like? What are gonna be the big innovations? What are gonna be gonna be the big new technologies that make our lives a lot easier?
Nick Schrock [00:46:51]:
It's a big question. You know, I still I might give answers that are not satisfactory, to because I still think there's so much low hanging fruit on basic nuts and bolts thing in this domain. Like, there still isn't it still is not standard, I think, for there to be a green button that goes that turns green when you know that your PR into a data platform is not going to break anything. Like, most companies that I know or nearly all of them do not have the tools and processes in place to do that to both even get to that state where you're very confident that code you're about to merge and deploy is not gonna break anything, and, also, getting to that state in a reasonable productive way. So, literally, if we get to that state in 3 years where broad swaths the industry can now live in that world, it sounds like the most boring pedestrian thing. But I think that level of productivity in an org is transformational, because productivity for an engineer is not just the ability to create to complete tasks at a certain rate. Productivity is about reducing the complexity of the hold in your head so you can start to think bigger and compound on top of your system rather than just being drowning in all the fires that you're lighting along the way. So I think we'll I think a great 3 year time frame goal is to get every practitioner who's working on doing data platform, and not even, like, across the industry.
Nick Schrock [00:48:37]:
Make this even possible so that they feel like they're in a productive workflow and can develop software quickly and with confidence. And I think that there's a bunch of stuff that needs to happen in order to make that work. You know? Like, I'm inspired by, like, DuckDV on this front where you can kinda, like, run the same data warehouse code on your laptop in the cloud or in your web browser. That last fact is bananas to me, by the way. And then, tools like Modal, which is startup, by Eric Bernhardson that makes it very fast to upload new code to a cloud and run it kind of as fast as you can on your laptop. I think that type of stuff is great for ML workflows. I like to think that DAG is doing a part to play there, because, like, we're an orchestrator that really thinks about the full end to end development process rather than just think of it as, like, a deploy only thing that runs stuff on a schedule. We're more of a programming model.
Nick Schrock [00:49:34]:
So, yeah, I mean, everyone you know, again, makes me a little foggy maybe, and everyone's so hot on, you know, the Gen AI capabilities, and they're incredible. But I still think there's so much value in the data pipelining and building the core assets to drive all this stuff. And the developer workflows and the engineering workflows are just still so broken. They are so broken. So that's, like, one thing, which is just, like, if we get this basic bread and butter stuff right, the world is a whole lot better. Kind of the next, you know, after that the other thing I'm excited about is we call it the the so called single pane of glass, in the data platform. So is there actually a trusted source of truth where you can understand, observe, and operate your data platform without it and, you know, all due respect to the company you guys work for, without it just being one vertically integrated stack deployed by one company. Because I think it's, like, one way the world ends up, which is that there's people who are on the Snowflake stack, and there's people on the Databricks stack, or there's people on the Microsoft stack, and then they have to, like, choose what stack they're on early in their careers.
Nick Schrock [00:50:58]:
And, like like anything, there'll be tools of different quality, different layers of the stack, and there are there are portable skills between them, and that's, like, a very sad world. Like, that's to me, that's like going back in time to a world where, like, you had to, like, decide when you're 25. Like, I'm a Microsoft person. No. I'm an Oracle person. And then you're kinda stuck in this one proprietary stack for a while. There's a lot of pot of gold at the end of the rainbow for companies that do that, so I don't begrudge them for doing that. You know, I think Databricks should build its own catalog and its own orchestrator and all this stuff because it makes sense for them to do that.
Nick Schrock [00:51:35]:
But, you know, I you know, what I'm really looking forward to, say, in, like, a 5 year time frame is a is a world of much more I'll call it ordered heterogeneity, where there is heterogeneity at the stores, the compute layer, and all sorts of tooling, but it's bound oh, the the my mic dropped again. I'm just articulating too much. I almost hit my mic. The, you know and we, like it's very trite, and so I apologize in advance. But, like, you know, there is sort of, like, an iPhone moment to be had. I think with the right obviously, I think it's Dexter. But with the right technology where you can kinda, like like, in the iPhone, the thing that's cool about it is that it presents this, like, very ordered layer of software that provides lots of guarantees and user experience, but you can still install 1,000 of apps. There's still, like, a wide vast array of capabilities.
Nick Schrock [00:52:26]:
And if we can get if we can get a toolkit in the company's hands where their data platform can feel like that, where it's like, okay. Like, we can bring in a different technology. Like, you know, we're using Snowflake at first for analytics, but we wanna bring data, Databricks in for ML capabilities. Right? Bringing bring that app back without while still providing a unified experience to all your stakeholders is, like, a very exciting future. We're very far away from that, but I do think it what needs to happen. Lastly, you got me going here. So I guess I came up with some good ideas. The the, the other thing I'm really excited about is emerging of software and data engineering.
Nick Schrock [00:53:08]:
So meaning that, you know, one of our lines that we use now is that, well, data engineering is software engineering. Soft it's a software engineering subspecialty. I think the other way should and will occur, which is that data is so instrumental to the way that every software application is written. Meaning, like, you need to have analytics to understand what's going on. There's gonna be more ML capabilities so that data feeds in ML models, which feeds back into behavior of the system. That this whole, like, silo between software and data engineering should, like, disappear. And instead, you should have people who know enough about data engineering sprinkled around all your product teams so it becomes much more of a merge discipline. You know, from basic things all the way to, all the way from, like, hey.
Nick Schrock [00:53:54]:
If you add a logging field, let's have a system in place so you don't break your data warehouse. Yep. Right? So there's, like, basic levels of just, like I mean, in in retrospect, it's gonna be like, that's the way that work. It's, like, wantonly reckless that, like, at any point, like, the an application developer could, like, just break the data warehouse. Like, it's I swore there. It's crazy. But there's more profound things too. Like, I was very inspired.
Nick Schrock [00:54:22]:
I was talking to engineering leader, Watershed, and it's a basically, it's a it's a company that does climate change, carbon credit compliance for other companies. And it's basically a data pipelining product, because, like, the companies upload this data in the some in a structured format. They do transformations on it to proof reports and all sorts of stuff. But they have all of their application developers writing, writing data pipelines in effect. So they have one code base. It's all JavaScript. They so their front end, their back end's all JavaScript, and the business logic to their data pipelines is also in JS. And they and they do their kind of distributed compute on DuckDV.
Nick Schrock [00:55:12]:
It's like a like, bananas awesome stack. I was, like, so inspired when I talked to them. But I think what's cool about that is they structured their system so that every software engineer is accountable for the kind of date data pipelining, consequences of what they do. And they can actually share business logic between like, they can literally run the same code in the web app and in their data pipelines, which I think is awesome. You know, they obviously had to do a bunch of work in order to make that work well and build a bunch of custom infrastructure. But so I guess those are the the the three things that I'd be excited for, which I think will happen on some long time frame. So, you know, getting basic dev flow working across company's data platforms, you know, single pane of glass so that there's a coherent experience across a heterogeneous toolset, and then the merging of software and data engineering.
Michael Berk [00:56:07]:
What do you think is standing between Daxter and the iPhone moments?
Nick Schrock [00:56:15]:
That is a good question. I still think that, you know, there's you know, I still think we have to work on incremental adaptability and progressive disclosure complexity in our system. One of the challenges of building out in the open is that you wanna you wanna you know, you need to move forward and experiment with ideas and test stuff. But the moment you do that, then you have a public API of support forever. So there's a lot of one way doors here. So that's been a challenge through the project for sure. And I think as a result, we kind of a lot of cruft, and, things that we can't undo, but we have to move forward. You know? And I think if and we're well aware of these problems, and totally acknowledge that the feedback we get on this is correct.
Nick Schrock [00:57:24]:
But I think, like, one of our biggest challenges is, like, making making it so that both in greenfield cases, it feels smoother to get up get the system going as well as being able to land at companies that have existing data platforms and making that process feel less risky and more incremental is is, like, very critical.
Michael Berk [00:57:51]:
Okay. I'm following, I think. And what just out of curiosity, what are some of those core one way doors that you've walked through and and can't go back?
Nick Schrock [00:58:06]:
I mean, you can always go back. It's and some of it's, like, not very profound, you know, but, like, if you, like, choose a stupid name or something. Right? It's very easy to do. Naming is hard. You know, and then 6 months later, you decide that the name is dumb. Now it's like a whole thing. You know, you have to deprecate, roll it out to your community. You have to get people to migrate their code if you wanna eliminate it.
Nick Schrock [00:58:31]:
Yada yada yada. So it's just a general notion of public API server. This is what I'm thinking about. Like, web apps are easy to change. You just push you just you push it. The UI reforms. People are annoyed for a bit, and then they move on because they're they haven't written code and infrastructure that binds against it. But building an infrastructure project, product is challenging that way, especially if you, like, if you make strict guarantees on backwards compat, which we do.
Nick Schrock [00:59:03]:
You know? Like, you are not allowed to break people's code.
Michael Berk [00:59:09]:
Yeah. That makes a lot of sense.
Ben Wilson [00:59:12]:
So maybe on a major version release that you tell everybody 6 months ahead of time. This is coming by
Nick Schrock [00:59:17]:
the way. Right.
Ben Wilson [00:59:18]:
We have the same thing with demo flow. It's like, you have to do back compatibility testing for everything. So nightly builds, we're testing other, like, tools that we integrate with, like, libraries. We're testing their main branch on GitHub. We're pulling it every night, testing along with all of our other version tests just for that reason.
Nick Schrock [00:59:39]:
Right.
Ben Wilson [00:59:39]:
Yeah. It's a lot of infrastructure you have to build. I mean, a lot of people look at a tool like that, and they're like, well, yeah. There's a couple 100000 lines of code in here, but then you look in CI. You're like, woah. That's a lot of tests. Like, yep. Yeah.
Ben Wilson [00:59:54]:
And a lot a lot of those are dynamically created too.
Nick Schrock [00:59:57]:
Yeah. Well, to be clear, like, I actually think Databricks is pretty inspirational on this front insofar as, like, they were building out in the open, and they went from RDDs to data frames to Spark SQL. And it was kinda like a layer cake in a way. But now I doubt that much net new code is being written against RDDs, for example. I mean, probably some, but, like, it's not, like but I assume that the the runtime still, you know, will support older code. I don't know what the version break has been, but on that. But
Ben Wilson [01:00:28]:
Spark 4 is a foundational shift.
Nick Schrock [01:00:32]:
Okay.
Ben Wilson [01:00:33]:
Yeah. So that's a big you won't you won't easily be able to enter like, interface with RDDs anymore.
Nick Schrock [01:00:41]:
Okay. I mean, that's it's been it's been, like, 15 years or something, so it seems like a reasonable that's a pretty good that's a pretty good run, I think, for supporting a legacy technology. Yep. Is Spark 4, like, specialized for Photon or something? Or
Ben Wilson [01:00:56]:
Yeah. A lot of back end compiled c code and and new abstraction layer with, basically multiuser process isolation. So it's very safe environment where you can be like, hey. 30 people can log in to the same cluster. You can't see what anybody else is doing, and we'll we'll dynamically allocate resources on the cluster level to make sure that everybody can execute their code in a reasonable amount of time.
Nick Schrock [01:01:19]:
Got it.
Ben Wilson [01:01:21]:
But, yeah, that took years, to go through Apache PMC, you know, committee voting, people saying, well, I don't I still really like this language interface, like, this way, and eventually convincing a huge group of people to say, yes. This is the way forward.
Nick Schrock [01:01:42]:
Yeah. I think people dump open source projects into Apache way too early into their life cycle. I think it's a way I think it's like what you do if you wanna deliberately slow down. You know, it's like the CIA manual, which is like, how do you, like, slow down companies, like, demand on committees, blah blah blah. Like, one of them would be an open source software. Like, get into Apache. Mhmm. Yeah.
Nick Schrock [01:02:03]:
Yeah. Once something's like, you wanna, like, things to stabilize, quite a bit, before you do that because I think it does reduce innovation.
Ben Wilson [01:02:13]:
Yes. Could not agree more, and that's why we didn't do that with MLflow. Yep. Linux Foundation's great, having their support.
Nick Schrock [01:02:20]:
That's But Yeah.
Ben Wilson [01:02:21]:
Apache That's we're good.
Nick Schrock [01:02:24]:
Yeah. I mean, if it's the only way to excel trace the IP, if that's what you wanna do, then it is what it is. But it should be the much like Microsoft, this is a tool of last resort. The and, also, with GraphQL, we've had a very positive experience with Linux Foundation.
Ben Wilson [01:02:37]:
Yes. Yeah. They're awesome.
Nick Schrock [01:02:40]:
Yep.
Michael Berk [01:02:41]:
Cool. So
Nick Schrock [01:02:42]:
Once I found out that if we put put GraphQL on Apache, they're like, well, you have to use Jira. I'm like, what are you talking about? Like, bye.
Ben Wilson [01:02:55]:
Yeah. There's a lot of requirements in where their contractor is like, why are we doing it this way? Like
Nick Schrock [01:03:03]:
Why do why do you care about our task management system? Like, are you are you kidding me? Anyway, all due respect, I know there's, you know, different strokes for different folks.
Ben Wilson [01:03:14]:
Yeah. So big projects in there. There's a lot of also dead projects in there, but lot of lot of important things.
Michael Berk [01:03:22]:
Yeah. So before we say anything too controversial, I think we should wrap. I know we're also coming up on time. So, this was a really interesting episode. I have, like, 7 topics that I would like to talk about for multiple hours each, but we don't have that kind of time. So I'll quickly summarize some things that stood out to me. If you're looking for business opportunities, you should look for pain, importance, sometimes novel technology. And then if there's a central point of leverage, that's an advantage as well.
Michael Berk [01:03:54]:
On the first principle side, don't look down upon the beginner's mindset. Typically, fresh perspectives are really valuable. And then, also, you should be really context aware and go deep into the problem space. So if you really understand something, you can typically operate with first principles more easily. And then a good example of true understanding is be able to explain it like you're 5 or 2 of 5 year old. Excuse me. So don't use jargon and be able to, match different levels of complexity based on your audience. And then for
Nick Schrock [01:04:24]:
5 year old might be a high bar. Let's
Michael Berk [01:04:27]:
let's say You're right.
Nick Schrock [01:04:28]:
Let's say a Yeah.
Michael Berk [01:04:30]:
What's what's a good age, Nick?
Nick Schrock [01:04:33]:
I would say someone who's, like, a a 20 year old, say.
Michael Berk [01:04:39]:
Okay. Cool. Or or your grandma?
Nick Schrock [01:04:42]:
Right.
Michael Berk [01:04:44]:
And then regarding the future, there's still a lot of low hanging fruit in the data engineering orchestration world. That's a global quality checks, and having a central source of truth, sort of that that glass pane, and then borrowing techniques from software to increase reliability. All three of these things are gonna be very valuable in the next 3, 5, 10 years. So, Nick, if people wanna learn more about you, your companies, your projects, where should they go?
Nick Schrock [01:05:10]:
Yeah. So daxter. Io is Daxter's, you know, kind of homepage. And we didn't actually talk about DAX to that much, but I know this. The the, you know, it's a it's a it's a place where you go to build data pipelines and a data orchestration platform, but it is very targeted towards software and ML engineers who are who embrace software engineering best practices. And kind of the core notion is that instead of thinking in tasks, you should actually think about the data assets that you're producing out of the pipeline? And then lots of things fall out from that, including increased developer productivity. You know, your organization gets way more context on your on your data, and increased trust. So very exciting times.
Michael Berk [01:06:00]:
Yeah. And it's an awesome tool. So yeah. A lot of our listeners are highly technical and and typically marry the the software engineering, data engineering, and ML components. So I'm sure this will be of interest to to all
Nick Schrock [01:06:11]:
of them.
Michael Berk [01:06:13]:
So, yeah, great episode. Until next time, it's been Michael Burke and my co host. I'm Wilson. And have a good day, everyone.
Ben Wilson [01:06:20]:
We'll catch you next time.