Organizing Data Workflows With Prefect - DevOps 144

Anna Geller is the Lead Community Engineer at Prefect. She joins the show with Jillian and Jonathan to discuss her article, "Scheduled vs. Event-driven Data Pipelines — Orchestrate Anything with Prefect and further explains the company she works for, "Prefect". It is a company that helps users coordinate their workflows. Moreover, she discusses the many advantages Prefect may provide to its users.

Hosted by:

Jillian Rowe •

Jonathan Hall

Special Guests:

Anna Geller

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

About This Episode

Importance of Prefect
Different problems that Prefect can solve
Scheduled vs. Event-driven Data Pipelines

Picks

Anna - Everything Everywhere All at Once (2022) - IMDb
Anna - Top Gun: Maverick (2022) - IMDb
Jillian - American Thanksgiving
Jillian - prefect.io
Jillian - Watch The Dragon Prince | Netflix Official Site
Jonathan - transistor.fm
Jonathan - Jonathan Hall / Transistor.fm Go SDK · GitLab

Transcript

Jillian_Rowe:

Hello, hello, I am here with our panelists today for another show of Adventure DevOps. And with me, I have Jonathan Hall.

Jonathan_Hall:

Hey everyone.

Jillian_Rowe:

And our guest, Anna Geller, who is the lead community manager at Prefect, which I'm pretty excited about because I actually use Prefect.

Anna_Geller:

Hello everyone, I'm grateful to be on your show and thank you for inviting me.

Jillian_Rowe:

Oh, thank you for coming on. So tell us a little bit about what you do. First of all, like what is Prefect? How did you get to be there? Any other kind of exciting tidbits you have for us.

Anna_Geller:

Yeah, sure. So I live in Berlin, Germany. So it's quite cold right now here. And I work as a lead community engineer at Prefect. Before joining Prefect, I was doing data engineering, data science, Python programming, and data engineering consulting. And especially in consulting, I got often frustrated by the existing tools to build data workflows. And I started to look for alternatives. And I wrote a couple of blog posts. about what I've learned along the way. And that's how I started blogging and how I came across Prefab.

Jillian_Rowe:

That's great. And then was it like kind of a direct path? They asked you to be the, what is your exact title? Lead Community Engineer, right? That's the one

Anna_Geller:

Exactly.

Jillian_Rowe:

we're going with? Okay, okay, cool. So it was that kind of a direct path that you became the Lead Community Engineer at Prefect?

Anna_Geller:

Not at all. So at first I joined as a solutions engineer and there was actually quite a funny story that during KubeCon our person who was maintaining this community support had to be at KubeCon and we didn't have coverage there. So I jumped in and it turned out that I wasn't too bad and I just started doing more of it and yeah I think that's how I got into community. In general, I think with community, the challenge is to provide help at scale. So kind of like my mantra is to earn users' trust in a healthy way at scale. Because I think, especially when your community grows, it starts to become difficult to give people enough attention and support. And some questions are repeated, so you get asked about the same thing. multiple times, so it can become challenging. And I think that doing this in a really right way, in a healthy way is difficult, and I found this challenge really cool.

Jillian_Rowe:

I think that's really great. I mean, you guys must be doing such a good job because I know Prefect has really taken off just over like the last few years that I've seen it and that I've been using it. So it must be, you know, it must be going well. So what are kind of the, you know, the tips that you would say, what are some good directions to go in if you're like, okay, this is an open source software and you want to get the kind of visibility on it that clearly Prefect has because Prefect does have. an open source component as well, right, as well as the cloud. So what would you say is a good way to go about that?

Anna_Geller:

Yes, I think the best way is to really start with the problem that you are trying to solve. And once you are clear about that, everything becomes more obvious how you go from there. So I think in our existing space, we have so many tools that are quite opinionated in what they allow you to do. So for example, you may have a lot of tools that allow you to do data ingestion or data transformation. And they do it really well. But what's often missing is this tool that is unopinionated and that gives you a choice how you want to build your data flows. And usually when we say that PreVac is a tool that lets you coordinate your data flow, when we say coordinate, we mean this wide spectrum between observability and orchestration across teams and systems. I think you can probably confirm that you've used it. What is your experience?

Jillian_Rowe:

Uh, yeah, I have used it and that is what, you know, what people like about it. They like that it's very open and that you can, you know, you can kind of throw almost any problem at it. So, you know, for example, I don't even use it just for orchestrating data flows. Like I do, but I also use it just generally as a scripting language because I find it very well designed. And you know, when you're, when you have kind of complex, um, I don't know, we'll say ETL for the purposes of this conversation. Instead of having to write in all my own like logging statements, like this is the, you know, starting, ending, did it end successfully, try, catch all that kind of stuff. Prefect has all that sort of built in there just by putting that task decorator on the function. And then I get like all, you know, like the logging, the monitoring, the scalability, all that kind of thing directly, directly out of the box.

Jonathan_Hall:

Anna_Geller:

100%

Jonathan_Hall:

I have a question. Sorry to talk over you there. Clearly, Jillian knows and uses this product, which allows you to have a nice technical conversation here. I don't know what your product does. I'm looking at the website and it says an open source data workflow orchestration. I don't know what data workflow is or why it needs to be orchestrated. Pretend I'm really stupid, talk to me like I'm five. uh... and explain to me and the and the audience who isn't into data in layman's terms what is this and who should care about

Jillian_Rowe:

Go ahead

Anna_Geller:

it.

Jillian_Rowe:

and I think Jonathan got cut off again.

Anna_Geller:

Exactly. Yeah. So I think a good way to describe it is that we want to give our users visibility across the entire data stack. And why is it important? Jonathan asks. It's important because people working with data typically use various databases, data ingestion and transformation tools, tools for analysis and machine learning. And often they need a place where they can see how all those tools are working together. And they often need to ensure that those tools are working well together. And I think this is where this coordination plane becomes critical because it allows you to see what's the state of my data infrastructure right now. Is my data platform healthy? Are some agents or infrastructure components down? Did we miss some SLA on some workflows? Or maybe did our critical data warehouse refresh pipeline failed? And if you currently consider all those questions, you will realize that there is no single tool currently that can provide you with all those capabilities. For example, there are tools for platform engineers to monitor uptime and SLAs for infrastructure, right? Things like Honeycomb, Datadog, but there is really no such tool for data flows and data infrastructure, or at least I would say not a coherent one. And I think that's the reason why we are building this coordination plane and why this problem becomes really important to solve.

Jonathan_Hall:

I will trust that that was a great answer. Listeners don't know this necessarily, but I just dropped off. My internet's been flaky today, so I apologize for that. I think I got my last sentence cut off, but.

Anna_Geller:

Yeah, maybe

Jillian_Rowe:

Yeah, we're

Anna_Geller:

Jillian_Rowe:

not...

Anna_Geller:

can-

Jillian_Rowe:

Oh no, go ahead.

Anna_Geller:

Yeah, I think maybe I can just continue by just discussing the problems that prefab tries to solve, right?

Jonathan_Hall:

Wonderful, yes.

Anna_Geller:

Yeah. So I think the first problem that prefab solves is providing visibility into your data flows. So at any time you should be able to confidently say what failed, what succeeded, what crashed, what got canceled, and maybe if something got canceled, who did that? And I think this visibility to see is... is this foundation of data flow observability. You need to be confident that something that was critical to you actually worked. And once you actually have that visibility, you can add logic and configuration to more proactively configure how you want to react to specific data flow results. So for example, at the very least, you may want to have notifications, so that when something happens, something fails, something crashes, something gets canceled, you'll get an alert and you at least know that something is not right. And in general, I think with prefac, we have this very helpful philosophy of incremental adoption, which means that you never have to use or learn more about the product than you actually need. So what it means in practice is that as your use case grows, you can just start adopting more features when you need them. And the basic setup is... You install the prefact client, which is essentially a Python library. You just do pip install prefact. You connect to your prefact cloud account from a single CLI command, and you can start running your Python scripts, which have a single flow decorator. And then it's already observable with prefact. You gain this visibility. You can track its execution state in the UI. And there's, there's for example, additional component, which are, which are tasks. But it's actually enough if you only use flows. So flow is this main component and tasks are optional. So if you need more visibility and configurability, you can start using tasks only when you need them. And one specific use case that we often observe, for example, with our users is that they may have observed their flow and they notice that there is some potential to speed up the process if you run some parts of this workflow concurrently or in parallel. And only then you have to learn about a feature that we have for parallelization, which are task runners. So this means that you didn't even have to learn what a task runner is, unless you actually needed them. So I think this is really, really cool design philosophy that we have and a good constraint to some extent.

Jillian_Rowe:

Yeah, you know, I've really liked that feature too, because I work with a lot of data scientists and, um, and also like just straight up biologists who come from a lab and maybe don't really want to have to code that much. And so throwing a tool at them that they have to really learn a lot of the tech is just, it's not going to work. Like it's, it's just not going to work. They don't care about the tool. They care about the science that they're doing, but being able to, you know, throw something at them and be like, okay, if you just put the small decorator on your Python function. When you're doing like your whatever it is that you're doing, right? You're parsing your DNA or something like that really helps them to adopt, you know, like some of these tools. So I found Prefect is really good as, I mean, both like a data workflow orchestrator, but also just as like an additional, almost like a scripting language where you just put it in and you give it to the scientists and you say, okay, here you go. But then this gives you some insight into your data flows as well.

Anna_Geller:

Exactly. So I think what you mentioned is also interesting. You mentioned you were working with this kind of like sensitive data, right? And what, what prefig has is, is this hybrid execution model. So when you work with this sensitive data, you can just use some SAS product because you can like expose this data to a third party system. And because of this problem, many, many practitioners decide to either self-host some tool on-prem. or build some product entirely in-house. And what Prefect does is we provide this secure environment where you can run your flow and Prefect doesn't even, never sees your code or data. So that's something that is really interesting. And in the topic of actually today's discussion, we've just scheduled an event-driven data pipelines. You can imagine use cases where you run some workflows on-prem. And maybe you want to trigger some, emit some event from this workflow and do perform some action based on this event. Like this, this would be possible very easily with cloud these days, right? Where you have different platforms such as AWS Lambda, Google Cloud Functions, etc. But if you, if you want to combine it with some on-premise workflows, it becomes suddenly quite difficult to, to combine that and gain this like visibility. into both workflows which are running in cloud, event driven, and workflows which are maybe scheduled and running on-prem. So I think that's also a very interesting lens.

Jillian_Rowe:

Great, what about, so we kind of started to talk a little bit about your blog posts. There is a blog post that will be in the show notes, it's on Medium, and the Prefect blog, and it's scheduled versus event-driven data pipelines orchestrate anything with Prefect. Do you wanna start to tell us a little bit about that blog post and like why you wrote it? What was the context for you writing it? Why did you feel like the world like needed to know specifically about this use case of Prefect?

Anna_Geller:

This is a very common engineering challenge that you don't want to run only something on schedule, but sometimes you want to run something based on something else happening. So I think this use case that we described in this blog post is very common, that you want to trigger some workflow only when a new file arrives in S3. And it sounds like such a trivial problem to solve, but when you look at implementation, There are so many ways how you could approach it. So one of them is that you just have a service that checks periodically if this new file arrived in the tree and if then it just triggers some action. And if not, maybe you sleep and wait, you know, and try again in a couple of seconds or minutes. And already this kind of workflow is quite... It's quite wasteful because on the one hand, you are wasting resources because you have this long running service that you need to maintain. And on the other hand, that is also not very scalable. So imagine you have to do it for hundreds of files and processes. So those are kind of the challenges. And also there's this another perspective that's what if this file didn't arrive at that expected time? You need to suddenly start building hacks such as pause execution and like checking states and essentially you write code that has some expectations about the word and you're using this static definition and you need to define in advance what you're going to do if those expectations are violated. So for example, do you fail the process if the file didn't arrive or do you wait and try again later? And in contrast, if you... If you approach this from this event-driven perspective, this process becomes suddenly much simpler because your code only runs when needed. And it's easier to troubleshoot, it's easier to define this expectation that you have. And also you can have this kind of SLA approach, right? Like what do you do in the absence of this event not happening?

Jillian_Rowe:

Yeah, that's great. I like the definition of, like, let's think about what we do when things go wrong, because it's one of those tasks most of the engineers don't like to think about, but bad things, bad things happen when you don't, right? So of course, of course, we all want to do that.

Anna_Geller:

Exactly. And I think like in general, there are those different design patterns, right? So, uh, there is one, like when you do everything purely based on events, um, it's called, it's often called choreography where you have this kind of like, um, each component only reacts to specific events, messages or signals. And there is like no central, uh, conductor, no central component that tells other processes what needs to happen and when. And I think. This design approach is really great for some use cases, but it's worse for other use cases, where you have more predictable workflows, where you know what will happen, you know what needs to happen, you know when it needs to happen. And I think this really interesting lens that you actually need both. You need orchestration, and you need this event-driven workflows, this kind of choreography type design pattern working in tandem. to build something which is scalable, reliable, and just easy to maintain.

Jillian_Rowe:

Yeah, very cool. And then I mean, with that, just in case anybody is wondering, you can still do the sort of typical scheduled workflows, right? Like, I want this to run at midnight every night from now until the end of time, we can do that with Prefect here, right?

Anna_Geller:

Exactly. So I think the distinction that is good to have is that scheduling is best for those more predictable and often batch processing workflows when you're generally clear about those expectations, what needs to happen. So for example, you ingest some data, you then transform it. You then maybe run some data quality checks to ensure that this data is in expected format and schema. You may then update some reports or generate some metrics. or trigger some reverse ETL, trigger machine learning process, et cetera. So I think this scheduling approach is often really desirable because it makes things easy. You know that this process roughly runs every day at 2 AM, and then it runs for like five hours. So that 7 AM, this workflow is ready, which took a couple of hours to run. And it makes it easy to reason about. It's not suitable for everything. And I think like an interesting perspective is like, if we look at some process end to end. So for example, let's say you have some anomaly detection process based on machine learning. So for example, you would start with maybe processing some images to train this model, right? You would just pre-process your data, you would generate your model. And then like you have this. But you need to really fit it into the whole process. Your ingestion transformation might need to start first, but then you need to trigger something more advanced later on. So just combining those different patterns together is where a combination of scheduled and event-driven really shines. Because, for example, once you have this model, you may want to serve it based on events. So for example, If new user from your application uploads a file, you may want to generate prediction, whether there's anomaly there or not. And suddenly you need this event-driven approach. So I think this combination is really, really nice.

Jillian_Rowe:

Yeah, that's all great. What kind of industries are you seeing this take off in? I know I'm in biotech and I've seen a ton of companies adopting Prefect and Prefect Cloud, but I'm wondering what other industries are out there that are using it.

Anna_Geller:

So I'm not sure if this is industry dependent. I think to some extent I see it across industries that almost everyone has schedules, something that runs on schedule and something that needs to run only when something happens. So it would be hard. I think there are definitely use cases that are better suited for certain industry and require dedicated tools. But that's also something that is, I think, great about our product is that it's so unopinionated that you can actually reflect those different use cases and industries in a single tool. So like, essentially, I think a good way that we solve here is that we give a choice how we want to build data flows and we don't just give one way to do it. Like you have always options, you always can change your design later on. It's adaptable and... You can, we essentially give those building blocks and you can decide what you do with them.

Jillian_Rowe:

Very neat. Yeah, I know it's not like a tool that's specific to any industry. I was just wondering if any in particular had jumped on it the way that I've seen biotech, like definitely has jumped on it, which is all good, all good things.

Anna_Geller:

Yeah, so I think in general, this orchestration is such an interesting problem, right? That almost everyone needs to have retries when something fails. And this is industry independent. You always need alerts when something fails. You always need to ensure that your machine can actually handle your workflow. So maybe you need some concurrency limits. So there are those different use cases that... I think orchestration is universal.

Jillian_Rowe:

I'm so happy that we're finally talking about data on the show. I know I brought this up before

Jonathan_Hall:

Hehehe.

Jillian_Rowe:

the show, but sometimes we'll be talking about something and I'm, you know, clearly they're not mentioning the data in their blog post or whatever it is that we're discussing, but I'm thinking about the data. And then I have, I have opinions. So it's nice that we're having a show that, you know, that we're talking all about the data.

Anna_Geller:

So speaking of opinions, maybe what's your opinion about observability and like, do you have like some like, do you like how this place is evolving?

Jillian_Rowe:

Yeah, it's so much easier to do things like this now and to have these really like HIPAA and GXP compliant workflows because part of the deal with HIPAA compliance, that's like a type of healthcare data compliance that's pretty common throughout biotech is that you do have to have observability into your data and your processes and your pipelines. And you need to make sure that you are keeping data and then you're not losing data is a big one. And then like, if you have these... kind of workflow runners, they need to be, a lot of times they need to be like siloed and private because you can't have, you know, bits of data, you know, touching other types of data, or, you know, if you have like different studies or different patients, the data has to be siloed off. Or if you're doing, if you're a biotech company that does like analysis as a service, you have to have like separate workflow runners. So. I mean, all of these are good things for being able to run, you know, like HIPAA compliant and GXP style workflows, because then it's kind of built into the system, right? You can say, okay, we know that it, we know that the, the process completed and that the data was written to the database because we, we have this data pipeline here and you know, pre-vector, I suppose you could choose another one for kind of the purpose of argument and say, okay, like, like here it is. So for example, one of my favorite stories with that. is during COVID there was a company in the UK that was being, they were getting like funding to record contact and trace data, right? And they were using an Excel sheet and the Excel spreadsheet went to, I don't remember how many thousands of rows it goes to, but then it would just stop and the data would stop being written. And I think like, well, what if they were using something, you know, like a data pipeline where we made sure that the data was being actually written properly to. probably not a spreadsheet, right? Like a database and then there was like a count and then you could actually like have just a quick check to say, okay, we wrote this data, let's just check and make sure that it's there, right? So that would be your kind of sanity check, I suppose. Like I always like to put those in any of my data pipelines where I think something happened and then I have sort of like a test within the pipeline, kind of like the way the software engineers write their software tests. I'll actually put a test to say, Okay, like this did happen, right? I didn't hallucinate this, this really is here and we have this data and it should look something like this and this is the way that it should be. So I like the fact that you can have those all within the data pipeline. And then when you get the observability, you know, hopefully, nobody's getting any very big fines for saying they're HIPAA compliant and then losing data, right? That's the thing that we're always afraid of, at least in biotech.

Anna_Geller:

Yeah, that's the hope. Yeah. And definitely Excel is not a database, right? That's what we learned from the story. Yeah, I think also interesting perspective is that how many different kinds of observability are there, right? So I think if you ask 10 different people, what do you think means observability, they will give you a different answer. And I think one part of it is because there are so many different kinds of observability. Like one of it is this application observability. where you are tracking metrics, traces, logs, and like how things are performing over time. So for example, number of API requests, we have an error status code, et cetera. But like when you look at the data observability, you want to ensure, as you mentioned, that your data is in expected format. Your data is of the right quality. It has the right schema. The volume changes, how volume changes over time. You need to ensure that it all makes sense. But this is this data observability, which is completely different from application observability and application monitoring. But then on the other hand, the problem that Prefix solves is this data flow observability, where you have changes to flow and task current orchestration states, maybe calls on your blocks, or just essentially things that reflect what happened in the user's data stack at a specific time. And something that you want to observe and take action on to drive orchestration. So for example, when your Databricks job completed, you want to start your dbt cloud job, you know, and you need to make sure that you only start this once this finished, you know, or the same like with maybe five trend data ingestion or, um, custom scraper that you built just in pure Python and orchestrated with prefect. So I think. Like having. different tools for those different kinds of observability and also having, for example, prefects for data flow observability helps to solve this problem. But definitely it's probably a challenge that we still need to solve as industry in the next years.

Jillian_Rowe:

Yeah, I definitely think it's interesting. I mean, different people are trying to, um, you know, to solve it in different ways, especially when you look at the, like the data observability versus the, you know, application observability. Um, I haven't like quite dug into it, but I saw that there was a blog post somewhere about getting Prefect and the library great expectations, which allows you to like make kind of assumptions, I would say about your data, like give it a data set, great expectations will create a set of assumptions about your data. And then you can kind of test those assumptions against future data sets, I suppose, and how that would integrate with Prefect. And just in general, I've seen Prefect integrates with a lot of the kind of, you know, PyData ecosystem and a lot of the tools that are out there. I mostly use it with Dask because that's just kind of, that's just kind of like what I'm doing on a day-to-day basis is using like Prefect and Dask and, um. just trying to get things to run as quickly as possible. But I am seeing lots of other tools out there as well.

Anna_Geller:

Absolutely. And also like with this observability and data quality monitoring, you have also those different kinds of ways how you can approach it from this perspective of whether you do it actively or passively. So for example, data quality checks is something that you very actively have to implement. You need to actually go ahead and first inspect your data, define like what you expect from your data, right? And then like encoded as your data quality check, embedded into your data flow and maybe schedule it. So this is this kind of active, this control part. And this is also part of orchestration. You need to actually make sure that this runs at the right time and when it's needed. But there is also this passive mode of orchestration where you just, sorry, of observability, where you just want to see how things change over time. what is happening currently right now in your, in your stack. And also maybe use, use different, different tools for that purpose. So for example, when you look at, let's say data log, you can, you can see those different metrics traces, possibly you don't need to like actively go and like, okay, what is, what is the, a current ECPU threshold on, on this machine, right? And the same, for example, let's say Monte Carlo. Monte Carlo also takes this passive approach to data observability, where you just connect your data warehouse to the platform and you can just start observing it passively without doing anything. So those different modes probably also help to get different lenses of observability.

Jillian_Rowe:

Yeah, I do want to say run a count everybody run count. Okay. You have, you have a data pipeline. You should know how many, how many records of data you should have at the end, I would think for the most part.

Anna_Geller:

How many

Jillian_Rowe:

Anna_Geller:

null

Jillian_Rowe:

hope.

Anna_Geller:

values? Yeah.

Jillian_Rowe:

Yeah. Things like that.

Anna_Geller:

for sure.

Jillian_Rowe:

Yeah, that's great. Do we have anything else we want to talk about for the event driven versus scheduled pipelines? Because I wanted to talk about the workflow orchestration as well, because I think Prefect has a really interesting model for that, or whatever

Anna_Geller:

Yes.

Jillian_Rowe:

else you want to talk about. It's your show.

Anna_Geller:

Okay, cool. Yeah, I think with this kind of observability lenses, what's really fascinating to me is that you have a lot of, you have many platforms these days that allow you to collect metadata about your execution, right? You collect metadata about your data. You collect metadata about your data flow. But often they just provide you only with this state of the world where you can maybe see... how things work in a dashboard, but you cannot do anything with it. Like they actually stop at this point where you can only see, let's say that this failed and okay, that's great, but you need to actually take manually action to do something with it. And what is interesting to me is to actually take it one step further and take automated action. So for example, you observe that this workflow failed, maybe you need to cancel this another process that depends on it. You know, so kind of like, we often see people, for example, just going into real time from, from batch, you know, did they say, oh, we need to switch to real time, but like, what's the, what's the problem? What problem are you trying to solve? If, if all you want to do is to just start your dashboard more frequently, you know, maybe. I see the numbers changing, changing faster. It may not be like the right, the right motivation to do it. But if you want to actually take automated action based on what you observed, based on what's happening, then those, those event driven and real time systems become important and really drive value to the business. So I think this, this kind of aspect of, of making it actionable is, is, is important.

Jillian_Rowe:

So do you mean to tell me that if I just have the big TV up with the dashboard so that when my boss thinks in and they think I'm doing stuff, that might not be the way to go?

Anna_Geller:

No.

Jillian_Rowe:

No, no? Okay, all my years in IT lied to me. Okay, all right, I have to adjust my worldview for that one a little bit. Have you ever been in like an office where they just have like the big TVs with all the dashboards that nobody ever actually looks at?

Anna_Geller:

I was, I was, and I was

Jillian_Rowe:

Yeah.

Anna_Geller:

always fascinated by.

Jillian_Rowe:

I agree, I was too. It's, I don't know, just for show and to make it look like we're doing stuff.

Anna_Geller:

Yeah. And maybe, maybe also before we move into orchestration, like one important aspect also of, of, of this, like taking action is often you don't want to only take action based on what's, what's happening, but also if something doesn't happen. Right. So for example, you have this critical job that you need, like it, it must be executed by then by 9 AM, for example, this critical sales table should be updated every morning by 9 AM. And if it's not. maybe automatically open a new support ticket in Monday or Jira or notify people because it's really critical that this finishes by 9am. So this is also something where you need to take action very quickly but it's not kind of event that you just fire and do something immediately. It's more proactive that you need to actually look at the system as a whole and see okay the take some automated action.

Jillian_Rowe:

Yeah, I think it's always really important to kind of identify where the computer can help and where a person needs to jump in and start saying, okay, like what's what's happening here? What do we need to do? So over these kind of jobs, like you said, this this job has to go by 9am and it's not I don't know, start ringing the bell until somebody until somebody figures out why

Anna_Geller:

For sure, sometimes a conversation can solve more problems than any automated process. Can you join us on this suspect?

Jillian_Rowe:

Are you back now?

Jonathan_Hall:

I don't know.

Jillian_Rowe:

Maybe? Yeah, guys, we haven't like kicked Jonathan off the show or anything. He's just having technical difficulties, so.

Jonathan_Hall:

Speaking of observability, I really wish I could look at the logs for my router, because I have no idea what's wrong with my network right now.

Jillian_Rowe:

I don't know either.

Anna_Geller:

Exactly, so you need a dashboard of how your route

Jonathan_Hall:

Exactly.

Anna_Geller:

is.

Jonathan_Hall:

Exactly.

Jillian_Rowe:

Start pushing data through and see how fast it goes, and that's your new Prefect dashboard.

Anna_Geller:

Oh, we don't do dashboard yet, but that's a good idea. I will take it as a feature request.

Jillian_Rowe:

Prefect has a dashboard, the Orion UI, right?

Anna_Geller:

Ah, okay. You mean this type of effort. I thought you mean you actually visualize your, your, your route. That looks.

Jillian_Rowe:

Oh, no, no, no, I was just saying that. In case Jonathan needs a new project, he can make that. Maybe he could use Prefect as the backend. I am actually using Prefect as the backend for my latest open source project that I'm trying to get companies to throw some cash out my way for, and so far they're not, so far they're not, but I'm sure they will one of these days, and then I'll give you guys all the accolades that you deserve.

Anna_Geller:

After they watch this, listen to this podcast, they will be on board.

Jillian_Rowe:

That's right, that's right. I don't know if any of my clients listen to the podcast. Well, I know like a couple of them do, but not, it's more like my peers rather than specifically my clients. I don't know. So I guess the other thing with Prefect that I wanted to talk about that I think was a pretty big jump from other kind of workflow data workflow managers that I've seen is this idea that the code doesn't necessarily have to live with the orchestrator. And what I mean by that is like, let's take something like Airflow and like I'm not ragging on Airflow, right? Like I love Airflow. I still use it quite a bit. Most of my clients have jumped over to Prefect, but You know, that's, uh, that's life, right? But one of the things that always really frustrated me with airflow is that the code kind of had to live within airflow. So there had to be some way of getting it on there, which over time became more complicated than it should have, especially, uh, when I was working with data scientists and they weren't really great at things, you know, like CICD or running tests. or this kind of stuff, because often they would just want to throw all their Python libraries in there, our libraries are in there or whatever. And they may or may not play nice with like the Airflow Core or whatever other like plugins or libraries or anything like that that we had going on. So one of the things that I really liked about Prefect is that you can run from like multiple kind of locations and it's sort of like built in there right you can run from like a Docker file. You can just run from like Python code, which is still probably what I do the most, but I like having options because I do. And I can't think of any other ones right now. Are there other options besides you can remotely host, you can have like remote storage. So your data doesn't have to go to prefact. Prefect kind of will call your data from a remote, like let's say a remote S3 location or remote Docker image, all that kind of stuff. Am I missing anything there?

Anna_Geller:

Yes, I think you described it really well. So I think one kind of paradigm is taking code from, um, from GitHub, from, from your code repository. Another one is to baking code into your Docker image. Um, so your code lives where all your dependencies live together. And the third pattern is, is this using cloud storage buckets. So S3, GCS, Azure, Ablob storage, uh, where you can just upload your code and prefab will. retrieve it at runtime so that there's still this barrier that prefect never sees your code. It just like grabs it when it needs and only needs this metadata, this point of where your code lives. And that's everything we need. We just need to know where your code exists and you can just change it later if you want to. Maybe, maybe you started with GitHub, but then you notice like your team is doing monorepo and it doesn't work. You need to maybe split it into different repositories or maybe switch to cloud storage bucket and you can very easily switch between those different blocks. So I think in general, like this concept of blocks is really where Prevec shines. So you have those modular building blocks where you can piece together to, to, to create your deployment. So you may start with, as you said, you have the storage block that defines where your code lives. Then you have your infrastructure block that defines where your code should be executed. It could be a serverless container. It could be a Kubernetes job. It could be just a local process on a VM. And you can very flexibly switch between those when needed. So for example, you may just run everything on Kubernetes right now, but maybe at some point you realize, oh, you don't have the resources to maintain this Kubernetes cluster. So you could switch to a serverless container infrastructure block and just keep, keep using the same workflow. See like you don't need to change your code. You don't need to really change your process. you just point to a different block. And the reason that it works is actually because all those components are API first. So you define your blocks via API, and you can also change it via API call. And I think this API versus is critical to enable such flexible patterns.

Jillian_Rowe:

Yeah, I have to say that's one of my favorite parts, because I know like the, or one of my favorite aspects of Prefect, because I know like all my clients want to do slightly different things, and you would think like, oh, it wouldn't be that difficult to incorporate this slightly different thing, until you actually try to do it, and then you realize that it is, in fact, gonna be a major pain if somebody wants to do S3, and somebody else wants to do GitHub, and somebody else wants to store their code in Docker. So I've always liked that.

Anna_Geller:

Yeah. And the worst part is they like many tools, they force you to actually rewrite your code to switch between different things, you know, like often, I think we we've seen in the past that this many products adopted this workflow as code philosophy and prefac took it one step further and adopted this code as a workflow approach. So we kind of assume that you already defined your logic in code. Why would you need to rewrite it for orchestration? Your code is already the best representation of a workflow. And that's, that's all we need. So no need for any branching DSL. Just write native Python if else, and use a for loop or while loop where needed. And also based on your, on your code and your stack, you define your, those modular building blocks and you can use them to, to express how, how this code should behave. Like whether we need to grab this code from from Esri or from GitHub or whether we should run it on ECS or on Kubernetes. It can be adjusted anytime without modifying the code.

Jillian_Rowe:

Yeah, I'm a big fan of the ECS because I'm cheap. I'm too cheap to be running Kubernetes clusters all the time, right? I want my AWS batch Fargate clusters.

Anna_Geller:

Yeah, ECS Fargit is probably underestimated in how much you can do with this, right? So like no need to manage any clusters here, you just register your test definition and you can run your container. And it can be anything. It can be a long running data processing job. It can be an application. You can create a service to ensure that this container is running 24 7 without disruption. So this is definitely a very, very flexible service. And we see. so many community users and so many enterprise clients adopting ECS Fargate. And I think in general, I'm a huge fan of serverless and I really, I'm really happy to see that other cloud vendors also started adopting this approach. So if you look at, for example, Azure, Azure now has Azure Container Instances, which allows you to do essentially the same what ECS Fargate does. And they actually also kind of improve this. by adding GPUs. So you can actually run a container serverless with GPU, which is quite unique. I haven't seen it in any other cloud vendors yet. So that's a very, very interesting progression. And on Google, you have Cloud Run. So Cloud Run job could be kind of a container that you just run, and Cloud Run service could be to serve applications based on your requests. And yeah.

Jillian_Rowe:

Very cool. I didn't know that about the Azure had, you know, you can use GPU with your serverless functions. It's very neat. I don't think AWS has that. I think with AWS, I think you're still limited to like 32 gigs or something like that on Fargate, although I don't know, it gets improved all the time. So I'm sure at some point they'll throw more on there.

Anna_Geller:

Yeah, so AWS has, as you said, they've even increased the CPU, but GPU is still not there, unfortunately. There is an open issue. I hope they will pick up, but for now, no GPUs on Fargate.

Jillian_Rowe:

Alright, there we go guys, we have an issue to go vote on. We all want GPUs on AWS Fargate, I guess.

Anna_Geller:

Yes.

Jillian_Rowe:

Uh, yeah, so I was thinking, um, I know another really interesting aspect of Prefect, at least for me, is that you get this kind of parallel computing and concurrency without really having to code it in. So you don't have to set up, um, thread queues or, you know, run GNU Parallel or like whatever MPI, you don't have to write MPI, let's stick with that. You don't have to write MPI to be able to have like multi-threaded code that runs concurrently. Prefect is just kind of smart enough, especially with Prefect 2.0, that it can kind of figure out, like if you're running a for loop, for example, it will just, it can automatically paralyze that, right? And I'm wondering like what black magic happened so that it can do that, because I find that very interesting.

Anna_Geller:

Yeah, we have excellent engineers. So I think Michael Atkins and Chris Pickett are people who are building this. So kudos to them. I also think it's quite magical that you don't need to think about this. You just define which task runner you want to use. Do you want to run it on task? Okay, no problem. Just specify this task runner. You can optionally point it to an existing cluster so that you don't have to create one on demand. And the same with Ray. Do you want to switch from dust to Ray? No problem. Just change this task runner. And concurrent one is, is particularly interesting because especially in data engineering, we have so many workflows that depend on IO. So you are extracting data from, from some systems, which is, which is IO operation. You're loading it to somewhere else. So like automating it this way, uh, with just concurrent task runner, which is Out of the box, the default setting is really nice so that you don't even have to maintain Dask or Ray if you want to take advantage of this concurrency, which we have out of the box with async support. And also, I think with async, that's also worth noting that you can very easily switch between synchronous and asynchronous execution. This is also something where you don't need to think about. You just define async if you want asynchronous execution. And both just work. There's also kind of magic that is worth noting.

Jillian_Rowe:

I think that's very cool. I haven't jumped that much into the async Python world. I know it's used quite a bit in like some of the API libraries to like fast API and things. So I know I'm like a couple of years behind the times on that, but I'll just I'll just keep using Dask until I don't know, I'll I'll shut the door and turn off the lights behind me, I guess, when I'm the last one out when it comes to that. What about I haven't used Ray at all. How is that like, are people adopting that? What is it being used for? I know it's like kind of a high performance machine learning library, but how is that being used with Prefect?

Anna_Geller:

So we have several users who started adopting Ray. I don't have any data right now at hand to tell you how many users started using it. But I think Ray is very promising. Like there are great engineers who are working on continuously improving this product. Yeah, I'm very excited. And I think actually the best part is that you don't need to worry about it if you don't need to. Right? If everything works great for you with Dask, you don't need to learn about Async, you don't need to learn about Ray. It exists there when you need it, and if you need it, but until then you can just be happy with Dask and Prefect.

Jillian_Rowe:

I mean, that's true, but I'm also an engineer with Shining Object Syndrome, so I don't really know how long that's going to last for me, but you are factually correct, at least. I don't

Anna_Geller:

Yeah.

Jillian_Rowe:

actually have to worry about Ray and async and all this kind of stuff, but it is nice to know it's there for me when I need it, if I do more of a machine learning library kind of approach.

Anna_Geller:

Yes, I think one good way to just get started with it is just try out this Ray Tascana and just replace the Asc with Ray and see how it works, see the Ray Dashboard and you can see whether you like it.

Jillian_Rowe:

That is a good idea. That is a good idea. I've been thinking about trying to see if like more machine learning could be applied to some kind of next generation sequencing techniques. Like if you can get the similar same results and more or less time, because often people are generating so much data and there are these kinds of tried and true methods that people use, but what if there was another way? What if there was another way that was faster and you got like the same or similar results? But that's like a research. I think that's a research project for another day. But if I do it, I'll probably do it with, you know, Prefect and Dask and Ray and all this kind of stuff.

Anna_Geller:

awesome.

Jillian_Rowe:

And then I think before the show, we were talking a little bit about kind of open source software and your involvement in that. Um, we're all pretty big proponents of open source software on the show. Well, how did Prefect decide to be, you know, at least the, the Prefect software, right, is all open source. And then there's this kind of, uh, you know, like hosted version that you can sign up for. And obviously that's paid because AWS costs money people, it costs a lot of money, but you know, how did, how did that all come about?

Anna_Geller:

So we operate in this Pi data space. And I think just being, being available there as, as open source, it, it really helps even with, with distribution. People can, can easily find it. And, and also like purely from, from this perspective of just learning this as a new tool, you just install this library. You start using it. You start, you can see what it, what it solves the problem that you are trying to solve. And then you can decide whether you want to buy this cloud product or not. I think this testability, this flexibility of what you can do with open source is something that changed so many things, right? Like can you imagine living without Postgres? It's like hard to imagine the industry without it. So I think just getting started and incorporating it with other tools is like... where our oven sauce really helps.

Jillian_Rowe:

Yeah, I've also found over the last few years, you know, you're talking about the Pi data space. It seems that everybody is trying to get into the Pi data space and also make their tools, you know, so that they play well with the other tools in the Pi data space, um, you know, specifically so that there is quite easy adoption among the different tools. And so that, you know, if you get one tool to work, well, then you have the feature sets from all the other tools as well. And I think that's really important for, um, you know, adoptability and maintenance and all that kind of thing.

Anna_Geller:

And when you want to ensure that those tools are working well together, you install Freefax.

Jillian_Rowe:

Exactly, exactly. Install Prefect for all the things. So where can people go if they want to get help with Prefect? They're like, all right, cool, I'm sold. Where do they go?

Anna_Geller:

So the first part where they can find us is prefac.io. That's our website. We have also Slack. So prefac.io slash Slack is where you can join our Slack community. You can ask any questions. You can also find me there. Apart from that, we have Discourse. So discourse.prefac.io is where you can also find answers to existing questions. You can find lots of resources to get started. And finally, of course, our GitHub. So you can find our GitHub repository. You can submit any feature request, bug report, if there are any.

Jillian_Rowe:

Is there also a discourse forum? I thought maybe there was for a bit, but I don't know how much that's kind of maintained or whatnot or if it's still around.

Anna_Geller:

I'm actually the one maintaining that.

Jillian_Rowe:

Okay, there we go.

Anna_Geller:

So this course of prefab.io is where I'm around all the time.

Jillian_Rowe:

Okay, that's always good to know. Did you have anything else you wanted to add or any points, anything important or not so important that we didn't get to?

Anna_Geller:

So maybe kind of as a message, just don't get overwhelmed by the amount of tools that are out there. Just always look at what problem are you trying to solve and see what tool can you help accomplish your goal. And of course, as a call to action, trial prefab, just install it, see how it goes, and ask any questions if you have one on Slack or Discourse.

Jillian_Rowe:

Cool, great, okay. I think, all right, well, I think we are all ready to move to PICS. Jonathan, are you there? You look kind of frozen. Oh, no, he's not

Anna_Geller:

She's

Jillian_Rowe:

there.

Anna_Geller:

frozen.

Jillian_Rowe:

I know, he's frozen. I don't know what we're gonna do, because only one of us, by the way, has the ability to press the record button and the stop record button, and that person is Jonathan today. So the show is just gonna go on forever without him. So I guess that's what we're doing. Moving on over. and to American Thanksgiving.

Anna_Geller:

very very cool I love that

Jillian_Rowe:

Peace back. Jonathan, you wanna do your pick while you're here or do you wanna like conserve

Jonathan_Hall:

I can,

Jillian_Rowe:

your energy?

Jonathan_Hall:

yeah, see if I stick around long enough to actually say something here.

Jillian_Rowe:

Yeah, we'll see.

Jonathan_Hall:

So my pick for the week is transistor.fm because I had a feature

Jillian_Rowe:

hahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahah

Jonathan_Hall:

request from them and it's kind of a funny story. I hope I can get it said before I drop off again. I emailed their support and said, hey, I would like to be able to upload a transcript to my podcast through your API. Is this possible? Can it, or what would it take to make it possible? And they replied less than two hours later said, sorry, that's not an option. We've had a few people request it. We may add that feature in the future with some other API improvements. I was like, all right, great. So I started shopping around for other podcast hosts that would let me do this. The next day they wrote back and said, we've added that feature, would you test it for us? That's pretty cool. So they've added the feature and it works. So I'm in the process of. automating my podcast uploading, including transcripts. But I thought that was pretty great customer service. I mean, despite their first response saying, sorry, we don't do that, in the end they got it done faster than I could have expected. So that's cool. So that's my first pick. My second pick, hashtag shameless self plug is a package I wrote to talk to Transistors API. in Go. So if you're a Go programmer and you want to talk to transistor.fms API, you can use my new little package that does that. It's called a gitlab.com slash Flemzy. That's with a Z slash transistor. So those are my two picks for the week.

Jillian_Rowe:

Cool, and did you have a pick?

Anna_Geller:

Yes, I do. So I'm a movie fan, so my picks will be movies. I'll just say my two favorite movies of the year so far. The first one was Everything Everywhere All At Once. You've probably watched it. I think it's just a great story. It's funny, visual, touching. It doesn't have this typical movie arc. It's quite different and refreshing. And the second pick is Top Gun Maverick. It kind of exceeded my expectations when it comes to this movie. I didn't expect it to be good, like a follow-up after 30 years, but it was really, really good. Without feeling unnatural, great action scenes, not boring at all. So yeah, those are my two picks.

Jillian_Rowe:

Great, I think I'm gonna pick American Thanksgiving is tomorrow and I'm actually pretty excited for that because for the first time since COVID started and all the like crazy supply chain issues, I can actually find all the things that my kids want because as it turns out, my kids are considerably pickier than I am. But this year we got all the things. We have like turkey and stuffing and mashed potatoes and stuff to make pumpkin pie. So I'm pretty excited for all that. And I guess for a, you know, for a tech pick. the hell I picked Prefect. I actually do use it quite a bit. This isn't me just like being nice on the show. I do use it on an almost daily basis or like, you know, when I'm writing code anyways, I do tend to use Prefect quite a bit. And I really do like it as both a workflow orchestrator and even like just an additional add-on to kind of any of my ETL sort of pipelines. I like the logging and the, um, the like... Hey, stupid, this failed message. I like those very explicit for me because, you know, sometimes you're busy and you need those. And then I guess I'll pick the Dragon Prince again, the fourth season. I might've picked it last week. I don't remember. If I didn't, it's a really good show. It's on Netflix. I've been watching it with my kids. We've been like binge watching it. And it's something that kind of the whole family likes, which is very- Very difficult to find something that my eight year old, my 11 year old, I like, the husband likes, everybody likes it. So that's it. Those are the picks. Anybody have any kind of last minute remarks?

Jonathan_Hall:

Yeah, I can't wait to listen to this episode to hear everything I missed.

Anna_Geller:

I'm sorry.

Jillian_Rowe:

You know, Jonathan, with your pic though, I kind of, I kind of feel like the mystery of why Jonathan is getting kicked off of Riverside has been solved. So I would just, maybe I would just keep that in mind. That was why I started laughing is because we're on a different podcast host and I don't know. I thought that it was funny.

Jonathan_Hall:

I don't think anybody's DOSing me because of my pick. Certainly not before the fact, but

Jillian_Rowe:

Yeah,

Jonathan_Hall:

Jillian_Rowe:

but

Jonathan_Hall:

so,

Jillian_Rowe:

haven't

Jonathan_Hall:

that would make

Jillian_Rowe:

Jonathan_Hall:

for

Jillian_Rowe:

said

Jonathan_Hall:

a very

Jillian_Rowe:

other

Jonathan_Hall:

interesting

Jillian_Rowe:

things before?

Jonathan_Hall:

episode in the future to debug that, yeah.

Jillian_Rowe:

Yeah, we'll have to figure that out. All right, everybody. Well, until next week, bye-bye.

Jonathan_Hall:

Cheers!

Anna_Geller:

Thank you. Bye.

Organizing Data Workflows With Prefect - DevOps 144

0:00

1:02:09

Playback Speed:

Show Notes

About This Episode

Sponsors

Links

Picks

Transcript