Scaling and Shopify with Kir Shatrov - RUBY 633

Today’s guest Kir Shatrov is a production engineer on Shopify based in London, UK. Today, he and the panel are discussing capacity planning. Kir believes that capacity planning becomes a priority when your company starts losing money and your customers are suffering. When someone does get to the point of scaling their app, it’s important to look at the limitations of the hosting service. It is also important to remember that scaling is not a job that is ever completed.

Hosted by:

Charles Max Wood •

Nate Hopkins

Special Guests:

Kir Shatrov

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Kir talks about his experience and time with Shopify and what types of changes have happened in the four years he’s been with the company. Kir explains that when Shopify was founded about 12 years ago, they were some of the first contributors to Rails, and Rails was just a zip file they shared over an email. This is important to know because the monolith code for Shopify has never been rewritten, so they put a lot of care into keeping it working. He talks about some of the techniques Shopify uses to avoid splitting into microservices when scaling their organization and how the multiple instances of the database are structured and managed from an ops point of view. He talks about what aspects of Shopify are open source and the approach to the architecture of the background jobs system.

The panel discusses what should be done if you want to scale your project and move away from background jobs. Kir talks about what criteria his company uses to determine what moves to a background job and when it is too much to background something. The show finishes with Kir sharing some of his favorite tips, tricks, and approaches he’s used at Shopify.

Picks

Nate - Open source
Nate - Cats (Maine Coon and Russian Blue)
Charles - 75 Hard challenge
Kir - Ruby Hack Challenge
Kir - Follow Kir Shatrov @kirshatrov on Twitter and @kirs on Github

Transcript

Hey, everybody. And welcome to another episode of Ruby Groves. This week on our panel, we have Nate Hopkins. Hello, everybody. Andrew Mason.

Hello. I'm Charles Max Wood from dev chat dot TV. And this week, we have a special guest, and that's Kir Shatrov. Kir, do you wanna say hi? Let us know who you are.

Hi. My name is, Kir. I'm a production engineer at Shopify, where I work on scalability in the platform, and I'm based in London, UK. Nice. Now Shopify doesn't have to deal with any scalability.

Right? I mean, they only run, like, half the shopping carts on the web and things like that. Right? Oh, yeah. So, I I'm curious as as we dive into this.

You know, you you gave us a couple of articles. 1 was on, the state of background jobs. The other one was on, like, capacity planning for web apps. I kinda wanna start with this and dive mostly into when should I start caring about this. Right?

Because if I have a small app, it it matters a lot less for a while. And then eventually, I'll get enough users or enough people using the capacity to actually go, alright. Now I really need to start thinking about this. So yeah. Where do you find the the the cutoff point is for this kind of thing?

Definitely. There is a lot of talk and technologies that it's natural for engineers to be super interested in, but the price of over engineering things and choosing some solutions that are maybe too complex at the stage where, your project is right now, that price can be too high. And often, the most resourceful, thing you can do is just deploy it on Heroku and let it run. And, it will cost, a few $100 for, your Heroku bill. For me, I think the the cut point is around the time when you start losing the control of maybe, your hosting cost, or you start noticing that, whatever scalability promise that you have start hurting your customers.

And you start losing money either as a result of your customers being unhappy or as a result of the thing costing to run a lot more than, a company can afford to run the business in a reliable way. Yeah. That makes sense. It it's interesting too that you've kinda tied it to those two practical breakpoints. Right?

A lot of people, they try and tie it to, well, I have a certain number of users or I have a certain size of an app or I have, you know, a certain amount of server capacity or, you know, stuff like that. And it's it's interesting to me that a lot of this, you know, you you've tied it back to, oh, it's impacting the customers or, oh, you know, it's it's impacting my bottom line. And and then it's like, oh, okay. How do I deal with this? I also think it's interesting that you mentioned that, you know, it's it's easy to do if you just hand it off to Heroku and let them handle it.

And I know that, I I haven't heard it as much from Nate, but I've definitely heard it from Eric over at, CodeFund that that's kind of his approach. He doesn't wanna deal with DevOps. He just wants to push it to the cloud and then, you know, let them handle it, and he's willing to pay for Heroku to do it. Yeah. That's that's our philosophy right now.

But, I mean, we're also short staffed. Right? Yeah. So we've got 2 well, really, we're just 1 and a half, developers on the project, other than we've got plenty of, contributors that that help us fix bugs and things like that. But there's only 2 of us that are full time, you know, looking at code, and Eric's really only about half time looking at code, if if that.

Right? Right. So we don't have the time or the bandwidth to really delve deep into into, you know, the ops story. Mhmm. That makes a lot of sense.

So I'm I'm curious, Nate, at what point would you guys consider moving off of Heroku? I mean, would it be a cost thing or would it be something else? You know, we're still we're we found product market fit, and we are trying to scale it now. We're trying to scale on the sales side. So as soon as we have enough customers and enough, consistent revenue flowing in to allow us to kinda back off and and look at our operation story, that's probably the time.

So I would say we're probably maybe 6 months away from, you know, having the luxury of being able to look at that. Yeah. That makes sense. So, Kier, as as somebody gets to that point, you know, and I I think this might be a relevant conversation then for Nate. But, you know, when they get to that point and they're thinking, okay, we're gonna scale this, maybe they move it off of Heroku and onto a, you know, a Kubernetes cluster or they move it on to, you know, a virtual private server, something like, DigitalOcean or something.

What things should they be looking at then to scale their their stuff up? For any hosted services, like, for instance, it's common to use hosted, databases as a service. I think it's important to look at whatever limitation that service provides because any hosted service would would have some kind of those. I've I remember I've read a blog post where, an app had a very specific requirement for some Postgres extension that they've been using. And they switched, I think, 3 3 providers that, gave them Postgres as a service.

And they've been unhappy with each, and they obviously spend a lot of efforts. And finally, they got to run Postgres on their own because having that very extensive extension and requirement, that was a huge point for them. When choosing a provider like that, it's important to understand any limitations. And it is, and from another angle, I think there is there is so many scalability related problems that you can run into that usually it's you start looking at the one that's most critical right now. Like, I've, I've been part of projects where, they run into scalability issues with the database layer with MySQL or with with with Progress.

And as they fixed it and iterated on it on it and their database could accept a lot more load, they came to another bottleneck. And that bottleneck is different every time depending on the business, depending on your patterns of the usage that's coming from your customers. So it's, fixing one thing at a time, 1 by 1. And sometimes that's a never ending story, especially if the company grows large and there is a a team works just on scalability, which is currently the case, for my team at Shopify. Yeah.

That's a terrific point in terms of it really not this is not a job that ever completes. Right? It's something that you're always having to stay on top of it, especially if the company is enjoying any level of success. One cool thing about, CodeFund is we are even though we're on Heroku, we're able to leverage some of the Postgres and more advanced Postgres features like table partitioning and things like that, which has enabled us to continue to scale on that platform. We're hosted on a 160 plus sites right now.

And so we're we're seeing between 2 and a half 1000000 and 3,000,000 requests a day pipe through the server. And now we are paying a premium for Heroku, but we're still I I think we're under 800 a month on our on our production, setup. And we're probably a little over provisioned, in anticipation of spikes and things like that. And so, you know, we don't quite have the fine tuned control that that we would like to have. You know, your point on on Postgres, as you want to customize that and install your own plug ins and things like that into the database layer, that's that would be something that would be fantastic because since we are using table partitioning, I know there's some some plug ins that just are not, you know, broadly available on the Heroku platform that would be kind of a luxury to use for us that we've we've kinda had to work our way around some of those things.

I'm curious about your experience and time with Shopify. How long have you been with the team, and what types of changes have happened since you've, been with the company? I've been at Shopify for almost, 4 years, and I've always been part of the production engineering department, which deals with the infrastructure and is, is less exposed to to the product. And, just that department grew so much from maybe, while I've been here from maybe 30 people to now, more than a 100. And all of those people are working on the infrastructure and reliability.

And, with the motto of that, our job is to keep the the site up. There's another, aspect to scaling here, going from 40 to a 100 people. Like, how how has the team scaled? Like, what's the dynamic been like? Yeah.

It's it's interesting to to follow dynamics in terms of team scaling in in every organization, and I imagine it's it's a different story. It's it affected, so many things. Like, for instance, at the time when I joined, our Shopify is based in Canada, and, most of, infrastructure engineers were just one office. Now people who work on the infrastructure are based in 3 offices, and there is also, a lot of remote people like, like me. And then as you grow, you, you end up investing into some some of the things that you would never invest before and and have teams who work just on one part of development environment, for for instance, or just on, background jobs infrastructure, something that I wouldn't have imagined 3 years ago.

So what is the technical portfolio for for Shopify around? And, like, how has it changed since you joined? I mean, obviously That's a great question. There's there's been a lot of new tools and techniques and stuff that have come out, but, you know, just over the last 4 years. And so I'm curious what the evolution of tooling has looked like at Shopify.

Yeah. That's a great point of discussion. So, I think, first, there is something I wanted to give the context to to our listeners first. Is that when Shopify, was founded about 12 years ago by Toby Lutke, Toby was one of the first contributors to Rails, and he he knew, David, DHH, and they exchanged some emails. And at around the time when he started company, when he started Shopify on Rails, Rails was just, a zip file that they exchanged over an email.

It wasn't even some specific version published on a gem server because I'm not even sure if there was if there were any gem servers at that point. So from that day when he started, on on Rails, that app, still exists. It was never rewritten. It's a monolith that has been around for, more than a decade. We tend to put a lot of love into it to make sure that developer experience stays, great, Unlike it often happens that a monolith is just, too slow and too hard to work with that, developers get so much friction and decide to to go splitting or, calling the the monolith, legacy.

It it never happened for us. I've I've got I gotta interject and just ask a question on your monolith in terms of like, I know Shopify is a very large company. How many developers have their hands in the the monolithic code base? My rough guess would be from a 100 to 200 people, given that r and d in total is a lot more, be because there would always be people working on other part of stack, also mobile developers and so on, as you can imagine. So back to your point about how has the stack changed.

In terms of tools that are familiar to listeners of our our podcast, it's still pretty much a classical Rails app with all the things that come with it. In terms of the infrastructure, I think the biggest shift that I have observed of the company was move from physical data centers to the cloud, to Kubernetes. And that's another hold interesting story because we were able to move to Kubernetes in cloud one shop at the time. Given that we have millions of them, we wanted to make this process as continuous and and fine control as possible. So we just took one shot, moved it to cloud, and progressed, and we were able to control that.

It's it's fascinating to me that you have upwards of 200 developers working on on a monolithic rails code base. Like, some conventional wisdom that I've heard in other circles and certainly bumped into in my career has been that if you're gonna scale your organization, you apply Conway's Law and, you know, break out into microservices. Like and the conventional wisdom seems to be that that's really the only way to do it, and and you guys are a terrific counterpoint to that. What are some techniques you've used to facilitate it? I think one of the biggest has been adopting domain driven development development and, splitting that monolith into I would not call them namespaces, but it's kind of components.

At least, that's how we call them. There is nothing very secret or special about it. It's basically just a way to structure your app directory so that each team, each component gets their, their part. Therefore, it helps a lot to establish the ownership because it's for instance, as soon as you see an exception in production in some of the exception tracking service that you use, you see that exception is coming from components slash support slash app slash model slash something. You immediately know that's a support component, and you have all the metadata to find people who can help with that.

And even, an on call escalation or a Slack channel where you can, chat and point out. And we started, leveraging that for some of the to automate some of the things. Like, for instance, if exception within one app happened in that component, we'll send a notification to their Slack channel, not to some generic Slack channel with tons of exceptions from all over the company. Establishing those ownership is, I would say, the main technique. Okay.

So a domain kind of a domain driven design, and then you give a team, like, full stack responsibility or at least all the all the areas of the stack that that particular domain piece may touch. Right? So that could slice all the way through front ends all the way down into the model layer? Yeah. It's not as as strict as as you can imagine, and there would always be cases of reaching out directly from one active record model to another through components, through different domains, And that's not great.

We try to build tools to discourage people from, doing that and to for them to know what are the right patterns. Like for us, it's mostly entry points that are well, that are typed and declared and documented. So this is kinda shifting gears a little bit. I'm really curious about the the database infrastructure because I know on Shopify, essentially, you you've, sharded the database or or maybe not sharded, but there's multiple instances of the database, right, that are all that backs this. How is that structured, and how do you manage that from an ops perspective?

Oh, yeah. That's also a great, discussion point. So also to give some of the context to the listeners, for all well known Rails companies, like Shopify, GitHub, Basecamp, name a few, that's been founded around 10 years ago. At that time, MySQL was the best known database that that everyone knew how to run and operate. People were the most familiar.

And some other, like, progress were not maybe as good or as as, established at that point. So that's one huge reason why, this subset of companies, including us, are all based on MySQL. And, yeah. At, I think it was around 2014, 2015, when we realized we can no longer fit, everything into 1 DB. We figure out we have to find a way to scale horizontally.

And for a multi tenant SaaS application, there is a great way to do that. You base since you your tenants are always isolated, you don't have to you don't have any joins between multiple tenants. So you can put tenants, through different shards, through different partitions, and, manage those independently, which also reduces the the blast radius. If you have 100 shards, one is is down for whatever reason. Only 1% of your customers are getting some negative experience, and you you go and fix that as as soon as possible, but it's not all of the platform.

So we invested a lot into sharding. In terms of application logic, it's it's mostly done on, rails layer. We have a, a rails team at Shopify that that helps to to steer that into the best direction possible, at least from the rails point of view. And from the ops point of view, it's, it's just, a lot of, shards, that that can be located even in different regions, and we, which also can allow to to isolate some, tenants geographically. So let me just recap to see if I've got the the the picture in my mind correct.

So we've got a Rails monolith that's kind of structured, with kind of these domain areas of responsibility. That's how you structure your teams. And the way you've scaled this, at least up to this point in the conversation, is you're just dealing with, like, just mountains and mountains of data. So you've sharded your multi tenancy across different database nodes. But for the developer, it can just look like a typical Rails application.

Correct. And, something to add is that we our goal is to make that all that sharding complexity hidden away from developers who write product features. For them, it may feel like there is just a database with a lot of tables that represent the the business model. But underneath, there would be some smart chart selection that would happen at the beginning of the request, for instance, that would select the the right database. And, I mentioned this just for MySQL, for relational database, but we we've realized that, it makes no sense to have sharded, MySQL but just one global Redis.

Because regardless of how well you shard, that one global Redis or that one global memcache would still be a single point of failure. And, as you can imagine, we learned that lesson, by experiencing those single point of failures. So our philosophy is that every resource would be sharded. So there would be a smaller instance of Shopify that has its own MySQL, that has its own Redis, that has its own Memcache that helps with this isolation. So with each web server, essentially, or or maybe a partition of web servers that scale horizontally, all of those would not necessarily have a local copy of Memcache and and Redis, but maybe just a shared one for that that cluster of web servers?

One thing I should note is that stuff like, web servers, it's still all shared capacity. And, it's mostly it's only resources that are isolated. So any web server can talk to any to any partition or any, like, smaller instance of Shopify. It's mostly the matter of selecting the right path depending on what's the customer. So now I'm a little curious in terms of because there there there's obviously a pretty significant coordination piece there, you know, when the request initially comes in.

And and then you assign the the correct memcache server, the correct Redis server, and the correct MySQL server. How much of that infrastructure did you guys have to build at Shopify, and how much are you leaning on the database providers for those things? Honestly, I think it's mostly all in house built. And, to to give a bit of context about that, it's mainly a component called Sorting Hat. I like the name.

That, is using some Yeah. Sounds magical. Route. The, Sorting Hat is using a global, lookup table to to find which which domain, which shop is on which partition. It gets the partition, and then it goes to the location of that partition, can be US west, US central, US east, somewhere else.

And, then it just hits the right database located in that region, and the right through all through rails and mostly through HTTP headers. We've, and what's what I find very interesting is that, we were able to build all of that on top of NGINX. Since NGINX allows you to to write scriptable, Lua modules where you can implement any kind of logic. In those Lua modules in NGINX, you can query your database to look up something where that tenant lives, and then you just proxy that through NGINX, and you manipulate the headers and just make this work. So it's, quite a lot of infrastructure that we had to write.

But at the same time, as I talked to colleagues at different companies, it's all custom tailored, and there is no, there is rarely, a same stack, same same use case. So that's also, that would be a bit hard maybe a bit hard to share and abstract. So yeah. How much of that infrastructure, tooling is open source? Is that all secret sauce internal stuff, or have you open sourced some of it?

We try to open source, quite a few things. There is also a lot of conference talks that, will link to show notes that give way better overview of the architecture than I just explained. The routing layer itself, I I wouldn't say it's open sourced, but there is lots of information out there for someone who who would want to build, and use same techniques. So that's probably a good segue into, you know, additional scaling aspects. So you've you've addressed a lot of the persistence layer or pretty much the entire persistence layer, horizontal scalability, but you still have response times to deal with.

Right? And so one way to make response times fast is through background jobs. And I know you've got quite a bit of expertise there. What is the approach and architecture of Shopify's background job system? Well and just to pile on here real quick, it seems like when people start talking about scaling Ruby app or Rails apps or Sinatra apps or whatever, this is one of the first things people reach for.

Right? Because any long running task, they they just, you know, shunted off to background job and, you know, report errors back to the user if they have to. And it it shortens the response time because then it's, hey, go do this job instead of I'm going to grind through the work of doing this job. Yeah. And and before you jump in with an answer too, I mean, one thing to to bear in mind is, like, some of this stuff is just it's baked into Rails with ActiveJob.

But, I mean, you don't even have to set up Redis or anything like that to support it. Right? It'll run it on a background thread out of a box. So what is the path for a developer? Like, kind of Chuck's lead in question.

You start on a small project that maybe be a little hobby thing, then it starts to get some traction, and then maybe it turns into a business. What does the evolution of of kind of evolving that background job handling look like over time? Oh, yeah. And, to note that, like, myself or some of the pipe projects, I run background jobs exactly in the background thread in those Ooma processes. Yeah.

Just because it makes no sense to to pay for extra, for instance, Sidekick downloads on Heroku for those pipe projects. And exactly as you pointed out, it makes sense to, to start with something as as brutal as a background thread. And then I'm really happy that Ruby community has, has a project like Sidekick and and Mike Bergham, who is behind, that project, who has pushed the community to adopt some best practices around background jobs and, also offers 99% of what community needs as a as an open sound project. And for the remaining of 1%, when you get to that point, you can buy a a pro or an enterprise edition. And I'm pretty sure that when anyone is at that point, that's actually quite, an affordable software to buy as a company.

And, just like most of the community who is using, Redis, Sidekick, Shopify is very similar in terms of setup because we've been around for so long time such a long time. We've, started with Rescue, if anyone remembers. That was a pre Sidekiq era library to basically achieve the same. So we we still run, Rescue. We run Redis.

We got to, rewrite most of, Rescue internals because we're multitenant, and we want to, to share some of the capacity and and reuse that between tenants, which we can dive into if, if you say, later. I guess the first question, from you and from some of the listeners could be why we're not on Sidekiq. And, the answer, I would say, is mostly the legacy part and also how much we know the stack and how much we customize it for us at this point. But we're also starting some smaller apps at the company, some smaller rails apps. In fact, in addition to the monolith, we probably have a couple hundred other smaller rails services for something very specific, or maybe something just employee facing.

And all of that would use the recommended set of libraries, that includes Sidekick. Yeah. That makes sense. I'm also working on a software as a service. I'm sponsoring one of the bigger conferences that serves that niche, podcasting, in August.

And so I anticipate that things are gonna grow. And, yeah, I have a lot of things that I am pushing into the background jobs right now just because, you know, I wanna get the response times down. But one thing that I'm wondering about, and I'm kinda tempted to go with Heroku, but part of me I don't know. I have this mental block about paying for something that I could probably figure out the scaling on myself or at least do some, you know, a couple of minor things to to help with the performance and scaling that way. So what what should I be looking at next?

It seems like you all have kind of gone toward the cloud, and I'm wondering if that's the right answer Or, you know, beyond background jobs, what's the next step? A step to reduce response time? No. More it's more a step to just get it to scale, you know, get that, you know, be able to handle more traffic without having the site slow down. Right.

There would always be some kind of bottleneck, which is, depending on if you have a good set up of tools, should be, possible to to find. And for us, that bottleneck has changed through the time. And, I would guess there is no single answer because maybe there is something in a web server, in a controller still spending quite a lot of time, which which, slows down the response time. Or maybe it's all, database that's a bottleneck. Or maybe it's it's Redis.

Or maybe the rails, reaches out to some external service that is not located too close to it, which increases latency and also impacts response time. Yeah. That makes sense. I'm curious what criteria do you use to determine what should go move into a background job? Obviously, you you may hit some latency, on a particular request and see something that is kind of low hanging fruit to move to a background job.

But just because you moved it to a background job, doesn't mean you've actually addressed the root of the problem. You've just moved it out of the request flow. Right? Oh, yeah. And, a very common pattern that I see in, people do with jobs is, for instance, you want to iterate over all users in your app and do something about each of them.

Maybe remind them that they need to add a credit card, or maybe if something expired or you want to send them an engagement email. When you start, you have just a 100 users. So that job works off pretty quickly, under a minute maybe, depending on what kind of work that is. What you grow to, 1,000, 100,000 to 1,000,000. And a job to iterate over a 1000000 users and to check balance of each of them, that job starts taking, days or or weeks.

And how do you solve that? And it's just so easy to introduce that problem. You just do user dot find each in a job, and it works, but until the point when, it it stops. So the way how we solved it, and that's actually all open source. We'll also link in the show note.

We've solved that by making every job interruptible and, preserving a a cursor so that a job would progress for a bit, and then maybe it would get restarted for some reason. But, basically, this allows us to iterate over really long collections and do some work with them and never lose the work that has been done. Nice. Yeah. That's really cool.

I'm gonna check out the Shopify job iteration. That's that sounds really, really interesting. One of the things that we've done at CodeFund is when we're iterating across, of course, we'll do, like, a find in batches, and then we will just enqueue the smaller work. So when the large job fails, it's essentially idempotent and can be just rerun again without without, impacting, you know, things that may have been half processed or halfway chunked through. Yeah.

That's the approach that I take as well. You know? An interesting side effect of that could be that, again, if this leads to a fan out of, a 1000000 jobs, because if you have 10,000,000 users and each, batch is size of 10, for instance, like, the numbers don't really matter. But the point is that if the fan out of so many jobs, we need to remember that something like Redis is, always limited in memory. And, there's been so many times across every I would say across every organization where I worked that people would, push Redis into out of memory state.

And, unfortunately, there is no, I would love to have a great solution for that. But every time we want to do something like you described, iterate in batches, n q something, we have to be mindful about what's behind that. And most commonly Yeah. I've been bit by that as well. You you start dropping jobs because, there's no memory left.

Certainly happens at times when there's when jobs might be failing. Right? Sidekick gives you some pretty nice fail safe, capability where it will reattempt those jobs. But if you've got a bug and not a lot of memory dedicated to your Redis instance, then, of course, you you may start losing work that may be critical to the business. Yeah.

I could see that. I haven't run into that myself, but I could definitely see that happening. This is a great reminder about all sorts of data, databases that exist there and maybe push push someone to to learn about that. Because at the end, Redis so Redis is in memory database, which is bound by some RAM that you give it. It can be gigabyte, can be 4, can be 16.

And that backlog of jobs would not be backed by something that's that can be written on a on storage that's bigger than RAM. Like, it would like, which would be disk if it's, for instance, MySQL progress. So something that we would really like to find is, a store that could persist those, things on disk with a performance not too far, and and features not too far from Redis. Redis does have the capability to push to write to disk. Right?

To flush itself out to disk. Yeah. So that only helps to have a snapshot in case, computer where Redis is running reboots, but it still doesn't allow you to store more than you have than the RAM that you have. Yeah. I mean, that's probably a great argument to move to cloud.

Right? Because, on Heroku, it's just a one button click when you see the the memory filling up to scale out or scale up your your Redis storage capacity. Yeah. And a lot of cloud databases or cloud instances, they they have methods for compensating for that, and so they will just migrate you to a bigger instance or, you know, basically allocate it to allocate it new memory without you even having to click it. As far as that workflow is, validated and people are certain that it will work, that's that's a great feature of cloud providers.

One of the thoughts that I've I've had architecturally, which would be kinda neat on on the background processing, would be some some jobs obviously are a bit more ephemeral and less critical, and they could be handled in a little bit more localized fashion. So it'd be neat to build a a routing layer that was intelligent where you maybe had 3 3 stages of Redis or or just background job storage. Right? One could be, this is very ephemeral and not very important, so we'll just let it be handled in process on a separate thread. So we'll route that job over there.

Or it may be that this web server the job is still kind of ephemeral, but a little bit more important. So we could have a a dedicated Redis instance sitting on the web server that that has just a small set of dedicated memory for that. And you could push those jobs there to handle some of that back pressure. And then for the really important stuff, you could heft those off to, like, your appliance tier of Redis storage that, you know, gives you the full capacity across the entire application? Oh, yeah.

We haven't done something like this for jobs, though, I think it could help a lot. But in general, like, in terms of building systems, I think this is a common case of, defining priority for different workloads, which also allows you to shed some of the load. So, for instance, you would have it doesn't have to be jobs. It could be, something as basic as web requests. And, there are requests that go to, something that's very important to the business, maybe, maybe checkouts, which has the highest priority.

Then you have something medium priority that's maybe browsing, just the the admin. And then you have something low priority, like, checking out robots.txt or checking out sitemap or hitting an API. And by declaring priorities to those requests, when you're at the load, you can shed some of those that you don't need. And, this idea comes mostly from the largest companies in the industry. Like, Google has lots of papers and and and books, how they do it.

And as you can imagine, every request to Google service would have some kind of priority. And they actually shared those. Like, I'm pretty sure that mail is higher priority than, watching videos on on YouTube. That it's really interesting. And one of the neat things about Sidekick is it provides, like, in turn if you couch that in terms of background jobs, Sidekick provides some of that facility just out of the box even for a simple deploy.

Right? Because it will you can you can prioritize. You can say this is in the critical queue. This is in the default queue. This is in the low priority queue.

And Sidekick will drain the higher priority queues first. Now you could start there and then and then eventually expand out and say, well, I'm gonna give a set of dedicated worker virtual machines or or dynos or whatever to process a particular queue, and I may even give a separate dedicated Redis instance or or tier, you know, for that particular queue. But you can start with just a simple Redis instance and and the default, sidekick configuration. I'm gonna say just for anyone listening because when we're talking about, like, scaling large systems, right, like like Shopify. But if you're starting a Rails app, for me, the go to is pretty much I always reach for Redis, Postgres, and Sidekick along with, you know, everything else that comes out of the box with Rails.

That's pretty much what I always go for when I start a new project. Yeah. I mean, I use I've used Rescue in the past for a lot of projects, and then, yeah, I moved into Sidekick for my newer stuff. But yeah. When is it too much to back background something?

Right? So I wrote a gem that allows me to essentially background every or any method that hangs off of an active record model, which is really convenient. But what I found is it makes it almost too convenient, where if something seems to be slowing down a request, you can just do a dot defer to the method name, and it would stick it into the background, which is great. But it got abused, and we ended up with far too much running in the background, hitting those problems you're talking about, like exhausting memory and stuff. So what how do you how do you determine what should be backgrounded?

That's a good question. And frankly, as someone who's spent quite a lot of time on that part of stack, I'm not sure if there is a single answer. And, I think it's somewhat related to how for instance, if it's active record and SQL queries, how heavy are those queries? If your request time out is, 30 seconds and just one SQL, query that's, for some reason, heavy, some kind of aggregation takes 10, and you need maybe to run a few of those, there is no way to fit that into a web request. And, of course, it might not make a lot of sense to do the premature optimization, and it can be fine to just start with everything in the web request in a controller.

And then, you find out that's the thing where your app spends most of the time in web request, and you just move that to a job. Because for for simple apps, that's maybe it will be part it will never be a job, and it will scale, fine for the next, few years. Yeah. I wonder if a good approach would be to first, this probably very much depends on if you've got paying customers that are being impacted. Right?

So if if paying customers are being impacted and you've got just some inefficiency in a query or some aspect of a web request, maybe you background that. But but you also set you put it in some type of, planning process where you revisit that job and try to actually optimize the the real root of the problem. Yeah. I tend to use the background jobs when I have a performance issue in the request pipeline, like we've talked about before. And then if there's a problem with running it in a background job, you know, it's timing out or, you know, something's breaking or something like that, you know, then I revisit it from there.

I don't know if there's a silver bullet. I think a lot of times it's context specific and you just have to, okay, I'm moving this out of the request pipeline. Okay. Now it's having a problem here, so now I've got to address it, the issue there. And, yeah, you know, eventually, it kind of bubbles itself up to the top of your tech debt queue and you address it.

So one thing before we wrap up, do you have, like, some favorite tips or tricks or approaches that you do at Shopify or have done at other employers that make this easier or, you know, something that you just feel like is is something that you did that you're proud of? Yes. For someone who is curious about performance and fixing those kind of bottlenecks, My best, advice would be to study all the set and variety of tools that you can use. These tools can be as high level and web based and simple as Miro, like, and some of the similar, services that you can connect to to your app and see insights to more system level, tools, like, for instance, Escharase. The amount of times where Escharase saved me, or and, some of the my colleagues at the middle of the of the service disruption, I just, it's so hard to count those.

And my advice is not necessarily about STrace, but knowing the the wide variety of tools that you can use. Some of the those tools are very Linux specific and system level. Some of them are, Ruby level, like, RB Spy, a great tool by Julia Evans, or Arbitrace. And then there are some services that that offer that, those kind of things. So if you know that range of tools and you know which one is the best for something that you're looking for, you pick it up and, and fix the thing.

Which I know we're gonna wrap up, Sam. I got a couple just a couple of questions to put you on the spot here. One is, do you know what the request volume that Shopify does per second? The public number that I can say is, about 80,000 requests, per minute. And what about background jobs?

How how how many background jobs are being processed per minute? That's a great question. And, to be honest, I don't remember those numbers just out of my head. Yeah. Yeah.

I could probably suffice to say that it's a lot. Right? Yeah. It's a lot, and it can be, very spiky. And there is a a a huge difference from steady state and, spiky state.

Because Shopify is also hosting some of the world's largest sales, sometimes for celebrities. Sometimes it's, worldwide cups and some special sales that where, millions of people, try to to crash, Shopify stores. Yeah. I can imagine. CodeFund is is tiny in comparison.

Since January, we've done over 300,000,000. Wow. That still feels like a lot to me. Yeah. We we keep changing what's in the background, what's not in the background.

So the we've had that number kind of artificially inflated at times, but but still, yeah, that's that's a lot of background work. Yeah. Makes sense. Alright. Well, I'm gonna push us to PIX.

Nate, do you wanna start us off with the PIX? Sure. So I guess one pick for me today is open source. How fantastic open source is. I've got a thing on the side that I'm doing for, my brother-in-law, and it's basically a CRM.

So I I went kinda diving around GitHub for open source tools that I might be able to use to set up for him. And I found fat free CRM, which is a Rails based, CRM. It's a bit antiquated on the, you know, the way it looks in terms of the UI and UX, but it's pretty fantastic. The data model's, solid, and it meets all of his needs, which is terrific. The other pick I've got is cats.

So we've got a a Maine Coon and a Russian Blue, and they just provide so much joy for my girls and and for the family in general. So highly recommend getting a pet and especially a cat. Nice. I'm gonna step in here with a couple of pics. The first one that I have is a challenge that I've been doing.

This is a challenge that has been less fun with a broken arm, but, it you know, I started it because I just I I really wanna prove to myself that I can do this. And, yeah, doing it with a broken arm, it just I wasn't gonna wait to heal because it's it's several weeks to heal a broken arm. Anyway, the the challenge is called 75 hard. It comes off of the MF CEO Project podcast, with Andy Frisella, and I picked that on the show before, his podcast. But, anyway, it's basically a challenge that he made up, but, it essentially is a challenge to prove that you can, you know, do what you've got to do for 75 days.

So there are 5 rules and if you violate any of the rules, then you have to start over the 75 days. And, the first rule is you have to work out twice a day for at least 45 minutes each time and one of the workouts has to be outside. So if it's raining, if it's cold, if it's hot, if there's a hurricane, you know, whatever, you're gonna work out outside. And basically the he says that that's just to push you through the, you know what, sometimes you have to do stuff when the conditions aren't ideal. The other rule, you have to drink a gallon of water every day, you have to read 10 pages of a book every day, you have to, choose a diet and stick to it, No cheating every day for 75 days.

So a lot of diets, you know, people are like, well, I take a cheat day every week. No cheat days. No cheat days on 75 hard. And then the last one is you have to post a status photo to social media. And so, yeah, I've restarted twice so far.

The first time I forgot to read the 10 pages, which was dumb. It was the one thing I kind of took for granted that I do and I didn't do it. The the other one, I got a salad from Costa Vida, and I didn't realize that I hadn't told them to take the rice out of it, and I've been doing the keto diet. So yeah. So I started over.

I I felt really dumb about that. I was like, I know they put rice in it. I don't know why I didn't ask them to take it out. So, yeah, so it's just kind of learning to adapt to some of this stuff, but, I'm definitely enjoying the process. And incidentally, just to throw it out there.

So I've I've been doing the the challenge for about a week and a half and, you know, and I'm I'm currently on day 2. Just to throw that in the right because I had to restart. The flip side is is that I've lost £10 in a week and a half. That's a serious program. Like, you gotta be committed.

Yeah. But it it he says it's a mental toughness challenge. Right? You you're gonna go, and some days you're just gonna have to push through and do some stuff that you really don't feel like doing. Yeah.

Like, the the run to that I have scheduled today, it'll probably be both of my 45 minute workouts together because it's it's one of my longer training runs for the marathon I'm gonna run-in October. And, yeah, I'm really feeling it today, especially with my arm and everything else. I do not wanna go out there and do it, but, you know, I've gotta suck it up and go do it. So anyway but, yeah, you know, I've gotta go do 2 workouts tomorrow and tomorrow's a holiday. So Yeah.

Yeah. Anyway, so that that's my pick. If you wanna go follow me on Instagram, I think my handle is Charles Max Wood. Then I've been posting my, my social media posts there. I I tend to try and post them to Twitter and Facebook as well, but I'm not always great about that.

I'm pretty consistent on Instagram. So anyway, Kier, do you have some pics for us? To be honest, I'm not I don't know the for the, like, the format very well. If you can just do the LDR. Are there 1 or 2 things that you think everybody in the world should know about?

Don't put it that way. Right. This one, I think it would be interesting for the main audience, like Ruby developers. Couple weeks ago, I followed, a hacking guide from MRI committers that shows you how to build, Ruby, how to change some simple source in c, how to rebuild it again, and see how it works, which also allows you to try all the new features that are coming with Ruby 2.7 because you build it from the master branch. So you can, go and try stuff like pattern matching.

That's something that you're excited about. And, the the reason why it can be interesting for any Ruby developer to try, is because you get to see all the magic, behind it, just all the c code. And it's becomes no longer just that thing that some rumor committers that I have no idea about build. And it becomes something that you can understand a little bit better maybe. And, I think that, hacking guide was, also made to reduce the barrier to, to start doing that open source.

So I think this point, falls back to the big net net that Nate brought up about open source being awesome. We'll link that too. Very cool. Yep. Cool.

One more question. If people wanna find you online, see what you're working on these days, how do they find you? Yeah. It's, Kirshatrov, on Twitter or Kirsh on GitHub. Awesome.

Alright. Well, thank you for coming. This is this is really interesting. I wanna ask, like, a dozen more questions, but we just don't have time, so maybe we'll have you come back. Thanks for inviting.

We'll be happy to come back. Alright. Well, let's go ahead and wrap this one up, folks. And, we'll come back next week with another episode. Thanks a lot.

Bye. Bye.

Scaling and Shopify with Kir Shatrov - RUBY 633

0:00

51:38

Playback Speed:

Show Notes

Sponsors

Links

Picks

Transcript