Building for Disaster Resilience - DevOps 177

Doug Neumann is a career software engineer. He joins the show alongside Jonathan and Will to talk about disaster recovery. He begins by explaining the difference between "building high availability and disaster recovery". He also talks about how this impacts your system and how it will make you resilient in any trouble or error you might experience. They also dive into the process of implementing it, and many more!

Hosted by:

Jonathan Hall •

Will Button

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Doug Neumann is a career software engineer. He joins the show alongside Jonathan and Will to talk about disaster recovery. He begins by explaining the difference between "building high availability and disaster recovery". He also talks about how this impacts your system and how it will make you resilient to any trouble or error you might experience. They also dive into the process of implementing it, and many more!

Socials

LinkedIn: Doug Neumann

Picks

Transcript

Will (00:01.361)

What's going on everybody? Welcome to another episode of Adventures in DevOps. Joining me in the studio today, my co-host Jonathan Hall.

Jonathan (00:10.354)

What's up guys?

Will (00:12.289)

And today, our special guest, we have with us Doug Newman from RPO. How are you, Doug?

Doug Neumann (00:18.898)

I'm doing great Will, it's great to be here, thanks for having me.

Will (00:21.533)

Thank you for being here. I'm looking forward to this conversation. So to give the listeners a little context, you want to tell us a little bit about your background?

Doug Neumann (00:30.282)

Yeah, sure. I am a career software engineer. I've worked at startups, worked at big companies like Microsoft. I've written code, I've led software teams. I'm probably better at leading than actually coding. And I'm not that good at leading, Will, but all that said, I do have production code out there these days. And

Will (00:46.151)

Yeah.

Doug Neumann (00:55.702)

It's fun, it's a joy. So from a DevOps perspective, I think I come to DevOps more from the Dev side than the Ops side, but I've spent a ton of time working on cloud infrastructures and building out operational infrastructures. So I kind of have a pretty good sense for how that gets done these days as well.

Will (01:17.261)

Right on. Speaking of cloud infrastructure, we're going to be talking about disaster recovery today. So I'll just

Jonathan (01:26.458)

I don't know anything about that, so I'm looking forward to learn. I mean, I think that's what happens when I leave the room.

Will (01:33.372)

Right?

Doug Neumann (01:34.047)

The disaster part or the recovery part? Okay, I see.

Jonathan (01:35.502)

I'm the disaster part, the recovery happens after I leave, yeah.

Will (01:37.833)

It's a tag team operation, you know, teamwork makes the dream work. So seriously though, I mean, I run in AWS. I have my servers in multiple regions according to AWS best practices. Why would I need to think about disaster recovery?

Doug Neumann (01:56.586)

Yeah, so just a little context in general, I am a founder of a startup these days that builds a disaster recovery solution for AWS. And so, I believe why you invited me, Will, to come join you guys here today is to kind of talk and answer these specific questions. That's right. And fundamentally, I think resilience is something that everybody builds for, they aspire for, they hope that they have achieved.

Will (02:12.057)

Oh, that's right.

Doug Neumann (02:26.254)

Everybody still seems to have outages. And um...

you know, disaster recovery is kind of part of your strategy for achieving resilience and being able to come recover from these outages, especially the big ones, the most existential ones that can oftentimes be threats to the continuity of a business. So in general, we think of resilience as having two halves. There's high availability and there's disaster recovery. High availability is what you build into your workload

your system doesn't go down. Disaster recovery is what you enable so that if you go down despite that high availability you can always get back up. And in a lot of ways it's kind of like insurance.

Will (03:15.889)

Yeah, I don't have that either. I'm kidding. Um, no, so disaster recovery, to put it in a, like a little practical application, we, we talk about high availability. We run in us East one, cause that's where y'all run to, right? Everything's in us East one and AWS.

Doug Neumann (03:17.794)

You don't have interest, yeah. All right. Yeah.

Doug Neumann (03:34.738)

Well, our workload is a multi-region active workload across three regions, including US East One, which is perhaps the least reliable region that we could have chosen. Yes, it has a reputation for sometimes going down.

Will (03:46.837)

No!

Will (03:54.685)

So high availability is all about using the multiple availability zones in a specific region. And then disaster recovery involves having a way to fail over to a different region should the odd scenario of US East 1 not be available, right?

Doug Neumann (04:18.282)

Well, that is a way to look at it. I think in general, high availability is about architecting a system so that you have redundant components that can...

automatically pick up from each other. Sometimes that's because you're load balancing your workload across multiple systems and if one of those goes down the load balancer stops sending it traffic. Other times it's because you have two redundant systems, one of which is taking traffic, but if it goes down your system is automatically going to fail over to the other and that failover can happen through various means. But in all these cases high availability is hands-off. You don't have to

in order to recover from whatever that event is, the system automatically operates. And to the point you were making, the most common way we see this in AWS is that people deploy workloads across multiple availability zones. Each availability zone is one or more distinct data centers, and that gives you high availability in the face of traditional data center outages. Those are things like power outages, fires.

the center flood a couple months ago. I think that one might still be offline. But those kinds of outage events, availability zones, give you a great architectural underpinning for building a high availability resilience.

Will (05:36.146)

Yeah

Doug Neumann (05:51.274)

You can build high availability across multiple regions in the cloud as well. And generally that looks more like, you know, it's more like no SQL systems distributed across large geographies.

actively taking traffic in multiple regions at the same time and eventually consistent across those regions. And I'm throwing out a ton of buzzwords here, Will. I'm sorry that I'm just getting into the weeds so quickly. But fantastic, great. You know, with the right architectural investments, you can build yourself a high availability.

Will (06:24.141)

No, no, keep going. I've almost got bingo on my buzzword bingo card.

Doug Neumann (06:36.862)

architecture across multiple regions, in which case you're resilient to much grander outages in the cloud than just a fire in a single data center. That said, there's a lot of architecture that goes into that, and that's not appropriate for everybody. And so that's for many people where disaster recovery comes into play. We're not going to build.

the high availability across multiple regions, but we still want to make sure that if a region goes offline for an extended period of time, we can get back online ourselves and we can use disaster recovery to get us there.

Doug Neumann (07:18.53)

Did I throw too many buzzwords at you?

Doug Neumann (07:23.37)

Nope. Okay. You looked deep in thought, so I, uh...

Jonathan (07:25.426)

I'm trying to think of a good question. Oh, definitely, yeah.

Will (07:25.993)

I think we're ready.

Will (07:29.625)

I am. I'm thinking like, so from a, so let's talk just a little bit about the actual implementation of that. Cause there's, you know, I think a lot about when it comes to technology, there's like this saying of there's three ways to do it. There's do it yourself. There's.

do it with me or I'll do it for you. So if we think about the do it yourself mode for accomplishing this, you kind of touched on that a little bit with talking about like the database specific where you have databases in multiple regions and they're eventually consistent, which is a lot of architecture. And that's gonna be highly dependent on the...

Doug Neumann (08:06.359)

Yeah.

Will (08:24.613)

the skill set, the team that you have, and the different technologies that you're working with. And all of those have to line up in your favor for that to work. But a lot of times we don't get the, we don't get a vote.

in what systems we're working with. So that kind of puts us in a different model. And I think that's where RPO, your company, comes in and helps. So what does it look like from that perspective?

Doug Neumann (08:48.11)

care.

Doug Neumann (08:57.458)

Yeah, so effectively, the RPO takes single region workloads in AWS and knows how to continuously replicate those into an alternate region of AWS. So we provide disaster recovery within AWS. And we should talk about cyber recovery too, because that's another scenario that we do that's not about regions, but rather about bad actors. But we'll come back to that in a second.

But the way that RPO works is you point us at your production environment, and we go scan it and find everything that you're using in there. All of your data, all of your infrastructure in the configuration. And we know to back up all of those things using what's built into the cloud. So, you know, there's multiple ways to back up an EC2 instance in AWS, and we know how to do that in multiple ways, depending on the service levels that you're trying to achieve with your disaster recovery strategy.

There are different mechanisms for doing this with an RDS database. There's different mechanisms for doing this with an S3 bucket. And then there's all this networking and security and identity and autoscaling, containers, serverless, all of these things that live around the data. And so we know how to scan all of that, how to back up all those distinct things, typically using the native mechanisms in the cloud platform, and then go restore that into another region.

when you need it. So, and it happens all through automation. Whether or not you've got Terraform, CloudFormation, Pulumi, you know, your infrastructure's code of choice, we can do, recreate all those environments and get you back online in as quickly as Amazon can launch your systems, again, typically in minutes.

Will (10:49.618)

Right on.

Doug Neumann (10:50.742)

So, I mean, fundamentally, it's just backup and recovery, but the recovery can happen in a different region, and the backup needs to back up a lot more than the data. It needs to back up the entire cloud environment.

Will (10:54.129)

Yeah

Will (11:04.541)

So does that mean you're running duplicates in multiple regions? So like from a cost perspective, did you just double your operating costs?

Doug Neumann (11:14.982)

No, it means that you have to have your data staged in what we call warm storage. So it's not hot, but it's not archived in Glacier where it's going to take you 24 hours to recover it. But in warm storage...

And so you're going to pay the cost of storing your data in a disaster recovery environment. That's fundamental to any DR strategy is having your data backed up in a different location and therefore paying for it a second time. But you don't have to have running systems, running servers, running databases. We're not spinning up NAT gateways, load balancers, all that kind of stuff. The cloud is elastic. We can just bring those things up on demand and that way it doesn't have to impact your

Will (11:44.434)

Right.

Doug Neumann (12:02.256)

Well done disaster recovery strategy only adds 2 to 3% to the overall cost of your cloud infrastructure.

Will (12:11.061)

Oh wow, that seems like a pretty reasonable cost for having a good disaster recovery strategy.

Doug Neumann (12:19.37)

Yeah, it is. I'll be honest, it depends on the shape of your data. If you have a bunch of massive data sets and hardly any compute, well, you're paying for the cold storage or the warm storage of the data. The ratio of data to compute doesn't get you to that 2 to 3 percent. But in a typical workload that we see, you know, it's just a couple percentage points added to your cloud bill to have everything ready to go in the other region.

Will (12:22.821)

for sharing.

Will (12:49.031)

Right on.

What do you think, Tarzan?

Jonathan (12:56.786)

This is so far out of my field of expertise, it's hard for me to even think of intelligent questions to ask, which is sad because I used to work for a data or a disaster recovery company. I just didn't work on that product.

Doug Neumann (13:02.314)

What? Let's talk.

Will (13:08.527)

Yeah

Doug Neumann (13:10.538)

Yeah. We did talk though, I mean, you mentioned Will, like the DIY versus have someone do it for you kind of thing. And I'll totally acknowledge like RPO is on the have somebody do it for you into the spectrum. You were a pretty turnkey product. You connect us to a cloud environment, tell us you want to recover in this other location. And we just take over and do that. Yeah, there are...

Will (13:23.931)

Right.

Doug Neumann (13:36.274)

A lot of people, especially DevOps practitioners, that want to build stuff themselves and that have invested in a lot of cloud automation and are trying to figure out how to solve these problems on top of those frameworks. And we work with a lot of those kinds of customers. In general, you know, your...

infrastructure as code that you would use to rebuild the environment doesn't contemplate data recovery. It doesn't contemplate things like secrets and secrets manager and how you're going to restore those or users in Cognito user pools and various things that are kind of they're not data but they're not configuration they sort of live in between. And you know we help people

Will (14:04.201)

Mm-hmm.

Doug Neumann (14:24.458)

figure out how to reason about those things and how RPO can solve that for them. But if they're going to DIY themselves, we have plenty of conversations saying, well, this is how we do it. And this is you could go build that yourself as well.

And so in general, when we talk to DevOps practitioners about disaster recovery, and they're asking, well, what do I need to do to my CDK investments in order to enable this? It's really about understanding, well, how are you backing up your data? How are those backups gonna integrate with your code in order to recover things when you are restoring a system with data? How are you going to replicate your other configuration,

Doug Neumann (15:07.581)

where your container images are and stuff like that. And there are ways you can do everything that RPO does and build it yourself. And we end up talking through a lot of those things.

Will (15:19.825)

For sure, and I think that model fits a lot to just like the general trend in software engineering over the past, I don't know, probably 10 or 15 years of stop building things that are the same for everyone. Things like authentication, you know, everyone has authentication.

like diving right into, no, your system's good, Jonathan. It's good, don't change a thing. Another example that has come up at every company I've worked at is a ticketing system, whether you use Jira or GitHub issues or, you know.

Jonathan (15:43.351)

I don't. Was I supposed to?

Jonathan (15:48.011)

Okay, good.

Will (16:04.445)

there's hundreds of them out there, or get mad at all of them and decide to build your own and then six months later regret that decision. I think this kind of follows into that model because you touched on a lot of things there like replicating your secrets and your ECR repositories. In addition to the data that you know that you have to replicate in cloud environments, there's a lot of subtle things that happen in the background that you don't realize that you needed to be replicating until you try to fail over.

and then those things are continuously changing. So you could do it once, but then if you're using AWS, they introduce a new feature that makes a change to it. And so you've got to not only run your core business and do your core daily operations, but then also follow the AWS change logs to see how that impacts your DR strategy. Or you can just use something like RPO where that's your core business,

100 percent.

Doug Neumann (17:07.53)

Yeah, totally. There was this, you know, a time when AWS in particular used to talk a lot about the undifferentiated heavy lifting of IT. And, you know, all of the things that businesses that weren't necessarily technology businesses were having to pay people to do, such as racking servers and wiring data centers and things that aren't strategic and material to the bank that you're trying to run.

or something. And so, you know, the promise of the cloud has been that, that they can take over those workloads and let the bank focus on what actually helps them provide better service to their customers. Disaster recovery, unfortunately, was not included in that, that vision. So banks have to build their own disaster recovery. And what we're doing with RPO is just saying, we can take that undifferentiated heavy lifting off of your plate and let you go focus on banking. Banking is maybe a bad example.

but...

Will (18:04.497)

Yeah, unless you're a banker.

Doug Neumann (18:08.918)

Yeah, it's true. Yeah.

Will (18:13.225)

Cool, so you mentioned before we started recording here some horror stories that help illustrate the need for disaster recovery, so entertain us.

Doug Neumann (18:23.914)

Well, I mean, I'll start with like the story that got me into this, which was, um,

This was in 2017. This is a pretty famous event. So there's probably a lot of people listening who lived through this. But back in 2017, I ran a software team that was at the telecom. Telecoms like to own their infrastructure. They're kind of allergic to the cloud. Software engineers like the cloud. They like bright, shiny objects. And so my team was pushing workloads into the cloud in an environment where that was not always appreciated. And

Will (18:57.127)

Yeah

Doug Neumann (18:58.91)

I was the one standing up in meetings saying, well, the cloud is more resilient than these data centers we have. And we're, uh, I was the one standing up in meetings saying, well, the cloud is more resilient than these data centers we have.

building more resilient software because of it. And resilience is really important in telecom. There's this five nines of availability is the standard that you're trying to meet. And we were doing great with that in AWS until one day an Amazon employee was performing a routine maintenance operation, made a typo, took down S3, everything builds on S3. It took down pretty much everything in US East one and.

We had a five hour outage in the middle of the day, and that was an eternity. And that was honestly the event in my career where I realized that there was a whole new level of resilience that my team had not even contemplated that we needed to be prepared for.

Will (19:36.302)

Yeah

Doug Neumann (19:50.01)

And the thing is, Will, like when you're in that, in the moment there and you're two hours in on this outage and they're not telling you what's going on and you're, you know, the executives are breathing down your neck saying, get us back online. And your answer is, I can rebuild the system but there's no data. The data is all trapped in US East one.

And we don't know, is it gonna be over in another hour? Is it gonna be a day? Is it gonna be a week? What's happening? How long is it gonna take? And luckily, historically, I think the longest outages Amazon's ever had have pushed 24 hours in duration, which is a long time, but that's a lot better than 72 hours.

what could be longer. But all that said, we were unprepared for that particular event. And so we went after that event, we went and built multi-region redundancy for those workloads to make sure that particular outage wouldn't happen again, but we were only focused on that one outage, not an outage of the EC2 control plane, not an outage of Kinesis like happened a couple of years ago, the day before Thanksgiving, the other services we were using, we only now were resilient

three outages and that probably really wasn't good enough. So yeah, I mean I usually start with that particular story, although that's maybe not the most interesting one, it's just a personal one. We talked a little bit, touched on the idea of cyber disasters, and so much of the conversation that we have with people is about resilience to cloud outages and that's

Will (21:25.565)

Mm-hmm.

Doug Neumann (21:31.406)

important, but there's a lot more ransomware going around these days. There are a lot more bad actors that you need to be concerned about. And most people's investments in security focus on keeping those people out of the environment. But the rubber hits the road when you have to recover from a bad thing that happens once they get in. And so I oftentimes like to spend more time talking about these kinds of events. Like there's a...

I think it's MGM casinos in Las Vegas that are currently offline because of a ransomware attack. I don't know, maybe they've gotten back online in the next past 24 hours, but it's been days that they can't, as I understand it, book new hotel reservations. And the casino, sometimes I hear it's online, sometimes it's offline. They're losing a lot of money because of a ransomware event. There's a pretty famous outage from a couple years ago where...

an employee at Cisco who worked on WebEx got mad and just deleted 456 EC2 instances out of their production environment. It took them two weeks to rebuild that production environment. If you think about it, probably everybody listening on this podcast who is a DevOps practitioner working in the cloud has some level of administrative access to a production environment.

Will (22:37.981)

Hahaha

Doug Neumann (22:54.29)

And you have to give that to people. That's, they can't do their jobs without that. But if that person makes a bad decision or if that person's access gets stolen by somebody else, then that gives them carte blanche to go do very destructive things, criminal things. The person that did that got thrown in jail. But.

That said, it's kind of too late. They already have the two week outage and thankfully they had the data so they can recover it. It just took them two weeks to rebuild all those systems. You know, disasters come in many different flavors that you have to be prepared for.

Jonathan (23:21.598)

Mm-hmm.

Will (23:33.393)

For sure. And it's not always malicious. I mean, I think it's human nature to go to the malicious intent first, but it could just be, you know, that you were pointed at the wrong environment. You had the wrong environment variable set on your system. Or the one that I've seen over and over again in my career is someone pushes their AWS access keys up to GitHub.

And it literally takes seconds for someone to discover those on GitHub once you commit and push.

Doug Neumann (24:06.366)

Yeah, it's amazing how quickly they can discover that and take advantage of that. The one that's bit me multiple times in my career is I thought I was connected to the test database before I deleted this data. It turned out that was a production database. And I have on many occasions been scrambling to go figure out how am I gonna restore those rows from a backup. Another relatively recent story, this was last year.

Will (24:18.399)

No.

Doug Neumann (24:36.062)

Atlassian. They had two teams, one I guess was responsible for writing a script to clean up some data, another team was responsible for running that script.

Jonathan (24:45.442)

I heard about this, I think.

Doug Neumann (24:47.102)

and they didn't communicate effectively about the arguments to the script. And so when they ran the script, they gave the idea of like an entire, the entire site that need a subsection of which needed to be deleted rather than just that subsection. And that caused them to go delete hundreds of customers.

Will (24:51.453)

Yeah

Doug Neumann (25:10.286)

during installations, I believe. So, yeah, it took them three weeks or so, as I recall, to go back and recover that data and get those customers back online. Classic, just miscommunication. There was nothing malicious intended there, but teams don't always communicate well and these things happen. So they're definitely accidental at times.

Jonathan (25:35.378)

So I'm curious to hear a little bit about, I mean, I think when people hear disaster recovery, I think, laypeople in particular, people who don't think about this as part of their daily job, tend to think huge disasters. The data center floods, tornado or a hurricane, or some sort of natural disaster, or some backhoe goes over the power line and the whole data center is disconnected from the world, things like this. But you just touched on another topic of more...

isolated disasters like somebody accidentally deleted 6,000 rows from a database. You don't need a wholesale data restore. So on the one hand, it's probably less of a disaster because the business is still running probably. It's affected a small subset of customers. On the other hand, it maybe seems like a harder thing to deal with because if we just had to restore everything, maybe that would at least feel simpler because we know where we are. We know that everything's gone, everything has to come back.

Doug Neumann (26:12.835)

Mm-hmm.

Jonathan (26:34.246)

when you have this sort of isolated thing, whether it's human error or one disk got corrupted or whatever the case is, how do you plan for these sorts of cases, so that, in particular, so that you can recover quickly? Because in the Atlassian case, you don't want to restore the whole database because that probably takes hours or at least longer than necessary. You just need to restore a small subset of things. How do you plan for these sorts of unexpected things?

Doug Neumann (26:59.722)

Yeah, so first I think what you're hitting on is at some level, your recovery process needs to be aware of the application that's being recovered there. You need to know, like, what is the schema of this database so that I can go extract those 6,000 rows and put them back in? How am I going to filter that table down to just the rows that I want based off of what I deleted?

and whatnot. And so there's a very surgical component that requires you to be extremely deep on the application you're running to be able to do that. And I think in those particular cases you're going to have to have a human involved that understands that application. You just need to make sure that you've got

everything prepared so that they can do that as efficiently and effectively as possible. So first and foremost, you have to make sure you have that database backed up and backed up relatively recently or according to whatever your tolerance is for possible data loss.

And then, you know, it's a question of do we know, do we have a bastion host in place? That if we launch a new RDS database and we can actually get into this environment, and then do we have all the right tools in place so we can pull those rows out of one database and restore them back into the other, and just prepping through a few scenarios like that and thinking about what needs to be there in order for us to...

Effectively enact that but you're only going to get so far with that Jonathan. I think there's always going to be Some component of do we have the right people on the job in the moment that understand how to how to take this Seed of a solution and actually get us out of this problem

Jonathan (28:45.405)

Yeah.

Doug Neumann (28:47.926)

So I totally acknowledge that what we do with RPO is at the infrastructure layer, we can't go in and do that surgical recovery of the 6,000 rows. We can recover that entire database, but that might be too big of a hammer for you. But what we do instead is we focus on these, the end of the recovery spectrum that is harder for people to build, how am I gonna take my running workload out of region A and drop it in region B, pick up with no data loss in just a few minutes and keep going.

Jonathan (28:57.339)

Mm-hmm.

Doug Neumann (29:19.02)

That's a pretty hard problem to solve that we come in and give people a turnkey solution around.

Jonathan (29:25.57)

But even then in this case of 6,000 rows deleted, you could presumably restore that database, not replacing the current one, but just sort of a backup, and then query that backup from the restored data to select those 6,000 or whatever you need to do in that case, right, to perform that surgery.

Doug Neumann (29:41.15)

Yeah, that's exactly that's the surgery that you have to do. And the best you can really do is just make sure you're prepared for surgery. You don't necessarily know what you're going to have to cut out of the patient. But make sure you've got a scalpel and a clean. Gurny, my medical metaphor is failing me all of a sudden here, guys. I just I didn't go to med school for a reason. I wasn't smart enough, but maybe it would have helped today. OK, fantastic. Yeah.

Jonathan (30:05.486)

I watched Smash enough, it all makes sense to me. Ha ha ha.

Will (30:07.723)

Oh for sure.

Doug Neumann (30:09.366)

And I'm sure that's very accurate medicine they were practicing in the jungle, yeah.

Jonathan (30:12.254)

Of course, right? Just ask Hawkeye.

Will (30:17.139)

It probably aligns a lot closer to our version of work than not. You know, just where it feels like we're in the trenches and getting shelled while trying to do technical work.

Doug Neumann (30:34.882)

Definitely, I think the conditions are less than ideal, oftentimes.

Jonathan (30:40.442)

Now you said that RPO handles a lot more than data. So maybe we can expand on that a little bit. I mean, suppose I have a Kubernetes cluster running hundreds or thousands of microservices and I depend on some other services, S3 and RDS and who knows what else. Does RPO take care of all of that or is it moving towards taking care of all of that? Where does it sit in terms of handling everything I might need to care about?

Doug Neumann (31:07.062)

Yeah, I mean, that's where we shine is that exact scenario. So, RPO knows how to back up not just the AWS stuff, but all your Kubernetes config, back up your persistent volumes that you've got inside of Kubernetes, understand the relationship between those pods that you're running there and the security and networking.

Jonathan (31:10.352)

Mm-hmm.

Doug Neumann (31:31.586)

container repos and things that are defined outside of Kubernetes at the AWS layer, and then be able to wire all that back up. So in that scenario, we can extract all the configuration from Kubernetes, go create you a new cluster in another region, push that in, but rewrite it all on the fly so that it references the new subnet IDs for where these things should

IAM roles for identity and all that kind of stuff. And so that's the big picture, a great example of what RPO is focused on doing is solving that bigger picture problem.

Jonathan (32:15.61)

How does this differ aside from the data aspect, which I understand is its own thing? How does that, the whole Kubernetes management, everything, how does that differ from somebody setting up Terraform or whatever infrastructure is code sort of scenario? And maybe the answer is it doesn't, it's just most people don't do that, I don't know.

Doug Neumann (32:28.95)

Yeah.

Doug Neumann (32:34.794)

Well, I think it solves a couple problems that people's Terraform doesn't typically solve. So first and foremost is data. When you build a Terraform config to go build out a cloud environment and launch a Kubernetes cluster and push some configuration into it, you typically don't go and say, oh, by the way, I want these persistent volumes created from these snapshots of data.

And so that's a big part of the problem. But there's more than just that. And the example I really like to kind of push at on is your definitions in Kubernetes, or if you're using the elastic container service in your task definitions, they're gonna reference Docker containers that you've probably built and deployed into an ECR repository.

In a disaster scenario, a lot of people are thinking, well, I'm just going to run my build process again. And they don't necessarily recognize that their build process has a lot of external dependencies on things that live out on the internet. Frequently, these are packages, open source packages that are being downloaded and packaged together and wrapped into Docker containers that are pushed. Those packages.

probably 80% of the time live in AWS and they get downloaded from AWS. And if you're trying to be resilient to an AWS outage, you might not actually have the ability to go and access those things. So one of the things that's really first principle that we solve with RPO is we should be able to recover you without any external dependencies outside of your own cloud environment. So rather than relying on a rebuild of a Docker container,

the ones that you already built and you push them into your source environment, and then we can rebuild, we just restore them into an ECR repo in your recovery environment. So there's, you know, it's really about all that kind of knitting all that back together and making sure that it's a full fidelity clone of what you deployed into your production environment.

Doug Neumann (34:59.89)

and just taking away all the considerations of how do I make this portable, how do I make this stateful, how do I be resilient to these kinds of things that happen beyond my control in the event of a real recovery.

Will (35:19.525)

I think that's been one of the problems that really highlighted the value for us whenever we used RPOs. If you build your infrastructure with Terraform and then...

you know, 100 deployments later, because every day, every week, however often your development teams are deploying new features and updates to your application.

Terraform's not really aware of that because that happens outside of Terraform. But with RPO, every time that changes, it's replicating it over. And then whenever you do have to execute that failover scenario, you're already saving time because all that infrastructure is already built over in the DR environment. You just have to launch it. But then it's also been updated with every deployment so that your latest, the

of your application code is over there in addition to the infrastructure.

Doug Neumann (36:26.73)

Yeah, I think that in a world where virtual machines are a significant part of your workload, the state on those virtual machines...

is oftentimes very important. Sometimes that's business data, but oftentimes it's application code like you're describing here. But it's still important that you have that recovered in your recovery environment to be able to get back online and that you aren't necessarily relying on people to either by hand or through automation that may be dependent on things outside of your environment, recreating those things. So.

Will (37:04.613)

One of the things we've talked about that is interesting to me is we've talked about different types of events that would trigger a disaster scenario. And I think there's probably some time and effort we could all spend.

in each of our respective teams just talking about what are the most common things that could happen to our environment. Cause that's going to differ from team to team, but really talking about, okay, what, what's the likely failure scenario here and how are we prepared to deal with it?

And that takes a lot of time and effort. And I think it's been one that, in my experience, we don't spend very much time doing that because it feels non-productive. It's like, I just think as humans, a lot of us tend not to do long-term planning and this falls into that category. But there's a lot of different ways where you could be impacted by this and not see the different...

roadblocks that you're going to encounter until you do like a walkthrough of that scenario.

Doug Neumann (38:18.582)

Yeah, I think one way to look at this, what's one of the most likely failure modes, the other is what are the most catastrophic failure modes. Because fundamentally a lot of us work in businesses that are 100% dependent on the technology that we're operating. And if that...

technology had some form of catastrophic failure that could put us out of business. So I think you have to ask both of those questions, Will, and I agree with you. It's difficult to find the time to step away from the tickets and everything else is having in the daily job to reflect on those things and to brainstorm as a team and then to choose to make it a priority. And ultimately, that is really one of the reasons why RPO exists is because

in the face of these things, people knowing that they need to invest to be resilient to them, they just also acknowledge that they have competing priorities and they don't have the time to go invest there. And so we come along with a turnkey solution that means you can turn up disaster recovery in 30 minutes and go back to all those other things and then sleep easier at night.

Will (39:29.829)

Yeah, and I think one of the things in hindsight that you'll discover as you progress through your career, I commonly hear people say, oh, my manager doesn't give me time to do that, or the business doesn't give me time to do that. And I think one of the ways of thinking about that is the first error there is that you're asking for time to do it. And...

the assumption from the business side of your company, regardless of what company you're working for, their assumption is that you're already doing that. Whenever they ask you to deliver a new feature, they're not saying, hey, stop working on disaster recovery and go deliver this feature. They're assuming that disaster recovery is one of those things that you're just already doing. To use another horrible metaphor,

example, like if you go in to have your appendix removed, you don't tell the doctor, hey remove my appendix and when you're done be sure and you know stop all the bleeding and sew it back up and use antibiotics. You just assume that those things are going to happen as part of the process and disaster recovery is one of those things that from a business person's perspective they just assume that you're taking care of. So by thinking that you need permission to do

you're setting yourself up for more challenges.

Doug Neumann (41:01.93)

Yeah, totally agree. And I don't know, Jonathan, if you caught that episode of mash where they take the appendix out and then they fail to suture. No, that episode doesn't exist.

Will (41:10.347)

Hahaha

Doug Neumann (41:16.779)

Anyways.

Will (41:24.557)

Oh, and by the way, Doug, editor, please cut this out. Whenever we have these long awkward pauses, the editor just cuts them out. So it sounds like we're just like on top of this and just witty and timely the whole time.

Jonathan (41:34.272)

We always sound smarter than we are.

Doug Neumann (41:36.411)

Fantastic, alright, I'm like, this podcast isn't going so well for me right now. We have these, I don't know what to talk to you about, so, anyway.

Jonathan (41:41.694)

Hehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehehe

Will (41:44.199)

No, no, we have a fantastic editor who makes us sound much more professional than we actually are.

Doug Neumann (41:49.986)

Fantastic, yeah.

Jonathan (41:52.934)

What else should we ask you about or talk about? Any other topics that you think are relevant? What should me, a person who's too stupid to know what to ask, ask?

Doug Neumann (42:01.866)

Well, I mean, I think at a high level, it's really, do you understand your resilience strategy? And is it aligned with what your business needs? And I will say that DevOps practitioners often go to high availability as my resilience strategy, disaster recovery, I don't need to worry about in the cloud. And the truth is you need both.

High availability can mitigate a lot of outage scenarios, a lot of disaster scenarios.

And if you have an environment, an organization that will allow you to invest in significant architectural changes or just different strategies upfront, that are oftentimes more challenging to engineer, take longer to build, then you might build yourself a ton of high availability into your application. You'll never build resilience to a cyber attack as a high availability.

Doug Neumann (43:05.344)

that particular scenario. You have to consider how am I going to get back online if the guy sitting in the cube next to me goes rogue.

you have to consider ransomware attacks. They happen all the time these days. And so that gets people typically opening their mind to there is more to what we need to do for resilience than just focus on a multi AZ strategy, which everyone's really excited to build. And it is a great way to achieve resilience. It just doesn't give you all of the resilience that you need to have. So that's where the DR side of the coin comes in. It's not going away. I think a lot of people sort of figure,

did in a data center. In the cloud, we don't need to do DR any longer. And the truth is, the cloud is just a bunch of data centers. And a lot of those same outage scenarios, you're susceptible to, regardless of where you're running your infrastructures.

Will (44:04.521)

For sure, not only are they still data centers, but now they're data centers that you don't control or have physical access to. So when there is an outage, you have even less ability to respond than you did back when you were racking your own servers.

Doug Neumann (44:21.479)

I don't want to go back to the recurrent servers world. I'm very excited, very happy that I don't have to do those things. But yeah, we can't just forget about the disciplines that we learned over the decades before we moved into the cloud.

Will (44:24.142)

Oh me neither.

Jonathan (44:39.944)

Yeah.

Will (44:41.565)

Yeah, I don't think we've ever made it more than like two or three episodes where at least one of us on the show has said, Oh, no, I don't want to go back to a physical data center. I'm happy to deal with the challenges of the cloud to keep from going back into that cold data center.

Jonathan (44:51.873)

Hehehehehehe

Doug Neumann (44:54.006)

Yeah.

Will (45:05.425)

Well, cool. Is there anything else we should talk about?

Doug Neumann (45:10.274)

We're gonna do picks, right? You told me I had to bring a pick.

Will (45:12.113)

We do? Yeah. Well, let's do picks then. Jonathan, you came prepared, right?

Jonathan (45:12.186)

We gotta do pics. Yeah.

Jonathan (45:20.49)

I did, yeah. I have a couple picks. They're both books. I've listened to both of them on Audible, but you can read them too, if you like, you know, using your eyes instead of your ears. The first one is called The Art of Business Value by Mark Schwartz.

Will (45:22.627)

Alright.

Jonathan (45:37.618)

And he wrote another book, I think he was the same author who wrote a book I read a while back, a seat at the table. Yeah, same author, which is about how IT can sort of have an impact on business. And so this book, The Art of Business Value is basically trying to define the concept of business value, which we talk about a lot, I think, whether we're doing DevOps or agile or whatever, like, does that provide business value? You know, we have these questions. But as he points out in the book, like,

Do you know what business value is? I'll bet you don't. I bet nobody at your company knows. It's a pretty nebulous concept when you actually try to pin it down. You try to staple that to the wall, it's like jello, it just falls apart as soon as you try. So he does a pretty good job of explaining why it's difficult to explain and then helping to define it as well as we can. The most important thing isn't the answer, what's the definition though, it's thinking about it. So it's a good book to read, I recommend it.

Will (46:10.909)

Yeah.

Jonathan (46:36.538)

And then the second book I want to recommend today, I'm still in the middle of reading it. It's called The Art of Action, How Leaders Close the Gaps Between Plans, Actions and Results by Stephen Bungay. And I heard a podcast where he was interviewed. I think it was the No Nonsense Agile podcast where he was interviewed. And so this book was originally a business book, but it's gotten a lot of attention from software developers in the agile community.

And so I'm reading it because it's related to that, but he basically looks back to the 1800s in the Prussian military and some tactics they adopted and how to really succeed in warfare and how that is beneficial in business. Basically, the idea is you can't plan ahead because you don't know what's gonna happen. There's too much random chance. And in the case of war, you have an enemy who's literally trying to...

hamper your capabilities and disrupt your plans. Hopefully it's not that adversarial in most software development, but sometimes it is if you're trying to do antivirus or something like that, or any sort of security related stuff. But anyway, so those are my two picks for the week. The Art of Business Value by Mark Schwartz, and The Art of Action by Stephen Bungay, two arts of books for all the software people out there.

Will (47:56.457)

on. Doug, what do you got for a pic?

Doug Neumann (48:00.266)

Well, gosh, so I mean, first time guest on the podcast, I didn't put in a forethought after I hear Jonathan's end of this. I think the pick, I'm going to go with an alcoholic pick here, if I can do that. Yeah. Not that I'm trying to be an alcoholic, but I have some friends that I hang out with too often, they're pushing me in that direction.

Will (48:06.345)

Hahaha

Will (48:16.206)

Nice, yeah.

Jonathan (48:17.586)

Please.

Doug Neumann (48:30.314)

I think that my pick is Amari today. And Amari is the plural of Amaro. And Amaris are a class of a pair of teeth from Italy.

You might be familiar with Campari or Aperol. These are sort of at the friendlier end of the Amari spectrum. But I've discovered these recently. A few months ago, I made a cocktail that called for one of these and I bought it. It was not a small investment, but it turns out they're just an incredible...

Will (48:50.837)

Okay. Oh yeah.

Doug Neumann (49:14.946)

complex class of liqueurs that you can make all sorts of incredible drinks from or you can enjoy them neat straight up and so I am here like Thursday night I'm hosting my friends over we're having a Mari night I've got to figure out three different cocktails to serve them from three different so they can experience this range of everything from you know slightly bitter and fantastic

to utter cough syrup that is at the other end of the spectrum that still makes a fantastic beverage if it's mixed the right way. So.

Will (49:47.985)

Hahaha

Will (49:57.218)

So does Naikwul qualify as an Amari? No. Ha ha ha.

Doug Neumann (49:59.942)

I don't know. Actually, when I first had Fernet Bronco, which is one of the Mari that we were going to have on Thursday night, I was like, I feel like this is going to knock me out. I'm going to have a great night's sleep and I'm going to wake up with no more nasal congestion. So didn't turn out to have that effect, but there was a moment I had hope. So.

Jonathan (50:01.246)

Ha ha ha!

Will (50:13.798)

Hahaha!

Jonathan (50:14.108)

Hehehe

Jonathan (50:22.927)

Must be missing some ingredients.

Will (50:27.161)

right on, that's cool. I just recently was introduced to those because we were at the, we do happy hour with our friends every Friday night and occasionally they have the distributors come in and they came in with a peach flavored drink that was like that. And it was like, wow, that's actually.

actually pretty nice.

Doug Neumann (50:58.85)

They're delicious. Yeah.

Will (51:00.037)

Yeah, yeah, for sure. They go down really, really easy.

Will (51:08.329)

Cool, so my pick, I think, I want to say I picked this book last week, but it was just so good I'm going to pick it again. Developer Hegemony, the Future of Labor by Eric Dietrich. And the reason I like this book is because it talks about, a little bit about

how we got into this current working environment and how it doesn't really apply to software development, but there's centuries of this is how we've always done it, leading to that. And so it's good at rethinking, whether you work in DevOps or software engineering, it's good perspective on how to rethink the value that you bring to the business and...

how to just use that knowledge to advance your career. And he specifically talks about the fact that the way most of us advance in our career is through job hopping and then addresses why that is. But it was a pretty cool book. So that's my pick.

Jonathan (52:11.026)

I'll second that. That's a great book. He has other good books, too.

Will (52:13.113)

Right on. I haven't read his other books, but I really do like his writing style. He does a great job of just making the pages drip with sarcasm when he wants to.

Jonathan (52:25.282)

Yes, he has a great blog too. He's not as active as a blogger now as he used to be, but dadetech.com is a great blog for a lot of that same sort of wisdom.

Doug Neumann (52:35.658)

Well, I just want to point out that the intellectuals on this podcast both brought books and the lush on the podcast is

Will (52:41.906)

Hahaha

Jonathan (52:45.241)

If I had thought about liquor, I probably would have recommended a whiskey. My birthday party was last week. I'm whopping 44 now. We had some great whiskeys, so maybe I'll save that for next week.

Doug Neumann (52:55.918)

That's a good one.

Will (52:57.757)

For sure, yeah, you bring the whiskey pick, I'll bring the cigar pick, and Doug, we'll have to have you back on so that we can just make this a drinking podcast. Ha ha ha, adventures in the dark. I mean, it's 8.30 in the morning here, that's cool, right? You're not gonna judge if I, right? Oh.

Jonathan (53:01.554)

There we go.

Jonathan (53:09.867)

Adventures in drunk devops.

Doug Neumann (53:10.948)

Fantastic. Adventures in something other than deadlobs, unfortunately.

Jonathan (53:16.35)

Ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha

Doug Neumann (53:16.962)

Yeah.

Jonathan (53:22.558)

It's 530 somewhere. Oh, it happens to be 530 here. Oh, great.

Doug Neumann (53:26.187)

No. Hahaha.

Will (53:26.841)

Even better, we'll use Jonathan's time zone for the podcast.

Doug Neumann (53:31.266)

classic.

Will (53:33.085)

Cool, so Doug, if people want to get more information from you or talk with you about this or just find out more about you, how can they do so?

Doug Neumann (53:41.066)

Yeah, so I am doug at rpo.io, and you can always just go old school and send me an email, but most of the community dialogue we're having these days is out on LinkedIn. So track me down on LinkedIn, send me a connection request, let me know that you heard this. I get a lot of unsolicited connection requests these days, and most people try to sell me something, so it, you know, certainly...

Will (53:49.057)

Nice.

Will (53:55.163)

Okay.

Will (54:04.694)

On LinkedIn? No!

Doug Neumann (54:10.934)

would be beneficial, I think. If you let me know, you at least listen to this podcast. Probably a lot of people listening to podcasts are gonna try to sell me something now.

Will (54:19.953)

Hey, I listened to this podcast. Can you give me your credit card number? Yeah. Cool. Well, Doug, thanks for coming on the show. This was very enlightening. And I think it's one of those areas that we could all spend a lot more time thinking about, especially now knowing that disaster recovery doesn't have to be the big overhaul that I initially thought that it would have to be.

Jonathan (54:23.398)

Hehehehehe

Doug Neumann (54:23.53)

Right? Yeah.

Will (54:47.193)

So thanks for coming on the show and sharing that with us. And everyone who's listening, thanks for listening.

Jonathan (54:54.366)

Thanks, until next time.

Doug Neumann (54:54.762)

Yeah, thank you for having me guys. It's been great to be here. Alright.

Will (54:57.605)

Right on. Take care.

Building for Disaster Resilience - DevOps 177

0:00

53:12

Playback Speed:

Show Notes

Sponsors

Socials

Picks

Transcript