The Evolution of Disaster Recovery Strategies in Modern Cloud Environments - DevOps 186

Sagi Brody is a seasoned technologist and CTO at Opti9. They share insights on disaster recovery strategies, the importance of effective documentation, and the challenges of managing resilience in the cloud. They discuss the need for standardized tools and processes, the impact of technological advancements on traditional strategies, and the increasing complexity of multi-region and multi-cloud setups. With a focus on practical experiences and industry trends, this episode provides valuable perspectives on the evolving landscape of DevOps and the essential skills needed to navigate it effectively. Tune in to gain valuable insights into managing resilience, disaster recovery, and the importance of clear and accessible documentation in the world of DevOps engineering.

Hosted by: Will Button
Special Guests: Sagi Brody

Show Notes

Sagi Brody is a seasoned technologist and CTO at Opti9. They share insights on disaster recovery strategies, the importance of effective documentation, and the challenges of managing resilience in the cloud. They discuss the need for standardized tools and processes, the impact of technological advancements on traditional strategies, and the increasing complexity of multi-region and multi-cloud setups. With a focus on practical experiences and industry trends, this episode provides valuable perspectives on the evolving landscape of DevOps and the essential skills needed to navigate it effectively. Tune in to gain valuable insights into managing resilience, disaster recovery, and the importance of clear and accessible documentation in the world of DevOps engineering.


Sponsors


Links

Socials

Transcript

 
WILL_BUTTON: What's going on everybody? I'm your host today, Will Button for Adventures in DevOps. And before we get started, I do want to remind everyone or tell you for the first time maybe, that we are now doing these shows live in addition to the podcast. So if you do want to catch it live, we're recording on Tuesdays at 930 Central Time. That's GMT minus six. And actually it's like 9 30 ish because we start at 9 30, but then usually have a little bit of a pre-chat to get our guests up to speed on how the process works. And then I click the go live button. So shortly after 9:30 central time, you can catch us live on Facebook, LinkedIn and YouTube. And speaking of our guests today, I have Sagi Brody, chief technology officer of Opti9 consultant. Former software developer, according to his own words, still has code in production that probably shouldn't be. And I can definitely relate to that. Today, we're going to be talking about resilience and disaster recovery in the cloud, how it's relevant, why you still need it, and then dig into that. T, welcome to the show.

SAGI_BRODY: Thank you. Great to be here with you, Will. I'm excited to speak to a technical audience, which is not always the case. So, you know, see if I get called out on anything, but it's great to be able to get as deep as I want to be without having to pull myself out.

Hey folks, this is Charles Maxwood. I've been talking to a whole bunch of people that want to update their resume and find a better job. And I figure, well, why not just share my resume? So if you go to topendevs.com slash resume, enter your name and email address, then you'll get a copy of the resume that I use, that I've used through freelancing, through most of my career as I've kind of refined it and tweaked it to get me the jobs that I want. Like I said, topendevs.com slash resume will get you that and you can just kind of use the formatting. It comes in word and pages formats and you can just fill it in from there.

SAGI_BRODY: Right on. That's always my biggest fear in doing talks and live shows like this. It's the part of the show that I call Stump the Chump where you say something wrong and somebody just calls you out on it. That pressure can be very useful to harness the right way. I used to force myself to volunteer to speak on highly technical topics at conventions that I knew nothing about. And I had like maybe two months. And so these topics that I was putting on the bottom of my list that I knew I needed to learn, but I was just dragging my feet on. You know, now I have a time and day where I'm the expert and I'm gonna potentially get stumped. And so, you know fear of embarrassment is a very great motivator.

WILL_BUTTON: Oh, for sure. I like the approach. That's bold. But that's definitely going to be effective. I like it. I may steal that from you. Yeah. Cool. So tell our viewers a little bit about your background and how you got to be the CTO of Opti9.

SAGI_BRODY: Sure. Yeah, so when I was, like probably many who were listening, you know, it got into sort of this industry as a teenager, just, you know, screwing around with computers and the internet and having fun. And that, you know, that sort of curiosity somehow turned into a job, which is great. So it was late 90s. My co-founder of a company called Web Air, we were kind of in the right place at the right time. And we just started hosting websites for our friends. So you can kind of think of it as a posting company. So we were working with technologies like FreeBSD and Apache, and the typical web stack. This was great because it was before Google and before things like PHP and before things like customer service or support. It was kind of like sink or swim. We just built everything ourselves and just scaled up with our customers, grew that business, sort of pivoted towards enterprise about maybe 10 years after that started to focus on management of private cloud deployments, management of public clouds, orchestration, and sort of owning the glue in between these hybrid cloud environments. So a lot of networking, which is always fun. And then I got into BCDR, so Disaster Recovery as a Service Backups. Networking was always our secret sauce, which is fun. Saying things like, it's great that you're copying your data somewhere, how are you going to consume it? What does consumption look like? You know, these are networking problems. So I've always been a big network guy. Eventually we sold that business to private equity. I stayed on as the CTO and we rebranded to Optinine after we bought or merged with two other companies. And I'm still here mostly in a sort of a, more of a chief technology product officer, which is starting to become a thing now, which, you know, CTO is so vague, there's different personas. So really, my role is sort of product-focused company, you know, sort of customer-focused. What are we building for customers? How are we helping them? How are we working within the bounds of the third-party technologies that we use from an integration perspective? How do we push the envelope? And I'm also doing some consulting on the side and just trying to stay busy.

WILL_BUTTON: Right on. I think that's a cool perspective or a cool journey because for a lot of us we end up spending a few years at a company and then jump to another company. And so we end up going from company to company. I've done it myself just to, well, let's be honest, I've done it for salary increases, but also because of the opportunity to work on different technologies that I wanted to go deeper into. But I think that's a cool path and an unusual one in the fact that you've been with the same company even though there have been mergers and acquisitions along the way. So you've really built your skillset in, um, driving the company as it matures versus driving your skills, your skillset as you mature.

SAGI_BRODY: Yeah. You know, I would say we, we were a service provider and a service provider is interesting. It's very different from an enterprise environment. A lot of people don't realize the nuances and differences. It's funny, whenever a vendor used to call us and try to sell us something, we'd be like, all right, cool. Do you have multi-tenant capabilities? No, I'm like, okay, do you know what a service provider is? Like, realize like, and then we'd say, well, listen, if you can make a service provider happy, you can probably make anybody happy. So, we'll tell you what you need to build. But you're absolutely right. What was great for us was that our customers were sort of in multiple industries and verticals, trying to run different applications and solve different problems. I'd say the market is more sort of segmented now and mature now, but what's great about being a service provider is your customers are coming to you with the problems that need to be solved, with the use cases. As the industry changes and grows and as there's new shiny objects, they're coming back to you and pushing you and saying, hey, we heard about this really cool thing. We want to use it. Can we use it? And it's like, you know, it's like, yes, you of course, and then we'll figure it out on the back end or no. And it's like, you want to lose a customer. And listen, if there's value in what they're saying, and you've heard it more than once in the last four weeks, then you listen to it. And if you think someone else can benefit from it, you listen to it. Our customers have been pushing us always. And that's sort of driven innovation. And so we never had to. Okay, maybe you're not like going way outside of your, out of your target zone, but you're constantly pivoting. You're constantly trying to keep a leg up on your competitors. Because if your customers are asking you for that, then competitors are hearing the same thing. And so I think where we've done well is we've owned our own, you know, we've never owned our own IP, but we've owned our own glue. And that's empowered us to be able to mix and match best and breed and just innovate and be at the forefront. So I agree with you. I think service providers are a great place to be. And even for, you know, listen, I was a founder, so maybe it's a little different, but within our environment over the years, you know, a new problem to be solved would come in and maybe one employee would sort of just jump on it and just be like, hey, you know, that's cool. I can do that. And as a smaller company, we'd be like, all right, you know, like, let's say a silly example is when, you know, like, you know, NoSQL platforms got big and people wanted us to manage Mongo and we'd have one gentleman who is like, yeah, I'd love to do that. It's like, all right, you know, great. Next week, you know, that guy is the Mongo expert. Everything has a Mongo one is, is, um, goes to, goes to him. And so there's just a lot of opportunity for self-growth there. If you can recognize and take it.

WILL_BUTTON: Yeah, for sure. And I think that's the key to longevity in this space is a desire to continually grow and learn new skills.

SAGI_BRODY: Yeah.

WILL_BUTTON: But there's also some old skills that we can't let go. One of those being disaster recovery and backups. And you mentioned it before we started recording. It's one of those that seems to have been pushed on the back burner over the last 10 years or so, um, but doing so has some definite, um, some definite impacts to your business. So talk to us a little bit about resilience and DR in cloud.

SAGI_BRODY: Yeah, I'd love to. So, you know, we have traditionally provided a disaster recovery as a service offering for, you know, I don't want to say legacy, sort of nonsort of cloud-native applications. So things that are not necessarily running on AWS or Azure. And so over the years we would if you look at maybe a deployment running on VMs, running on VMware or Hyper-V or KVM, we would basically provide an entire ecosystem needed so that your applications would continue to operate despite some sort of outage at the production site or cybersecurity event or somebody fat fingering database. And so what that looks like is obviously replicating the data. But more important than that is sort of understanding what is consumption. What does consumption look like? How are your users going to consume the application from the DR side as they did in production? And that's a big sort of networking sort of task or challenge to deal with. And then also dealing with dependencies, like what about all these shared services that they're relying upon, like authentication or networking, DHCP, IPAM, stuff like that. And so we take ownership of authoring the wrong books, not only for failover, failback, but what if it's just one application that you want to failover? And what do you do with shared resources? You know, if you have a legacy database server, which is a weird thing to say, and it's running, you know, it has hosting databases for 10 different applications and you want to failover one, do you bring the database server with you? And so, always interesting situations and challenges. And then, you know, when public cloud started getting popular, you know I had a pretty pessimistic look on disaster recovery in general. And I think the entire sort of industry was excited about the fact that like, we won't have to deal with that anymore. We have the ability now to just build applications that are inherently resilient, you know, from the bottom up and, you know, we'll deploy them on the cloud. They'll be self-healing. And then we don't have to deal with this. You know, and I think, you know, what's happened is that. People have tried that. People have tried to build these applications that will run, let's say, in multiple AWS regions. And they realize the complexity involved in building the applications from the start with that thought in mind is just far beyond the bounds of what they wanna deal with. And we see that even when you invest time and resources into that, it doesn't necessarily mean that it's going to work. Every time AWS East has an outage or goes down, how many very large popular sites, household name sites, go down that are technology companies that we know are deploying in multiple regions? So why are they down? Because it's almost like it's this impossible thing to build. And it's not always their fault. The interdependence between their applications and even third party SaaS or third party pass mean that can they actually test this thing? Can they actually test their resilience plan without actually affecting production? So what I've seen is sort of like the industry going towards the middle ground where, and some people don't even realize you can do this, but you can basically deploy an application in a single region, not have to sort of build this whole resilience concept into your application from day one and then employ traditional disaster recovery strategies towards sort of gaining resilience on your app. So you deploy on one region, and maybe now we can use replication tools that are more coordinated focused. And then we can still take all of those things that we learned over the years from traditional disaster recovery, things like dependency mapping, building runbooks to deal with different situations building sort of network strategies so that I can test at the DR site without poisoning my production data. If your production app is connected to Salesforce API and you bring up your app in DR and you start playing with records, like oops, we're modifying production data. So all of these sort of core disaster recovery strategies, given the modern data mover that knows how to replicate or rewrite rewrites and resources, like they use the Terraform or Cloud Formations, giving something a modern data mover and then apply everything from traditional DR and you can actually achieve resilience without having to go crazy from the development.

WILL_BUTTON: Yeah, one thing that you mentioned there a couple of times that I think is really key is testing that. And it reminds me, every time I think about that, it reminds me. Way back early in my career, decades ago, my boss asking our team, Hey, are you guys ready for a disaster now? And we're like, Oh yeah, we're all set. And he's like, okay, great. Everybody show up on Saturday. And so we showed up on this Saturday and we went out to, he ran into the conference room in a hotel, had some servers sitting there and he had our backup tapes, it's like, great, restore everything. And we didn't even make it five minutes into the process before we realized, oh wait, we don't have the floppy disk to update our BIOS or we don't have the boot disk to reinstall the operating system. And it was a really, really long and painful day, but the lessons have stuck for a couple decades now.

SAGI_BRODY: Yeah, it's great when people are sort of overly focused on data replication when it comes to disaster recovery. Or even worse, some people just think that their backup strategy is also their resilience or disaster recovery strategy. We won't get too much into that, but you have two separate goals with two separate strategies to employ. So yeah, you really need to sort of pre-author of the run book. And I think today, what's interesting now too, that we're seeing is that if you look at an event like a ransomware attack or a cybersecurity event, you know, the incident response plan, or sort of the end or the disaster recovery run book or something like that, it's not something that a single team would be dealing with. Right? Like a DevOps team is responsible for sort of the uptime and resilience of an application and presumably they own sort of all this orchestration for production to the R or multi-regions and fail over. That's great. But now if you bring in this sort of, you know, this sort of security aspect that this has this need to fail over was in relation to a security event. Now you have a completely new team. Maybe it's a internal soccer security team or an external MSSP. And now you have these two teams that unfortunately in many organizations don't speak that much. And now they need to be lockstep as part of the incident. And if you think about a CTO or CIO at a higher level, they kind of become the quarterback between these teams during an incident. And it's not something that I think they even realize they were ever going to have to deal with. And so the incident response plans, the disaster recovery run books need to be inclusive of who owns what during sort of a security incident. Can you even bring up the application at the R site? Do you want to?

WILL_BUTTON: So how does a team that maybe they recognize that their DR, their resilience plan isn't where it should be? What are the first steps? Because to get this done, you need to devote time and resources and it has to be prioritized. Sometimes you have to prioritize it above day-to-day operations. I think specifically it comes down to what are you going to say no to so that you can, so that you do have the bandwidth to say yes to this? So what are some good early steps for people once they recognize that they're not where they should be?

SAGI_BRODY: Yeah, so I think it's a good question. I think the first thing that they need to do, and I think the market has matured a bit here and this is fairly obvious now, but the teams need to kind of sit down and figure out what they have an appetite to take ownership and responsibility for in this realm. And so if you look at a traditional DevOps, how this goes in general for a DevOps conversation is are we application developers? Are we SREs? Who is responsible for ongoing management, metrics collection, efficiency? And obviously, that's a, there's no right or wrong with any of these things. And a lot of it has to do with sort of the, uh, the DNA of the company and what they kind of want to be when they grow up and do they want their, you know, certain IT teams, you know, adding value to the business or managing infrastructure. Um, and so, you know, we'll see, I'll see smaller organizations that are like, you know, we're a small team, we own everything, so we're going to just internalize it and also see very large organizations that have, you know, an abundance of resources. And they basically make the decision that we don't want to be in the business of managing disaster recovery. We don't want to be responsible for it. We'd rather outsource it. And an interesting thing to think about here is the complexity of all of our applications and our deployments, they're not getting simpler, they're getting more complex. In fact, I think you can argue that part of the goal of DevOps these days, part of one of the things they should be striving for and maybe even a key metric to focus on is to what extent am I making my, you know, the deployment that I'm managing, to what extent am I making it simpler and less complex? And obviously the more complex, the harder it is to manage, to monitor, to scale, to secure, and to make, and to make resilient. So I think people need to acknowledge that. And when you have that conversation, one of the answers that comes out of that conversation could be, hey, we wanna make it simpler. How do we make it simpler? Why don't we outsource certain layers and certain responsibilities? And disaster recovery and resilience is an easy one to outsource. It's low-hanging fruit. Typically, it does not affect your production too much. If you can use sort of that middle ground strategy that I mentioned at the beginning. You don't have to modify your application, you know, much at all in order to be able to achieve resilience. So that would be my answer. The first thing you have to do is sit down and figure out, you know, what is your appetite to manage and own that internally?

WILL_BUTTON: Yeah, for sure. Yeah. And I think that's a huge selling point. If you have a strategy where you don't have to buy your existing infrastructure application a whole lot, that's always going to be a big selling point. Let's do this. Take a step back and help me understand why moving to the cloud or using the cloud cloud providers like AWS is not a DR strategy in itself.

SAGI_BRODY: Yeah. Well, you know, they give you the right, you know, as everybody knows, right? It's going to Home Depot and they're giving you the right tools and you got to make up what you want. So you have to look at it on a per sort of platform per platform sort of environment. So if you look at something like S3, which obviously is being you know is being stored in multiple you know local zones within within a region or even has the ability to sort of have its own inherent built-in sort of cross-region replication, you're probably good there from a you know if you wanted to if you wanted to build a disaster recovery strategy between let's say east and west. When that's three perspective, it is fairly straightforward. You can kind of put a check next to that layer as far as your data being available at the DR site. You're recovering from a cyber attack or sort of a manipulation of the data. That's another story. But if the entire interviews goes down and you want your application back up and running within a set RTO, you can kind of put a check there. For other sort of, you know, other platforms, it's not always the case that that is done. Typically, it's not, you know, and so there are snapshot capabilities that exist, but then there's this entire orchestration task that sits on top of all that. So you have all of your configurations and resources maybe at another site, but now your applications are not necessarily written to be able to reference those reference IDs at the R site. So it's really a replication orchestration strategy, right? And so what we'll do is we'll look at your various applications, and then we'll look at the AWS, and we're doing this mostly for AWS today, in addition to the legacy environments, which I mentioned before, but for a public cloud, we'll look at the various and we will employ underlying AWS technologies to ensure the data is up to date at the DR site. And so maybe that's, maybe that is cross-region snapshots or maybe that is AWS DRS, which works very well for certain platforms but can be expensive. So now we get into the application criticality question of how critical is each application to be up and running and sort of match the right replication technologies to the cost and to the application criticality. Beyond that, we're using orchestration tools. One of them is called RPO that we'll use, that will orchestrate some of this back and forth. And RPO might be something that's great for a team that wants to internalize all this and just say, we got a tool, let's use it. Where Optinine comes in is, it's not just about the tool. It's who is, do you want to take ownership of the failover process and the failback process? Do you want to take ownership testing, building the network integration strategy, building the automations into, let's say, DNS, maybe SD-WAN policies, so on and so forth. So we kind of sit on top and own the entire process soup to nuts so that DevOps teams and IT teams can just wash their hands of it and focus on building applications.

One of the things I've been working on lately is getting all of the tools organized that we use to put the podcast together. This probably reflects pretty similarly on things like DevOps, where maybe you have a number of systems that rely on each other. And I kind of ad hoc to the whole system together to begin with. And now I've been working through a lot of these systems and figuring out, okay, how do they connect to each other? How could they connect better? What custom projects do I want to put together on this? And what I've been finding is that if I go ahead and put together a diagram that shows all the different systems that are part of my layout, then I can visualize where things are and I can fix a lot of the issues that arise. And I can say, okay, I need to focus here or here, right? And so I've been using Miro for this, M-I-R-O, Miro. And yeah, it's terrific, right? I can just, I can kind of figure out and lay out the whole thing. I can say these connect to these in these instances in these ways. I need them to connect and send this information here when things happen in the podcast production system. At the end of the day, I have a system that makes sense. I'm really, really loving Miro. You can go check them out and find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at Miro.com slash podcast. That's three free boards at Miro.com slash podcast.

WILL_BUTTON: Right on, yeah.

Fellow devs, if you're like me and tired of missing important alerts when your application crashes because you have hundreds of emails in your inbox, alert fatigue is real. But none of us can afford to overlook critical errors or crashes because of a noisy inbox. That's why you need to check out Ragon Crash Reporting, now integrated with Microsoft Teams and Slack for alerting. You can set up thresholds for your errors based on an increase in error count, a spike in load time, or new issues introduced in the latest deployment along with custom filters that give you even greater control. Assign multiple users to notify the right teams with alerts linked directly to the issue in Raygun, taking you to the root cause faster. Never miss a runaway error. Make sure you're quickly notified of the errors, crashes, and front-end performance issues that matter the most to you and your team. Try Raygon alerting today. Create a world-class issue resolution workflow that gives you and your customers peace of mind. Visit raygun.com to learn more and start your 14-day free trial. That's raygun.com.

WILL_BUTTON: I'm actually an RPO customer and it's a It's a great tool. It's just, it's one of the few tools I've seen that just does what it says it's gonna do at an exceptional level. But just like you mentioned, you know, that's only part of it. That handles the infrastructure. There's still the whole human aspect of it of verifying what you've replicated and doing a failover to it and testing it and making sure it works. And that's another full-time job in itself.

SAGI_BRODY: It is. And the funny thing is for us is again have been a company that has been doing in providing disaster recovery as a service for, VMware platforms, physical servers, IBM iSeries, Zen KVM based applications. The funny thing is, we're not, you can say we're a technology company, but it's really that glue that we're owning. But we have brought in best-in-breed data movers and sort of replication tools to to focus on specific platforms. And so we brought in RPO, it's like, hey, here's the best and brief tool for cloud-native AWS apps, but everything else that we're doing, all the value we're providing, and all the wrappers around the replication tool, like they're all the same as we were doing five, 10 years ago, which is actually pretty cool. It's like, if you can stay up with the tech and you can build a platform that can support multiple integrations in a modular way. Like you can, you can stay relevant through all of these crazy clout changes.

WILL_BUTTON: For sure. I should, um, we had Doug from RPO on the podcast a few weeks ago. I should do another episode with both you and him and just go into a deep dive on this whole subject.

SAGI_BRODY: So him and I, and I've known him for a while and I really am super bullish on their platform. I think it's amazing. Um, him and I are doing a webinar tomorrow actually, uh, about all this in detail.

WILL_BUTTON: Oh, right. I'm. I will get that from you and make sure that that's in our show notes when this episode goes live. Awesome. That was a cool talk. When it comes to DR in the cloud, you mentioned that providers like AWS have a lot of the tools built in. You just have to look at them on a case-by-case basis, see what those tools are, and it makes sure that they're enabled and that they're working properly for you. How often do you see the need or do you recommend cross-provider? DR strategies like backing up our AWS or replicating our AWS environment in Azure or GCP. Because that brings with it a whole, like an exponential increase in overhead as well as cost.

SAGI_BRODY: Yeah, that's a great question. I think that you kind of have to look at three buckets here in general. You have your high availability, the ability to achieve high availability, right? Which maybe is sort of. I think in order to build high availability for your application cross-region or cross-cloud, you're really not going to be able to get away from sort of building your application with that intent from day one and having to apply so much more complexity to your application, to your CICD process. And really the level of expertise that you need from your developers, just I think it's on another level, right? If you're just starting the process of building an application now, and that is your goal, you can't go back later and be like, oh, we'll just do that later. No, it has to be in the DNA of your application. This is also an interesting point when you start to think about integrations with third parties. You start to think about all of the third-party providers that you're going to utilize from an API perspective or from a data perspective. If you have this mandate to have resilience and high availability as part of your application or security, and you build a framework or requirement around that, you need to have those conversations with those third parties before you start using them and not after because if they're the weakest link in the chain from that perspective, if they don't have great resiliency to provide you with the options you need, then you're stuck. I think too many companies they have the SaaS sprawl or they just started using them. And then, you might spend, I don't know, years building an HA application that works across cloud, but one of your vendors is not locked up and boom, you achieve nothing. But so understand the differences HA backups and sort of traditional DR applied here. And really sort of figure out, I would say figure out where do you want your sort of vendor locking to be, right? If you're okay with vendor lock-in with one cloud, that's fine. I don't think there's anything wrong with that, especially again, if you're building your application with forethought into that. I'm sure you've seen it many times, people that are like, I'm going to use AWS Mobile Regions, but I'm purposely not going to use any platform services. I'm going to run my own SQL instances and kind of go backwards in that way. Fine. if you're using RPO, as far as I'm aware, today it does not have any cross cloud replication capabilities, but let's say it did. Great, now your vendor lock-in is on that level. So I would say a lot of this is sort of risk aversion, risk mitigation. I think the likelihood of all of AWS going down and having a need for sort of cross cloud is hopefully very little to none. But I think a single region outage, as we've seen, is, you know, fairly, it's definitely an realm of possibility and happens. But I do think what you're saying makes sense from a backup perspective, right? Maybe we don't need, you know, an RTO of being able to failover from AWS to Azure within four hours or 24 hours. But if we're copying our data, if we're having a copy of our data there, and we understand what the path to to sort of bringing it back up looks like. I think that you're in better shape than most are today.

WILL_BUTTON: Yeah, I agree with that 100%. I've had, as a consultant, I've had multiple companies come to me over the years and say, hey, we need to implement DR. So we want to, we can't trust AWS, or we don't want to trust AWS. So we want to use multiple cloud providers. And my approach with them has always been, I don't think AWS, is and not picking on AWS here, but I don't think that's the weak point. And then we go through and look at their stack and it always comes down to the fact that, you know, that hasn't been the weak point. You know, they've chosen to use a managed database provider. And so all of their data is not even in their AWS environment, or they have all of these external dependencies like Salesforce or, or different things like that. And it's like, okay, if you can replicate all of your infrastructure over to another provider, but this third party dependency is still a single point of failure and much more painful if that goes down. Which makes me think along those lines, since you work with a lot of companies in this, how willing are third party vendors to talk about what their own internal DR and high availability strategy is? They all have their boilerplate off the cuff answer that they have to provide.

SAGI_BRODY: And it's always gonna be pretty vague and you're probably gonna have to go back two or three more times. And sometimes they'll just refer you to the SLA. And obviously their SLA credit mechanisms like most are gonna be just a joke, right? And so it is a risk. I mean, I will say on the compute side, I do think that Kubernetes has democratized the compute layer and has made it very easy to sort of deploy your code where you want, when you want to. But you're right, it is the database layer and sort of the rest of the shared services layers. And that's kind of, it is kind of a hard pill to swallow because again, if you kind of want to manage and run, operate your own databases, that's fine. It'll be less expensive that way. You'll save money and you'll have more control and you will be able to sort of make good on this sort of cross, cross-cloud resilience if you want to. But now the operational overhead has increased. And so, you know, part of what we've done and sort of what I've sort of been dabbling in, you know, with some consulting is just doing that dependency mapping, application mapping and figuring out what we want to do. And just by the way, just because you're using PaaS in production doesn't mean that you can't have sort of a single database deployment in DR with some sort of, you know, sort of, you know, snapshot or replication mechanism in place as a backup. And look, it takes you two days to get that, to get all the tweets worked out, you know, post event. You know, most people will say that's not the end of the world and they will accept that as a solution. Because to be honest, a lot of folks are looking, unfortunately they're looking to check the box on a DR strategy or having one in place for compliance. And having the DR strategy does not necessarily mean that you have a run book or you have super low RTOs, RPOs. It just means that you sat down and written what you would do during an event, even if it hasn't been fully tested. And so if that's what you're after, if that's what your goal is, because maybe you are not a fully technology-based platform as a business, as a revenue generation, oftentimes that isn't enough.

WILL_BUTTON: Yeah, I think having the conversation about RTO, the recovery time objective is really important to have. Because all of my entire career, I've never worked for companies like Google. There's been one exception where I had one of my employers, we were doing healthcare for trauma patients. So we had to move quickly there. But for most businesses, having...Having that RTO conversation is very helpful because while ideally you would like to say, oh yeah, we can fail over in two hours, that's cool, but it comes with a set of costs and acknowledging the fact that it would be embarrassing to tell your customers we'll be up in two days. Maybe that is the right strategy based on your business.

SAGI_BRODY: Yeah, you got to start somewhere, right? When I work with companies to build a disaster recovery strategy and actually roll it out, you know, first thing we'll ask is what are the business goals you're trying to achieve? And some of the questions might be, you know, do you need, are you only looking to protect against a sort of a full failure at the production site where all the applications need to be situations where you might need to fail over individual applications. And then there might be other questions like, do you want to fail over if there's like one server that is sort of ransomware and, and, you know, of course, everybody says yes to everything. Yes, we want all that. The problem is the sort of the more situations sort of the increased complexity and, you know, ironically enough, the full failover event, the everything needs to come over at the same time is actually much easier to, to, to build for and to achieve than all of the others, because typically you have this sort of interdependence between applications or sitting, there may be behind the same firewall and the same VPC, the same network. And so if you can keep them on all the same IP addresses and keep references intact, then it is much easier. And so typically we'll deploy a phased approach. Well, let's be able to achieve that, improve that, show that it works, and then we'll sort of peel back the rest of the layers of the onion and strive for more.

WILL_BUTTON: Yeah, it reminds me of an analogy from drag car, drag car racing, speed costs money, how fast can you afford to go?

SAGI_BRODY: Yeah, yeah. And I really think it's interesting when you think about these things and you think about the, the burdens, if you're looking for complete HA, multi-region or even multi-cloud, the burden, the extra burden that you're putting on your, you know, DevOps or app dev teams, you know, and what is, what does that translate into just sort of the business impact. You know, how much longer are your development cycles because of that? And what are you not being, what features are you not able to work on because all this extra time put into the forethought of this high availability. That's why I like the middle ground approach where let's have our developers focus on developing a application that runs on a single, let's say AWS region. And you know, hands, you know, head down, hands to keyboard, focus on building applications, which they probably have a lot of experience doing that. This whole multi-region thing is typically fairly new to someone and they're gonna go off on a tangent. So app devs, you focus on building an application that is resilient within a region. AWS makes it fairly straightforward to do that. And then maybe a separate team or SRE team or a company like Optinine kind of comes in over the top and says, we are going to employ a disaster recovery as a service to that single region deployment and achieve resilience using tools like RPO and using proven strategies and that way the app devs can just highly focus. I think that's such a win-win and honestly I don't even know that there's a ton of developers out there that can even achieve the HA with the high degree of success.

WILL_BUTTON: Yeah, I think one of the other benefits of that approach is discovering tribal knowledge because in a lot of the scenarios I've been involved with. We do things and we take certain steps or actions because of this tribal knowledge that we happen to know. And in many cases, we don't even know that we're making decisions based on tribal knowledge. But when you bring in a third party like Optinine, then you're coming at it from a fresh perspective without the tribal knowledge. And it works really well to expose that. And it's like, oh, okay, now we have this piece of information that has to be documented and and formalized

SAGI_BRODY: Absolutely, and like I said at the beginning You know when I hear tribal knowledge, you know, I hear I hear complexity and I think that I think there's this whole idea of managing complexity managing complexity sprawl, you know. Fighting to reduce complexity. It's not It is not being pushed enough, you know from an industry perspective. In fact, I think we have the opposite problem. I think we have a lot of folks out there and I'll even, you know, different times in my career, I've definitely been guilty of this. You know, we have we have shiny object syndrome and we want to be able to be exposed to all the latest and greatest tools. You know, I think we're all curious people in this industry and we like playing with these things. I think I think part of it also is just maybe a little bit of fear and ensuring that we have the latest shiny object syndrome is, you know, is I think the complete opposite of, I wanna keep my environment simple so that it's manageable, so that I can reduce the need for tribal knowledge. And this kind of goes into like other soft skills, right? Like, you know, if I want the person at 4AM to be able to fix what I built, you know, to what extent am I a good technical documenter? And to what extent do I take pride in that as a standalone skill that I'm good at, you know, as a developer or an SRE or DevOps person.

WILL_BUTTON: Yeah, agreed. And just speaking from personal experience, I'm not good at documenting. I'll write something that just seems to be as clear as it can be. And then usually me, six months later, looks at it and is like, who's the moron that wrote it? Oh, wait, never mind.

SAGI_BRODY: Yeah, I think we've all been there. I mean, when I manage a lot of technical teams, and it's always that last 10%. Let me see the documentation. How are we monitoring it? How do we know if it goes down? How are you backing it up? I mean, we want to build cool things, right? Then we just want to pass it on. But I do think that us as sort of DevOps engineers, we need to start taking pride in sort of skills that are outside of the hands of the keyboard. Technical documentation, taking pride and being able to walk away, go on vacation and people knowing what's going on by reading my documentation without calling me. I think also being a good troubleshooter, and this kind of goes back into the complexity and sort of disaster recovery conversation, but to what extent is my troubleshooting skillset high? And I think, unfortunately, a lot of the soft skills don't have great KPI metrics that you can kind of throw on a resume that can show how well you do with those things. But I love honing the troubleshooting skill and being brought into a problem that I know nothing about and figuring it out quickly compared to maybe folks that wrote it or have been dealing with it. It's, you know, that's fun. It's a great little challenge.

WILL_BUTTON: Yeah, that's one thing I've advocated for for years now is my role as a DevOps engineer is to work myself out of a job, you know, to set everything up so that it runs and when it doesn't, it's clearly documented and what steps to do and someone new can come on board and get their app to production without having to rely on me and do so in a way that makes sure that they honor the constraints of the business. And if I can do that, then there's no reason for me to be at that company anymore. And I think that's my own personal metric for job success.

SAGI_BRODY: Yeah, I think actually, you know, not to not to pull more shiny objects into the conversation, but I think so. Gen.ai, I think, has a huge potential to help in the screen. In fact, I'm talking to some startups that are already starting to do this, where you will plug them into all of your internal documentation, and they will basically just give you a chatbot where you can just ask questions. And so, you know, having service provider experience, this is really interesting because, you know, if we're managing multiple customer deployments, you know, part of what Optinine does is we're providing, we're doing managed cloud ops for managing AWS deployments on behalf of our customers. But, you know, not to say, we don't want every customer to be their own science project, but there is always going to be this balance of standardization and customization. And so we have very detailed documentation on each customer's deployment and diagrams and all that, but...It's very hard to scale that, especially for the person at 4am that gets the phone call that something is down. And having to sit through and read all the documentation and catch up, it's an impossible task to do when you need to spend hours catching up before you can even begin to troubleshoot. And this, I think, is what is really cool, which is where Gen.ai can help, where if you have this LLM that's constantly looking at this data, and you can have a bot where you say, hey, where's this customer stuff deployed? When was the last time something was deployed? When was there a change? And you can just quickly get those answers. To me, as someone who's managed 24 seven teams, I mean, that's just super exciting. And that really helps us scale, you know, the knock and the sock organization.

WILL_BUTTON: Yeah, for sure. Cause context switching is huge. And that's where it seems to really raise its visibility of how painful and expensive context switching is. And I think you probably are very familiar with it from your experience at Opti9, when you switch from not only project to project, but customer to customer. And so you are working on one customer's environment that's built this way. And then the pager goes off and you have to switch to a completely different environment. And so how do you minimize that amount of time where you're just sitting there with a blank stare trying to figure out where to begin in this environment that could have infinite number of combinations?

SAGI_BRODY: Yeah. And I'd say like now based on, as we're saying, the complexity, it's almost impossible, to be honest. It really is. And having the tribal knowledge and the experience working on a specific customer's environment helps greatly. So what we do is we obviously try to have as many standardized tools as we can, standardized monitoring. I like looking at different monitoring strategies where we have, we build monitoring, again, into the CICP workflow as far as what we're going to monitor. But what I'd like to do is to really have sort of macro-level alerts go off at the same time as certain micro-level alerts go off. So if my application is down, if we're monitoring a specific query and we want to see that it's returning, you know, greater than 25 results from the customer perspective, if that goes down, I would like to see, you know, four or five different monitors they're monitoring specific layers of the backend or specific API endpoints, also going down at the same time. So the poor technician at 4 a.m., we're kind of spoon feeding them, hey, there's a serious problem, but at the same time here, we also notice these poor things that are out of whack. And so instead of having to start from scratch, they can kind of work backwards from the lowest hangings group.

WILL_BUTTON: Yeah, just giving them a series of breadcrumbs to follow.

SAGI_BRODY: Yeah. And again, I mean, I think that's, that's a strategy. I mean, to me that's, is that a technical, is that a, is that a sort of technical skill or is that sort of a quasi non-technical strategy that you need to employ, you know, with this resilience or SRE hat? No.

WILL_BUTTON: For sure.

SAGI_BRODY: Sort of DevOps, right there in the middle of DevOps, I think.

WILL_BUTTON: Yeah. One of the things I like to do is in all of my alerts, I like to include like hey, here's the alert. Obviously here's why it went off. And then here is a link to the application dashboard and the run book for that. Just to leave those breadcrumbs and help minimize that context-switching time.

SAGI_BRODY: Absolutely.

WILL_BUTTON: Documentation. You've mentioned that multiple times and it's a pet peeve of mine because I don't like confluence, I don't like notion. Um, I don't like GitHub read knees. Pretty much I don't like any of the documentation tools, but you mentioned standardizing on tools. Do you have a preferred documentation tool?

SAGI_BRODY: I don't, I've used sort of all of the above. You know, I would say the answer, I don't think that there's one tool that's better than the other, right? And this is cliche, but right, it's more about the use. It's like talking about the best diet, right? It's the best, it's the one that you can do consistently over time.

WILL_BUTTON: Yeah.

SAGI_BRODY: I think one of the, and I think so as long as it's simple and you can build them into your workflow fairly easily, that is the best tool. I'll tell you one win related to that that I experienced years ago. It happened to be with Confluence, but the same example I know is the same sort of capability and it was available in almost all documentation tools now. Years ago, we used to use Visio to create like diagrams, then we upload them into the documentation tool. And that whole process of sort of bringing the work or the output from one tool and the other, that process, people don't want to do that. They'll end up just sort of keeping the diagram, let's say in their own, let's say they're using Lucidcharts or Glyphi or something like that. They'll end up just keeping it in that account. So a big win for me was when Confluence started adding in these plugins where you can actually create the diagram without having to go out of the documentation system and have the diagram embedded right into the document right there, instead of having to build it in a separate tool, then import and copy and all that. I think that was great because now I'm authoring a document, I want to show a visual representation, I'm a big visual person, and I can just create the diagram right there without having to leave the page, hit Save. And now the actual the actual IP of that diagram is embedded into the document. It can never be pulled apart. Nobody can ever tell me, oh yeah, I never, I never uploaded the latest version of diagram into the document. So there's the whole concept of, of, you know, working in the updating of documentations and diagrams into your workflow. I think it's a really good example of how, of how you can do that. Obviously with, um, I think with Jira and, um, and GitHub, you can do that, but I don't think that that capability exists enough for more of an infrastructure operations SRE perspective.

WILL_BUTTON: Right, Han. When it comes to making sure things are up to date, whether that's documentation or run books or your failover strategy, what's the minimum frequency you would recommend someone reviewing that?

SAGI_BRODY: Well, I'd say twice a year is probably the minimum. But then you also need to add hooks into your change control. Right? Any time you maybe deploy a new service, you know, that should be a hook to whoever's responsible for resilience, maybe an outside vendor like, but it's not, it's an outside vendor like Optinine. You know, so if I'm sitting on the customer seat, I'm going to add as much as many hooks as possible. And I'm going to say to my vendor like, hey, we just change this, we just change that. Make sure our, make sure our resilience still works. Now, on the flip side, if I'm in Optinine seat, I might say, yeah, no problem we've updated it, which we'll do in earnest and we have to. But hey, we did what we had to do, but we got to retest now. So you do have to find that balance. And it doesn't mean you can't update these things in an ongoing basis and then kind of have a list of what you want to ensure functions during the next test. I will say though, one of the important things with testing is you don't want to just have the IT teams doing the application testing. You really need to have users testing. Maybe it's QA. Maybe you have internal staff that are using the system. You need people that can smell out a problem with the application, can smell out the fact that it's maybe a little bit more sluggish or that certain functionality doesn't work as good. And this is a big miss. A lot of IT teams try to internalize it because they want to just move past it.

WILL_BUTTON: For sure, yeah. Just as an IT background, my overall objective is to avoid as many conversations with other humans as possible. But this is one of those areas where you you just kind of can't do that. And I'm guilty of doing it too, of performing a failover, looking, yeah, all the health checks pass, no alarms, that must be good, and then moving on.

SAGI_BRODY: Yeah, and I like the idea of almost making product managers responsible for some of this. You know, if resilience and high availability is, you know, is a feature or component, you know, of sort of the outward product then I do think that they can be the liaison between the developers, third parties, or whoever's owning the resilience. You do need a quarterback there. And if there is a product management function, I think this is a great aspect for them to ensure continuity of long-term.

WILL_BUTTON: Yeah, agreed. Like a seasoned product manager is just worth their weight in gold because they understand all of these different layers of complexity and interactions between the teams and just by job definition, they're really good at orchestrating and pulling in the right resources at the time that they're needed.

SAGI_BRODY: Yeah, and with third parties, right? If they get wind of using Salesforce, what they're going to want to do is potentially pull in whatever positive capabilities are being pulled through. Maybe they're pulling that into a product feature set. They also need to better understand that what it means for the outward messaging on the resilience or if they can still make good on that promise.

Hey, this is Charles Maxwood. I just wanted to talk really briefly about the Top End Devs membership and let you know what we've got coming up this month. So in February, we have a whole bunch of workshops that we're providing to members. You can go sign up at topendevs.com slash sign up. If you do, you're going to get access to our book club. We're reading Docker Deep Dive and we're gonna be going into Docker and how to use it and things like that. We also have workshops on the following topics and I'm just gonna dive in and talk about what they are real quick. First, it's how to negotiate a raise. I've talked to a lot of people that they're not necessarily keen on leaving their job, but at the same time, they also want to make more money. And so we're gonna talk about the different ways that you can approach talking to your boss or HR or whoever about getting that raise that you want and having it support the lifestyle you want. That one's gonna be on February 7th, February 9th. We're gonna have a career freedom mastermind. Basically you show up, you talk about what's holding you back, what you dream about doing in your career, all of that kind of stuff. And then we're gonna actually brainstorm together, you and whoever else is there and I, all of us are gonna brainstorm on how you can get ahead. The next week on the 14th, we're gonna talk about how to grow from junior developer to senior developer, the kinds of things you need to be doing, how to do them, that kind of a thing. On the 16th, we're gonna do a Visual Studio or VS Code tips and tricks. On the 21st, we're gonna talk about how to build a software course. And on the 23rd, we're gonna talk about how to go freelance. And then finally, on February 28th, we're gonna talk about how to set up a YouTube channel. So those are the meetups that we're gonna have along with the book club. And I hope to see you there. That's going to be at topendevice.com slash sign up.

WILL_BUTTON: All right. Well, we are coming up on an hour here. Is there anything else that you feel like we should be covering when it comes to resilience, VR and managing complexity?

SAGI_BRODY: So it seemed like the most important thing. And this might be a little cliche these days, but you know, it's just. Make no assumptions, make no assumptions on any of the platforms that you're using in regards to what built-in resilience or redundancy exists. And also keep in mind that high availability and resilience does not always equate your ability to recover from specific types of events. You know, if you're hit with a cyber attack and your data is corrupted in production systems, you know, having a replica or having high availability, even with multiple regions, does not mean you can recover from that. There are other sort of strategies that you need to employ. Obviously, how far back is your Snapchat history, your journal, and you'll need to have separate and run books for that type of situation than the high availability type of situation. So just understand those are completely separate. And again, make no assumptions.

WILL_BUTTON: Yeah, it's almost like this would make a really good board game.

SAGI_BRODY: That would make a good board game.

WILL_BUTTON: Yeah.

SAGI_BRODY: We should do like a jump to conclusion.

WILL_BUTTON: Uh. Yeah. Well done on the office space reference.

SAGI_BRODY: Yeah. Nice to see that.

WILL_BUTTON: Well played. Cool. So if folks, if our listeners want to talk more about this or reach out to you directly with additional questions, what's the best way for them to do it?

SAGI_BRODY: Find me on LinkedIn. That's probably the platform that I'm most active on. Write to me on there or sagi or you'll find me on. My name is unique, so I have no doubts that anyone who's listening to the show will not be able to find me.

WILL_BUTTON: Your name is unique. Is that short for something?

SAGI_BRODY: Sagia is a Hebrew name. Okay. Like other Hebrew names. They can kind of get butchered, you know in these parts, but there are much worse Hebrew names that I you know. So I don't have it that bad but it is I mean It is nice because when I get cold calls, I immediately know that this person never spoke to me before The built-in screening feature, yeah

WILL_BUTTON: Awesome. Well, Sagi, thank you so much for joining me today. This has been a cool conversation. And I think it's one that we, we need to spend more time talking about because it often gets overlooked or assumed. Like you said, make no assumptions.

SAGI_BRODY: Absolutely. Cool. Well, thank you. Thank you. I mean, this has been fun. You're a great presenter and it's nice to talk to someone who's kind of also lived it and been through it as well.

WILL_BUTTON: Right? Us old guys have to group together and tell war stories once in a while. All right. Thanks again, Steggy. Thank you for listening and we will see y'all next week.

 

Album Art
The Evolution of Disaster Recovery Strategies in Modern Cloud Environments - DevOps 186
0:00
58:32
Playback Speed: