From the course: Tech Trends

System outages: Recovery and resilience

From the course: Tech Trends

System outages: Recovery and resilience

- How do we prevent our computer infrastructure from collapsing due to a software update or human error? And can we totally prevent this from happening? Regional and global service and software outages happen more often than we'd like, and when they do, we're reminded of how much our society depends on our sometimes fragile digital infrastructure. In July, 2024, millions of computers went down worldwide forcing banks and airlines and hospitals and other critical infrastructure to halt their operations as their computer systems were manually reset. The cause? An automatic update from CrowdStrike, a company providing critical enterprise security software. In 2022, an outage from Canadian cell phone provider Rogers affected millions of users across the country. The cause? An error introduced during routine maintenance. And the same thing happened with Meta in 2021. During routine maintenance, a misconfiguration of a system took most of their platforms offline, causing a global outage. And there have been several highly publicized near misses, including Log4J, Shellshock, Heartbleed, and the notorious Y2K bug. All narrowly prevented thanks to global efforts from system administrators. Seeing these outages happen and seeing their widespread consequences, there's some questions worth asking about how to build more robust and resilient processes to prevent outages when we can, and how to get our systems back online as quickly as possible when they go down. To bring some clarity to this, I reached out to a handful of our LinkedIn learning instructors for their insight. To start, I asked what service providers can do to prevent these outages from happening in the first place? Here's cybersecurity expert Tia Hopkins. - Some of the low hanging fruit is to to do rollouts in stages. You know, global rollouts all at once, if they go well, they go well. If they do not go well, it can be catastrophic, as we've seen. Have rollback plans in place just in case things don't go well. - Now, roll backs aren't always an option, especially if the update is to a core system. So testing before rollout is equally essential. Cybersecurity expert and author Caroline Wong has more to say about this. - It's critically important for service providers to conduct security testing as well as quality assurance testing on their updates before those are pushed out to their customers. - Because these updates are often critical, there's a tension here between testing to make sure everything is perfect and rolling out updates to protect users. TIA explains. - The balance that has to be striked for a vendor of security software is that balance between, Hey, we've got an update here that's going to protect our customer base and the longer we delay the rollout of this update, the more our customers are unprotected from this thing. - In other words, providers can do everything in their power to get it right on their end, and even then there's still a chance of something going wrong. Which brings us to the next question. What can IT and security admins do to reduce the risk of their systems going down when something does go wrong during an external service update or routine maintenance? Strategic cybersecurity leader, Mike Wylie, starts out with a simple message. - The best thing IT admins can do to reduce the risk of their systems going down is to partner with a vendor they trust and that has a strong track record. - With that trusted vendor relationship in place, Caroline and Tia point out the necessity of understanding your own systems and having robust policies and practices in place. - Any organization is likely to have a unique or specific setup with regards to their tech stack. They might have different versions of different software components, and it's important for IT admins to test this on a few systems before deploying across an entire production set of systems. - Having strong rollback plans in place is important. Having a sandbox environment where updates can be tested before they're rolled out globally, so staged rollouts, QA testing, you know, testing different types of machines with different configurations before updates are rolled out, and ensuring change management protocols are clear and understood. - Which is all well and good until something goes wrong. Here's Tia again. - This is a huge realization of how much we rely on technology. What will you do if your options for recovery do not include the technologies that you're used to leveraging? - That's a great question, and part of the answer, according to Caroline, is to pay close attention to the risk level of everything that happens in your systems. - It is critical for organizations to determine what are high risk actions? For example, a high risk action is going to be when a software provider pushes a mandatory update. This needs to go through extra rigorous QA and security testing. - This begs a question I've heard often and even grappled with myself. If I know an automatic update may be risky, should I just opt out and stop it from taking place at all? Tia points out the complexities of this dilemma. - On the one hand, you could allow, you know, automatic updates and experience what we just experienced with this global outage. But on the other hand, you could disable automatic updates and neglect to install an update that would protect you from something and then have a breach. - Which brings us to today and why we're here talking about this. With so many different systems and stakeholders connected together in our information infrastructure, is it at all possible to build truly robust systems that are to these type of outages? Mike rounds us out with a reality check - To build a fully resilient and redundant system, there would be extreme cost, complexity and likely would introduce new vulnerabilities that result in a worse outcome. - Did you catch that last part? Trying to build a perfectly robust system might introduce new vulnerabilities. I think this is the thread running through this entire conversation. Cybersecurity is not about perfection, it's about continuous improvement in a rapidly changing landscape. When we experience these massive system outages, it's usually from a new novel cause. And whenever they happen, cybersecurity experts like Caroline and Mike and Tia dive in to figure out what happened, how to fix it, and most importantly, how to stop it from happening again. So here are my takeaways. Number one, testing at every stage of the process is essential. Number two, phased rollouts ensure errors can be caught before they go global. Number three, critical systems need rollbacks and redundancies for when things go wrong. And finally, number four, in place of chasing a utopian robust system. We work together to find issues, solve them quickly and learn from them. That's what we already do and that's what gives us the resilience to recover and build back better when our digital infrastructure goes down.

Contents