Resilient Design Practices

Explore top LinkedIn content from expert professionals.

Summary

Resilient design practices focus on creating systems, products, and experiences that can adapt, recover, and continue to function when unexpected problems or failures occur. Instead of aiming for perfection or simply avoiding breakdowns, these practices help ensure recovery and adaptability in the face of real-world challenges, from software bugs to climate disasters.

  • Plan for failure: Build in features and processes that allow quick detection, recovery, and continued operation when things go wrong instead of assuming everything will work perfectly.
  • Design for extremes: Consider unusual scenarios, such as poor connectivity, extreme weather, or user errors, and make sure your design helps people succeed even in tough conditions.
  • Embrace adaptability: Encourage frequent reflection and iteration so systems and teams can learn from setbacks and emerge stronger after disruptions.
Summarized by AI based on LinkedIn member posts
  • View profile for Vitaly Friedman
    Vitaly Friedman Vitaly Friedman is an Influencer

    Practical insights for better UX • Running “Measure UX” and “Design Patterns For AI” • Founder of SmashingMag • Speaker • Loves writing, checklists and running workshops on UX. 🍣

    224,238 followers

    ☂️ Designing For Edge Cases and Exceptions. Practical design guidelines to prevent dead-ends, lock-outs and other UX failures ↓ 🚫 People are never edge cases; “average” users don’t exist. ✅ Exceptions will occur eventually, it’s just a matter of time. ✅ To prevent failure, we need to explore unhappy paths early. ✅ Design full UI stack: blank, loading, partial, error, ideal states. ✅ Design defaults deliberately to prevent slips and mistakes. ✅ Start by designing the core flow, then scrutinize every part of it. ✅ Allow users to override validators, or add an option manually. ✅ Design for incompatibility: contradicting filters, prefs, settings. 🚫 Avoid generic error messages: they are often main blockers. ✅ Suggest presets, templates, starter kits for quick recovery. ✅ Design extreme scales: extra long/short, wide/tall, offline/slow. ✅ Design irreversible actions, e.g. Delete, Forget, Cancel, Exit. ✅ Allow users to undo critical actions for some period of time. ✅ Design a recovery UX due to delays, lock-outs, missing data. ✅ Accessibility is a reliable way to ensure design resilience. Good design paves happy paths for everyone, but also casts a wide safety net when things go sideways. I love to explore unhappy paths by setting up a dedicated design review to discover exceptions proactively. It can be helpful to also ask AI tooling to come up with alternate scenarios. Once we start discussing exceptions, we start thinking outside of the box. We have to actively challenge generic expectations, stereotypes and assumptions that we as designers typically embed in our work, often unconsciously. And to me, that’s one of the most valuable assets of such discussions. And: whenever possible, flag any mentions of average users in your design discussions. Such people don’t exist, and often it’s merely an aggregated average of assumptions and hunches. Nothing stress tests your UX better then testing it in realistic conditions with realistic data sets with real people. Useful resources: How To Fix A Bad User Interface, by Scott Hurff https://lnkd.in/ecj6PGPU How To Design Edge Cases, by Tanner Christensen https://lnkd.in/ecs3kr8z How To Find Edge Cases In UX, by Edward Chechique https://lnkd.in/e2pfqqen Just About Everyone Is an Edge Case, by Kevin Ferris https://lnkd.in/eDdUVHyj Edge Cases In UX, by Krisztina Szerovay https://lnkd.in/eM2Xynba Recommended books: – Design For Real Life, by Sara Wachter-Boettcher, Eric Meyer – The End of Average, by Todd Rose – Think Like a UX Researcher, by David Travis, Philip Hodgson – Mismatch: How Inclusion Shapes Design, by Kat Holmes #ux #design

  • View profile for Shalini Goyal

    Executive Director @ JP Morgan | Ex-Amazon || Professor @ Zigurat || Speaker, Author || TechWomen100 Award Finalist

    116,272 followers

    Systems don’t fail because something went wrong - they fail because nothing was prepared to handle what went wrong. That’s why failure-handling patterns are a core part of system design. This visual breaks down 12 essential techniques engineers use to build resilient, fault-tolerant systems that stay reliable under real-world pressure: - Retry Reattempt failed operations to handle temporary network or service glitches. Used in API calls, database queries, and distributed requests. - Circuit Breaker Stops calls to unhealthy services to prevent cascading failures. Common in microservices communication. - Bulkhead Isolates failures so one overloaded component doesn’t crash the entire system. Used with thread pools and microservice resource isolation. - Fallback Provides a degraded or cached response when a dependency fails. Keeps the user experience smooth with static data or defaults. - Timeouts Prevents waiting forever for slow or stuck services. Critical for APIs, databases, and distributed systems. - Dead Letter Queue (DLQ) Captures failed messages for later inspection or reprocessing. A staple in message queues and event-driven architectures. - Rate Limiting Protects systems from abuse or overload by restricting excessive requests. Used widely in public APIs and authentication services. - Load Shedding Drops non-critical traffic during peak load to keep core functions alive. Common in high-traffic or real-time systems. - Graceful Degradation Reduces functionality instead of failing completely. Used in dashboards, e-commerce platforms, and streaming apps. - Redundancy Duplicates critical components to eliminate single points of failure. Standard practice for databases, servers, and networks. - Health Checks Detects unhealthy services and removes them from rotation. Used by load balancers and orchestration tools. - Failover Automatically switches to a backup system when the primary one fails. Essential for multi-region deployments and database clusters. Mastering these techniques is what separates systems that work in theory from systems that work in production. Which ones have you used in your architecture?

  • View profile for Akhila Kosaraju

    I help climate solutions accelerate adoption with design that wins pilots, partnerships & funding | Clients across startups and unicorns backed by U.S. Dep’t of Energy, YC, Accel | Brand, Websites and UX Design.

    23,382 followers

    I just watched a talk on Design for Climate Disaster and completely questioned my assumption about designing in climate. Most designers design for perfect conditions. We assume fast WiFi. Sunny days. Users who aren't panicking. But designing for climate resilience is the opposite of that. In her talk at Figma Config, Megan Metzger talks about her design work for Forerunner's disaster response platform. The features aren't flashy. They're functional: • Mobile-first design with high-contrast screens • Offline functionality that syncs when connectivity returns • Real-time FEMA calculations for immediate decisions The results: Damage assessment time dropped from 3-4 hours to 45 minutes. Over 15,000 assessments completed faster. This unlocked $2.4 billion in recovery funds sooner. Megan's approach: design for effectiveness over elegance. Her three crisis design principles: 1. Trust comes from reliability under pressure Your system must work with low battery. Weak internet. When everything else fails. 2. The right tools make impossible tasks possible Enable people to do hard things under difficult conditions. 3. Clarity enables action Clear design removes hesitation. Give users confidence to act decisively. Climate disasters aren't rare anymore. They're Tuesday. Every month brings new records. Heat domes. Atmospheric rivers. Category 6 hurricanes. The biggest climate companies are finally getting this: • Rivian designs trucks that maintain navigation during wildfire smoke. Not just daily commutes. • Sunrun designs solar systems that work during blackouts. Not just sunny days. • Climavision builds weather radar for extreme events. Not just forecasting. As more companies enter climate adaptation and disaster response, Megan's principles become survival requirements. The same principle applies to climate technology: • Solar panels that work during storms • EV charging that functions in extreme weather • Carbon tracking that doesn't glitch during peak usage As climate designers, we obsess over features. We should obsess over reliability. Your climate solution isn't just competing with other green tech. It's competing with the status quo when everything goes wrong. The fossil fuel system works reliably. That's why people stick with it. If your sustainable alternative fails during stress, you've lost more than a customer. You've lost trust in the entire climate movement. My takeaway: design for the worst day, not just the best day. Test your climate tech during power outages. During heatwaves. During floods. Because that's exactly when we need it to work. But this also begs the question - How do we balance reliability with efficiency?

  • View profile for Elise Victor, PhD

    Writing and Research on Motivation, Identity, Responsibility, and the Modern Human Experience

    34,177 followers

    Stop trying to be break-proof. Build for bounce back. We’re taught to design systems that are fail-safe. But resilience isn’t about never falling down. It’s about how fast we get back up. It's what we learn in the process. Speed & adaptability matter most. When things break, there’s a window, a threshold. - If recovery begins quickly, momentum builds. - If it drags, damage compounds. This applies to both people and teams. When a setback occurs, the first 24-72 hours determine whether we stabilize or spiral. - In systems, it’s the critical recovery rate. - In leadership, it’s the response rhythm. 5 qualities of highly resilient people/teams: (1) Design for recovery, not perfection. Ask this question: "If we fail tomorrow, how do we restore 80% of capacity in one day?" Create fallback plans, reroute paths, and playbooks for quick resets. (2) Move fast to regain momentum. Recovery is like a muscle; it strengthens through use. After a disruption, run short, visible wins to restore confidence and signal progress. (3) Know thresholds. Systems collapse when they cross invisible lines. People do too. Identify your "too broken" point before you reach it, and build early warning signals around it. (4) Don’t bounce back. Bounce forward. True resilience isn’t returning to what was. It’s transforming into what’s next. Every crisis is data, and every disruption is feedback. (5) Build recovery habits. Don't be caught off guard. Reflect even for small setbacks. Ask "What was surprising?" and "What supported a fast recovery?" Resilience isn’t a trait. It’s a design choice. We can’t eliminate failure, but we can build systems and teams that learn, adapt, and come back stronger every time. The goal isn’t to be unbreakable. It’s to be unstoppable. ♻️ Share if this resonates. ➕ Follow Elise Victor, PhD for mindset and growth insights.

  • View profile for Benjamin Cane

    Distinguished Engineer @ American Express | Slaying Latency & Building Reliable Card Payment Platforms since 2011

    4,838 followers

    You may be building for availability, but are you building for resiliency? Many teams design for availability. Far fewer design for resiliency. A concept that took me a while to really grasp is that building highly available systems and highly resilient systems is not the same thing. The difference is how the system reacts to failure. 🚄 High Availability When you build for high availability, the goal is simple: ensure there is always another path. If something fails, traffic can be redirected somewhere else. For example, a service might run across multiple availability zones or regions. If one fails, traffic is routed to another. Detecting failures and redirecting traffic are core elements of building for high availability. Availability is about rerouting traffic when something fails. 🚂 High Resiliency Building for resiliency is different. The solution to failure isn’t another path; it’s how the system handles the error. When a dependency fails, the decision becomes: Do we retry? Do we continue without that dependency? Do we degrade functionality? Do we stop processing altogether? Resiliency is about defining what happens when things go wrong. Sometimes you can continue processing. Sometimes you can defer work and fix it later. Resiliency is absorbing failure instead of avoiding it. 🧩 A Simple Example When you design systems with resiliency in mind, you tend to treat dependencies differently. A simple example is configuration. Many systems use distributed configuration services so that runtime behavior can change without redeployment. But that configuration service then becomes a dependency. To avoid turning it into a hard dependency, many systems cache the configuration in memory. When updates occur, the system fetches the new configuration and switches only after it’s fully loaded into memory. If configuration refresh fails, the system continues operating with the last known configuration. Transient failures don’t bring the system down. That’s resiliency. 🧠 Final Thoughts When I talk about non-functional requirements, you’ll hear me say: “Highly available and resilient systems” I separate them intentionally because the approaches are different. Availability ensures there is always another path. Resiliency ensures the system can continue operating when failures occur. Availability routes around failure. Resiliency survives failure. You need both.

  • Outages should be viewed as indicators of stress within a business model rather than simple glitches. Recent incidents, such as the Amazon Web Services (AWS) DNS failure and Vodafone’s UK outage, highlight a critical issue: many so-called "resilient" architectures may actually function as single points of failure, despite appearing to have multi-cloud alternatives. If an Industry 4.0 operation relies on only one cloud region, DNS path, or vendor control plane, true resilience is lacking, and reliance on fortunate circumstances may be the case. Addressing this requires a shift towards designing systems that anticipate failure. Strategies may include prioritizing local-edge operation technology (OT) to maintain essential functions, employing active-active configurations across multiple regions and providers, ensuring diverse peering and identity paths, utilizing dual-carrier connectivity, and implementing private 5G networks for reliable control. Regulatory bodies such as DORA, NIS2, and UK Operational Resilience will likely seek concrete evidence of resilience rather than presentations. While achieving true resilience involves costs, it is important to consider that unplanned downtime can result in significant financial losses and damage customer trust. Recommended practices include conducting regular “Failure Day” exercises, mapping third-party dependencies down to the API level, and revising key performance indicators (KPIs) from uptime to fault tolerance. This approach can help ensure that, in the event of disruptions in systems like us-east-1, operational capabilities remain intact and financial performance is protected. At #BellLabsConsulting we have a full methodology to prevent events such as these, but also have a faster response when they happen.

  • View profile for Madhusudan Vishwanath

    Engineering Leader - Cloud Infrastructure, Networking and Security

    1,576 followers

    The recent AWS outage underscores two architectural fundamentals that often separate resilient systems from fragile ones: 1️⃣ Enforcing Tiered Service Architecture Every large-scale cloud platform should maintain a strict service tiering model. Tier 0 services — the true backbone — must have minimal or no dependencies. These are the primitives everything else relies on (e.g., identity, networking, metadata). Tier 1 and higher services may depend on lower tiers, but never in reverse. When lower-tier services inadvertently depend on higher-tier components (for metrics, authentication, or orchestration), the system becomes entangled — and outages ripple far wider than they should. Clear separation of service tiers is the foundation of blast-radius containment. 2️⃣ Multi-Region Deployment and Replication For customers, architectural resilience means assuming failure is inevitable. Services should be multi-region by design, not by afterthought — with data and state replication, independent failover logic, and health-aware routing. High availability isn’t achieved by redundancy alone; it’s achieved by eliminating correlated failure domains — both in provider and customer architectures. Resilience is an outcome of disciplined design — not luck during an outage. #AWS #CloudArchitecture #Resilience #HighAvailability #DistributedSystems #SRE

  • View profile for Leon M.

    Where Cloud and AI Converge to Redefine Business Value

    16,946 followers

    Announcing a new role at Intellias as a VP of Global Cloud Strategy on the same day Amazon Web Services (AWS) works through an outage feels like a direct message and a reminder that provider uptime is only part of the story. Real resilience is a business strategy. It is easy to point at a cloud provider. The harder and more valuable work is looking inward and asking what we could have designed differently so customers feel a brief pause, not pain. Think utility power. Most of the time the lights come on without a thought. When they do not, outcomes depend on what you put in place: a fresh bulb, the right breaker, a UPS, a small generator, maybe solar plus batteries. Cloud is the same. Choices you make before the storm determine how you ride it out. What we control: (1) Resilience by design: retries with backoff, idempotency, timeouts, load shedding. (2) Blast radius limits: cell-based architecture and per Region isolation. (3) Right-sized redundancy: Multi AZ as baseline; warm standby or active active for critical journeys. (4) Data protection targets: clear RTO and RPO mapped to customer journeys. (5) Operational muscle: chaos and game days, runbooks, crisp communications plans. (6) Cost clarity: compare the price of resilience with the cost of downtime and decide explicitly. Resilience Menu (in increasing cost and complexity): (1) Hygiene and graceful degradation: health checks, feature flags, fallback content, read-only modes, rate limits, capacity buffers, synthetic monitoring. (2) Multi AZ fundamentals: AZ-aware shards, queue-first patterns, dead-letter queues, warm pools, circuit breakers, bulkheads, structured timeouts and backoff. (3) Multi Region warm standby: cross Region backups, pilot light, async replication, prepared DNS or traffic manager failover, rehearsed runbooks with target RTO/RPO. (4) Active active multi Region: global data strategies and conflict resolution, partition-tolerant stores, global service discovery, continuous chaos at scale, contractual SLOs. (5) Targeted multi cloud (when concentration risk is unacceptable): selective diversification for control planes such as DNS, CDN, or identity. Outages will happen. The question is whether customers experience a slowdown or a well-practiced plan. In my new role, I am doubling down on making resilience intentional, measured, and worth the money. As Werner Vogels says, "Everything fails, all the time" Chaos is inevitable. Chaos engineering makes it intentional and survivable, turning resilience into a competitive edge: faster recovery, steadier customer experience, and the ability to ship when others stall. #cloudstrategy #resilience #aws #architecture #SRE #devops #businesscontinuity

  • View profile for Dr Fatemeh Rezazadeh

    Energy & Infrastructure Executive | Capital Strategy & Commercial Leadership | Board Advisor | Cross-Border M&A Transactions & Platform Growth

    3,908 followers

    There was enough power, but there wasn’t enough resilience. Last week’s Heathrow shutdown wasn’t just a power outage—it was an exposure. A transformer fire at the North Hyde substation took out electricity to the world’s second-busiest airport. The ripple effects were felt across global aviation, supply chains, and headlines. John Pettigrew, CEO of National Grid, says the other two substations serving Heathrow had enough capacity to keep the airport running. So why the closure? Because operational resilience isn’t just about capacity—it’s about design, systems, decision-making, and time. Heathrow’s CEO explained that they had to shut down thousands of systems and methodically reboot them to ensure safety. Backup generators existed—but only to cover critical safety systems, not full operations. Switching to alternate substations wasn’t instantaneous; reconfiguring and restoring took hours. This is a classic example of design resilience vs. lived resilience. We often assume that having backup available is enough. But in complex systems—airports, hospitals, data centers—it’s how quickly and safely that backup can be activated that defines true resilience. Other major airports have made resilience a priority: - JFK, New York – 110 MW gas-fired CHP plant enabling full microgrid operation during outages. - Frankfurt Airport – Redundant grid feeds, on-site gas turbine generation, and UPS systems. - Amsterdam Schiphol – Integrated energy management system with diesel and battery backup for essential systems. - Changi Airport, Singapore – Multiple grid connections, standby diesel generation, and automated switchgear. - Incheon International, South Korea – Dual-feed substations, backup diesel generators, and smart grid control. These airports understand that resilience isn’t a luxury—it’s a license to operate. This is the future of energy for critical infrastructure: - Decentralized - Redundant - Fast-switching - Integrated with grid and on-site systems. If Heathrow—despite being served by three substations—could still go dark for nearly 24 hours, the question isn’t who to blame. It’s what to build differently. Are we designing our infrastructure for availability, or for agility? Are we investing in energy systems that can recover, or just survive? Let’s make sure this isn’t just a red flag—it’s a redirection. #EnergyResilience #InfrastructureLeadership #FutureOfPower #CriticalInfrastructure #Heathrow #GridSecurity #Digitalisation #Electrification

Explore categories