☂️ Designing For Edge Cases and Exceptions. Practical design guidelines to prevent dead-ends, lock-outs and other UX failures ↓ 🚫 People are never edge cases; “average” users don’t exist. ✅ Exceptions will occur eventually, it’s just a matter of time. ✅ To prevent failure, we need to explore unhappy paths early. ✅ Design full UI stack: blank, loading, partial, error, ideal states. ✅ Design defaults deliberately to prevent slips and mistakes. ✅ Start by designing the core flow, then scrutinize every part of it. ✅ Allow users to override validators, or add an option manually. ✅ Design for incompatibility: contradicting filters, prefs, settings. 🚫 Avoid generic error messages: they are often main blockers. ✅ Suggest presets, templates, starter kits for quick recovery. ✅ Design extreme scales: extra long/short, wide/tall, offline/slow. ✅ Design irreversible actions, e.g. Delete, Forget, Cancel, Exit. ✅ Allow users to undo critical actions for some period of time. ✅ Design a recovery UX due to delays, lock-outs, missing data. ✅ Accessibility is a reliable way to ensure design resilience. Good design paves happy paths for everyone, but also casts a wide safety net when things go sideways. I love to explore unhappy paths by setting up a dedicated design review to discover exceptions proactively. It can be helpful to also ask AI tooling to come up with alternate scenarios. Once we start discussing exceptions, we start thinking outside of the box. We have to actively challenge generic expectations, stereotypes and assumptions that we as designers typically embed in our work, often unconsciously. And to me, that’s one of the most valuable assets of such discussions. And: whenever possible, flag any mentions of average users in your design discussions. Such people don’t exist, and often it’s merely an aggregated average of assumptions and hunches. Nothing stress tests your UX better then testing it in realistic conditions with realistic data sets with real people. Useful resources: How To Fix A Bad User Interface, by Scott Hurff https://lnkd.in/ecj6PGPU How To Design Edge Cases, by Tanner Christensen https://lnkd.in/ecs3kr8z How To Find Edge Cases In UX, by Edward Chechique https://lnkd.in/e2pfqqen Just About Everyone Is an Edge Case, by Kevin Ferris https://lnkd.in/eDdUVHyj Edge Cases In UX, by Krisztina Szerovay https://lnkd.in/eM2Xynba Recommended books: – Design For Real Life, by Sara Wachter-Boettcher, Eric Meyer – The End of Average, by Todd Rose – Think Like a UX Researcher, by David Travis, Philip Hodgson – Mismatch: How Inclusion Shapes Design, by Kat Holmes #ux #design
Resilient Design Practices
Explore top LinkedIn content from expert professionals.
Summary
Resilient design practices focus on building systems, products, and experiences that can withstand disruptions, adapt to unexpected situations, and recover quickly from failures. These approaches anticipate challenges and ensure reliability, especially when conditions are less than ideal.
- Plan for exceptions: Think ahead and design for scenarios where things might go wrong, from internet outages to user errors and extreme weather, so users aren't left stranded.
- Enable quick recovery: Build mechanisms such as fallback options, undo features, and backup systems that help people and technology bounce back swiftly after a setback.
- Test under stress: Evaluate how your design performs in real-world tough situations, like power outages or high traffic, and refine it to stay reliable when it matters most.
-
-
Systems don’t fail because something went wrong - they fail because nothing was prepared to handle what went wrong. That’s why failure-handling patterns are a core part of system design. This visual breaks down 12 essential techniques engineers use to build resilient, fault-tolerant systems that stay reliable under real-world pressure: - Retry Reattempt failed operations to handle temporary network or service glitches. Used in API calls, database queries, and distributed requests. - Circuit Breaker Stops calls to unhealthy services to prevent cascading failures. Common in microservices communication. - Bulkhead Isolates failures so one overloaded component doesn’t crash the entire system. Used with thread pools and microservice resource isolation. - Fallback Provides a degraded or cached response when a dependency fails. Keeps the user experience smooth with static data or defaults. - Timeouts Prevents waiting forever for slow or stuck services. Critical for APIs, databases, and distributed systems. - Dead Letter Queue (DLQ) Captures failed messages for later inspection or reprocessing. A staple in message queues and event-driven architectures. - Rate Limiting Protects systems from abuse or overload by restricting excessive requests. Used widely in public APIs and authentication services. - Load Shedding Drops non-critical traffic during peak load to keep core functions alive. Common in high-traffic or real-time systems. - Graceful Degradation Reduces functionality instead of failing completely. Used in dashboards, e-commerce platforms, and streaming apps. - Redundancy Duplicates critical components to eliminate single points of failure. Standard practice for databases, servers, and networks. - Health Checks Detects unhealthy services and removes them from rotation. Used by load balancers and orchestration tools. - Failover Automatically switches to a backup system when the primary one fails. Essential for multi-region deployments and database clusters. Mastering these techniques is what separates systems that work in theory from systems that work in production. Which ones have you used in your architecture?
-
Hi 👋 🚀 Resiliency Engineering: Why Top Tech Companies Never Fail Their Users In today’s software landscape, failures are inevitable. What separates the giants like Netflix, Google, and Amazon from the rest is not that they avoid failures, but that they anticipate, measure, and recover from them. ⭐ What is Resiliency Engineering? It’s the practice of designing systems that continue to operate correctly even when parts of the system fail, and can recover quickly. 🟢 Real-world Usage: In microservices, if one service goes down, the rest keep running. In cloud systems, even if an entire data center fails, uptime is preserved. In e-commerce and fintech, payment failures or network issues are handled gracefully to ensure a seamless user experience. 🟠Key Techniques & Tools: Retry with Backoff Circuit Breakers Timeouts & Fallbacks Bulkhead Isolation Rate Limiting 🟣 Monitoring Resiliency: Measure what matters: Availability / Uptime Error Rate Latency / P95 / P99 MTTR (Mean Time To Recovery) MTBF (Mean Time Between Failures) 🔵 Case Study: Netflix uses Chaos Engineering with tools like Chaos Monkey to intentionally fail services and test system resilience. Result? 99.99% uptime for millions of users worldwide. ⭕ Practical Steps to Improve Resiliency: 🔸Define SLOs & SLIs for every service 🔸Implement retry, timeout, circuit breaker, and fallback mechanisms 🔸Set up monitoring and observability (Prometheus, Grafana, OpenTelemetry) 🔸Run Chaos Engineering experiments 🔸Conduct blameless postmortems to learn and improve continuously Resiliency isn’t optional. It’s a competitive advantage. The question is: How resilient is your system today? #ResilienceEngineering #SRE #ChaosEngineering #Microservices #CloudNative #Reliability #Observability #SiteReliabilityEngineering #TechLeadership #HighAvailability
-
I just watched a talk on Design for Climate Disaster and completely questioned my assumption about designing in climate. Most designers design for perfect conditions. We assume fast WiFi. Sunny days. Users who aren't panicking. But designing for climate resilience is the opposite of that. In her talk at Figma Config, Megan Metzger talks about her design work for Forerunner's disaster response platform. The features aren't flashy. They're functional: • Mobile-first design with high-contrast screens • Offline functionality that syncs when connectivity returns • Real-time FEMA calculations for immediate decisions The results: Damage assessment time dropped from 3-4 hours to 45 minutes. Over 15,000 assessments completed faster. This unlocked $2.4 billion in recovery funds sooner. Megan's approach: design for effectiveness over elegance. Her three crisis design principles: 1. Trust comes from reliability under pressure Your system must work with low battery. Weak internet. When everything else fails. 2. The right tools make impossible tasks possible Enable people to do hard things under difficult conditions. 3. Clarity enables action Clear design removes hesitation. Give users confidence to act decisively. Climate disasters aren't rare anymore. They're Tuesday. Every month brings new records. Heat domes. Atmospheric rivers. Category 6 hurricanes. The biggest climate companies are finally getting this: • Rivian designs trucks that maintain navigation during wildfire smoke. Not just daily commutes. • Sunrun designs solar systems that work during blackouts. Not just sunny days. • Climavision builds weather radar for extreme events. Not just forecasting. As more companies enter climate adaptation and disaster response, Megan's principles become survival requirements. The same principle applies to climate technology: • Solar panels that work during storms • EV charging that functions in extreme weather • Carbon tracking that doesn't glitch during peak usage As climate designers, we obsess over features. We should obsess over reliability. Your climate solution isn't just competing with other green tech. It's competing with the status quo when everything goes wrong. The fossil fuel system works reliably. That's why people stick with it. If your sustainable alternative fails during stress, you've lost more than a customer. You've lost trust in the entire climate movement. My takeaway: design for the worst day, not just the best day. Test your climate tech during power outages. During heatwaves. During floods. Because that's exactly when we need it to work. But this also begs the question - How do we balance reliability with efficiency?
-
Stop trying to be break-proof. Build for bounce back. We’re taught to design systems that are fail-safe. But resilience isn’t about never falling down. It’s about how fast we get back up. It's what we learn in the process. Speed & adaptability matter most. When things break, there’s a window, a threshold. - If recovery begins quickly, momentum builds. - If it drags, damage compounds. This applies to both people and teams. When a setback occurs, the first 24-72 hours determine whether we stabilize or spiral. - In systems, it’s the critical recovery rate. - In leadership, it’s the response rhythm. 5 qualities of highly resilient people/teams: (1) Design for recovery, not perfection. Ask this question: "If we fail tomorrow, how do we restore 80% of capacity in one day?" Create fallback plans, reroute paths, and playbooks for quick resets. (2) Move fast to regain momentum. Recovery is like a muscle; it strengthens through use. After a disruption, run short, visible wins to restore confidence and signal progress. (3) Know thresholds. Systems collapse when they cross invisible lines. People do too. Identify your "too broken" point before you reach it, and build early warning signals around it. (4) Don’t bounce back. Bounce forward. True resilience isn’t returning to what was. It’s transforming into what’s next. Every crisis is data, and every disruption is feedback. (5) Build recovery habits. Don't be caught off guard. Reflect even for small setbacks. Ask "What was surprising?" and "What supported a fast recovery?" Resilience isn’t a trait. It’s a design choice. We can’t eliminate failure, but we can build systems and teams that learn, adapt, and come back stronger every time. The goal isn’t to be unbreakable. It’s to be unstoppable. ♻️ Share if this resonates. ➕ Follow Elise Victor, PhD for mindset and growth insights.
-
You may be building for availability, but are you building for resiliency? Many teams design for availability. Far fewer design for resiliency. A concept that took me a while to really grasp is that building highly available systems and highly resilient systems is not the same thing. The difference is how the system reacts to failure. 🚄 High Availability When you build for high availability, the goal is simple: ensure there is always another path. If something fails, traffic can be redirected somewhere else. For example, a service might run across multiple availability zones or regions. If one fails, traffic is routed to another. Detecting failures and redirecting traffic are core elements of building for high availability. Availability is about rerouting traffic when something fails. 🚂 High Resiliency Building for resiliency is different. The solution to failure isn’t another path; it’s how the system handles the error. When a dependency fails, the decision becomes: Do we retry? Do we continue without that dependency? Do we degrade functionality? Do we stop processing altogether? Resiliency is about defining what happens when things go wrong. Sometimes you can continue processing. Sometimes you can defer work and fix it later. Resiliency is absorbing failure instead of avoiding it. 🧩 A Simple Example When you design systems with resiliency in mind, you tend to treat dependencies differently. A simple example is configuration. Many systems use distributed configuration services so that runtime behavior can change without redeployment. But that configuration service then becomes a dependency. To avoid turning it into a hard dependency, many systems cache the configuration in memory. When updates occur, the system fetches the new configuration and switches only after it’s fully loaded into memory. If configuration refresh fails, the system continues operating with the last known configuration. Transient failures don’t bring the system down. That’s resiliency. 🧠 Final Thoughts When I talk about non-functional requirements, you’ll hear me say: “Highly available and resilient systems” I separate them intentionally because the approaches are different. Availability ensures there is always another path. Resiliency ensures the system can continue operating when failures occur. Availability routes around failure. Resiliency survives failure. You need both.
-
Outages should be viewed as indicators of stress within a business model rather than simple glitches. Recent incidents, such as the Amazon Web Services (AWS) DNS failure and Vodafone’s UK outage, highlight a critical issue: many so-called "resilient" architectures may actually function as single points of failure, despite appearing to have multi-cloud alternatives. If an Industry 4.0 operation relies on only one cloud region, DNS path, or vendor control plane, true resilience is lacking, and reliance on fortunate circumstances may be the case. Addressing this requires a shift towards designing systems that anticipate failure. Strategies may include prioritizing local-edge operation technology (OT) to maintain essential functions, employing active-active configurations across multiple regions and providers, ensuring diverse peering and identity paths, utilizing dual-carrier connectivity, and implementing private 5G networks for reliable control. Regulatory bodies such as DORA, NIS2, and UK Operational Resilience will likely seek concrete evidence of resilience rather than presentations. While achieving true resilience involves costs, it is important to consider that unplanned downtime can result in significant financial losses and damage customer trust. Recommended practices include conducting regular “Failure Day” exercises, mapping third-party dependencies down to the API level, and revising key performance indicators (KPIs) from uptime to fault tolerance. This approach can help ensure that, in the event of disruptions in systems like us-east-1, operational capabilities remain intact and financial performance is protected. At #BellLabsConsulting we have a full methodology to prevent events such as these, but also have a faster response when they happen.
-
The recent AWS outage underscores two architectural fundamentals that often separate resilient systems from fragile ones: 1️⃣ Enforcing Tiered Service Architecture Every large-scale cloud platform should maintain a strict service tiering model. Tier 0 services — the true backbone — must have minimal or no dependencies. These are the primitives everything else relies on (e.g., identity, networking, metadata). Tier 1 and higher services may depend on lower tiers, but never in reverse. When lower-tier services inadvertently depend on higher-tier components (for metrics, authentication, or orchestration), the system becomes entangled — and outages ripple far wider than they should. Clear separation of service tiers is the foundation of blast-radius containment. 2️⃣ Multi-Region Deployment and Replication For customers, architectural resilience means assuming failure is inevitable. Services should be multi-region by design, not by afterthought — with data and state replication, independent failover logic, and health-aware routing. High availability isn’t achieved by redundancy alone; it’s achieved by eliminating correlated failure domains — both in provider and customer architectures. Resilience is an outcome of disciplined design — not luck during an outage. #AWS #CloudArchitecture #Resilience #HighAvailability #DistributedSystems #SRE
-
The Unsung Heroes of the Shoreline: Engineering Resilience with Tetrapods When we think of coastal defense, we often imagine massive, flat sea walls. But engineering resilience often requires something much more sophisticated than just a bigger wall. Meet the tetrapod. These four-legged, tetrahedrally shaped concrete structures are a masterclass in functional design. While a flat wall tries to absorb 100% of a wave's energy—often leading to structural failure or catastrophic erosion underneath—the tetrapod takes a different approach. The Power of Cooperation and Dissipation: Dissipation, Not Resistance: The tetrapod’s shape is specifically designed not to block the water. Instead, its geometry forces the incoming wave to split and flow around its limbs. This breaks the wave’s energy and dissipates its force through turbulence, protecting the coast behind it. The Strength of Interlocking: Tetrapods are rarely used alone. They are designed to be placed in random interlocking groups. When waves hit, the structures slightly shift and lock tighter together, increasing the stability of the entire barrier. The chaos of their placement is actually their strength. Adaptability: Unlike a fixed wall, tetrapod structures can be customized and repaired easily by adding or rearranging units to adapt to changing coastal conditions. The Takeaway for Project Management and Leadership: Don't fight force with force. When facing significant challenges (like market disruption or internal opposition), look for ways to dissipate that energy and redirect it, rather than trying to block it completely. The whole is greater than the sum of its parts. Individual strength matters less than systemic interlocking. A team that self-organizes and "locks together" under pressure is far more resilient than a collection of strong individuals acting alone. Chaos can contain stability. The most rigid systems are often the most brittle. Sometimes, a designed "randomness" allows for a flexibility that can withstand pressures a rigid structure cannot.
-
You can't 'think' your way to resilience. You have to design for recovery. Most organizations want to treat burnout as a mindset problem (because it means the burden of 'fixing' it falls on the employee). They bring in folks like me to offer workshops, wellness challenges, and resilience trainings that, ironically, just add more to employees' to-do lists. Early in my career as an organizational psychologist, I was asked to design a resilience intervention for a group of physicians. The brief was clear: "Help them handle stress better." But, when I looked closer, the real issue wasn't mindset; it was math. Their schedules had no recovery built in. No slack in the system, and no permission to pause. Asking them to "be more resilient" would have been like telling a sprinter to recover during the race. So instead of creating more work disguised as wellness, we focused on scaffolding recovery into their existing routines: ↳ Collaborative debriefs to process emotional load during working hours. ↳ Getting scribes to reduce "pajama" time (time spent catching up after work). ↳ Accessible resources available anytime, so they could take advantage whenever they want. That shift made the change stick. Because resilience usually doesn't come from pushing harder, it comes from designing smarter. Yesterday, on a podcast with Elizabeth de Stadler 🎈 (who helps attorneys find balance), we talked about how this same trap shows up in law firms. I think I speak for Elizabeth when I say we both wish people understood that building systems that prevent harm in the first place is a much better strategy than trying to help people "cope better" (which runs the high risk of moral injury). Toughness might get you through a sprint, but recovery is what helps you win the long game. So, here's a better question for all of us building high-performing systems and teams: How is recovery built into your week? If you enjoy posts about building strong systems, finding joy, and creating a life full of agency, I will not let you down. Please follow me here: Michael Rucker, Ph.D.