Last week I had an interesting C++ debugging experience. I'm sharing it below as it might be useful to others. I was making an API (let's call it A()) safer by adding assertions on its precondition. In the past, A() would do some implementation-defined stuff when the operation doesn't make sense. It's dangerous to rely on such behavior. Since this indicates a programmer error, I decided to make A() fail (as in crashing the program) if its precondition is not met. This will let us catch such programmer errors earlier and more easily. Unsurprisingly, this change broke tons of tests in presubmit checks. Great! All these breakages are bugs waiting to be fixed. The change just helped me discover all these bugs. For free. So I looked at the crash stack traces to see who the callers of A() are, as the bugs are likely somewhere near the call sites. However, this didn't get me very far as the stack traces often don't reflect the actual call chains due to aggressive compiler optimizations (e.g. inlining). E.g. often a stack trace shows that A() is called by Foo(), but I cannot find this call in Foo()'s body as the actual call chain may be Foo() -> Bar() -> Baz() -> A() - the compiler has just squashed them via inlining. This makes the debugging a lot harder. BTW, A() is extremely widely used, so it doesn't work to go through all callers of A(). Easy, I thought. I'll just reduce the optimization level and disable function inlining when compiling the code. Unfortunately, this trick didn't work as the programs were so complex that disabling inlining caused the build machines to OOM. If there were just a handful of callers of A(), I could've added logging at these call sites so that I can tell which one leads to the crash. However, as said earlier, there are way too many callers for this to be practical. Luckily, C++20 allows us to solve this problem with O(1) effort. I added an optional parameter to A() like this: void A(int some_param, std::source_location loc=std::source_location::current()) { if (precondition is not met) { LOG(FATAL) << loc.file_name() << ":" << loc.line() << ": A() called with broken precondition."; } ... } Now, when we call A(), the call site's source file location will be automatically passed to A() and be logged when the precondition is not met. The crash stack traces now tell me exactly where I should be looking. The bugs are quickly identified and fixed. Sweet.
Debugging Tips for Software Engineers
Explore top LinkedIn content from expert professionals.
Summary
Debugging is the process of identifying, analyzing, and fixing issues in software to ensure it functions as intended. For software engineers, adopting structured strategies can simplify the process and increase accuracy, transforming challenges into learning opportunities.
- Reproduce the issue: Always start by recreating the bug in a controlled environment so you can clearly understand its behavior and scope.
- Trace and analyze: Follow the code logic and data flow step-by-step to pinpoint where the issue originates and why it occurs.
- Test and plan fixes: After identifying the problem, implement changes cautiously, test thoroughly, and add safeguards like logging or automated tests to prevent future recurrences.
-
-
Turns out "undefined" isn't a valid API key. Every 500 errors I’ve caused has taught me more than a successful deployment ever did. Backend engineering isn’t just about building systems. Sometimes, you break them, debug them, and learn. Here are 7 real mistakes that taught me more than any tutorial: 1. Forgot to set an environment variable: worked locally, blew up in prod. ✅ Don’t assume defaults exist ✅ Fail fast if critical configs are missing ✅ Validate env vars on startup—not after the app crashes 2. Didn’t handle a null or undefined field: classic edge case blind spot. ✅ Validate input and response data ✅ Use null-safe access patterns ✅ Add tests for edge cases 3. Relied on a 3rd-party API without a fallback: guess who had a bad day when it went down? ✅ Use retries with backoff ✅ Add fallback responses ✅ Gracefully degrade non-critical features 4. Improper timeout config: hello, hanging requests, and cascading failures. ✅ Set proper timeout values ✅ Handle timeout errors explicitly ✅ Monitor and tune under load 5. Race conditions in async code: everything’s fine... until it’s not under load. ✅ Avoid shared mutable state ✅ Use locks or atomic ops when needed ✅ Simulate load in test environments 6. Pushed a schema change without a data migration: and broke everything in 2 seconds. ✅ Pair schema changes with migrations ✅ Always test on staging with real data 7. Skipped input validation: the user sent a payload that wrecked my assumptions. ✅ Never trust client data ✅ Validate at the edge (API boundary) ✅ Enforce schemas and constraints You don’t become good by avoiding failure. You get there by surviving it. Failure isn’t a detour—it is the curriculum. Any lesson to add?
-
Anyone can fix a bug. But the way you do it shows what kind of engineer you are. Here’s a checklist mindset that’s helped me: ✅ Try to reproduce the bug first ✅ Trace where in the codebase it’s happening ✅ Backtrack the logic & data flow - understand the “why” ✅ Figure out what files or components need changes ✅ Plan how you’ll verify if your fix actually works ✅ If you’re stuck, ask questions early (not last!) ✅ Once fixed, check if it’s working end-to-end ✅ Write tests to catch it early in the future ✅ Follow through: share updates, close loops, and let people know it’s taken care of - that’s how you build trust. You didn’t just solve a bug. You solved it well.