Jacob Blankenship’s Post

Conditionals in code incurs a performance cost many software engineers I have worked with are not aware of. Each conditional represents a fork in the path of processing for a CPU. Modern CPUs use branch prediction to improve performance by assuming the correct answer. This is to improve performance and instruction throughput. CPUs have incredible prediction rates based on pattern recognition, but occasionally they guess wrong. The cost of being wrong involves "unwinding" the instructions executed. Multiple statements in a conditional or nested conditionals are particularly difficult. The good news is a technique called branchless programming solves part of this issue. Conditionals in many cases can be altered to calculate the answer directly. By removing the need to guess, the CPU no longer has to unwind. An example counting even numbers:

  • No alternative text description for this image

You can push the same idea further by using bitwise operations instead of % when working with powers of two, because the CPU already works in binary and & is cheaper than an integer division; for example, when checking for even numbers you do not actually need value % 2 == 0, you just need to test whether the least significant bit is zero, so a branchless loop can look like this: uint64_t count{ 0 }; for (uint64_t value{ 0 }; value < 10; ++value) { // branchless, using bitwise AND count += ((value & 1) == 0); } Here value & 1 extracts only the lowest bit, it is 0 for even numbers and 1 for odd numbers, and the comparison ((value & 1) == 0) yields 1 for evens and 0 for odds, so the CPU can accumulate the result without any conditional jump; you can generalize this trick to any 2^k divisor by using value & ((1ULL << k) - 1) instead of % (1ULL << k) and then turning the boolean expression into 0 or 1 and adding it directly, which keeps the code branchless and closer to what the hardware is actually doing under the hood.

This is a (wonderful) rabbit hole that was my gateway to data-parallel (SIMD / GPU / etc.) programming years ago. With predication / masking you can eliminate all runtime-dependent program control - switch / if / etc. in all situations - (non-inlined) function calls - even gotos - with any amount of nesting / combinatorics. And modern CPU (and of course GPU) hardware are ready to jump on it. You just can't really do it (practically anyway) with C/C++ and their ilk, which are not the mother tongue of all languages (contrary popular belief) - they are really *still* DSLs for PDP-11-esque scalar machines, and the metaphors don't line up. For instance, you need to pass an implicit "active mask" to all non-inlined functions in play. Strangely C++ makes a really good *backend* to produce virtually perfect conditional-free hardware exploitation code, but you'll need a different front end that emits this specialized C++. In this particular example, apart from not needing to really compute the value, to really see the power of eliding conditionals you'd want to do a per-"lane" count result and then do a reduction outside the loop.

Don’t write for the compiler. Write for the humans who read and maintain the code. I’m not a compiler writer but modern compilers can detect a lot of common logic flows and generate equivalent performing code. Also your Boolean in the assignment is equivalent to an “if”. The Boolean has to be evaluated and there are two outcomes it has to choose between. No real difference. But one is harder for the human to look at and easily reason about.

I agree with others in the comments that this example isn’t great for the concept being discussed and code readability is to be valued over trickiness that the compiler will clean up anyway, but to bring it around to actual branchless code: Clang has __builtin_unpredictable() to suggest to the compiler to produce a branchless sequence if possible. It was originally added by my team at Sony to support games processing arrays of in-game entities in a tight loop where previous entries in the array are not a prediction indicator for future entries as CPU branch prediction hardware assumes.

Fun fact: The actual assembly code output by the compiler for this is only branchless due to smart compiler optimizations -- i.e., it depends on tricks possible with most modern CPU instruction sets. That's because any expression involving a comparison operator logically does make a decision. That decision, however, doesn't always have to be encoded into an actual conditional branch instruction. Lucky us. However, if you write similar code using a floating point type (or anything else with a comparison operation), jump instructions magically appear in the assembly output. The lesson is, just because there are no explicit 'if' statements in the source code, doesn't mean you're doing branchless programming.

Have you done benchmarking on this code? The former communicates intent better than the latter does, it's more readable, and that can make the difference between a developer taking a few seconds to understand it and taking a few minutes. Unless this is the hottest part of an innermost loop, any performance gains you will get from it are probably going to be negligible. Micro-optimisations are rarely worth it except under specific circumstances. Also, don't forget that optimising compilers are are a thing and they're now pretty good at spotting patterns like this and optimising them out. There's a place for this kind of optimisation, but if you use it all over the place you'll get a much less readable codebase for questionable performance gains. Don't trade readability for performance unless you can demonstrate that there's a clear benefit, and that means profiling and benchmarking.

count += (value +1) & 1; Also, the ternary expression includes an implied branch and doesn’t eliminate a branch. a = test ? b : c; Is equivalent to if ( test ) a = b; else a = c; That aside, I agree with your point.

Kinda, it depends, not in this case. Compilers are excellent and trying to coerce a compiler to produce code that you think is faster tends to generate code that is slower. Also, focusing on trying to generate faster code everywhere is bad practice. Write code that you can modify and maintain easily, PROFILE YOUR CODE, find your hotspot, focus on those.

Are you sharing your journey to learn how to code on Linkedin ?

Did you just claim a ternary is not a conditional? I feel dumber having read your post.

See more comments

To view or add a comment, sign in

Explore content categories