Non-uniform access by indexing vs non-uniform access from a switch case or if-else statement

Question

This question is supposed to be in the most general sense possible, this is because I'm using the Slang shading language to apply potentially to multiple graphics APIs.

In OpenGL and Vulkan the equivalent of a Constant Buffer in DirectX is a Uniform Buffer. Let's just say you have a Constant Buffer or a Uniform Buffer declared:

struct ConstantBuffer
{
    uint values[2];
};

Now, it has been said that accesses to Constant Buffers and Uniform Buffers need to be dynamically uniform. Assuming it is, what is the difference between:

uint result;
switch (non_uniform_condition)
{
case 0: result = ConstantBuffer.values[0]; break;
case 1: result = ConstantBuffer.values[1]; break;
}

and:

result = ConstantBuffer.values[non_uniform_condition];

I have a spent a long time talking to an LLM, which makes the distinction between "memory divergence", which is what happens when you do values non_uniform_condition, and "control-flow divergence", which is what happens when you do case 0: result = values[0]; case 1: result = values[1];. I honestly don't know if this is even a thing.

Furthermore, it's known that if you have an array of descriptors, like say an array of Texture2D[], and you want to access them in a non-uniform way, you need to use:

nonuniformEXT

or

NonUniformResource

However, what about in the following case:

switch(non_uniform_condition)
{
case 0: textures[0];
case 1: textures[1];
}

Do I still need to use nonuniformEXT here?

vandench · Accepted Answer · 2025-10-18 23:03:09Z

Do I still need to use nonuniformEXT here?

No.

Consider what this will do at the machine code level (without optimizations). If you pass a constant into the array accessor, then the instruction will use an immediate. If you pass variable into the array acessor, then the instruction will use a register. These are two different instructions. They may have the same name, but they will have different encodings, and slightly different handlers inside the processor.

For any type of array access, using a constant should always be allowed. Using a variable will either be explicitly defined to require something like nonuniformEXT, or should be fine without such an extension.

As noted in your previous question however, the switch isn't inherently superior (and is almost certainly universally worse). Beyond being more painful to implement, older hardware will likely have to step through every branch, and newer hardware capable of jumping the correct branch will also likely be capable of nonuniform access. There isn't much reason for nonuniform access to be slow, in the same way that ARM hasn't required aligned memory reads for well over a decade. It might make a small performance difference, but the real reason for the distinction is to support a broader range of hardware.

So array[non_uniform_idx] is fundamentally different from switch (non_uniform_idx) { case 0: array[0]; case 1: array[1]; }?

Yes. (Probably, technically it could be encoded functionally the same depending on the architecture, but Vulkan is intended to be architecture agnostic).

In the first case it's the same instruction executed by all threads, and in the second it's DIFFERENT instructions being executed by each thread (different parts of the code).

Pretty much.

So what the LLM was saying about the distinction between "memory divergence" and "control-flow divergence" was correct?

Probably.

I still don't understand what the fundamental difference deep down is though that makes this difference.

So, let's look at the x86 32-Bit CISC architecture. Being CISC it has many ways of encoding instructions, and they're variable length. We can examine the encoding differences of two instructions and see where they diverge.

0x0000000000000000:  8B 48 08    mov ecx, [eax + 8]
0x0000000000000003:  8B 0C B8    mov ecx, [eax + edi*4]

The first can be read as "load a 32-Bit value from the memory address stored in EAX, offset 8, into ECX".
The second can be read as "load a 32-Bit value from the memory address stored in EAX, plus an offset stored in EDI times 4, into ECX".

ECX would be the register holding the 32-Bit value queried from the array, EAX would be the register holding the base address of the array, EDI would be the non uniform index (multiplied by 4 for a 32-Bit value size), 8 would be a fixed uniform index.

The first instruction decodes as

0x8B - MOV r32, r/m32
0x48 - ModR/M Byte [EAX]+disp8
        - 0b01XX'XXXX - 8 Bit Displacement
        - 0bXX00'1XXX - Store into ECX
        - 0bXXXX'X000 - Base Register EAX
08 - Displacement of 8

The second instruction decodes as

0x8B - MOV r32, r/m32
0x0C - ModR/M Byte [--][--] (SIB follows)
        - 0b00XX'XXXX - No Displacement
        - 0bXX00'1XXX - Store into ECX
        - 0bXXXX'X100 - SIB follows
0xB8 - SIB [EDI*4]
        - 0b10XX'XXXX - Index Scale 4
        - 0bXX11'1XXX - Index Register EDI
        - 0bXXXX'X000 - Base Register EAX

Both of these decode to the MOV r32, r/m32 instruction, but x86 intentionally has a complex form of address generation.

The first decodes as base register with displacement, whereas the second decodes as base register with a scaled index and no displacement.

It's worth noting however, that while we can encode this in the x86 32-Bit architecture, a full equivalent could not be formed with the x86 16-Bit architecture. The closest we can come is

0x0000000000000000:  8B 4F 08    mov cx, [bx + 8]
0x0000000000000003:  8B 09       mov cx, [bx + di]

The A register can no longer be used for memory addresses, and we can't use a scaled index.

Interestingly, your x64 CPU can and does use both instruction sets. When an x86 CPU first boots it actually starts in 16-Bit mode, and can only later through code be switched to running in 32-Bit or 64-Bit mode.

While the original x86 instruction set is certainly capable of non uniform index, you can see there are differences in how the instructions are encoded, and it's not hard to imagine early programmable GPUs deciding that they don't need an instruction for handling non uniform indexes.

You could reasonably ask "What if every GPU actually does support non uniform indexing?", then great, enable the extension and be confident it will work on any GPU with a Vulkan driver. Most likely you will not encounter any GPUs which lack this capability. You can search NonUniform at https://vulkan.gpuinfo.org/listfeaturescore12.php to see the device coverage for nonuniform indexing on GPUs with a Vulkan 1.2 driver. It currently sits at around 93%-94% (though there is an outlier in that shaderInputAttachmentArrayNonUniformIndexing only has 75% device coverage). A lot of devices which supposedly don't support it actually got support added in a later driver. For instance, any Apple Vulkan driver made before 2023-12-25 lacks support, but any driver made after has support.

So array[non_uniform_idx] is fundamentally different from switch (non_uniform_idx) { case 0: array[0]; case 1: array[1]; } ? In the first case it's the same instruction executed by all threads, and in the second it's DIFFERENT instructions being executed by each thread (different parts of the code). So what the LLM was saying about the distinction between "memory divergence" and "control-flow divergence" was correct? I still don't understand what the fundamental difference deep down is though that makes this difference. I mean I understand that one indexes into ...
... a descriptor array with a value, versus having a branch that points to difference code instructions telling the code to execute either path1 or path2, each of which access a different descriptor. I understand at a surface level why these are different, but I don't really understand why, or what's actually happening. Is there some way of understanding this, or a simple way you can explain it? Or something I can read to understand it?
Also, this applies only to descriptors/resources, not buffers. Including the ConstantBuffer on DirectX or uniform buffer in Vulkan or OpenGL, which I've read needs to have dynamically uniform access, which I don't think is the case anymore. I know Metal had that requirement but I think it dropped it.

Collectives™ on Stack Overflow

Non-uniform access by indexing vs non-uniform access from a switch case or if-else statement

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related