Do I still need to use nonuniformEXT here?
No.
Consider what this will do at the machine code level (without optimizations). If you pass a constant into the array accessor, then the instruction will use an immediate. If you pass variable into the array acessor, then the instruction will use a register. These are two different instructions. They may have the same name, but they will have different encodings, and slightly different handlers inside the processor.
For any type of array access, using a constant should always be allowed. Using a variable will either be explicitly defined to require something like nonuniformEXT, or should be fine without such an extension.
As noted in your previous question however, the switch isn't inherently superior (and is almost certainly universally worse). Beyond being more painful to implement, older hardware will likely have to step through every branch, and newer hardware capable of jumping the correct branch will also likely be capable of nonuniform access. There isn't much reason for nonuniform access to be slow, in the same way that ARM hasn't required aligned memory reads for well over a decade. It might make a small performance difference, but the real reason for the distinction is to support a broader range of hardware.
So array[non_uniform_idx] is fundamentally different from switch (non_uniform_idx) { case 0: array[0]; case 1: array[1]; }?
Yes. (Probably, technically it could be encoded functionally the same depending on the architecture, but Vulkan is intended to be architecture agnostic).
In the first case it's the same instruction executed by all threads, and in the second it's DIFFERENT instructions being executed by each thread (different parts of the code).
Pretty much.
So what the LLM was saying about the distinction between "memory divergence" and "control-flow divergence" was correct?
Probably.
I still don't understand what the fundamental difference deep down is though that makes this difference.
So, let's look at the x86 32-Bit CISC architecture. Being CISC it has many ways of encoding instructions, and they're variable length. We can examine the encoding differences of two instructions and see where they diverge.
0x0000000000000000: 8B 48 08 mov ecx, [eax + 8]
0x0000000000000003: 8B 0C B8 mov ecx, [eax + edi*4]
The first can be read as "load a 32-Bit value from the memory address stored in EAX, offset 8, into ECX".
The second can be read as "load a 32-Bit value from the memory address stored in EAX, plus an offset stored in EDI times 4, into ECX".
ECX would be the register holding the 32-Bit value queried from the array, EAX would be the register holding the base address of the array, EDI would be the non uniform index (multiplied by 4 for a 32-Bit value size), 8 would be a fixed uniform index.
The first instruction decodes as
0x8B - MOV r32, r/m32
0x48 - ModR/M Byte [EAX]+disp8
- 0b01XX'XXXX - 8 Bit Displacement
- 0bXX00'1XXX - Store into ECX
- 0bXXXX'X000 - Base Register EAX
08 - Displacement of 8
The second instruction decodes as
0x8B - MOV r32, r/m32
0x0C - ModR/M Byte [--][--] (SIB follows)
- 0b00XX'XXXX - No Displacement
- 0bXX00'1XXX - Store into ECX
- 0bXXXX'X100 - SIB follows
0xB8 - SIB [EDI*4]
- 0b10XX'XXXX - Index Scale 4
- 0bXX11'1XXX - Index Register EDI
- 0bXXXX'X000 - Base Register EAX
Both of these decode to the MOV r32, r/m32 instruction, but x86 intentionally has a complex form of address generation.
The first decodes as base register with displacement, whereas the second decodes as base register with a scaled index and no displacement.
It's worth noting however, that while we can encode this in the x86 32-Bit architecture, a full equivalent could not be formed with the x86 16-Bit architecture. The closest we can come is
0x0000000000000000: 8B 4F 08 mov cx, [bx + 8]
0x0000000000000003: 8B 09 mov cx, [bx + di]
The A register can no longer be used for memory addresses, and we can't use a scaled index.
Interestingly, your x64 CPU can and does use both instruction sets. When an x86 CPU first boots it actually starts in 16-Bit mode, and can only later through code be switched to running in 32-Bit or 64-Bit mode.
While the original x86 instruction set is certainly capable of non uniform index, you can see there are differences in how the instructions are encoded, and it's not hard to imagine early programmable GPUs deciding that they don't need an instruction for handling non uniform indexes.
You could reasonably ask "What if every GPU actually does support non uniform indexing?", then great, enable the extension and be confident it will work on any GPU with a Vulkan driver. Most likely you will not encounter any GPUs which lack this capability. You can search NonUniform at https://vulkan.gpuinfo.org/listfeaturescore12.php to see the device coverage for nonuniform indexing on GPUs with a Vulkan 1.2 driver. It currently sits at around 93%-94% (though there is an outlier in that shaderInputAttachmentArrayNonUniformIndexing only has 75% device coverage). A lot of devices which supposedly don't support it actually got support added in a later driver. For instance, any Apple Vulkan driver made before 2023-12-25 lacks support, but any driver made after has support.