How to implement Ballot() with DX11


Today I will talk about a quick (slow) hack for getting some platform specific code working crossplatform. This is really about correctness, and not speed - I mean, otherwise the shader code wouldn't run on windows anyway. This hack won't be necessary once we can switch over to DXC - Microsoft's new shader compiler with shader model 6.0, but for now, that's out of reach.

from GPUopen... though I'm pretty sure this isn't actually what it does.

In order to explain what the function Ballot() actually does, I need to first take a little diversion to explain GPU hardware.


The way a (modern, discrete) GPU works is that it executes instructions WAVE_SIZE (64 on AMD and 32 on Nvidia) at a time, all in lockstep. An easy way to think about it is that a set of WAVE_SIZE threads all share one instruction pointer. A group of WAVE_SIZE threads all located on a single SIMD is called a wave (or warp, using NV terminology). The cool thing about this is that threads that are within the same wave can share data between each other. Here's a page on the wave intrinsics supported by Microsoft's new shader compiler. 

You can do things like:
  • Broadcast read/write a value at a specific thread
  • Get the min/max value across all threads
  • Get a bitfield that represents some true/false predicate over all threads in a wave.
That last thing is what Ballot() does. I guess that's still kind of unclear, so I will go into more detail.


Here's a quick code example to demonstrate the functionality of Ballot().

Is this actually useful? Yeah, you can use it to write out bitfields about the status of each thread, like maybe each thread does some kind of intersection test and you want to write out a collision mask to some buffer.

The bad part, like I mentioned earlier, is that this isn't supported on DX11, unless you implement it yourself.


Maybe we can use LDS?

Hey subheading text, that's a good idea! Groupshared memory (aka LDS) is a way in compute shaders to store some data that is local to a thread group. We could just use regular memory, but LDS access is really fast! Here's the basic plan:

  1. Allocate enough LDS for each wave in the threadgroup to have WAVE_SIZE bits. So that's a uint2 on AMD and just a uint on Nvidia. The maximum size of a threadgroup in DX11 is 1024, so the maximum number of uints you're going to need is 1024 / 32 = 32. This works out to the same number of bits on AMD because WAVE_SIZE is larger, less waves fit in 1024. So either 16 uint2 or 32 uints. I'll use 32 units for both in the example, with an additional index calculation for AMD.
  2. Use atomics to write out our predicate to the write bit in LDS. InterlockedOr() is the one we want.
  3. Read that value from LDS!
Here's the code:

Normally, we would need some kind of synchronization going on, but because we're carefully picking our accesses to be exactly WAVE_SIZE, we don't need to synchronize. Everything occurring within a wave is implicitly synchronized.

And lastly, to calculate the parameters waveId and laneId, you just need the group thread id.

waveId = gtid / WAVE_SIZE;
laneId = gtid % WAVE_SIZE;

And that's all there really is to it. It's definitely not going to be as fast as actually using the right intrinsics - and maybe some of you GPU driver wizards out there can tell me what HLSL code will compile to the right instructions to do this. It's not like, extremely dreadfully slower though - if the atomic gets optimized out, and you don't have any pathological bank conflicts - I imagine this could be 5x-10x slower than the native instruction. Which is pretty good for a super hack.

Hopefully, all this garbage can be avoided some time in the near future, when we have access to wave intrinsics from DXC. That would be sweet.

Hit me up on the tweeto @pyromuffin with your most tantalizing uses for Ballot() on DX11!


(oh, and my boss said I should probably plug the fact that we're hiring junior and senior gfx engineers at sony. Here's the listing, hit me up additionally if you want to chat about it. (double note, this blog is not affiliated with sony in any way)).

Popular posts from this blog

Reverse Engineering Unity Games

Rotating a vector using integer math