Khronos Releases Maximal Reconvergence and Quad Control Extensions for Vulkan and SPIR-V - Khronos Blog

Khronos Releases Maximal Reconvergence and Quad Control Extensions for Vulkan and SPIR-V - Khronos Blog

Khronos Releases Maximal Reconvergence and Quad Control Extensions for Vulkan and SPIR-V


The SPIR™ Working Group has developed two new SPIR-V extensions (and corresponding Vulkan® extensions) to provide shader authors with more guarantees about the execution model of shaders. These extensions formalize behavior many authors have previously taken for granted, so that they can now be relied upon across the ecosystem.

Maximal Reconvergence

To write portable, correct code, shader authors often need to understand which invocations are executing certain instructions concurrently. The most common cases for this requirement are when using subgroup operations or when an instruction requires dynamically uniform values (e.g. when accessing a buffer or texture resource without having enabled the non-uniform descriptor indexing feature). Shader authors rely on their intuition about which invocations should be executing together. Unfortunately, empirical testing shows that these behaviors are not portable across devices.

The SPIR Working Group introduces SPV_KHR_maximal_reconvergence to strengthen the guarantees provided by an implementation. This work is a successor to subgroup uniform control flow, released in June 2020, and provides stronger guarantees more in line with programmer intuition. Maximal reconvergence does not rely on any particular scope instance (e.g. subgroup) to be uniform in order to provide reconvergence guarantees. Instead, reconvergence happens wherever a shader author might reasonably expect it. Shader authors should be able to infer the state of divergence of invocations by examining the structure of the high-level language.

Let’s look at a simple example:

In unextended SPIR-V, it is not guaranteed that the instructions in Block B will be executed with the same concurrency as those in Block A because SPIR-V only requires reconvergence if control flow was uniform at the beginning of a divergence. Since Block A is after a non-uniform conditional branch, that precondition is unlikely. With subgroup uniform control flow, the concurrency of instructions in Block B would be guaranteed to be the same as those in Block A if, and only if, non_uniform_cond1 was subgroup uniform. Maximal reconvergence provides the strongest guarantee: implementations must guarantee that the concurrency of instruction execution in Block B includes all of the invocations that execute Block A that reach Block B, no matter the uniformity of control flow before Block A or between Blocks A and B.

This extension specifies a limited number of places where divergence (e.g. branches and discards), reconvergence (e.g. between loop iterations or after an if statement), and non-reconvergence (e.g. invocations in different loop iterations will not execute concurrently) can occur.

Shader authors have previously simply assumed this behavior in some cases, and most of the time it has largely worked out - unspecified behavior does not mean it will not do what you want. However, failure cases have been hard to debug as shaders are not obviously incorrect when this occurs. This extension formalizes the expected behavior of an implementation under a variety of these conditions and tests them rigorously to ensure that shader authors will be able to rely on their intuition about subgroup operations moving forward.

We recognize that extension rollouts can take a significant amount of time throughout the ecosystem. In order to avoid needing to duplicate your shader corpus to use maximal reconvergence, we have implemented a transformation in the SPIRV-Tools optimizer that can add/remove the MaximallyReconvergesKHR execution mode to/from every entry point in a module for your convenience.

Example: Atomic Compaction

Let’s examine the example we used to explain subgroup uniform control flow again. Here is where we ended up:

// Free should be initialized to 0.
layout(set=0, binding=0) buffer BUFFER { uint free; uint data[]; } b;
void main() [[subroup_uniform_control_flow]] {
  bool needs_space = false;
 // ... (subgroup converged by the end of this block)
 if (subgroupAny(needs_space)) {
   // Assume full subgroups weren't specified, so
   // calculate the actual subgroup size.
   uvec4 mask = subgroupBallot(needs_space);
   uint size = subgroupBallotBitCount(mask);
   uint base = 0;
   if (subgroupElect()) {
     // "free" tracks the next free slot for writes.
     // The first invocation in the subgroup allocates space
     // for each invocation in the subgroup that requires it.
     base = atomicAdd(b.free, size);
   }

    // Broadcast the base index to other invocations in the subgroup.
    base = subgroupBroadcastFirst(base);
    // Calculate the offset from "base" for each invocation.
    uint offset = subgroupBallotExclusiveBitCount(mask);

    if (needs_space) {
      // Write the data in the allocated slot for each invocation that
      // requested space.
      b.data[base + offset] = ...;
    }
  }
 // ...
}

When only using the subgroup reconvergence feature, the example requires that the subgroup must converge prior to the first if condition to guarantee the correct behavior.

With maximal reconvergence however, shader authors don’t have to ensure a fully converged subgroup anymore, nor do they need to keep all the invocations in a subgroup executing in a uniform fashion. In fact, with maximal reconvergence, we can go back to the original code sample and implementations will ensure it “just works”.

// Free should be initialized to 0.
layout(set=0, binding=0) buffer BUFFER { uint free; uint data[]; } b;
void main() [[maximally_reconverges]] {
  bool needs_space = false;
 // ...
 if (needs_space) {
  // If the subgroup is non-uniform, don't rely on gl_SubgroupSize.
    uvec4 mask = subgroupBallot(needs_space);
    uint size = subgroupBallotBitCount(mask);
    uint base = 0;
    if (subgroupElect()) {
      // "free" tracks the next free slot for writes.
      // The first invocation in the subgroup allocates space
      // for each invocation in the subgroup that requires it.
      base = atomicAdd(b.free, size);
    }

    // Broadcast the base index to other invocations in the subgroup.
    base = subgroupBroadcastFirst(base);
    // Calculate the offset from "base" for each invocation.
    uint offset = subgroupBallotExclusiveBitCount(mask);

    // Write the data in the allocated slot for each invocation that
    // requested space.
    b.data[base + offset] = ...;
  }
 // ...
}

We’ve come full circle and can now guarantee the behavior of the original code across all implementations that enable the extension!

Example: Peeling Loop

Using a peeling loop in a high-level language is a common technique that developers rely on. They structure their code to produce subgroup uniform values, but, frustratingly, the drivers can make valid transformations that break subgroup uniformity.

Let’s take a look at a simple peeling loop example:

uint my_value = ...;
for (;;) {
  uint value = subgroupBroadcastFirst(my_value);
  if (my_value == value) {
    // Perform subgroup operations under the assumption
    // that only invocations with matching values will partcipate.
    uint out_value = subgroupExclusiveAdd(...);
    break;
  }
}

This example looks straight forward in GLSL. Since the shader author uses a subgroup broadcast to filter which invocations participate in the subgroup addition there must be no problem here, right? The reality is that implementations are free to move the code from inside the if statement to outside the loop and that transformation is common in many compilers (e.g. LLVM). At that point, all the invocations will be reconverged and output_value will no longer be correctly calculated. Depending on the implementation this might be observed as a bug in the program. This behavior exists in the wild because it has never been required to work any other way and, until now, has been untested. Developers have been required to work around this behavior in creative manners.

With the extension, implementations must perform the subgroup addition such that the result is calculated only including the correct invocations. The extension is strict about when invocations can and cannot converge based on the shader source code.

Caveats

There are a few corner cases where convergence remains unintuitive, which developers should look out for:

  • Invocations with different selector values in switch statements are not guaranteed to converge, even if they take the same code path, until after the switch statement concludes; affecting default selectors and fall-through cases.
  • In cases where all remaining invocations in a quad are helpers, implementations may terminate the entire quad - maximal reconvergence cannot be used to require these remain live.
  • Ray tracing repack instructions cannot be used with maximal reconvergence. Handling complexity of shuffling invocations between groups would require future work to address.

Quad Control

When sampling from an image, it is desirable to sample the mip-level(s) of an image that most closely match the sampling intervals used by the shader. While applications can specify this interval directly, the sampling intervals, or derivatives, are typically calculated automatically based on the sample positions of neighboring fragments (the quad).

Implicit derivative calculation is both convenient for developers and efficient in hardware, but it relies on neighboring fragments all executing the same sampling instruction - if an expected neighbor does not sample the image, the mip-level selection is undefined. Further adding to this problem is that the specific neighbors used to calculate the derivatives were previously undefined - implementations could, for instance, calculate the derivatives once per-primitive to optimize fragment shaders.. The only way to mitigate this was to ensure sampling is always performed in uniform control flow - so all invocations of a fragment shader would have to read data even if the values would later be discarded, adding a lot of additional bandwidth to fragment shading.

Notably, as calculating implicit derivatives with missing neighbors only results in undefined level selection rather than undefined behavior, this can often go unnoticed. This can result in poor level of detail selection and potentially high cache pressure.

The quad control extension provides two key pieces of functionality to alleviate this problem; it guarantees that the derivative calculations are performed by the current invocation and its direct neighbors in a quad via the QuadDerivativesKHR execution mode, and provides functions to determine if any invocations in that quad would require a sample. This enables code that previously had to choose between correctness and performance to now choose both, by letting them sample exactly and only when necessary to achieve correct results. For example, code that previously looked like this (correct but slow):

// Samples on every iteration of the fragment shader
float4 color = texture(mySampledImage, uv);

if (dynamic_condition) {
  outColor = color;
}

or conversely this (incorrect and potentially also slow):

if (dynamic_condition) {
  // Undefined mip-level selection
  outColor = texture(mySampledImage, uv);
}

or maybe using explicit derivatives (correct but awkward and somewhat slow):

// Explicit derivative calculation to avoid undefined behavior
vec2 grad_x = dFdx(uv);
vec2 grad_y = dFdy(uv);

if (dynamic_condition) {
  outColor = textureGrad(mySampledImage, uv, grad_x, grad_y);
}

can all now be rewritten to this correct and fast option:

if (quadAny(dynamic_condition)) {
  // Samples only if the quad needs the sample
  float4 color = texture(mySampledImage, uv);

  if (dynamic_condition) {
    outColor = color;
  }
}

Interactions

Converged Quads

Making good use of shader quad control leans heavily on maximal reconvergence too, which is why it's in the same blog post - without maximal reconvergence, it's entirely possible that two invocations in the same quad are not converged even when using the quad any/all functions, resulting in undefined behavior despite a developer's best efforts. All implementations supporting VK_KHR_shader_quad_control must support VK_KHR_shader_maximal_reconvergence as well to ensure this functionality is available seamlessly for developers.

Helper Invocations

Helper invocations in fragment shaders have typically been a source of confusing behavior; they execute behind the scenes to enable implicit derivative calculation when rasterization would otherwise not provide an invocation, but cannot write results, and can be terminated automatically once operations accessing them have ended. Historically, implementations have differed in whether or not helper invocations participate in subgroup operations, and even if they did, implementations were free to terminate those invocations even if there were still subgroup operations to be executed. The MaximallyReconvergesKHR execution mode introduced by SPV_KHR_maximal_reconvergence requires helper invocations to participate in subgroups when present, and the RequireFullQuadsKHR execution mode in SPV_KHR_quad_control further requires that helper invocations get spawned even if no explicit quad operations are supplied, enabling better portability between shaders.

Helper invocations can still be terminated implicitly if all invocations in a quad are demoted to helpers or otherwise terminated explicitly by a shader, but this can no longer happen without those explicit terminations or demotions.

High Level Language Support

HLSL

Using this functionality in HLSL is intended to be tied directly to HLSL Shader Model 6.7; use of QuadAny and QuadAll will be mapped to the new quad operations in SPIR-V, and will implicitly require the MaximallyReconvergesKHR, QuadDerivativesKHR, and RequireFullQuadsKHR execution modes. Specifying the WaveOpsIncludeHelperLanes attribute in an HLSL shader will also require the MaximallyReconvergesKHR attribute, making the same guarantees as the WaveOpsIncludeHelperLanes attribute. Additional controls for the new execution modes are also being considered for future proposals.

It will be also possible to enable the MaximallyReconvergesKHR execution mode in older shader models directly.

GLSL

This functionality is exposed directly in GLSL via two new GLSL extensions; GL_EXT_maximal_reconvergence  and GLSL_EXT_shader_quad, with the direct mappings spelled out in those two extensions.

Conclusion

Maximal reconvergence and quad control represent a foundational step forwards in reliable, intuitive behavior across implementations. This portability is generally achievable with no (or very low) performance cost.