Fragment Shader Interlock Performance (α Compositing)

Dec 29, 2020 OpenGL Vulkan ROVs Fragment Shader Interlock

TL;DR: In the tests I performed, using ordered fragment shader interlock for Multi-Layer Alpha Blending (MLAB) on NVIDIA hardware was 4% faster than using spinlocks. Furthermore, fragment shader interlock and ROVs can guarantee memory access ordering, while spinlocks can't. Using per-pixel linked lists for alpha compositing was significantly slower than MLAB with fragment shader interlock and has an unbounded memory requirement.

Introduction

Over the last decades, how GPUs can be used radically evolved. In the olden days™, GPUs had specialized units for vertex and fragment processing. Due to multiple reasons, one being that different applications may utilize these different types of units to different degrees and thus run into a bottleneck, a shift to unified units happened that would perform both vertex and fragment processing. For a more in-depth view on the historical design of GPUs and the transition to the more modern architecture we know today, please let me refer you to Chajdas [2018]¹.

"This model – some ALUs bundled together, with some extra memory to allow communicating between sets of them – was exposed at the same time as a 'compute shader'."
– Chajdas [2018]¹

DirectX 11 and OpenGL 4.3 were the first versions of the respective graphics APIs that introduced functionality that was previously only available to compute APIs like NVIDIA's CUDA. Not only were developers now able to use compute shaders, but also scattered writes (in addition to gathered reads) through UAVs (DirectX), and SSBOs and image load/store operations (OpenGL). While gathered reads were already possible in previous versions of these APIs, scattered writes offer the possibility to write in a random-access style to GPU memory.

Previously, OpenGL enforced a coherent memory model². In one draw call, the application was not allowed to read from a resource that was written to in the same call, or else undefined behavior would be the result. Furthermore, writes to framebuffers could also not be scattered.

"Under normal circumstances, a coherent memory model is rigidly enforced by OpenGL. In general, if you write to an object, any command you issue later will see the new value. Because OpenGL behaves 'as if' all operations happened in a specific sequence, it is up to the OpenGL implementation to make sure that subsequent writes have occurred when you issue a read."
– https://www.khronos.org/opengl/wiki/Memory_Model

In the new memory model, when reading and writing to UAVs/SSBOs/using image load/store operations, the application developer was now responsible for ensuring that no undefined behavior would happen due to race conditions when reading from and writing to the same memory locations from multiple shader invocations. For this, there are different possibilities in unextended OpenGL.

In compute shaders, we can use execution barriers as synchronization primitives, which are, however, not available in vertex and fragment shaders.
We can use atomic operations to make sure no race conditions occur.
We can use spinlocks to make sure that no two shader invocations execute the same critical section of accessing the same memory locations at the same time (an explanation of spinlocks follows later in this post).
And finally, one could also opt to just use a formulation with multiple passes, and add memory barriers between these passes if necessary.

(As a side note, in OpenGL, adding the coherent qualifier to buffers/images and using memory barriers for flushing caches may be necessary, but we will just take correct handling of this by the application as given in the rest of the post.)

Order-Independent Transparency (OIT)

An area of computer graphics that has seen rapid development due to the advances in graphics hardware is Order-Independent Transparency (OIT). When using a blending technique like alpha compositing, we need to blend the generated fragments belonging to each viewport pixel either in front-to-back or back-to-front order. Sorting the geometry is not always feasible, and we might need to rely on sorting individual triangles, which is of course not very performant.

Algorithms like Deph Peeling³ tried to solve this by utilizing multiple geometry passes and resolving the next layer of the image in one pass. Compared to newer approaches, this is potentially considerably slower, as we need to render the scene geometry as many times as the depth complexity of the scene we have. The depth complexity is defined as the maximum number of (transparent) fragments falling into one viewport pixel.

A short side note: Like Depth Peeling, Weighted Blended Order-Independent Transparency (WBOIT)⁴ also relies only on the classic memory model. But I will ignore approaches like WBOIT here, as it might be sufficient for some simple graphical effects in video games, but has some far-reaching approximation assumptions for simplifying rendering.

With the advent of scattered writes in modern graphics APIs, new types of more performant techniques became feasible. One of these are per-pixel linked lists (sometimes also called per-pixel fragment lists) invented by Yang et al. [2010]⁵, which is a technique based on the concept of the A-buffer⁶. Yang et al. opted to use atomic operations to gather per-pixel linked lists of all fragments falling into one viewport pixel in a first geometry pass, and then to compute the final per-pixel color in a resolve pass where the per-pixel lists are sorted and the entries are then blended in the correct order. While per-pixel linked lists are much more performant than depth peeling, a major disadvantage is the unbounded memory requirement. We need to reserve a buffer that can hold all fragments generated in the geometry gather pass.

In recent years, multiple techniques were invented that tried to solve the problem of the unbounded memory requirement and increase the rendering performance by using approximate methods. One example is Multi-Layer Alpha Blending (MLAB)⁷, which was invented by Salvi and Vaidyanathan in 2014 and made use of Intel's pixel synchronization GPU feature. MLAB stores the first k layers of fragments, and merges the furthest fragments if more than k fragments arrive during the geometry pass. Pixel synchronization also made its way into DirectX as rasterizer order views (ROVs) and into the vendor-neutral OpenGL and Vulkan extension fragment shader interlock. For an explanation of fragment shader interlock, please let me refer to the excerpt of the OpenGL extension specification below.

"In unextended OpenGL 4.5, applications may produce a large number of fragment shader invocations that perform loads and stores to memory using image uniforms, atomic counter uniforms, buffer variables, or pointers. The order in which loads and stores to common addresses are performed by different fragment shader invocations is largely undefined. [...]
This extension provides new GLSL built-in functions beginInvocationInterlockARB() and endInvocationInterlockARB() that delimit a critical section of fragment shader code. For pairs of shader invocations with 'overlapping' coverage in a given pixel, the OpenGL implementation will guarantee that the critical section of the fragment shader will be executed for only one fragment at a time." [emphasis added]
– Grajewski [2015]⁸

Furthermore, application developers can specify whether they want to use ordered or unordered fragment shader interlock. Unordered fragment shader interlock just behaves like a per-pixel critical section, while ordered fragment shader interlock also ensures that the memory accesses performed in the critical section happen in the same order in which the geometry is submitted (e.g., the buffer location of the primitives within a draw call).

Defining per-pixel critical sections has a major advantage. Unlike per-pixel fragment lists, which relied on using only atomic operations for synchronization, we can now directly operate on per-pixel data structures in the geometry pass. This makes approaches like MLAB possible, which need to operate on these data structures already during the geometry pass. A major use case of fragment shader interlock are also emulators, which might need to implement custom blending functions not supported by the hardware the emulator is running on.

The OIT technique Moment-Based Order-Independent Transparency (MBOIT) by Münstermann et al. [2018]⁹ can also make use of ROVs/fragment shader interlock. MBOIT computes a set of stochastic moments of the absorbance along the view ray of each viewport pixel in a first geometry pass, and then blends all fragments using these moments in a second geometry pass. To compute the moments, it uses the classic memory model and additive blending with a framebuffer storing the moments as single-precision floating point values. However, it can also use less precision in order to save a lot of memory bandwith. The authors of MBOIT proposed using a quantization scheme to compress the moments to 16 bits per moment. However, for this using ROVs/fragment shader interlock becomes necessary, as a read-modify-write operation must be used when updating the quantized moments. These operations wouldn't be possible with the classic memory model, or at least not without additional blending modes that would need to be supported by the hardware.

Performance of Fragment Shader Interlock

Moment-Based Order-Independent Transparency (MBOIT)

So ROVs/fragment shader interlock are obviously a really handy feature. But what about the performance? This is unfortunately a question that can't be answered in general, as it is dependent on the used hardware, drivers and implemented algorithms.

First of all, let's look at the performance of MBOIT. In their paper, the authors of MBOIT found that not only was using quantized moments together with ROVs much faster than using single-precision moments without ROVs, but even when using no quantization, using ROVs for updating the single-precision moments was faster than writing to a framebuffer using the classic memory model.

"Surprisingly, the resulting frame times on our test hardware are actually shorter than those obtained with hardware-accelerated additive blending. We do not have a conclusive explanation for this phenomenon but note that a low overhead from using rasterizer ordered views is expected since the shader program is very short."
– Münstermann et al. [2018]⁹

Multi-Layer Alpha Blending (MLAB)

Similarly, I also wanted to test the performance implications of using fragment shader interlock for MLAB. Unlike MBOIT, where the updates of the stochastic moments are additive and thus (up to rounding differences when using quantization) fully commutative, the visual results of MLAB may vary depending on the order fragments are processed. Thus, in order to get coherence across frames, we need to use ordered fragment shader interlock. However, I also wanted to test whether there were any performance differences between ordered and unordered fragment shader interlock and whether spinlocks, which can be used as a replacement for unordered fragment shader interlock (NOT ordered fragment shader interlock), perform better or worse than fragment shader interlock.

Below, you can find the spinlock code I used. The code is based on the work by Brüll [2018]¹⁰.

 1if (!gl_HelperInvocation) {
 2    bool keepWaiting = true;
 3    while (keepWaiting) {
 4        if (atomicCompSwap(spinlockViewportBuffer[pixelIndex], 0, 1) == 0) {
 5            gatherFragmentMLAB(fragmentColor, fragmentDepth);
 6            memoryBarrier();
 7            atomicExchange(spinlockViewportBuffer[pixelIndex], 0);
 8            keepWaiting = false;
 9        }
10    }
11}

I tested the performance when using no synchronization, spinlocks, unordered fragment shader interlock and ordered fragment shader interlock for a camera path around an aneurysm line data set by Byrne et al. [2013]¹¹. Below you can see images of the data set and the averaged frame times for the tested camera path. The tests were run at a resolution of 1920x1080 pixels on an NVIDIA RTX 2070 SUPER on Ubuntu 20.04 using OpenGL. In order to optimize the MLAB code, all SSBO buffer accesses use locality-preserving tiled access patterns (see https://github.com/chrismile/LineVis/blob/master/Data/Shaders/Utils/TiledAddress.glsl). For the tests, MLAB with 8 explicitly stored fragment layers was used.

The complete code is available on GitHub: https://github.com/chrismile/LineVis

Opaque	Transparent
Figure 1: Opaque aneurysm streamlines. Vorticity is mapped to color.	Figure 2: Transparent aneurysm streamlines. Vorticity is mapped to color and opacity.

Using MLAB with spinlocks has a frame time that is about 4.3% higher than ordered fragment shader interlock. Generally, only ordered fragment shader interlock gives fully correct results, but potential temporal incoherences caused by unordered fragment shader interlock or spinlocks were not humanly perceivable in my tests. On the other hand, using no synchronization at all led to lots of flickering and artifacts.

In case you were wondering how Per-Pixel Linked Lists performed: The frame time when using the optimized sorting code based on priority queues tested by Kern et al. [2020]¹² is about 3.57 times higher than using MLAB with fragment shader interlock. The code can be found here: https://github.com/chrismile/LineVis/blob/master/Data/Shaders/Renderers/PPLL/LinkedListSort.glsl

Kern et al. [2020]¹² tested the performance of many OIT algorithms for rendering line data sets from scientific visualization. In their work, Per-Pixel Linked Lists proved to be considerably slower than MLAB and MBOIT using fragment shader interlock. For some test cases, the frame time of Per-Pixel Linked Lists was more than five times as high, while MLAB and MBOIT still provided results visually very close to the ground-truth. Depth peeling was so slow it was even excluded from their graphs.

Summary and Implications

When fragment shader interlock is not available on some hardware, falling back to other techniques could potentially cost us some performance and introduce temporal incoherence.

As we have seen above, using spinlocks unfortunately increases the frame time on NVIDIA hardware, and we still don't get the same ordering guarantees as from ordered fragment shader interlock and ROVs. While this might not matter for the order-independent moment updates of MBOIT, it is a no-go criterion for emulators and can potentially lead to some temporal incoherence for MLAB (even though it may be hardly noticeable). At the same time, Per-Pixel Linked Lists are considerably slower and take an unbounded amount of memory.

I would love to also test whether fragment shader interlock also comes out as the performance winner on AMD hardware, but unfortunately AMD does not support the fragment shader interlock extension for OpenGL and Vulkan at the moment. On DirectX, AMD supports ROVs, but I'm not very inclined to invest the time to port my program to Direct3D at the moment, but that could change in the future.

Videos

Synchronization

No Synchronization

Matthäus Chajdas. 2018. Introduction to compute shaders. https://anteru.net/blog/2018/intro-to-compute-shaders/ ↩︎
https://www.khronos.org/opengl/wiki/Memory_Model ↩︎
Cass Everitt. 2001. Order-Independent Transparency. http://developer.download.nvidia.com/assets/gamedev/docs/OrderIndependentTransparency.pdf ↩︎
Morgan McGuire and Louis Bavoil. 2013. Weighted Blended Order-Independent Transparency. Journal of Computer Graphics Techniques (JCGT) 2, 2 (18 December 2013), 122–141. http://jcgt.org/published/0002/02/09/ ↩︎
Jason C. Yang, Justin Hensley, Holger Grün, and Nicolas Thibieroz. 2010. Real-time Concurrent Linked List Construction on the GPU. In Proceedings of the 21st Eurographics Conference on Rendering (EGSR’10). Eurographics Association, Aire-la-Ville, Switzerland, 1297–1304. https://doi.org/10.1111/j.1467-8659.2010.01725.x ↩︎
Loren Carpenter. 1984. The A-buffer, an Antialiased Hidden Surface Method. In IN COMPUTER GRAPHICS. 103–108. ↩︎
Marco Salvi and Karthik Vaidyanathan. 2014. Multi-layer Alpha Blending. In Proceedings of the 18th Meeting of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D ’14). ACM, New York, NY, USA, 151–158. https://doi.org/10.1145/2556700.2556705 ↩︎
Slawomir Grajewski. 2015. GL_ARB_fragment_shader_interlock. https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_fragment_shader_interlock.txt ↩︎
Cedrick Münstermann, Stefan Krumpen, Reinhard Klein, and Christoph Peters. 2018. Moment-Based Order-Independent Transparency. Proc. ACM Comput. Graph. Interact. Tech. 1, 1, Article 7 (July 2018), 20 pages. https://doi.org/10.1145/3203206 ↩︎
Brüll, Felix. (2018). Order-Independent Transparency Acceleration. 10.13140/RG.2.2.17568.84485. https://www2.in.tu-clausthal.de/~cgstore/theses/ba_felix_br%C3%BCll_2018.pdf ↩︎
Greg Byrne, Fernando Mut, and J Cebral. 2013. Quantifying the Large-Scale Hemodynamics of Intracranial Aneurysms. AJNR. American journal of neuroradiology 35 (08 2013). https://doi.org/10.3174/ajnr.A3678 ↩︎
M. Kern, C. Neuhauser, T. Maack, M. Han, W. Usher, and R. Westermann. 2020. A Comparison of Rendering Techniques for 3D Line Sets with Transparency. IEEE Transactions on Visualization and Computer Graphics (2020), 1–1. ↩︎