Accelerating Occlusion Rendering on a GPU via Ray Classification

Accelerating Occlusion Rendering on a GPU via Ray Classification

Vasco Costa (INESC-ID/Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal), João Madeiras Pereira (INESC-ID/Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal) and Joaquim A. Jorge (INESC-ID/Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal)
DOI: 10.4018/IJCICG.2015070101
OnDemand PDF Download:


Accurately rendering occlusions is required when ray-tracing objects to achieve more realistic rendering of scenes. Indeed, soft phenomena such as shadows and ambient occlusion can be achieved with stochastic ray tracing techniques. However, computing randomized incoherent ray-object intersections can be inefficient. This is problematic in Graphics Processing Unit (GPU) applications, where thread divergence can significantly lower throughput. The authors show how this issue can be mitigated using classification techniques that sort rays according to their spatial characteristics. Still, classifying occlusion terms requires sorting millions of rays. This is offset by savings in rendering time, which result from a more coherent ray distribution. The authors survey and test different ray classification techniques to identify the most effective. The best results were achieved when sorting rays using a compress-sort-decompress approach using 32-bit hash keys.
Article Preview


Accurately rendering occlusions is required when ray-tracing objects to achieve more realistic rendering of scenes as illustrated in Figure 1 and Figure 2. However, stochastic ray tracing of soft shading phenomena pioneered by (Cook et al. 1984), such as shadows and ambient occlusion, requires multisampling sets of randomized rays. These secondary rays do not readily allow for optimization techniques that take advantage of locality of reference, as is the case with primary rays.

Figure 1.

House, Town, Voyager, Mini Cooper scenes rendered using shadows (top) and ambient occlusion (bottom)

Figure 2.

Venus, Lamborghini, Bubs scenes rendered with shadows (top) and ambient occlusion (bottom)

This brings about problems in highly pipelined architectures that exact a large cost in secondary memory accesses, such as modern GPUs. In this work we examine the performance of different ray classification techniques to compute soft shadows and ambient occlusion via multi-sample ray-cast techniques.

When determining the shadow term for a given point on the surface we need to cast N random shadow feelers samples towards the direction of the source lights which are geometrically represented by an area instead of a point. The ambient occlusion term, for a given point in a surface, can be computed by casting random ray samples across the hemisphere centered on that point around the surface normal. These random ray samples can be jittered in order to reduce artifacts (Cook 1986) with the aid of an R2 quasi-random number sequence (e.g. Sobel, Halton).

These calculations can be quite costly. For example, to compute either the shadow or ambient occlusion terms for a 1024x1024 image with 16 samples per pixel we need to fire over 16 million rays. To achieve high-performance rendering, this compute intensive task requires high-performance GPUs. However, performance gains are severely hindered because GPU utilization rate will suffer due to thread divergence. This arises when executing different sections of code in separate GPU cores, causing some to stall, partially denying the benefits of an architecture conceived to execute the same code over many different data. Indeed, not all rays will have the same length and most will traverse different regions of space, giving rise to different code sections being executed across processing cores simultaneously. To mitigate this, we sort rays using different classification techniques.

Ray classification techniques work by reordering rays to maximize their spatial coherence. This ensures low thread divergence during the rendering pass and thus improves parallel performance. This is of particular relevance to GPU architectures with in-order designs and simple branch prediction hardware that are ill suited to execute different code sections simultaneously. A simple and expedient way to achieve this classification is to bundle together rays with the same origin. However, this does not guarantee that those rays will not diverge significantly further along their path. Hence, we need to take into account these different directions when we perform ray classification. As seen above, computing occlusion terms for a megapixel image with 16 samples per pixel requires classifying and sorting 16+ million rays. Such processing can take tens or hundreds of milliseconds, even on a GPU platform, depending on the sorting algorithm used and on the desired classification accuracy.

To classify 6D rays with three components for each origin and direction at floating-point single precision requires sorting with 192-bit keys. However, to bundle rays we do not need to do an accurate sort. Hence it is possible to perform this task at a lower precision, using shorter keys (e.g. 32-bits), using a hashing scheme. This can speed the sorting process considerably. Further increases in sorting speed can be achieved via a compress-sort-decompress scheme that treats rays with the same hash value as a single unit. This decreases the size of the list to be sorted, further increasing performance.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 8: 2 Issues (2017): Forthcoming, Available for Pre-Order
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2015)
Volume 5: 2 Issues (2014)
Volume 4: 2 Issues (2013)
Volume 3: 2 Issues (2012)
Volume 2: 2 Issues (2011)
Volume 1: 2 Issues (2010)
View Complete Journal Contents Listing