When is a compute shader more efficient than a pixel shader for image filtering?

22

2

Image filtering operations such as blurs, SSAO, bloom and so forth are usually done using pixel shaders and "gather" operations, where each pixel shader invocation issues a number of texture fetches to access the neighboring pixel values, and computes a single pixel's worth of the result. This approach has a theoretical inefficiency in that many redundant fetches are done: nearby shader invocations will re-fetch many of the same texels.

Another way to do it is with compute shaders. These have the potential advantage of being able to share a small amount of memory across a group of shader invocations. For instance, you could have each invocation fetch one texel and store it in shared memory, then calculate the results from there. This might or might not be faster.

The question is under what circumstances (if ever) is the compute-shader method actually faster than the pixel-shader method? Does it depend on the size of the kernel, what kind of filtering operation it is, etc.? Clearly the answer will vary from one model of GPU to another, but I'm interested in hearing if there are any general trends.

Nathan Reed

Posted 2015-08-05T09:30:30.980

Reputation: 15 036

I think the answer is "always" if the compute shader is done properly. This is not trivial to achieve. A compute shader is also a better match than a pixel shader conceptually for image processing algorithms. A pixel shader however provides less leeway with which to write poorly performing filters. – bernie – 2017-11-20T18:32:48.447

@bernie Can you clarify what's needed for the compute shader to be "done properly"? Maybe write an answer? Always good to get more perspectives on the subject. :) – Nathan Reed – 2017-11-20T18:36:11.990

1Now look at what you made me do! :) – bernie – 2017-11-20T21:01:12.340

In addition to sharing work across threads, ability to use async compute is one big reason to use compute shaders. – JarkkoL – 2017-11-25T17:57:45.973

Answers

12

An architectural advantage of compute shaders for image processing is that they skip the ROP step. It's very likely that writes from pixel shaders go through all the regular blending hardware even if you don't use it. Generally speaking compute shaders go through a different (and often more direct) path to memory, so you may avoid a bottleneck that you would otherwise have. I've heard of fairly sizable performance wins attributed to this.

An architectural disadvantage of compute shaders is that the GPU no longer knows which work items retire to which pixels. If you are using the pixel shading pipeline, the GPU has the opportunity to pack work into a warp/wavefront that write to an area of the render target which is contiguous in memory (which may be Z-order tiled or something like that for performance reasons). If you are using a compute pipeline, the GPU may no longer kick work in optimal batches, leading to more bandwidth use.

You may be able to turn that altered warp/wavefront packing into an advantage again, though, if you know that your particular operation has a substructure that you can exploit by packing related work into the same thread group. Like you said, you could in theory give the sampling hardware a break by sampling one value per lane and putting the result in groupshared memory for other lanes to access without sampling. Whether this is a win depends on how expensive your groupshared memory is: if it's cheaper than the lowest-level texture cache, then this may be a win, but there's no guarantee of that. GPUs already deal pretty well with highly local texture fetches (by necessity).

If you have an intermediate stages in the operation where you want to share results, it may make more sense to use groupshared memory (since you can't fall back on the texture sampling hardware without having actually written out your intermediate result to memory). Unfortunately you also can't depend on having results from any other thread group, so the second stage would have to limit itself to only what is available in the same tile. I think the canonical example here is computing the average luminance of the screen for auto-exposure. I could also imagine combining texture upsampling with some other operation (since upsampling, unlike downsampling and blurs, doesn't depend on any values outside a given tile).

John Calsbeek

Posted 2015-08-05T09:30:30.980

Reputation: 2 927

I seriously doubt the ROP adds any performance overhead if blending is disabled. – GroverManheim – 2016-05-20T03:33:03.143

@GroverManheim Depends on the architecture! The output merger/ROP step also has to deal with ordering guarantees even if blending is disabled. With a full-screen triangle there aren't any actual ordering hazards, but the hardware may not know that. There might be special fast paths in hardware, but knowing for certain that you qualify for them… – John Calsbeek – 2016-05-20T06:05:40.787

3

John has already written a great answer so consider this answer an extension of his.

I'm currently working a lot with compute shaders for different algorithms. In general, I've found that compute shaders can be much faster than their equivalent pixel shader or transform feedback based alternatives.

Once you wrap your head around how compute shaders work, they also make a lot more sense in many cases. Using pixels shaders to filter an image requires setting up a framebuffer, sending vertices, using multiple shader stages, etc. Why should this be required to filter an image? Being used to rendering full-screen quads for image processing is certainly the only "valid" reason to continue using them in my opinion. I'm convinced that a newcomer to the compute graphics field would find compute shaders a much more natural fit for image processing than rendering to textures.

Your question refers to image filtering in particular so I won't elaborate too much on other topics. In some of our tests, just setting up a transform feedback or switching framebuffer objects to render to a texture could incur performance costs around 0.2ms. Keep in mind that this excludes any rendering! In one case, we kept the exact same algorithm ported to compute shaders and saw a noticeable performance increase.

When using compute shaders, more of the silicon on the GPU can be used to do the actual work. All these additional steps are required when using the pixel shader route:

  • Vertex assembly (reading the vertex attributes, vertex divisors, type conversion, expanding them to vec4, etc.)
  • The vertex shader needs to be scheduled no matter how minimal it is
  • The rasterizer has to compute a list of pixels to shade and interpolate the vertex outputs (probably only texture coords for image processing)
  • All the different states (depth test, alpha test, scissor, blending) have to be set and managed

You could argue that all the previously mentioned performance advantages could be negated by a smart driver. You would be right. Such a driver could identify that you're rendering a full-screen quad without depth testing, etc. and configure a "fast path" that skips all the useless work done to support pixel shaders. I wouldn't be surprised if some drivers do this to accelerate the post-processing passes in some AAA games for their specific GPUs. You can of course forget about any such treatment if you're not working on a AAA game.

What the driver can't do however is find better parallelism opportunities offered by the compute shader pipeline. Take the classic example of a gaussian filter. Using compute shaders, you can do something like this (separating the filter or not):

  1. For each work group, divide the sampling of the source image across the work group size and store the results to group shared memory.
  2. Compute the filter output using the sample results stored in shared memory.
  3. Write to the output texture

Step 1 is the key here. In the pixel shader version, the source image is sampled multiple times per pixel. In the compute shader version, each source texel is read only once inside a work group. Texture reads usually use a tile-based cache, but this cache is still much slower than shared memory.

The gaussian filter is one of the simpler examples. Other filtering algorithms offer other opportunities to share intermediary results inside work groups using shared memory.

There is however a catch. Compute shaders require explicit memory barriers to synchronize their output. There are also fewer safeguards to protect against errant memory accesses. For programmers with good parallel programming knowledge, compute shaders offer much more flexibility. This flexibility however means that it is also easier to treat compute shaders like ordinary C++ code and write slow or incorrect code.

References

bernie

Posted 2015-08-05T09:30:30.980

Reputation: 331

1

I stumbled on this blog: Compute Shader Optimizations for AMD

Given what tricks can be done in compute shader (that are specific only to compute shaders) I was curious if parallel reduction on compute shader was faster than on pixel shader. I e-mail'ed the author, Wolf Engel, to ask if he had tried pixel shader. He replied that yes and back when he wrote the blog post the compute shader version was substantially faster than the pixel shader version. He also added that today the differences are even bigger. So apparently there are cases where using compute shader can be of great advantage.

maxest

Posted 2015-08-05T09:30:30.980

Reputation: 26