Why is this conditional in my fragment shader so slow?

16

2

I have set up some FPS-measuring code in WebGL (based on this SO answer) and have discovered some oddities with the performance of my fragment shader. The code just renders a single quad (or rather two triangles) over a 1024x1024 canvas, so all the magic happens in the fragment shader.

Consider this simple shader (GLSL; the vertex shader is just a pass-through):

// some definitions

void main() {
    float seed = uSeed;
    float x = vPos.x;
    float y = vPos.y;

    float value = 1.0;

    // Nothing to see here...

    gl_FragColor = vec4(value, value, value, 1.0);
}

So this just renders a white canvas. It averages around 30 fps on my machine.

Now let's ramp up the number crunching and compute each fragment based on a few octaves of position-dependent noise:

void main() {
    float seed = uSeed;
    float x = vPos.x;
    float y = vPos.y;

    float value = 1.0;

      float noise;
      for ( int j=0; j<10; ++j)
      {
        noise = 0.0;
        for ( int i=4; i>0; i-- )
        {
            float oct = pow(2.0,float(i));
            noise += snoise(vec2(mod(seed,13.0)+x*oct,mod(seed*seed,11.0)+y*oct))/oct*4.0;
        }
      }

      value = noise/2.0+0.5;

    gl_FragColor = vec4(value, value, value, 1.0);
}

If you want to run the above code, I've been using this implementation of snoise.

This brings down the fps to something like 7. That makes sense.

Now the weird part... let's compute only one of every 16 fragments as noise and leave the others white, by wrapping the noise computation in the following conditional:

if (int(mod(x*512.0,4.0)) == 0 && int(mod(y*512.0,4.0)) == 0)) {
    // same noise computation
}

You'd expect this to be much faster, but it's still only 7 fps.

For one more test, let's instead filter the pixels with the following conditional:

if (x > 0.5 && y > 0.5) {
    // same noise computation
}

This gives the exact same number of noise pixels as before, but now we're back up to almost 30 fps.

What is going on here? Shouldn't the two ways to filter out a 16th of the pixels give the exact same number of cycles? And why is the slower one as slow as rendering all pixels as noise?

Bonus question: What can I do about this? Is there any way to work around the horrible performance if I actually do want to speckle my canvas with only a few expensive fragments?

(Just to be sure, I have confirmed that the actual modulo computation does not affect the frame rate at all, by rendering every 16th pixel black instead of white.)

Martin Ender

Posted 2015-08-15T21:28:31.733

Reputation: 1 164

Answers

18

Pixels get grouped into little squares (how big depends on the hardware) and computed together in a single SIMD pipeline. (struct of arrays type of SIMD)

This pipeline (which has several different names depending on the vendor: warps, wavefronts) will execute operations for each pixel/fragment in lockstep. This means that if 1 pixel needs a computation done then all pixels will compute it and the ones that don't need the result will throw it away.

If all fragments follow the same path through a shader then the other branches won't get executed.

This means that your first method of computing every 16th pixel will be worst case branching.

If you want to still down size your image then just render to a smaller texture and then upscale it.

ratchet freak

Posted 2015-08-15T21:28:31.733

Reputation: 4 695

5Rendering to a smaller texture and upsampling is a good way to do it. But if for some reason you really need to write to every 16th pixel of the large texture, using a compute shader with one invocation for every 16th pixel plus image load/store to scatter the writes into the render target could be a good option. – Nathan Reed – 2015-08-16T00:28:39.233