Shader flow controll performance

wezu · January 13, 2015, 12:15pm

I’ve been googlin this problem for a while and I don’t know what is true anymore.

Let’s say I have a fragment shader like this:

//pseudocode
norm=genNormal(tex0);
tang=genTangent(norm);
bnor=genBinormal(norm);
color=mixTextures(tex1, tex2, tex3, tex4, tex5, tex6, tex7, tex8, tex9);
norm=mixNormalMaps(norm, tang, bnorm, tex9, tex10, tex11, tex12, tex13, tex14, tex15, tex16, tex17, tex18);
color=doLight(gl_lightThing, norm, color);
gl_FragColor=mix(color, fog_color, fog_factor);

The shader would do a lot of texture lookups and some math and in the end mix it all with a constant fog color based on a ‘fog_factor’ calculated in a vert shader.

Now if the fog_factor is 1.0 then all the math and texture lookups do nothing -would it make sense to rewrite the shader like this:

//pseudocode
if(fog_factor>0.996)
{
gl_fragColor=fog_color;
}
else
{
norm=genNormal(tex0);
tang=genTangent(norm);
bnor=genBinormal(norm);
color=mixTextures(tex1, tex2, tex3, tex4, tex5, tex6, tex7, tex8, tex9);
norm=mixNormalMaps(norm, tang, bnorm, tex9, tex10, tex11, tex12, tex13, tex14, tex15, tex16, tex17, tex18);
color=doLight(gl_lightThing, norm, color);
gl_FragColor=mix(color, fog_color, fog_factor);
}

Let’s say the top 1/3 of the screen has a fog_factor of 1.0 will the gpu ‘cores’ for (some of?) these pixels execute both branches or just the ‘lite’ version?

cslos77 · January 13, 2015, 3:04pm

I hope I’m understanding you right, but no cores will run both branches under any circumstances. What happens is that, depending on the make and caliber of the GPU, it will break the task down into batches composed of a number of threads that run simultaneously (each thread running on its own core). The entire batch is constrained by how long the “longest” thread takes to finish, so if there are 100 threads in a batch, the batch will only be able to render its results when the final thread in the batch finishes its computations.

For an extremely simplified example, lets say your GPU has 1000 cores so it works in batches composed of 1000 threads each (not so simple but just for this example), which works out to 100x100 pixel chunks per batch. Only batches that have zero pixels with fog will gain any advantage from the “lite” branch; any batch that contains one or more pixels that have fog will run no faster than the “longer” branch since all the threads that process pixels with no fog sit idle until those that have fog have finished processing. Then the entire batch renders its results.

If the the scene is as clear cut in having fog only in the top 1/3, then yes, you would gain an advantage by including such a branch in your shader, but if the fog is going to be scattered and chaotic then I’m not sure you would gain much, if anything, because all it would take is one pixel per batch to have even a tiny amount of fog and the whole batch would run at the rate of this slower calculation.

Researching how compute shaders work might help clarify this, since the programmer has control over batch size (aka workgroup, aka wavefront) and this is explained in greater detail.

wezu · January 13, 2015, 3:47pm

Thanks, that helps a lot.
Do you by chance have any links to docs on how big chunks are in a real world scenario?

cslos77 · January 13, 2015, 6:45pm

Generally the GPU will try to run the largest batches possible, so a rough way to picture a batch of pixels would be a square with each size the length of the square root of the number of cores in the GPU. Keep in mind that most shader stages work on vertices so this only applies to the frag shader. So for your situation the GPU would be working against making branching efficient by looking for the largest batch size and increasing the chance of including a pixel with fog.

But another complication is that modern GPUs are fairly adept at filling idle core cycles with other tasks so they may be able to mask any inefficiencies introduced by branching in your shader. For a fairly dense but thorough overview of the problem of branching in SIMD (single instruction, multiple data) architectures:

gdiamos.net/papers/caches-paper.pdf

rdb · January 14, 2015, 2:53pm

Even with the (limited) branching optimizations of modern GPUs, it it is likely to still yield better results to compile two different shaders, one with fog and one without, and let your fogless models be rendered first and your fogged models second. (Panda does this type of sorting automatically.)

wezu · January 14, 2015, 3:53pm

I had this in mind for a terrain shader, there is only one model- the terrain plane.
For normal models I could just use the LOD system even with the same mesh but with a simpler shader.
Anyway the fog part is trivial (mix in a uniform color), I’d say the fog is 5-10% of the shader, so I’m not trying to make a version with and without fog, I have a ‘fog only’ and ‘fog+generate_normals+lights+texture_splating+normal_maps+glos_maps+god_knows_what_else’ shaders. One runs at 60 fps the other at 600 fps (hand waved numbers).

So if I have 1/3 of the screen runing a 600fps shader branch and 2/3 a 60fps I should get a final framerate of 1/3600+2/360= 240 fps

I know it’s not how things work, but if there is some performance to gain then it’s worth it.