Cartoon shader improvements

And now for the reply:

Np :slight_smile:

Sometimes limitations like this can be worked around. For example, many older GPUs (mine included) do not support variable length for loops (the Cg compiler wants to unroll loops, and cannot if the end condition depends on a variable).

If you have a shader generator, and the loop’s end condition uses a variable just because it depends on a configuration parameter (which remains constant while the shader is running), you can make the shader generator hardcode it from its configuration when it writes the shader. If you’re coding your application in Python, Shader.make() (from pandac.PandaModules) comes in useful for compiling shaders generated at runtime. Look at CommonFilters.py for usage examples. But of course doing this adds another layer of logic.

Also, keep in mind that error messages from Cg can sometimes be misleading. I encountered the variable length for loop problem when I was trying to figure out why Panda’s SSAO wouldn’t run on my GPU. I was debugging entirely the wrong thing until rdb stepped in, said he’d seen a similar situation before, and that it is likely the problem is in the variable-length loop, not the array it is indexing into (although the error message indicated that the problem should have been in array indexing).

(SSAO is fixed in 1.9.0, using the approach mentioned above.)

Just after I said that, I did some testing this evening and found that the following run at the same speed on my GPU:

if(samples > CUTOFF)
  o_color = lerp(o_color, k_targetcolor, (samples - CUTOFF) / (NUMSAMPLES - CUTOFF));

vs. the branch-free alternative

float f = step(CUTOFF, samples);
o_color = (1.0 - f)*o_color
        + f*lerp(o_color, k_targetcolor, (samples - CUTOFF) / (NUMSAMPLES - CUTOFF));

but on the other hand, this was the only if statement in the shader. When I later complicated this to

if(samples1 > CUTOFF)
  o_color = lerp(o_color, k_targetcolor, (samples1 - CUTOFF) / (NUMSAMPLES - CUTOFF));
else if(samples2 > CUTOFF)
  o_color = lerp(o_color, k_targetcolor, (samples2 - CUTOFF) / (NUMSAMPLES - CUTOFF));

vs. the branch-free equivalent

float f1 = step(CUTOFF, samples1);
float f2 = step(CUTOFF, samples2);
o_color = (1.0 - max(f1,f2))*o_color
        + f1*lerp(o_color, k_targetcolor, (samples1 - CUTOFF) / (NUMSAMPLES - CUTOFF))
        + (1.0 - f1)*f2*lerp(o_color, k_targetcolor, (samples2 - CUTOFF) / (NUMSAMPLES - CUTOFF));

the alternatives still ran at the same speed. Of course, this test is hardly conclusive; the texture lookups in the supersampler are probably taking so much time that a single if statement (or two) has a negligible effect on the total time taken by this particular shader. But that’s also a useful piece of information: branching is not always a total performance killer.

I also observed that the Cg compiler, at least as invoked by Panda, seems to optimize the code (which is of course the sensible thing to do - what is not clear a priori is whether there is an optimizer in any given compiler, and if so, what kinds of optimizations it applies).

The optimizer seems pretty advanced - it seems to do some kind of dependency analysis and omit code that does not affect the output. (I was trying to do a rudimentary kind of manual profiling of the shader, disabling parts of it to see what is taking the most time.)

Namely, even if the shader code analyzes both the normal and depth textures, there is absolutely no speed impact if it does not use the result (filling o_color with a constant value). The expected performance hit from the texture lookups appears immediately when the result of the calculation is used in computation of o_color. I disabled the if statement and the lerp, too, using just “samples/NUMSAMPLES” to set the red component of o_color, setting the other components to constant values. The result was the same.

In conclusion, it might be good to test your particular case using both the nested-if and branch-free approaches, if the branch-free version is not too complicated to write.