The New OpenGL Features in Panda3D 1.9

We’ve been working hard for the past months to update the OpenGL renderer and bring support for the latest and greatest features that OpenGL has to offer. We’ve not been very good at updating the blog, though, so we decided to make a post highlighting some of those features and how they are implemented into Panda3D. These features will be part of the upcoming Panda3D 1.9.0 release, which should come out within the following month, assuming that everything goes according to plan.

sRGB support (linear pipeline)

Virtually all lighting and blending calculations are written under the assumption that they happen in a linear space, meaning that multiplying a color value with x results in a color value that is x times as bright. However, what is often overlooked by game developers is the fact that the average monitor isn’t linear. CRT monitors have a gamma of around 2.2, meaning that the output luminosity was proportional to the input voltage raised to the power of 2.2. This means that a pixel value of 0.5 brightness isn’t actually half as bright as one of 1.0 brightness, but only around 0.22 times as bright! To compensate for this, content is produced in the sRGB color space, which has a built-in gamma correction of around 1/2.2. Modern monitors and digital cameras are calibrated to use that standard, so that no gamma correction is typically needed to display images in an image viewer or in the browser.

However, this presents an issue for 3D engines that perform lighting and blending calculations. Because both the input and output color space are non-linear, when you have a light that is supposed to attenuate colors to 0.5x brightness, it will actually cause it to show up with 0.22x brightness—more than twice as dark as it should be! This results in dark areas appearing too dark, and the transition between dark and bright areas will look very unnatural. To have a proper linear lighting pipeline, what we have to do is correct for the gamma on both the input and the output side: we have to convert our input textures to linear space, and we have to convert the rendered output back to sRGB space.

It’s very easy for developers to overlook or dismiss this issue as being unimportant, because it doesn’t really affect unlit textures; they look roughly the same because the two wrongs cancel each other out. Developers simply tweak the lighting values to compensate for the incorrect light ramps until it looks acceptable. However, until you properly address gamma correction, your lighting will look wrong. The transition between light and dark will look unnatural, and people may see banding artifacts around specular highlights. This also applies when using techniques like physically based rendering, where it is more important that the lights behave like they would actually behave in real life.

The screenshots below show a scene rendered in Panda3D using physically-based rendering with and without gamma correction. Note how the left image looks far too dark whereas the right image has a far more natural-looking balance of lighting. Click to enlarge.

ColorSpace-UncorrectedColorSpace-Corrected

Fortunately, we now have support for a range of hardware features that can correct for all of these issues automatically—for free. There are two parts to this: sRGB framebuffers and sRGB textures. If the former is enabled, you’re telling OpenGL that the framebuffer is in the sRGB color space, and will let us do all of the lighting calculations in linear space, and will then gamma-adjust it before displaying it on the monitor. However, just doing that would cause your textures to look way too bright since they are already gamma-corrected! Therefore, you can set your textures to the sRGB format to indicate to OpenGL that they are in the sRGB color space, and that they should therefore automatically be converted to linear space when they are sampled.

The nice thing is that all of these operations are virtually free, because they are nowadays implemented in hardware. These features have existed for a long time, and you can rely on the vast majority of modern graphics hardware to correctly implement sRGB support. We’ve added support to the Direct3D 9 renderer as well, and even to our software renderer! However, keep in mind that you can’t always rely on a monitor being calibrated to 2.2 gamma, and therefore it is always best to have a screen that allows the user to calibrate the application’s gamma. We’ve added a special post-processing filter that applies an additional gamma correction to help with that.

To read more about color spaces in the future Panda3D version, check out this manual page, although the details are still subject to change.

Tessellation shaders

Tessellation shaders are a way to let the GPU take a base mesh and subdivide it into more densely tessellated geometry. It is possible to use a simple base mesh and subdivide it to immensely detailed proportions with a tessellation shader, unburdened by the narrow bandwidth between the GPU and the CPU. Their programmable nature allows for continuous and highly dynamic level of detail without popping artifacts.

To enable tessellation, you have to provide two new types of shader: the control shader, which specifies which control points to subdivide and how many times to subdivide each part of the patch, and the evaluation shader, which then specifies what to do with the tessellated vertices. (They are respectively called a “hull shader” and a “domain shader” in Direct3D parlance.) They are used with a new type of primitive, GeomPatches, which can contain any number of vertices. There are helpful methods for automatically converting existing geometry into patches.

One immediately obvious application is LOD-based terrain or water rendering. Only a small number of patches has to be uploaded to the GPU, after which a tesselation control shader can subdivide the patches by an amount that is calculated based on the distance to the camera. In the tesselation evaluation shader, the desired height can be calculated either based on a height map texture or based on procedural algorithms, or a combination thereof.

Another application is displacement mapping, where an existing mesh is subdivided on the GPU and a displacement map is used to displace the vertices of the subdivided mesh. This allows for showing very high-detailed meshes with dynamic level of detail even when the actual base mesh is very low-poly. Panda3D exposes methods that can be used for converting existing triangle geometry to patches to make it easier to apply this technique. Alternatively, this can be done by way of a new primitive type supported by the egg loader.

Both methods are demonstrated in the screenshots below. Click to enlarge.

Displacement Mapping using Tesselation ShadersTessellated Terrain

Thanks to David Rose for implementing this feature! Support for tesselation shaders is available in the development builds.

Compute shaders

Besides the new tessellation shaders mentioned earlier, Panda3D supports vertex shaders, geometry shaders, and fragment shaders. All of these shaders are designed to perform a very particular task in the rendering pipeline, and as such work on a specific set of data, such as the vertices in a geometry mesh or pixels in a framebuffer.

However, because each shader is designed to do a very specific task, it can be difficult to write shaders to do things that the graphics card manufacturer didn’t plan for. Sometimes one might want to implement a fancy ray tracing algorithm or an erosion simulation, or simply make a small modification to a texture on the GPU. These things may require code to be invoked on the GPU at will, and be able to operate on something other than the vertices in a mesh or fragments in a framebuffer.

Enter compute shaders: a type of shader program that is general purpose and can perform a wide variety of tasks on the video card. Somewhat comparable to OpenCL programs, they can be invoked at any point during the rendering process, operating on a completely user-defined set of inputs. Their flexibility allows them to perform a lot of the tasks one might be used to implementing on the CPU or via a render-to-texture buffer. Compute shaders are particularly interesting for parallelizable tasks like physics simulations, global illumination computation, and tiled rendering; but also for simpler tasks like generating a procedural texture or otherwise modifying the contents of a texture on the GPU.

Of note is to mention that a lot of these tasks require another GLSL feature that we’ve added: ARB_image_load_store. This means that it’s not only possible to sample textures in a shader, but also perform direct read and write operations on a texture image. This is particularly useful for compute shaders, but this feature can be used in any type of shader. That means that you can now write to textures from a regular fragment or vertex shader as well.

If you’d like to get into the details, you can read this manual page. Or, if you’re feeling adventurous, you can always check out a development build of Panda3D and try it out for yourself.

In the following screenshots, compute shaders have been used to implement voxel cone tracing: a type of real-time ray tracing algorithm that provides global illumination and soft reflections. Click here to see a video of it in action in Panda3D.

global-illuminationsponza-reflection
The scene on the left is based on a model by Matthew Wong and the scene on the right makes use of the famous Sponza model by Crytek.

Render-to-texture features

We’ve made various enhancements to the render-to-texture functionality by making it possible to render directly to texture arrays, cube maps, and 3D textures. Previously, separate rendering passes were needed to render to each layer, but you can now bind the entire texture and render to it in one go. Using a geometry shader, one selects the layer of the texture to render to, which can be combined with geometry instancing to duplicate the geometry onto the different layers, depending on the desired effect.

One technique where this is immediately useful is cube map rendering (such as in point light shadows), where the geometry can be instanced across all six cube map faces in one rendering pass. This improves performance tremendously by eliminating the need to issue the rendering calls to the GPU six times.

In the screenshots below, however, layered rendering is used to render the various components of an atmospheric scattering model into different layers of a 3D texture. Click to enlarge.

atmospheric-scattering1atmospheric-scattering3

It is now also possible to use viewport arrays to render to different parts of a texture at once, though support for this is still experimental. As with layered render-to-texture, the viewport to render into is selected by writing to a special output variable in the geometry shader. This makes it possible to render into various parts of a texture atlas or render to different areas of the screen within the same rendering pass.

In the screenshots below, it is used to render the shadow maps of different cameras into one big shadow atlas, allowing the rendering of many shadow-casting lights in one rendering pass. The advantages of this approach are that parts of the texture can be rendered to on-demand, so that the number of shadow casters updating within one frame can be effectively limited, as well as the fact that different lights can use a different resolution shadow map.

boxes-shadowspanda-shadows

Stereo FBOs

We’ve long supported stereoscopic rendering and Panda3D could already take advantage of specialized stereo hardware. But now, as part of the development toward Oculus Rift support, we’ve made it possible to create a buffer on any hardware that will automatically render both left and right views when associated with a stereoscopic camera, making it possible to create postprocessing effects in your stereoscopic application without having to create two separate buffers.

With the multiview texture support introduced in Panda3D 1.8, a single texture object can contain both left and right views, and Panda3D automatically knows which view to use. This makes enabling stereo rendering in an application that uses post-processing filters very straightforward, without needing to set up two separate buffers or textures for each view, and it can be enabled or disabled at the flick of a switch.

Debugging and profiling features

We now take advantage of timer query support with the new GPU profiling feature added to PStats: Panda3D can now ask the driver to measure how much time the draw operations actually take to complete, rather than the CPU time it takes to issue the commands. This feature is instrumental in finding performance bottlenecks in the rendering pipeline by letting you know exactly which parts of the process take the longest.

The reason this feature is important is that PStats currently only displays the time it takes for the OpenGL drawing functions to finish. Most OpenGL functions, however, only cause the commands to be queued up to be sent to the GPU later, and return almost immediately. This makes the performance statistics very misleading, and makes it very difficult to track down bottlenecks in the draw thread. By inserting timer queries into the command stream, and asking for the results thereof a few frames later, we know how much time the commands actually take without significantly delaying the rendering pipeline.

It is also possible to measure the command latency, which is the time that it takes for the GPU to catch up with the CPU and process the draw commands that the CPU has issued.

GPU Timing with PStats

This feature is available in current development builds. The documentation about this feature is forthcoming.

Other features

This blog post is by no means a comprehensive listing of all the features that will be available in the new version, but here are a few other features we thought were worth mentioning:

Comparison of cube mapping with and without seamless mode enabled

Comparison of cube mapping with and without seamless mode enabled

  • We’ve added support for seamless cube maps, to eliminate the seams that can appear up on the edges of cube maps, especially at lower mip map levels. This is enabled by default, assuming that the driver indicates support.
  • We also added support for the new KHR_debug and ARB_debug_output extensions, which give much more fine-grained and detailed debugging information when something goes wrong.
  • We’ve added a range of performance improvements, and we’ve got even more planned!
  • We’ve made it even easier to write GLSL shaders by adding more shader inputs that Panda3D can provide. This includes not only built-in inputs containing render attributes, but also a greater amount of custom input types, such as integer data types and matrix arrays.
  • Line segments are now supported in a single draw call through use of a primitive restart index (also known as a strip-cut index), improving line drawing performance.
  • It is now possible to access integer vertex data using ivec and uvec data types, and create textures with an integer format (particularly useful for atomic access from compute shaders).
  • In the fixed-function pipeline, the diffuse lighting component is now calculated separately from the specular lighting component. This means that the specular highlight will no longer be tinted by the diffuse color. This is usually more desirable and better matches up with Direct3D behavior. The old effect can still be obtained by multiplying the diffuse color into the specular color, or by disabling a configuration flag.

Special thanks go to Tobias Springer, who has been using some of the new features in his project, for graciously contributing the screenshots for some of the listed features. His results with Panda3D look amazing!


Buffer protocol support

I’d like to talk for a moment about the new buffer protocol support in the latest development version of Panda3D. It’s not a particularly exciting feature, but it can be an important one, especially if you use Panda3D together with other libraries like NumPy or if you need to do a lot of low-level operations on texture or geometry data from Python. However, most use cases will not require this functionality.

The Python buffer protocol is a way for Python applications to get a direct hook into C/C++ memory, arrays in particular. Panda3D classes that support it provide a pointer into the underlying memory to the Python interpreter along with a description of how the data is laid out in memory. This description is necessary for Python to know how to access and copy the information.

Starting with Python 2.6, you can use the built-in multiview type to access the memory underlying an object exposing the buffer interface. You can then manipulate the data by converting it to a list or array.array object, creating sub-views and operating on those, writing or reading parts of it to a file, or even just modifying the memory directly as if it were a regular Python list.

Right now, the only Panda3D classes that expose the buffer interface are GeomVertexArrayData and PointerToArray (the latter of which is used for most array storage purposes in Panda3D, including textures), but more classes can easily be added on request. Conversely, Panda allows taking an existing buffer object (such as from array.array or a NumPy array) as source data for a texture or a vertex data array.

When copying data to other libraries such as NumPy, this can help cut down on unnecessary copy operations. Presently, you would call a method like get_data() to create a C++ string, which is one copy operation already, which would then be wrapped into a Python string, which is another. Finally, you would pass this string to NumPy, which would perform at least one more copy operation to copy it into its own representation. But since NumPy also supports the buffer protocol, you can now copy the contents of a texture or of a vertex data array straight into a NumPy array without any unnecessary copy operations.

One other interesting use case for this feature is the fast and efficient manipulation of vertex data and texture data from Python. Instead of having to create a GeomVertexRewriter or a PNMImage to modify the respective data, you can now create a memoryview to iterate over the data directly, easily copy subsets around, or page them out to disk. In my own use case, which involved a lot of geometry generation and manipulation, this flexibility allowed me to dramatically decrease the time spent generating geometry and flattening it. Direct access to the memory also allowed me to quickly page chunks of geometry data out to disk and back into memory when necessary.

The buffer protocol provides a lot of the flexibility of low-level memory access without being exposed to all of the intricacies of C/C++ memory management. In particular, the data is reference counted, so you don’t need to worry about deleting the data. You should in fact be able to keep multiviews around for non-immediate consumption, but keep in mind that you may still need to tell Panda3D when you’ve modified the data later (for instance with an additional explicit call to modify_ram_image()).

This feature will be available in the 1.9.0 release of the Panda3D SDK.


Triple your frame rate?

Historically, Panda has always run single-core. And even though the Panda3D codebase has been written to provide true multithreaded, multi-processor support when it is compiled in, by default we’ve provided a version of Panda built with the so-called “simple threads” model which enforces a single-core processing mode, even on a multi-core machine. But all that is changing.

Beginning with the upcoming Panda3D version 1.8, we’ll start distributing Panda with true threads enabled in the build, which enables you to take advantage of true parallelization on any modern, multi-core machine. Of course, if you want to use threading directly, you will have to deal with the coding complexity issues, like deadlocks and race conditions, that always come along with this sort of thing. And the Python interpreter is still fundamentally single-core, so any truly parallel code must be written in C++.

But, more excitingly, we’re also enabling an optional new feature within the Panda3D engine itself, to make the rendering (which is all C++ code) run entirely on a sub-thread, allowing your Python code to run fully parallel with the rendering process, possibly doubling your frame rate. But it goes even further than that. You can potentially divide the entire frame onto three different cores, achieving unprecedented parallelization and a theoretical 3x performance improvement (although, realistically, 1.5x to 2x is more likely). And all of this happens with no special coding effort on your part, the application developer–you only have to turn it on.

How does it work?

To use this feature successfully, you will need to understand something about how it works. First, consider Panda’s normal, single-threaded render pipeline. The time spent processing each frame can be subdivided into three separate phases, called “App”, “Cull”, and “Draw”:

app, cull, draw

In Panda’s nomenclature, “App” is any time spent in the application yourself, i.e. your program. This is your main loop, including any Python code (or C++ code) you write to control your particular game’s logic. It also includes any Panda-based calculations that must be performed synchronously with this application code; for instance, the collision traversal is usually considered to be part of App.

“Cull” and “Draw” are the two phases of Panda’s main rendering engine. Once your application code finishes executing for the frame, then Cull takes over. The name “Cull” implies view-frustum culling, and this is part of it; but it is also much more. This phase includes all processing of the scene graph needed to identify the objects that are going to be rendered this frame and their current state, and all processing needed to place them into an ordered list for drawing. Cull typically also includes the time to compute character animations. The output of Cull is a sorted list of objects and their associated states to be sent to the graphics card.

“Draw” is the final phase of the rendering process, which is nothing more than walking through the list of objects output by Cull, and sending them one at a time to the graphics card. Draw is designed to be as lightweight as possible on the CPU; the idea is to keep the graphics command pipe filled with as many rendering commands as it will hold. Draw is the only phase of the process during which graphics commands are actually being issued.

You can see the actual time spent within these three phases if you inspect your program’s execution via the PStats tool. Every application is different, of course, but in many moderately complex applications, the time spent in each of these three phases is similar to the others, so that the three phases roughly divide the total frame time into thirds.

Now that we have the frame time divided into three more-or-less equal pieces, the threaded pipeline code can take effect, by splitting each phase into a different thread, so that it can run (potentially) on a different CPU, like this:

app, cull, draw on separate threads

Note that App remains on the first, or main thread; we have only moved Cull and Draw onto separate threads. This is important, because it means that all of your application code can continue to be single-threaded (and therefore much easier and faster to develop). Of course, there’s also nothing preventing you from using additional threads in App if you wish (and if you have enough additional CPU’s to make it worthwhile).

If separating the phases onto different threads were all that we did, we wouldn’t have accomplished anything useful, because each phase must still wait for the previous phase to complete before it can proceed. It’s impossible to run Cull to figure out what things are going to be rendered before the App phase has finished arranging the scene graph properly. Similarly, it’s impossible to run Draw until the Cull phase has finished processing the scene graph and constructing the list of objects.

However, once App has finished processing frame 1, there’s no reason for that thread to sit around waiting for the rest of the frame to be finished drawing. It can go right ahead and start working on frame 2, at the same time that the Cull thread starts processing frame 1. And then by the time Cull has finished processing frame 1, it can start working on culling frame 2 (which App has also just finished with). Putting it all in graphical form, the frame time now looks like this:

The fully staged render pipeline

So, we see that we can now crank out frames up to three times faster than in the original, single-threaded case. Each frame now takes the same amount of time, total, as the longest of the original three phases. (Thus, the theoretical maximum speedup of 3x can only be achieved in practice if all three phases are exactly equal in length.)

It’s worth pointing out that the only thing we have improved here is frame *throughput*–the total number of frames per second that the system can render. This approach does nothing to improve frame *latency*, or the total time that elapses between the time some change happens in the game, and the time it appears onscreen. This might be one reason to avoid this approach, if latency is more important than throughput. However, we’re still talking about a total latency that’s usually less than 100ms or so, which is faster than human response time anyway; and most applications (including games) can tolerate a small amount of latency like this in exchange for a smooth, fast frame rate.

In order for all of this to work, Panda has to do some clever tricks behind the scenes. The most important trick is that there need to be three different copies of the scene graph in different states of modification. As your App process is moving nodes around for frame 3, for instance, Cull is still analyzing frame 2, and must be able to analyze the scene graph *before* anything in App started mucking around to make frame 3. So there needs to be a complete copy of the scene graph saved as of the end of App’s frame 2. Panda does a pretty good job of doing this efficiently, relying on the fact that most things are the same from one frame to the next; but still there is some overhead to all this, so the total performance gain is always somewhat less than the theoretical 3x speedup. In particular, if the application is already running fast (60fps or above), then the gain from parallelization is likely to be dwarfed by the additional overhead requirements. And, of course, if your application is very one-sided, such that almost all of its time is spent in App (or, conversely, almost all of its time is spent in Draw), then you will not see much benefit from this trick.

Also, note that it is no longer possible for anything in App to contact the graphics card directly; while App is running, the graphics card is being sent the drawing commands from two frames ago, and you can’t reliably interrupt this without taking a big performance hit. So this means that OpenGL callbacks and the like have to be sensitive to the threaded nature of the graphics pipeline. (This is why Panda’s interface to the graphics window requires an indirect call: base.win.requestProperties(), rather than base.win.setProperties(). It’s necessary because the property-change request must be handled by the draw thread.)

Early adopters are invited to try this new feature out today, before we formally release 1.8. It’s already available in the current buildbot release; to turn it on, see the new manual page on the subject. Let us know your feedback! There are still likely to be kinks to work out, so we’d love to know how well it works for you.


Panda3D and Cython

This is about how to speed up your Python Code, and has no direct impact on Panda3D’s performance. For most projects, the vast majority of the execution time is inside Panda3D’s C++ or in the GPU, so no matter what you do, fixing your Python will never help. For the other cases where you do need to speed up your Python code, Cython can help. This is mainly addressed to people who prefer programming in Python, but know at least a little about C. I will not discuss how to do optimizations within Python, though if this article is relevant to you, you really should look into it.

Cython is an interesting programming language. It uses an extended version of python’s syntax to allow things like statically typed variables, and direct calls into C++ libraries. Cython compiles this code to C++. The C++ then compiles as a python extension module that you can import and use just like a regular python module. There are several benefits to this, but in our context the main one is speed. Properly written Cython code can be as fast as C code, which in some particular cases can be even 1000 times faster than nearly identical python code. Generally you won’t see 1000x speed increases, but it can be quite a bit. This does cause the modules to only work on the platform they were compiled for, so you will need to compile alternate versions for different platforms.

By default, Cython compiles to C, but the new 0.13 version supports C++. This is more useful as you probably use at least one C++ library, Panda3D. I decided to try this out, and after stumbling on a few simple issues, I got it to work, and I don’t even know C++.

Before I get to the details, I’ll outline why you might want to use Cython, rather than porting performance bottlenecks to C++ by hand. The main benefit is in the process, as well as the required skill set. If you have a large base of Python code for a project, and you decide some of it needs to be much faster, you have a few options. The common approach seems to be to learn C++, port the code, and learn how to make it so you can interface to it from python. With Cython, you can just add a few type definitions on variables where you need the performance increase, and compile it which gives you a Python modules that works just like the one you had. If you need to speed up the code that interfaces with Panda3D, you can swap the Python API calls for C++ ones. Using Cython allows you to just put effort into speeding up the parts of code you need to work on, and to do so without having to change very much. This is vastly different from ditching all the code and reimplementing it another language. It also requires you to learn a pretty minimal amount of stuff. You also get to keep the niceness of the Python syntax which may Python coders have come to appreciate.

There are still major reasons to actually code in C++ when working with Panda, but as someone who does not do any coding in C++, I won’t talk about it much. If you want to directly extend or contribute to Panda3D, want to avoid redundantly specifying your imports from header files (Cython will require you to re-specify the parts of API you are using rather than just using the header files shipped with Panda), or you simply prefer C++, C++ may be a better option. I mainly see Cython as a convenient option when you end up needing to speed up parts of a Python code-base; however, it is practical to undertake large projects from the beginning in Cython.

Cython does have some downsides as well. It is still in rather early development. This means you will encounter bugs in its translators as well as the produced code. It also lacks support for a few Python features, such as most uses of generators. Generally I haven’t had much trouble with these issues, but your experience may differ.

Cython does offer an interesting side benefit as well. It allows you to optionally   statically type variables and thus can detect more errors at compile time than Python.

To get started, first you need an install of Cython 0.13 (or probably any newer version). If you have a Cython install you can check the version with the -V command. You can pick up the source from the Cython Site, and install it by running “python setup.py install” from the Cython source directory. You will also need to have a compiler. The Cython site should help you get everything setup if you need additional guidance.

Then you should try out a sample to make sure you have everything you need, and that it’s all working. There is a nice C++ sample for Cython on the Cython Wiki. (This worked for me on Mac, and on Windows using MinGW or MSVC as a compiler).

As for working with Panda3D, there are a few things I discovered:

  • There are significant performance gains to be had by just compiling your existing Python modules as Cython. With a little additional work adding static typed variables, you can have larger performance gains without even moving over to Panda’s C++ API (Which means you don’t need to worry about linking against Panda3D which can be an issue).
  • Panda3D already has python bindings with nice memory management, so I recommend instancing all the objects using the python API, and only switching to the C++ one as needed.
  • You can use the ‘this’ property on Panda3D’s Python objects to get a pointer to the underlying C++ object.
  • On mac, you need to make sure libpanda (and is some cases, possibly others as well) is loaded before importing your extension module if you use any of Panda3D’s libraries.
  • On Windows, you need to specify the libraries you need when compiling (in my case, just libpanda)
  • The C++ classes and Python classes tend to have the same name. To resolve this, you can use “from x import y as z” when importing the python ones, or you can just import panda3d.core, and use the full name of the classes (like panda3d.core.Geom). There may be a way to rename the C++ classes on import too.
  • If using the Panda3D C++ API on Windows, you will need to use the MSVC compiler. You can get Microsoft Visual Studio 2008 Express Edition for free which includes the needed compiler.

Using this technique I got a 10x performance increase on my code for updating the vertex positions in my Geom. It avoided having to create python objects for all of the vertexes and passing them through the Python API which translates them back to C++ objects. It was just a matter of moving over one call in the inner loop to the other API. This, however, was done in already optimized Cython code that was simply loading vertex positions stored in a block of memory into the Geom. Most use cases would likely see less of a benefit. Overall though, I gained a lot of performance both from the change over to Cython, and from the change over to the C++ API. These changes only required relatively small changes to the speed critical portions of my existing python code.

I made a rather minimal example of using a Panda3D C++ API call from Cython. Place the setup.py and the testc.pyx files in the same directory, and from the said directory, run setup.py with your Python install you use with Panda3D. If everything is properly configured, this should compile the example Cython module, testc.pyx, to a python extension module and run it. If it works, it will print out a few lines ending with “done”. It is likely you may need to tweak the paths in setup.py. If not on Mac or Windows, you will get an error indicating where you need to enter your compiler settings (mostly just the paths to Panda3D’s libraries).

I would like to thank Lisandro Dalcin from the Cython-Users mailing list who helped me get this working on Windows.


Pandai Library: A Quick Review of panda3d.ai

The Entertainment Technology Center (ETC) at Carnegie Mellon University in 2009 launched a graduate student project to add a collection of artificial intelligence behaviors like seek, flock, and evade along with 2D pathfinding to Panda3D.  This blog post is a reminder that this work is available as part of Panda 1.7.0, and receives ongoing attention from the ETC team.  The timeline for this work has been as follows:

  • Summer 2009: Collect feedback from the Panda3D community via Pandai forum post on requirements for an AI library. This forum post remains active today to address additions to the Pandai code base.
  • August – December 2009:  Start with Craig Reynolds’ published work on flocking behavior and A* algorithm for two-dimensional pathfinding between points. Develop the C++ code, based on community feedback, and prototyping with Building Virtual Worlds graduate student work at the ETC.
  • December 2009:  At the insistence of ETC Faculty advisors Ruth Comley and myself, the Pandai student team created a number of demonstrations and a detailed Pandai ETC Project Web site documenting the project work. The demonstrations show capabilities, such as a fish demonstration that shows wander, pursue, and evade.  The project web site includes further descriptions on the project team, motivations for the work, and downloadable content, including the art assets (like the fish) and code (the fish demo) needed to run demonstrations.  See Pandai ETC Project download page.

    Pandai demo: fish pursue hook until one is caught, then evade it

    Pandai demo: fish pursue hook until one is caught, then evade it

  • January 2010: Thanks to rdb, the Pandai library was published as part of the Panda3D 1.7.0 release. One change of note regarding the downloadable examples from the ETC Pandai Project web site: rather than “from libpandaai import *” the Python code should use “from panda3d.ai import *”. With this minor edit, you will be able to download and run the fish demo and others using Panda 1.7.0.
  • July 2010: Ongoing collection of feedback from the Pandai forum post led to the release of a Blender meshgen tool for pathfinding. A link to this tool has been added to the Pandai ETC Project download page.

The Pandai ETC team is responsible for version 1.0 of the Pandai library, and remain active in its support. Your comments are welcome here regarding the Pandai effort and shared code and examples. For help requests, continue the thread within the Pandai forum post.


Porting to Java

This was an April Fool’s Joke. The information in it is not meant to be taken seriously. Click the post title if you want to see it.


Panda SE Project

During the past few months, several students at Carnegie Mellon University’s Entertainment Technology Center (ETC) have been working on improving the egging process as well as incrementally improving the shader system.  Just take a look at their smiling faces!

Panda SE Team Photo

From Left: Wei-Feng Huang, Federico Perazzi, Shuying Feng (Panda), Deepak Chandrasekaran, Andrew Gartner

For those of you that have been with Panda 3D a long time you’ll know that there have been ETC Panda 3D projects in the past.  Some of them have had limited success due to an oversized project scope.  This project will instead focus on making complete feature sets rather than half implemented pieces like those past unsuccessful projects.  It will also focus on documentation both within the code and the manual to make sure that you, the Panda community will be able to take their work and build on top of it.

With that said, this project will primarily focus on two things:

  1. The shader inputs
  2. The egging/model exporting process

Shader Inputs

If you’ve taken a look at the source code of Panda 3D’s shader system and have had any experience in professional game engine development, you’ll notice that it’s a system that isn’t implemented fully.  Actually, the first shader system was an ETC student project and it has since then been improved through other ETC projects and the Panda 3D community.  Shader inputs is continuing this work in a structured manor.

Shaders have supported the input of arrays and arrays of vectors for quite some time.  However, Panda 3D has never supported this.  There have been some hacks in the past where arrays are passed as textures, but this is not ideal for performance and it ruins texture caching schemes.  After this project completes, users will be able to input arrays and arrays of vectors/matrices directly into the shader.

Screenshot of multiple lights demo

Screenshot of multiple lights demo

This may not seem that exciting at first but this lays the groundwork for many more things.  If your new to computer graphics having a complete shader inputs system allows for some of the following just to name a few.

  • Hardware accelerated actors/characters
  • Shader based instancing with dynamic texture and animation support (crowds)
  • Shader based vegetation system (fast trees and grass)
  • A real deferred shading system
  • A real light manager system for shader based lights

A Real Egging Pipeline

Up until now, there have been several attempts at user interfaces to the maya2egg, dae2egg, etc.  Most of them are just simple user interfaces to the command line equivalents of them.  This new user interface is much more than that.  It is an artist friendly build system.  Just check out some of the features.

  • Simple mode for when you don’t want a build system
  • Support for multiple maya versions
  • Support for egg tools such as egg-opt-char and egg-palettize
  • A batching system that automatically detects whether a file has been changed to allow for minimal rebuilds
  • Support for all tools to be built into batch system
  • Save/Load batch scripts

Like shader inputs this lays the groundwork for much future work.  For any game engine to be professional quality, it needs a set of robust artist tools such as node-based shader generators and artist friendly level editors.

Screenshot of WIP Egging GUI

Screenshot of WIP Egging GUI


ABI Backward Compatibility

Hey C++ developers of Panda3D,

I’ve just checked in a fix to the codebase that should give minor releases a backward compatible ABI. This means that if you link something against the Panda3D 1.8.0 libraries, you’ll still be able to use it with libraries of any Panda3D 1.8.X release. This rule was created to make C++ users able to use the web plugin functionality.

To the people working directly on the Panda3D codebase: do not merge anything onto the release branch (e.g. panda3d_1_7_branch) that is not backward ABI compatible. You can merge in new symbols, but you cannot merge altered or removed symbols. This rule does not apply to the trunk – you can do whatever you want on the trunk as there are ABI rules there. Of course, these rules don’t affect Python code; just exposed C/C++ symbols.

But you really don’t need to worry about any of this unless you actually want to merge things onto the release branch – and this is usually done by the release maintainer anyways.

As for linking to libraries on non-Windows systems: libraries like libpanda.so / libpanda.dylib will now symlink to libpanda.so.1.7 / libpanda.1.7.dylib. This ensures that if you link to libpanda, it will link against the 1.7 version of the library and won’t conflict with libraries of any other series. This allows you to have multiple series of Panda3D installed at the same time and run different games that are linked against different series of Panda.

The latest buildbot releases should already abide by these rules. I’m going to put up a buildbot script soon that regularly builds the release branch and alerts when the ABI compatibility is broken. (E-mail me if you want to be put on the buildbot e-mail notify list.) The next release, 1.7.1, will start being ABI compatible.

Have fun!


Pointer Textures

With the introduction of Panda3D 1.7.0, there is now a very powerful but also very dangerous feature that is now available to the public.  This feature has been used internally at Walt Disney Imagineering for some time now, but it is now available to everyone.  We call these Pointer Textures.  It allows a user within python to give a long int to Panda3D, and Panda will cast this to a pointer and without question, upload the data at this pointer to the graphics card!

You can see why this is dangerous.  This bypasses any checks or copies within Panda and is a direct gate to the graphics card.  This means that if you do something wrong, your application will crash.  No asserts, no error messages, it will just flat out crash.  That should be about the extent of it, but I’d be lying if I said I didn’t cause a few BSODs here and there with this technique.   Now that we have this little disclaimer out of the way, what is it good for?

For starters, it’s really fast.  The current MovieTexture implementation actually does three copies of the same data within memory.  Once from the decoder to system memory, once to Panda safe memory, and then another to the graphics card.  This certainly works and the performance is quite reasonable.  Where this fails is when you have large amounts of video data like 1080p content. That’s a lot of data to be unnecessarily moving around.  For these applications, we use Pointer Textures.

We have an external python module that decodes movies using DirectShow and through pointer textures, we can display it in Panda as a texture.  This has side benefits as well.  Since DirectShow is multithreaded and a python module is not locked to the main python thread, DirectShow can decode the movies using different cores.  This makes it a truly multi-threaded application. On a machine with 4 cores, I’ve managed to squeeze two simultaneous 720p videos through without problems.

Since pointer textures allow you direct access to graphics card memory, it also means that you can load anything as long as you have the pointer to the data and the correct format.  Using this we’ve implemented Image Based Lighting via HDR textures.

Another added benefit to using pointer textures is that you don’t have to have to modify Panda to get special features.  For instance, if you wanted to load an image format that is not currently supported by the texture loader, you would have to setup a Panda build environment, get the third party libs, and then modify the texture loader.  Now you don’t have to.  Just compile a python module via Swig or via the C Python API and make a function that returns the pointer as a long integer.

This goes far beyond static images or movies as well.  Since it’s just data at a pointer, this method also works with special devices.  Using this, we have connected Panda to specialized cameras, webcams, TV Tuner cards and even capture cards.  The possibilities are endless.

If you want to start using this, here is a quick example that just loads a HDR pfm texture and displays it. It only works on Windows with Panda3D 1.7.0. I’ve included the source code for the python module as well if you are feeling like you want to compile that too. Simply run PfmTexture.py and you’ll should see the demo.


Hardware Geometry Instancing

Recently, I have been working on a little Infinite Terrain demo that explains how to take full advantage of Panda3D’s terrain capabilities. In the process I have been adding various features to Panda3D that would make it easier for me, including many improvements to the Shader Generator. (On a sidenote, I didn’t end up needing any shaders for the terrain.)
Last Monday, I added trees to the terrain. As this is an infinite terrain demo, I needed to add quite an amount of trees. But I found that (even with Panda3D’s flattening capabilities) my GPU quickly let me down after a few thousand trees. So, I realized I needed to add geometry instancing support to Panda3D.
And so I did. It turned out to be quite trivial to implement and only took me an hour or two. The results are quite pleasing! I have managed to render over 100000 trees (and a huge terrain) at the same time at a reasonable framerate! Here’s a screenshot of what it looks like right now:

screenshot-Tue-Dec-22-14-52-54-2009-67

Of course, that WIP scene could use a lot of improvement, but you get my point. And with some proper culling and LOD, I could push the amount of trees even higher.

But doesn’t Panda3D already support instancing?

Currently, Panda3D supports instancing of animated models. That is entirely unrelated to geometry instancing. The existing instancing system only exists to improve performance if you have a lot of animated models, by reducing the amount of vertex displacements that are done by Panda3D’s animation system. Geometry instancing, on the other hand, exists to greatly reduce the amount of data that is passed to the video card. Whether the model is animated or not is irrelevant with the new instancing system.

How does it work?

Before yesterday, if you wanted to create multiple instances of a model, you’d either load the model multiple times or use copyTo. I’ve seen that many people use instanceTo in this case, but that will have no positive effect on performance for static models. The geometry will still be passed many times to the GPU, which is a slow process.
With the new system, you keep just a single copy of the model and call setInstanceCount(n) on it. This means that it will still be passed to the GPU only once, but it will be rendered n times.

You might be wondering, how do I give each node different parameters or a different position? Well, that can be done in the shader. You can access the instance ID in the shader and calculate the position, color, etc. based on that, or simply use a different model projection matrix from an array of transform matrices that you pass to the shader. This allows you to do basically anything. You can send a single sphere to the GPU, passing a 3D texture with a displacement map in each layer, and set the instance count to 1024. That will result in 1024 unique rocks using just one batch call and a very limited amount of uploaded geometry.

This leaves the question of whether this will actually work without use of shaders. The answer is no, I’m afraid. There is an OpenGL extension that allows you to use geometry instancing with the fixed-function pipeline, but very few video cards support it. Because it would be quite complicated to implement that, I decided not to do it.

Note that this is only supported in OpenGL so far. Maybe someone will add support for this to Panda’s DirectX side someday.