Threading -- How to squeeze more power out of panda

Return to General Discussion

Threading -- How to squeeze more power out of panda

Postby zhao » Thu Jul 26, 2012 11:08 pm

A small summary on Threading -- How to squeeze more power out Panda3d

This is a small primer on how to squeeze more CPU cycles out of a Panda3d program
that is being driven from Python. There are lot of information scattered across
the forum on how to setup threading and taskchains, but there is very little information
on the idiosyncrancies of python threading which is important to know if you really
want to squeeze more CPU cycles out of Panda3d.

First, let's talking about OS threading in C/C++ which is the most basic level of threading.
OS threads allow you to run two blocks of C/C+= code in parallel on multiple CPU cores.
If your code is entirely C/C++ you can generally schedule the workload however you like.
This includes instancing the same block of C++ code twice on multiple CPU cores.

<All C code, 2 Threads >
Code: Select all
         Core 1              Core 2             Time
    Thread1 -- C Code    Thread2 -- C Code       |
    Thread1 -- C Code    Thread2 -- C Code       |
    Thread1 -- C Code    Thread2 -- C Code       V


Python is essentially a C program. However, it's a special C program that can
only be instanced once, no matter how many threads are used to run it. For example,
if I have 2 C/C++ thread trying to run python code, whenever either thread is accessing
the python interperter to run python code, the other thread may not access the python
interpter. This is the infamous python GIL. And the reason why python is usually
thought of as a single-threaded application.

Thus, if you're entire codebase is all python, there is no advantage in running
multiple threads. In fact, your code will be ~2x as slow because of thread contention
issues.

<All python code, 2 Threads>
Code: Select all
                                                Time
          Core 1         Core 2                   |
    Thread1 - Python                              |
                      Thread2 Python              |
    Thread1 - Python                              |
                      Thread2 Python              V

*The cost of a Thread 2 wrestling the python interpeter from thread 1 python
is extremely costly because python's thread scheduling is poorly designed.

However, there is a special circumstance where threading under python is beneficial.
That is, when there a lot of C/C++ code interlaced with python code. When a thread
runing python code hits a python function which calls C/C++ block, that thread has the
option of relinquishing control of the python interpeter (giving up the GIL)
while it's running the C/C++ code and allow another thread to run some python code.
This effectively allows the python interpeter to be run 'twice' as much as you would
expect. Note, that not every C/C++ code is written to give up the python GIL. It must
be explicitly programmed to do so. (The function call to do this is trivial though.)

Code: Select all
          Core 1            Core 2
    Thread1 - Python                            Time
              C/C++      Thread2 Python           |
    Thread1   Python                              |
              C/C++      Thread2 Python           V


    In this last scenario, Core 1 is always fully occupied, while Core 2 can
maintain ~50% occupancy.

In Panda, the actual rendering is done all in C/C++. Usually, you use python
to setup the scene, and then (behind the scenes) call graphicsEngine.renderFrame(),
a C++ function to do the actual rendering. If you have a lot of shaders + models in
your scene, your FPS maybe very low and most of the time is spent in C++ uploading
data to the GPU or waiting for the GPU to render. This time is a great opportunity
to run some game/AI code in a second python thread (using threading or a taskchain with
numthreads > 0).

Code: Select all
        Core 1                                Core 2
   Thread1  - Python 1ms (setup scene)
              C/C++ 16ms (renderFrame)      Thread2 Python ( AI/Code ) 
--Frame 0 End------------------------------------------ 17ms/Frame
   Thread1  - Python 1ms (setup scene)
              C/C++ 16ms (renderFrame)      Thread2 Python ( AI/Code )
--Frame 1 End---------------------------------------------------------------

   If you don't use a second thread to run your AI/code, and/or use a normal task,
then the time distribution would look like:

   Thread1  - Python 1ms (setup scene)
              C/C++ 16ms (renderFrame)     
              Python 16ms ( AI/Code)
--Frame 0 End------------------------------------------ 32ms/Frame


Thus, using a second thread could effectively double your game speed, even if you
write all of your game code in python. The situation depicted above is ideal, and for my
own situation, the gain is closer 50%. ie., No threads uses 25% CPU while a secondary thread
boosts CPU utilization to 40%.

If you take this dual threading path, there are some additional caveats to worry about.
As I hinted early, 2 threads trying to acquire control of the python interpeter is very
costly and you may have a situation like this
Code: Select all
   Thread1  - Python 0.1ms (setup scene)
                                          Thread2 Python ( AI/Code )
              Python 0.1ms (setup scene)
                                          Thread2 Python ( AI/Code )
                      ....
          Each switch of which thread controls the python interperter may cost .1ms
                      ....
               C/C++ 16ms (renderFrame)

--Frame 0 End------------------------------------------ 27ms/Frame


Ideally, we would schedule Thread2 python's code to run only after Thread1's python
code is completely finished. We can manipulate python schedules in at least two ways:

a) when it hits a C/C++ block that gives up the GIL ie., during rendering
b) after a set number of python op codes (related to # of statements) as set by
sys.setcheckinterval(5000)

The default setting of setcheckinterval is I belive 100. Which leads to a lot of ping-ponging
between different threads trying to acqure the interpeter and losing a lot of time doing this.
Setting setcheckinterval to a very high number eg., 50000 essentially tells the interpeter,
"do not try to give up control of python until you a hit C/C++ block which gives up the GIL"

If you naively set this too high however, the AI/code thread may not relinquish
control back to the rendering thread fast enough and thus slow down the rendering.
Alternatively, one could sprinkle some dummy C/C++ blocks that gives up the GIL in the AI code
to judiciously force a switch.

In all likelihood however, your AI code will also rely on some C/C++ for things
like pathfinding ..etc, where opportunities naturally exist for the GIL to be given up.

How many threads should one use? It depends on how many elements of code gives up
the GIL. You should have as many threads as code modules that gives up the GIL + 1. In the example
code above, you should only use two threads as only the renderer gives up the GIL.
If you have a separate module which perhaps does some facial recognition and image proceessing
in C, that would be a candidate to be put into its own thread. Physics, should also in its own
thread. (Does the current bullet/ode implementation release the GIL? I'll have to check the source
code!)

Well, anyway, this is all of my thoughts on this subject. In the end, Python is a great
language for debugging, prototyping, meta-programming...etc. Hopefully, this overview of threading
will be useful to people who wants to squeeze more CPU cycles out of their Python code.
zhao
 
Posts: 224
Joined: Tue Nov 10, 2009 5:32 pm

Postby Nemesis#13 » Fri Jul 27, 2012 1:31 pm

Very interesting article. I hope you add some practical approaches aka. code snippets or possibly even some stats.

Keep up the good work!
User avatar
Nemesis#13
 
Posts: 1041
Joined: Mon Aug 04, 2008 8:09 pm
Location: Germany

Postby enn0x » Fri Aug 03, 2012 2:59 am

Does the current bullet/ode implementation release the GIL? I'll have to check the source
code!

A little hint: You would have to check the generated code. All Python/C++ interaction is code generated during the build process when invoking interrogate & interrogate_module.
enn0x
 
Posts: 1267
Joined: Wed Nov 08, 2006 1:39 am
Location: Germany, Munich


Return to General Discussion

Who is online

Users browsing this forum: No registered users and 1 guest