Like a good artistic visualization, you can never get bored of looking at how the Kinect works behind-the-scenes. Of course pretty colors and dancing skeletons is captivating, but the hard science behind just how the Kinect on XBOX 360 can recognize human poses in real-time from a single depth-image is equally fascinating.
To quench the Kinect thirst, Microsoft Research recently released an 8-page research publication to be presented at the IEEE Conference on Computer Vision and Pattern Recognition in June titled “Real-Time Human Pose Recognition in Parts from a Single Depth Image“. The paper reveals a lot of interesting facts, science and data behind the algorithms of Kinect.
Whilst acknowledging previous work in the field, one area the research team focused to improve was per-frame initialization. That is, the system can work without a lengthy “set-up” phase for the user.
The success of their work is of course a key component of Kinect – anyone can hop in to play at any time.
As part of its development, the team collected a database of around 500,000 frames of motion capture data of simulated poses with different people in an entertainment scenario such as driving, dancing, kicking, running and navigating menus.
From that, they generalized the dataset down to 100,000 more unique poses to which the system was trained to estimate body parts from. As an indication of just how computational intensive the development process really was, “training 3 (decision) trees to depth 20 from 1 million images takes about a day on a 1000 core cluster”.
For one I’m glad they spent days with 1000 core clusters so that Kinect can recognize me kicking and jumping at 200 frames per second.