C H A P T E R  1

Getting Started

In this chapter, we explain what makes Kinect special and how Microsoft got to the point of providing a Kinect for Windows SDK—something that Microsoft apparently did not envision when it released what was thought of as a new kind of “controller-free” controller for the Xbox. We take you through the steps involved in installing the Kinect for Windows SDK, plugging in your Kinect sensor, and verifying that everything is working the way it should in order to start programming for Kinect. We then navigate through the samples provided with the SDK and describe their significance in demonstrating how to program for the Kinect.

The Kinect Creation Story

The history of Kinect begins long before the device itself was conceived. Kinect has roots in decades of thinking and dreaming about user interfaces based upon gesture and voice. The hit 2002 movie The Minority Report added fuel to the fire with its futuristic depiction of a spatial user interface. Rivalry between competing gaming consoles brought the Kinect technology into our living rooms. It was the hacker ethic of unlocking anything intended to be sealed, however, that eventually opened up the Kinect to developers.


Bill Buxton has been talking over the past few years about something he calls the Long Nose of Innovation. A play on Chris Anderson's notion of the Long Tail, the Long Nose describes the decades of incubation time required to produce a “revolutionary” new technology apparently out of nowhere. The classic example is the invention and refinement of a device central to the GUI revolution: the mouse.

The first mouse prototype was built by Douglas Engelbart and Bill English, then at the Stanford Research Institute, in 1963. They even gave the device its murine name. Bill English developed the concept further when he took it to Xerox PARC in 1973. With Jack Hawley, he added the famous mouse ball to the design of the mouse. During this same time period, Telefunken in Germany was independently developing its own rollerball mouse device called the Telefunken Rollkugel. By 1982, the first commercial mouse began to find its way to the market. Logitech began selling one for $299. It was somewhere in this period that Steve Jobs visited Xerox PARC and saw the mouse working with a WIMP interface (windows, icons, menus, pointers). Some time after that, Jobs invited Bill Gates to see the mouse-based GUI interface he was working on. Apple released the Lisa in 1983 with a mouse, and then equipped the Macintosh with the mouse in 1984. Microsoft announced its Windows OS shortly after the release of the Lisa and began selling Windows 1.0 in 1985. It was not until 1995, with the release of Microsoft's Windows 95 operating system, that the mouse became ubiquitous. The Long Nose describes the 30-year span required for devices like the mouse to go from invention to ubiquity.

A similar 30-year Long Nose can be sketched out for Kinect. Starting in the late 70s, about halfway into the mouse's development trajectory, Chris Schmandt at the MIT Architecture Machine Group started a research project called Put-That-There, based on an idea by Richard Bolt, which combined voice and gesture recognition as input vectors for a graphical interface. The Put-That-There installation lived in a sixteen-foot by eleven-foot room with a large projection screen against one wall. The user sat in a vinyl chair about eight feet in front of the screen and had a magnetic cube hidden up one wrist for spatial input as well as a head-mounted microphone. With these inputs, and some rudimentary speech parsing logic around pronouns like “that” and “there,” the user could create and move basic shapes around the screen. Bolt suggests in his 1980 paper describing the project, “Put-That-There: Voice and Gesture at the Graphics Interface,” that eventually the head-mounted microphone should be replaced with a directional mic. Subsequent versions of Put-That-There allowed users to guide ships through the Caribbean and place colonial buildings on a map of Boston.

Another MIT Media Labs research project from 1993 by David Koonz, Kristinn Thorrison, and Carlton Sparrell—and again directed by Bolt—called The Iconic System refined the Put-That-There concept to work with speech and gesture as well as a third input modality: eye-tracking. Also, instead of projecting input onto a two-dimensional space, the graphical interface was a computer-generated three-dimensional space. In place of the magnetic cubes used for Put-That-There, the Iconic System included special gloves to facilitate gesture tracking.

Towards the late 90s, Mark Lucente developed an advanced user interface for IBM Research called DreamSpace, which ran on a variety of platforms including Windows NT. It even implemented the Put-That-There syntax of Chris Schmandt's 1979 project. Unlike any of its predecessors, however, DreamSpace did not use wands or gloves for gesture recognition. Instead, it used a vision system. Moreover, Lucente envisioned DreamSpace not only for specialized scenarios but also as a viable alternative to standard mouse and keyboard inputs for everyday computing. Lucente helped to popularize speech and gesture recognition by demonstrating DreamSpace at tradeshows between 1997 and 1999.

In 1999 John Underkoffler—also with MIT Media Labs and a coauthor with Mark Lucente on a paper a few years earlier on holography—was invited to work on a new Stephen Spielberg project called The Minority Report. Underkoffler eventually became the Science and Technology Advisor on the film and, with Alex McDowell, the film's Production Designer, put together the user interface Tom Cruise uses in the movie. Some of the design concepts from The Minority Report UI eventually ended up in another project Underkoffler worked on called G-Speak.

Perhaps Underkoffler's most fascinating design contribution to the film was a suggestion he made to Spielberg to have Cruise accidently put his virtual desktop into disarray when he turns and reaches out to shake Colin Farrell's hand. It is a scene that captures the jarring acknowledgment that even “smart” computer interfaces are ultimately still reliant on conventions and that these conventions are easily undermined by the uncanny facticity of real life.

The Minority Report was released in 2002. The film visuals immediately seeped into the collective unconscious, hanging in the zeitgeist like a promissory note. A mild discontent over the prevalence of the mouse in our daily lives began to be felt, and the press as well as popular attention began to turn toward what we came to call the Natural User Interface (NUI). Microsoft began working on its innovative multitouch platform Surface in 2003, began showing it in 2007, and eventually released it in 2008. Apple unveiled the iPhone in 2007. The iPad began selling in 2010. As each NUI technology came to market, it was accompanied by comparisons to The Minority Report.

The Minority Report

So much ink has been spilled about the obvious influence of The Minority Report on the development of Kinect that at one point I insisted to my co-author that we should try to avoid ever using the words “minority” and “report” together on the same page. In this endeavor I have failed miserably and concede that avoiding mention of The Minority Report when discussing Kinect is virtually impossible.

One of the more peculiar responses to the movie was the movie critic Roger Ebert's opinion that it offered an “optimistic preview” of the future. The Minority Report, based loosely on a short story by Philip K. Dick, depicts a future in which police surveillance is pervasive to the point of predicting crimes before they happen and incarcerating those who have not yet committed the crimes. It includes massively pervasive marketing in which retinal scans are used in public places to target advertisements to pedestrians based on demographic data collected on them and stored in the cloud. Genetic experimentation results in monstrously carnivorous plants, robot spiders that roam the streets, a thriving black market in body parts that allows people to change their identities and—perhaps the most jarring future prediction of all—policemen wearing rocket packs.

Perhaps what Ebert responded to was the notion that the world of The Minority Report was a believable future, extrapolated from our world, demonstrating that through technology our world can actually change and not merely be more of the same. Even if it introduces new problems, science fiction reinforces the idea that technology can help us leave our current problems behind. In the 1958 book, The Human Condition, the author and philosopher Hannah Arendt characterizes the role of science fiction in society by saying, “…science has realized and affirmed what men anticipated in dreams that were neither wild nor idle … buried in the highly non-respectable literature of science fiction (to which, unfortunately, nobody yet has paid the attention it deserves as a vehicle of mass sentiments and mass desires).” While we may not all be craving rocket packs, we do all at least have the aspiration that technology will significantly change our lives.

What is peculiar about The Minority Report and, before that, science fiction series like the Star Trek franchise, is that they do not always merely predict the future but can even shape that future. When I first walked through automatic sliding doors at a local convenience store, I knew this was based on the sliding doors on the USS Enterprise. When I held my first flip phone in my hands, I knew it was based on Captain Kirk's communicator and, moreover, would never have been designed this way had Star Trek never aired on television.

If The Minority Report drove the design and adoption of the gesture recognition system on Kinect, Star Trek can be said to have driven the speech recognition capabilities of Kinect. In interviews with Microsoft employees and executives, there are repeated references to the desire to make Kinect work like the Star Trek computer or the Star Trek holodeck. There is a sense in those interviews that if the speech recognition portion of the device was not solved (and occasionally there were discussions about dropping the feature as it fell behind schedule), the Kinect sensor would not have been the future device everyone wanted.

Microsoft's Secret Project

In the gaming world, Nintendo threw down the gauntlet at the 2005 Tokyo Game Show conference with the unveiling of the Wii console. The console was accompanied by a new gaming device called the Wii Remote. Like the magnetic cubes from the original Put-That-There project, the Wii Remote can detect movement along three axes. Additionally, the remote contains an optical sensor that detects where it is pointing. It is also battery powered, eliminating long cords to the console common to other platforms.

Following the release of the Wii in 2006, Peter Moore, then head of Microsoft's Xbox division, demanded work start on a competitive Wii killer. It was also around this time that Alex Kipman, head of an incubation team inside the Xbox division, met the founders of PrimeSense at the 2006 Electronic Entertainment Expo. Microsoft created two competing teams to come up with the intended Wii killer: one working with the PrimeSense technology and the other working with technology developed by a company called 3DV. Though the original goal was to unveil something at E3 2007, neither team seemed to have anything sufficiently polished in time for the exposition. Things were thrown a bit more off track in 2007 when Peter Moore announced that he was leaving Microsoft to go work for Electronic Arts.

It is clear that by the summer of 2007 the secret work being done inside the Xbox team was gaining momentum internally at Microsoft. At the D: All Things Digital conference that year, Bill Gates was interviewed side-by-side with Steve Jobs. During that interview, in response to a question about Microsoft Surface and whether multitouch would become mainstream, Gates began talking about vision recognition as the step beyond multitouch:

Gates: Software is doing vision. And so, imagine a game machine where you just can pick up the bat and swing it or pick up the tennis racket and swing it.

Interviewer: We have one of those. That's Wii.

Gates: No. No. That's not it. You can't pick up your tennis racket and swing it. You can't sit there with your friends and do those natural things. That's a 3-D positional device. This is video recognition. This is a camera seeing what's going on. In a meeting, when you are on a conference, you don't know who's speaking when it's audio only … the camera will be ubiquitous … software can do vision, it can do it very, very inexpensively … and that means this stuff becomes pervasive. You don't just talk about it being in a laptop device. You talk about it being a part of the meeting room or the living room …

Amazingly the interviewer, Walt Mossberg, cut Gates off during his fugue about the future of technology and turned the conversation back to what was most important in 2007: laptops! Nevertheless, Gates revealed in this interview that Microsoft was already thinking of the new technology being developed in the Xbox team as something more than merely a gaming device. It was already thought of as a device for the office as well.

Following Moore's departure, Don Matrick took up the reigns, guiding the Xbox team. In 2008, he revived the secret video recognition project around the PrimeSense technology. While 3DV's technology apparently never made it into the final Kinect, Microsoft bought the company in 2009 for $35 million. This was apparently done in order to defend against potential patent disputes around Kinect. Alex Kipman, a manager with Microsoft since 2001, was made General Manager of Incubation and put in charge of creating the new Project Natal device to include depth recognition, motion tracking, facial recognition, and speech recognition.

images Note What's in a name? Microsoft has traditionally, if not consistently, given city names to large projects as their code names. Alex Kipman dubbed the secret Xbox project Natal, after his hometown in Brazil.

The reference device created by PrimeSense included an RGB camera, an infrared sensor, and an infrared light source. Microsoft licensed PrimeSense's reference design and PS1080 chip design, which processed depth data at 30 frames per second. Importantly, it processed depth data in an innovative way that drastically cut the price of depth recognition compared to the prevailing method at the time called “time of flight”—a technique that tracks the time it takes for a beam of light to leave and then return to the sensor. The PrimeSense solution was to project a pattern of infrared dots across the room and use the size and spacing between dots to form a 320X240 pixel depth map analyzed by the PS1080 chip. The chip also automatically aligned the information for the RGB camera and the infrared camera, providing RGBD data to higher systems.

Microsoft added a four-piece microphone array to this basic structure, effectively providing a direction microphone for speech recognition that would be effective in a large room. Microsoft already had years of experience with speech recognition, which has been available on its operating systems since Windows XP.

Kudo Tsunada, recently hired away from Electronic Arts, was also brought on the project, leading his own incubation team, to create prototype games for the new device. He and Kipman had a deadline of August 18, 2008, to show a group of Microsoft executives what Project Natal could do. Tsunada's team came up with 70 prototypes, some of which were shown to the execs. The project got the green light and the real work began. They were given a launch date for Project Natal: Christmas of 2010.

Microsoft Research

While the hardware problem was mostly solved thanks to PrimeSense—all that remained was to give the device a smaller form factor—the software challenges seemed insurmountable. First, a responsive motion recognition system had to be created based on the RGB and Depth data streams coming from the device. Next, serious scrubbing had to be performed in order to make the audio feed workable with the underlying speech platform. The Project Natal team turned to Microsoft Research (MSR) to help solve these problems.

MSR is a multibillion dollar annual investment by Microsoft. The various MSR locations are typically dedicated to pure research in computer science and engineering rather than to trying to come up with new products for their parent. It must have seemed strange, then, when the Xbox team approached various branches of Microsoft Research to not only help them come up with a product but to do so according to the rhythms of a very short product cycle.

In late 2008, the Project Natal team contacted Jamie Shotton at the MSR office in Cambridge, England, to help with their motion-tracking problem. The motion tracking solution Kipman's team came up with had several problems. First, it relied on the player getting into an initial T-shaped pose to allow the motion capture software to discover him. Next, it would occasionally lose the player during motion, obligating the player to reinitialize the system by once again assuming the T position. Finally, the motion tracking software would only work with the particular body type it was designed for—that of Microsoft executives.

On the other hand, the depth data provided by the sensor already solved several major problems for motion tracking. The depth data allows easy filtering of any pixels that are not the player. Extraneous information such as the color and texture of the player's clothes are also filtered out by the depth camera data. What is left is basically a player blob represented in pixel positions, as shown in Figure 1-1. The depth camera data, additionally, provides information about the height and width of the player in meters.


Figure 1-1. The Player blob

The challenge for Shotton was to turn this outline of a person into something that could be tracked. The problem, as he saw it, was to break up the player blob provided by the depth stream into recognizable body parts. From these body parts, joints can be identified, and from these joints, a skeleton can be reconstructed. Working with Andrew Fitzgibbon and Andrew Blake, Shotton arrived at an algorithm that could distinguish 31 body parts (see Figure 1-2). Out of these parts, the version of Kinect demonstrated at E3 in 2009 could produce 48 joints (the Kinect SDK, by contrast, exposes 20 joints).


Figure 1-2. Player parts

To get around the initial T-pose required of the player for calibration, Shotton decided to appeal to the power of computer learning. With lots and lots of data, the image recognition software could be trained to break up the player blob into usable body parts. Teams were sent out to videotape people in their homes performing basic physical motions. Additional data was collected in a Hollywood motion capture studio of people dancing, running, and performing acrobatics. All of this video was then passed through a distributed computation engine called Dryad that had been developed by another branch of Microsoft Research in Mountain View, California, in order to begin generating a decision tree classifier that could map any given pixel of Kinect's RGBD stream onto one of the 31 body parts. This was done for 12 different body types and repeatedly tweaked to improve the decision software's ability to identify a person without an initial pose, without breaks in recognition, and for different kinds of people.

This took care of The Minority Report aspect of Kinect. To handle the Star Trek portion, Alex Kipman turned to Ivan Tashev of the Microsoft Research group based in Redmond. Tashev and his team had worked on the microphone array implementation on Windows Vista. Just as being able to filter out non-player pixels is a large part of the skeletal recognition solution, filtering out background noise on a microphone array situated much closer to a stereo system than it is to the speaker was the biggest part of making speech recognition work on Kinect. Using a combination of patented technologies (provided to us for free in the Kinect for Windows SDK), Tashev's team came up with innovative noise suppression and echo cancellation tricks that improved the audio processing pipeline many times over the standard that was available at the time.

Based on this audio scrubbing, a distributed computer learning program of a thousand computers spent a week building an acoustical model for Kinect based on various American regional accents and the peculiar acoustic properties of the Kinect microphone array. This model became the basis of the TellMe feature included with the Xbox as well as the Kinect for Windows Runtime Language Pack used with the Kinect for Windows SDK. Cutting things very close, the acoustical model was not completed until September 26, 2010. Shortly after, on November 4, the Kinect sensor was released.

The Race to Hack Kinect

The release of the Kinect sensor was met with mixed reviews. Gaming sites generally acknowledged that the technology was cool but felt that players would quickly grow tired of the gameplay. This did not slow down Kinect sales however. The device sold an average of 133 thousand units a day for the first 60 days after the launch, breaking the sales records for either the iPhone or the iPad and setting a new Guinness world record. It wasn't that the gaming review sites were wrong about the novelty factor of Kinect; it was just that people wanted Kinect anyways, whether they played with it every day or only for a few hours. It was a piece of the future they could have in their living rooms.

The excitement in the consumer market was matched by the excitement in the computer hacking community. The hacking story starts with Johnny Chung Lee, the man who originally hacked a Wii Remote to implement finger tracking and was later hired onto the Project Natal team to work on gesture recognition. Frustrated by the failure of internal efforts at Microsoft to publish a public driver, Lee approached AdaFruit, a vendor of open-source electronic kits, to host a contest to hack Kinect. The contest, announced on the day of the Kinect launch, was built around an interesting hardware feature of the Kinect sensor: it uses a standard USB connector to talk to the Xbox. This same USB connector can be plugged into the USB port of any PC or laptop. The first person to successfully create a driver for the device and write an application converting the data streams from the sensor into video and depth displays would win the $1,000 bounty that Lee had put up for the contest.

On the same day, Microsoft made the following statement in response to the AdaFruit contest: “Microsoft does not condone the modification of its products … With Kinect, Microsoft built in numerous hardware and software safeguards designed to reduce the chances of product tampering. Microsoft will continue to make advances in these types of safeguards and work closely with law enforcement and product safety groups to keep Kinect tamper-resistant.” Lee and AdaFruit responded by raising the bounty to $2,000.

By November 6, Joshua Blake, Seth Sandler, and Kyle Machulis and others had created the OpenKinect mailing list to help coordinate efforts around the contest. Their notion was that the driver problem was solvable but that the longevity of the Kinect hacking effort for the PC would involve sharing information and building tools around the technology. They were already looking beyond the AdaFruit contest and imagining what would come after. In a November 7 post to the list, they even proposed sharing the bounty with the OpenKinect community, if someone on the list won the contest, in order look past the money and toward what could be done with the Kinect technology. Their mailing list would go on to be the home of the Kinect hacking community for the next year.

Simultaneously on November 6, a hacker known as AlexP was able to control Kinect's motors and read its accelerometer data. The AdaFruit bounty was raised to $3,000. On Monday, November 8, AlexP posted video showing that he could pull both RGB and depth data streams from the Kinect sensor and display them. He could not collect the prize, however, because of concerns about open sourcing his code. On the 8, Microsoft also clarified its previous position in a way that appeared to allow the ongoing efforts to hack Kinect as long as it wasn't called “hacking”:

Kinect for Xbox 360 has not been hacked—in any way—as the software and hardware that are part of Kinect for Xbox 360 have not been modified. What has happened is someone has created drivers that allow other devices to interface with the Kinect for Xbox 360. The creation of these drivers, and the use of Kinect for Xbox 360 with other devices, is unsupported. We strongly encourage customers to use Kinect for Xbox 360 with their Xbox 360 to get the best experience possible.

On November 9, AdaFruit finally received a USB analyzer, the Beagle 480, in the mail and set to work publishing USB data dumps coming from the Kinect sensor. The OpenKinect community, calling themselves “Team Tiger,” began working on this data over an IRC channel and had made significant progress by Wednesday morning before going to sleep. At the same time, however, Hector Martin, a computer science major in Bilbao, Spain, had just purchased Kinect and had begun going to through the AdaFruit data. Within a few hours he had written the driver and application to display RGB and depth video. The AdaFruit prize had been claimed in only seven days.

Martin became a contributor to the OpenKinect group and a new library, libfreenect, became the basis of the community's hacking efforts. Joshua Blake announced Martin's contribution to the OpenKinect mailing list in the following post:

I got ahold of Hector on IRC just after he posted the video and talked to him about this group. He said he'd be happy to join us (and in fact has already subscribed). After he sleeps to recover, we'll talk some more about integrating his work and our work.

This is when the real fun started. Throughout November, people started to post videos on the Internet showing what they could do with Kinect. Kinect-based artistic displays, augmented reality experiences, and robotics experiments started showing up on YouTube. Sites like KinectHacks.net sprang up to track all the things people were building with Kinect. By November 20, someone had posted a video of a light saber simulator using Kinect—another movie aspiration checked off. Microsoft, meanwhile, was not idle. The company watched with excitement as hundreds of Kinect hacks made their way to the web.

On December 10, PrimeSense announced the release of its own open source drivers for Kinect along with libraries for working with the data. This provided improvements to the skeleton tracking algorithms over what was then possible with libfreenect and projects that required integration of RGB and depth data began migrating over to the OpenNI technology stack that PrimeSense had made available. Without the key Microsoft Research technologies, however, skeleton tracking with OpenNI still required the awkward T-pose to initialize skeleton recognition.

On June 17, 2011, Microsoft finally released the Kinect SDK beta to the public under a non-commercial license after demonstrating it for several weeks at events like MIX. As promised, it included the skeleton recognition algorithms that make an initial pose unnecessary as well as the AEC technology and acoustic models required to make Kinect speech recognition system work in a large room. Every developer now had access to the same tools Microsoft used internally for developing Kinect applications for the computer.

The Kinect for Windows SDK

The Kinect for Windows SDK is the set of libraries that allows us to program applications on a variety of Microsoft development platforms using the Kinect sensor as input. With it, we can program WPF applications, WinForms applications, XNA applications and, with a little work, even browser-based applications running on the Windows operating system—though, oddly enough, we cannot create Xbox games with the Kinect for Windows SDK. Developers can use the SDK with the Xbox Kinect Sensor. In order to use Kinect's near mode capabilities, however, we require the official Kinect for Windows hardware. Additionally, the Kinect for Windows sensor is required for commercial deployments.

Understanding the Hardware

The Kinect for Windows SDK takes advantage of and is dependent upon the specialized components included in all planned versions of the Kinect device. In order to understand the capabilities of the SDK, it is important to first understand the hardware it talks to. The glossy black case for the Kinect components includes a head as well as a base, as shown in Figure 1-3. The head is 12 inches by 2.5 inches by 1.5 inches. The attachment between the base and the head is motorized. The case hides an infrared projector, two cameras, four microphones, and a fan.


Figure 1-3. The Kinect case

I do not recommend ever removing the Kinect case. In order to show the internal components, however, I have removed the case, as shown in Figure 1-4. On the front of Kinect, from left to right respectively when facing Kinect, you will find the sensors and light source that are used to capture RGB and depth data. To the far left is the infrared light source. Next to this is the LED ready indicator. Next is the color camera used to collect RGB data, and finally, on the right (toward the center of the Kinect head), is the infrared camera used to capture depth data. The color camera supports a maximum resolution of 1280 x 960 while the depth camera supports a maximum resolution of 640 x 480.


Figure 1-4. The Kinect components

On the underside of Kinect is the microphone array. The microphone array is composed of four different microphones. One is located to the left of the infrared light source. The other three are evenly spaced to the right of the depth camera.

If you bought a Kinect sensor without an Xbox bundle, the Kinect comes with a Y-cable, which extends the USB connector wire on Kinect as well as providing additional power to Kinect. The USB extender is required because the male connector that comes off of Kinect is not a standard USB connector. The additional power is required to run the motors on the Kinect.

If you buy a new Xbox bundled with Kinect, you will likely not have a Y-cable included with your purchase. This is because the newer Xbox consoles have a proprietary female USB connector that works with Kinect as is and does not require additional power for the Kinect servos. This is a problem—and a source of enormous confusion—if you intend to use Kinect for PC development with the Kinect SDK. You will need to purchase the Y-cable separately if you did not get it with your Kinect. It is typically marketed as a Kinect AC Adapter or Kinect Power Source. Software built using the Kinect SDK will not work without it.

A final piece of interesting Kinect hardware sold by Nyco rather than by Microsoft is called the Kinect Zoom. The base Kinect hardware performs depth recognition between 0.8 and 4 meters. The Kinect Zoom is a set of lenses that fit over Kinect, allowing the Kinect sensor to be used in rooms smaller than the standard dimensions Microsoft recommends. It is particularly appealing for users of the Kinect SDK who might want to use it for specialized functionality such as custom finger tracking logic or productivity tool implementations involving a person sitting down in front of Kinect. From experimentation, it actually turns out to not be very good for playing games, perhaps due to the quality of the lenses.

Kinect for Windows SDK Hardware and Software Requirements

Unlike other Kinect libraries, the Kinect for Windows SDK, as its name suggests, only runs on Windows operating systems. Specifically, it runs on x86 and x64 versions of Windows 7. It has been shown to also work on early versions of Windows 8. Because Kinect was designed for Xbox hardware, it requires roughly similar hardware on a PC to run effectively.

Hardware Requirements

  • Computer with a dual-core, 2.66-GHz or faster processor
  • Windows 7–compatible graphics card that supports Microsoft DirectX 9.0c capabilities
  • 2 GB of RAM (4 GB or RAM recommended)
  • Kinect for Xbox 360 sensor
  • Kinect USB power adapter

Use the free Visual Studio 2010 Express or other VS 2010 editions to program against the Kinect for Windows SDK. You will also need to have the DirectX 9.0c runtime installed. Later versions of DirectX are not backwards compatible. You will also, of course, want to download and install the latest version of the Kinect for Windows SDK. The Kinect SDK installer will install the Kinect drivers, the Microsoft Research Kinect assembly, as well as code samples.

Software Requirements

To take full advantage of the audio capabilities of Kinect, you will also need additional Microsoft speech recognition software: the Speech Platform API, the Speech Platform SDK, and the Kinect for Windows Runtime Language Pack. Fortunately, the install for the SDK automatically installs these additional components for you. Should you ever accidentally uninstall these speech components, however, it is important to be aware that the other Kinect features, such as depth processing and skeleton tracking, are fully functional even without the speech components.

Step-By-Step Installation

Before installing the Kinect for Windows SDK:

  1. Verify that your Kinect device is not plugged into the computer you are installing to.
  2. Verify that Visual Studio is closed during the installation process.

If you have other Kinect drivers on your computer such as those provided by PrimeSense, you should consider removing these. They will not run side-by-side with the SDK and the Kinect drivers provided by Microsoft will not interoperate with other Kinect libraries such as OpenNI or libfreenect. It is possible to install and uninstall the SDK on top of other Kinect platforms and switch back and forth by repeatedly uninstalling and reinstalling the SDK. However, this has also been known to cause inconsistencies, as the wrong driver can occasionally be loaded when performing this procedure. If you plan to go back and forth between different Kinect stacks, installing on separate machines is the safest path.

To uninstall other drivers, including previous versions of those provided with the SDK, go to Programs and Features in the Control Panel, select the name of the driver you wish to remove, and click Uninstall.

Download the appropriate installation msi (x86 or x64) for your computer. If you are uncertain whether your version of Windows is 32-bit or 64-bit, you can right click on the Windows icon on your desktop and go to Properties in order to find out. You can also access your system information by going to the Control Panel and selecting System. Your operating system architecture will be listed next to the title System type. If your OS is 64-bit, you should install the x64 version. Otherwise, install the x86 version of the msi.

Run the installer once it is successfully downloaded to your machine. Follow the Setup wizard prompts until installation of the SDK is complete. Make sure that Kinect's extra power supply is also plugged into a power source. You can now plug your Kinect device into a USB port on your computer. On first connecting the Kinect to your PC, Windows will recognize the device and begin loading the Kinect drivers. You may see a message on your Windows taskbar indicating that this is occurring. When the drivers have finished loading, the LED light on your Kinect will turn a solid green.

You may want to verify that the drivers installed successfully. This is typically a troubleshooting procedure in case you encounter any problems as you run the SDK samples or begin working through the code in this book. In order to verify that the drivers are installed correctly, open the Control Panel and select Device Manager. As Figure 1-5 shows, the Microsoft Kinect node in Device Manager should list three items if the drivers were correctly installed: the Microsoft Kinect Audio Array Control, Microsoft Kinect Camera, and Microsoft Kinect Security Control.


Figure 1-5. Kinect drivers

You will also want to verify that Kinect's microphone array was correctly recognized during installation. To do so, go to the Control Manager and then the Device Manager again. As Figure 1-6 shows, the listing for Kinect USB Audio should be present under the sound, video and game controllers node.


Figure 1-6. Microphone array

If you find that any of the four devices mentioned above do not appear in Device Manager, you should uninstall the SDK and attempt to install it again. The most common problems seem to occur around having the Kinect device accidentally plugged into the PC during install or forgetting to plug in the Kinect adapter when connecting the Kinect to the PC for the first time. You may also find that other USB devices, such as a webcam, stop working once Kinect starts working. This occurs because Kinect may conflict with other USB devices connected to the same host controller. You can work around this by trying other USB ports. A PC or laptop typically has one host controller for the ports on the front or side of the computer and another host controller at the back. Also use different USB host controllers if you attempt to daisy chain multiple Kinect devices for the same application.

To work with speech recognition, install the Microsoft Speech Platform Server Runtime (x86), the Speech Platform SDK (x86), and the Kinect for Windows Language Pack. These installs should occur in the order listed. While the first two components are not specific to Kinect and can be used for general speech recognition development, the Kinect language pack contains the acoustic models specific to the Kinect. For Kinect development, the Kinect language pack cannot be replaced with another language pack and the Kinect language pack will not be useful to you when developing speech recognition applications without Kinect.

Elements of a Kinect Visual Studio Project

If you are already familiar with the development experience using Visual Studio, then the basic steps for implementing a Kinect application should seem fairly straightforward. You simply have to:

  1. Create a new project.
  2. Reference the Microsoft.Kinect.dll.
  3. Declare the appropriate Kinect namespace.

The main hurdle in programming for Kinect is getting used to the idea that windows, the main UI container of .NET programs, are not used for input as they are in typical applications. Instead, windows are used to display information only while all input is derived from the Kinect sensor. A second hurdle is getting used to the notion that input from Kinect is continuous and constantly changing. A Kinect program does not wait for a discrete event such as a button press. Instead, it repeatedly processes information from the RGB, depth, and skeleton streams and rearranges the UI container appropriately.

The Kinect SDK supports three kinds of managed applications (applications that use C# or Visual Basic rather than C++): Console applications, WPF applications, and Windows Forms applications. Console applications are actually the easiest to get started with, as they do not create the expectation that we must interact with UI elements like buttons, dropdowns, or checkboxes.

To create a new Kinect application, open Visual Studio and select File images New images Project. A dialog window will appear offering you a choice of project templates. Under Visual C# images Windows, select Console Application and either accept the default name for the project or create your own project name.

You will now want to add a reference to the Kinect assembly you installed in the steps above. In the Visual Studio Solutions pane, right-click on the references folder, as shown in Figure 1-7. Select Add Reference. A new dialog window will appear listing various assemblies you can add to your project. Find the Microsoft.Research.Kinect assembly and add it to your project.


Figure 1-7. Add a reference to the Kinect library

At the top of the Program.cs file for your application, add the namespace declaration for the Mirosoft.Kinect namespace. This namespace encapsulates all of the Kinect functionality for both nui and audio.

using Microsoft.Kinect;

Three additional steps are standard for Kinect applications that take advantage of the data from the cameras. The KinectSensor object must be instantiated, initialized, and then started. To build an extremely trivial application to display the bitstream flowing from the depth camera, we will instantiate a new KinectSensor object according to the example in Listing 1-1. In this case, we assume there is only one camera in the KinectSensors array. We initialize the sensor by enabling the data streams we wish to use. Enabling data streams we do not intend to use would cause unnecessary performance overhead. Next we add an event handler for the DepthFrameReady event, and then create a loop that waits until the space bar is pressed before ending the application. As a final step, just before the application exits, we follow good practice and disable the depth stream reader.

Listing 1-1. Instantiate and Initialize the Runtime

static void Main(string[] args)
    // instantiate the sensor instance
    KinectSensor sensor = KinectSensor.KinectSensors[0];

    // initialize the cameras
    sensor.DepthFrameReady += sensor_DepthFrameReady;

    // make it look like The Matrix
    Console.ForegroundColor = ConsoleColor.Green;

    // start the data streaming
    while (Console.ReadKey().Key != ConsoleKey.Spacebar) { }

The heart of any Kinect app is not the code above, which is primarily boilerplate, but rather what we choose to do with the data passed by the DepthFrameReady event. All of the cool Kinect applications you have seen on the Internet use the data from the DepthFrameReady, ColorFrameReady, and SkeletonFrameReady events to accomplish the remarkable effects that have brought you to this book. In Listing 1-2, we will finish off the application by simply writing the image bits from the depth camera to the console window to see something similar to what the early Kinect hackers saw and got excited about back in November of 2010.

Listing 1-2. First Peek At the Kinect Depth Stream Data

static void sensor_DepthFrameReady(object sender, DepthImageFrameReadyEventArgs e)
    using (var depthFrame = e.OpenDepthImageFrame())
        if (depthFrame == null)
        short[] bits = new short[depthFrame.PixelDataLength];
        foreach (var bit in bits)

As you wave your arms in front of the Kinect sensor, you will experience the first oddity of developing with Kinect. You will repeatedly have to push your chair away from the Kinect sensor as you test your applications. If you do this in an open space with co-workers, you will receive strange looks. I highly recommend programming for Kinect in a private, secluded space to avoid these strange looks. In my experience, people generally view a software developer wildly swinging his arms with concern and, more often, suspicion.

The Kinect SDK Sample Applications

The Kinect for Windows SDK installs several reference applications and samples. These applications provide a starting point for working with the SDK. They are written in a combination of C# and C++ and serve the sometimes contrary objectives of showing in a clear way how to use the Kinect SDK and presenting best practices for programming with the SDK. While this book does not delve into the details of programming in C++, it is still useful to examine these examples if only to remind ourselves that the Kinect SDK is based on a C++ library that was originally written for game developers working in C++. The C# classes are often merely wrappers for these underlying libraries and, at times, expose leaky abstractions that make sense only when we consider their C++ underpinnings.

A word should be said about the difference between sample applications and reference applications. The code for this book is sample code. It demonstrates in the easiest way possible how to perform given tasks related to the data received from the Kinect sensor. It should rarely be used as is in your own applications. The code in reference applications, on the other hand, has the additional burden of showing the best way to organize code to make it robust and to embody good architectural principles. One of the greatest myths in the software industry is perhaps the implicit belief that good architecture is also readable and, consequently, easily maintainable. This is often not the case. Good architecture can often be an end in itself. Most of the code provided with the Kinect SDK embodies good architecture and should be studied with this in mind. The code provided with this book, on the other hand, is typically written to illustrate concepts in the most straightforward way possible. You should study both code samples as well as reference code to become an effective Kinect developer. In the following sections, we will introduce you to some of these samples and highlight parts of the code worth familiarizing yourself with.

Kinect Explorer

Kinect Explorer is a WPF project written in C#. It demonstrates the basic programming model for retrieving the color, depth, and skeleton streams and displaying them in a window—more or less the original criteria set for the AdaFruit Kinect hacking contest. Figure 1-8 shows the UI for the reference application. The video and depth streams are each used to populate and update a different image control in real time while the skeleton stream is used to create a skeletal overlay on these images. Besides the depth stream, video stream, and skeleton, the application also provides a running update of the frames per second processed by the depth stream. While the goal is 30 fps, this will tend to vary depending on the specifications of your computer.


Figure 1-8. Kinect Explorer reference application

The sample exposes some key concepts for working with the different data streams. The DepthFrameReady event handler, for instance, takes each image provided sequentially by the depth stream and parses it in order to distinguish player pixels from background pixels. Each image is broken down into a byte array. Each byte is then inspected to determine if it is associated with a player image or not. If it does belong to a player, the pixel is replaced with a flat color. If not, it is gray scaled. The bytes are then recast to a bitmap object and set as the source for an image control in the UI. Then the process begins again for the next image in the depth stream. One would expect that individually inspecting every byte in this stream would take a remarkably long time but, as the fps indicator shows, in fact it does not. This is actually the prevailing technique for manipulating both the color and depth streams. We will go into greater detail concerning the depth and color streams in Chapter 2 and Chapter 3 of this book.

Kinect Explorer is particularly interesting because it demonstrates how to break up the different capabilities of the Kinect sensor into reusable components. Instead of a central controlling process, each of the distinct viewer controls for video, color, skeleton, and audio independently control their own access to their respective data streams. This distributed structure allows the various Kinect capabilities to be added independently and ad hoc to any application.

Beyond this interesting modular design, there are three specific pieces of functionality in Kinect Explorer that should be included in any Kinect application. The first is the way Kinect Explorer implements sensor discovery. As Listing 1-3 shows, the technique implemented in the reference application waits for Kinect sensors to be connected to a USB port on the computer. It defers any initialization of the streams until Kinect has been connected and is able to support multiple Kinects. This code effectively acts as a gatekeeper that prevents any problems that might occur when there is a disruption in the data streams caused by tripping over a wire or even simply forgetting to plug in the Kinect sensor.

Listing 1-3. Kinect Sensor Discovery

private void KinectStart()
            //listen to any status change for Kinects.
            KinectSensor.KinectSensors.StatusChanged += Kinects_StatusChanged;

            //show status for each sensor that is found now.
            foreach (KinectSensor kinect in KinectSensor.KinectSensors)
                ShowStatus(kinect, kinect.Status);

A second noteworthy feature of Kinect Explorer is the way it manages Kinect sensor's motor controlling the sensor's angle of elevation. In early efforts to program with Kinect prior to the arrival of the SDK, it was uncommon to use software to raise and lower the angle of the Kinect head. In order to place Kinect cameras correctly while programming, developers would manually lift and lower the angle of the Kinect head. This typically produced a loud and slightly frightening click but was considered a necessary evil as developers experimented with Kinect. Unfortunately, Kinect's internal motors were not built to handle this kind of stress. The rather sophisticated code provided with Kinect Explorer demonstrates how to perform this necessary task in a more genteel manner.

The final piece of functionality deserving of careful study is the way skeletons from the skeleton stream are selected. The SDK only tracks full skeletons for two players at a time. By default, it uses a complicated set of rules to determine which players should be tracked in this way. However, the SDK also allows this default set of rules to be overwritten by the Kinect developer. Kinect Explorer demonstrates how to overwrite the basic rules and also provides several alternative algorithms for determining which players should receive full skeleton tracking, for instance by closest players and by most physically active players.

Shape Game

The Shape Game reference app, also a WPF application written in C#, is an ambitious project that ties together skeleton tracking, speech recognition, and basic physics simulation. It also supports up to two players at the same time. The Shape Game introduces the concept of a game loop. Though not dealt with explicitly in this book, game loops are a central concept in game development that you will want to become familiar with in order to present shapes constantly falling from the top of the screen. In Shape Game, the game loop is a C# while loop running in the GameThread method, as shown in Listing 1-4. The GameThread method tweaks the rate of the game loop to achieve the optimal frame rate. On every iteration of the while loop, the HndleGameTimer method is called to move shapes down the screen, add new shapes, and detect collisions between the skeleton hand joints and the falling shapes.

Listing 1-4. A Basic Game Loop

private void GameThread()
    runningGameThread = true;
    predNextFrame = DateTime.Now;
    actualFrameTime = 1000.0 / targetFramerate;

    while (runningGameThread)
         . . .

            new Action<int>(HandleGameTimer), 0);

The result is the game interface shown in Figure 1-9. While the Shape Game sample uses primitive shapes for game components such as lines and ellipses for the skeleton, it is also fairly easy to replace these shapes with images in order to create a more engaging experience.


Figure 1-9. Shape Game

The Shape Game also integrates speech recognition into the gameplay. The logic for the speech recognition is contained in the project's Recognizer class. It recognizes phrases of up to five words with approximately 15 possible word choices for each word, potentially supporting a grammar of up to 700,000 phrases. The combination of gesture and speech recognition provides a way to experiment with mixed-modal gameplay with Kinect, something not widely used in Kinect games for the Xbox but around which there is considerable excitement. This book delves into the speech recognition capabilities of Kinect in Chapter 7.

images Note The skeleton tracking in the Shape Game sample provided with the Kinect for Windows SDK highlights a common problem with straightforward rendering of joint coordinates. When a particular body joint falls outside of the camera's view, the joint behavior becomes erratic. This is most noticeable with the legs. A best practice is to create default positions and movements for in-game avatars. The default positions should only be overridden when the skeletal data for particular joints is valid.

Record Audio

The RecordAudio sample is the C# version of some of the features demonstrated in AudioCaptureRaw, MFAudioFilter, and MicArrayEchoCancellation. It is a C# console application that records and saves the raw audio from Kinect as a wav file. It also applies the source localization functionality shown in MicArrayEchoCancellation to indicate the source of the audio with respect to the Kinect sensor in radians. It introduces an important concept for working with wav data called the WAVEFORMATEX struct. This is a structure native to C++ that has been reimplemented as a C# struct in RecordAudio, as shown in Listing 1-5. It contains all the information, and only the information, required to define a wav audio file. There are also multiple C# implementations of it all over the web since it seems to be reinvented every time someone needs to work with wav files in managed code.

Listing 1-5. The WAVEFORMATEX Struct

    public ushort wFormatTag;
    public ushort nChannels;
    public uint nSamplesPerSec;
    public uint nAvgBytesPerSec;
    public ushort nBlockAlign;
    public ushort wBitsPerSample;
    public ushort cbSize;

Speech Sample

The Speech sample application demonstrates how to use Kinect with the speech recognition engine provided in the Microsoft.Speech assembly. Speech is a console application written in C#. Whereas the MFAudioFilter sample used a WMA file as its sink, the Speech application uses the speech recognition engine as a sink in its audio processing pipeline.

The sample is fairly straightforward, demonstrating the concepts of Grammar objects and Choices objects, as shown in Listing 1-6, that have been a part of speech recognition programming since Windows XP. These objects are constructed to create custom lexicons of words and phrases that the application is configured to recognize. In the case of the Speech sample, this includes only three words: red, green, and blue.

Listing 1-6. Grammars and Choices

var colors = new Choices();

var gb = new GrammarBuilder();

gb.Culture = ri.Culture;

var g = new Grammar(gb);

The sample also introduces some widely used boilerplate code that uses C# LINQ syntax to instantiate the speech recognition engine, as illustrated in Listing 1-7. Instantiating the speech recognition engine requires using pattern matching to identify a particular string. The speech recognition engine effectively loops through all the recognizers installed on the computer until it finds one whose Id property matches the magic string. In this case, we use a LINQ expression to perform the loop. If the correct recognizer is found, it is then used to instantiate the speech recognition engine. If it is not found, the speech recognition engine cannot be used.

Listing 1-7. Finding the Kinect Recognizer

private static RecognizerInfo GetKinectRecognizer()
        Func<RecognizerInfo, bool> matchingFunc = r =>
            string value;
            r.AdditionalInfo.TryGetValue("Kinect", out value);
            return "True".Equals(value, StringComparison.InvariantCultureIgnoreCase)
                && "en-US".Equals(
                    , StringComparison.InvariantCultureIgnoreCase);
        return SpeechRecognitionEngine.InstalledRecognizers()

Although simple, the Speech sample is a good starting point for exploring the Microsoft.Speech API. A productive way to use the sample is to begin adding additional word choices to the limited three-word target set. Then try to create the TellMe style functionality on the Xbox by ignoring any phrase that does not begin with the word “Xbox.” Then try to create a grammar that includes complex grammatical structures that include verbs, subjects and objects as the Shape Game SDK sample does.

This is, after all, the chief utility of the sample applications provided with the Kinect for Windows SDK. They provide code blocks that you can copy directly into your own code. They also offer a way to begin learning how to get things done with Kinect without necessarily understanding all of the concepts behind the Kinect API right away. I encourage you to play with this code as soon as possible. When you hit a wall, return to this book to learn more about why the Kinect API works the way it does and how to get further in implementing the specific scenarios you are interested in.


In this chapter, you learned about the surprisingly long history of gesture tracking as a distinct mode of natural user interface. You also learned about the central role Alex Kipman played in bringing Kinect technology to the Xbox and how Microsoft Research, Microsoft's research and development group, was used to bring Kinect to market. You found out the momentum online communities like OpenKinect added toward popularizing Kinect development beyond Xbox gaming, opening up a new trend in Kinect development on the PC. You learned how to install and start programming for the Microsoft Kinect for Windows SDK. Finally, you learned about the various pieces installed with the Kinect for Windows SDK and how to use them as a springboard for your own programming aspirations.

