I've been working with the Microsoft Kinect for Xbox 360 on my PC for a few months now, and overall I find it fantastic! However, one thing that has continued to bug me is the seemingly poor quality of rendered Depth Frame images. There is a lot of noise in a Depth Frame, with missing bits and a pretty serious flickering issue. The frame rate isn't bad from the Kinect, with a maximum of around 30 fps; however, due to the random noise present in the data, it draws your perception to the refresh. In this article, I am going to show you my solution to this problem. I will be smoothing Depth Frames in real-time as they come from the Kinect, and are rendered to the screen. This is accomplished through two combined methods: pixel filtering, and weighted moving average.
Some Information on the Kinect
By now, I would assume that everyone has at least heard of the Kinect and understands the basic premise. It's a specialized sensor built by Microsoft that is capable of recognizing and tracking humans in 3D space. How is it able to do that? While it's true that the Kinect has two cameras in it, it does not accomplish 3D sensing through stereo optics. A technology called Light Coding makes the 3D sensing possible.
On the Kinect, there is an Infrared (IR) Projector, a Color (RGB) Camera, and an Infrared (IR) Sensor. For purposes of 3D sensing, the IR Projector emits a grid of IR light in front of it. This light then reflects off objects in its path and is reflected back to the IR Sensor. The pattern received by the IR Sensor is then decoded in the Kinect to determine the depth information and is sent via USB to another device for further processing. This depth information is incredibly useful in computer vision applications. As a part of the Kinect Beta SDK, this depth information is used to determine joint locations on the human body, thereby allowing developers like us to come up with all sorts of useful applications and functionality.
Important Setup Information
Before you download the links for either the demo application source or demo application executable, you need to prepare your development environment. To use this application, you need to have the Kinect Beta 2 SDK installed on your machine: http://www.microsoft.com/en-us/kinectforwindows/download/.
At the time of this posting, the commercial SDK has not been released. Please be sure to only use the Kinect Beta 2 SDK for this article’s downloads. Also, the SDK installs with a couple demo applications; please be sure that these run on your machine before you download the files for this article.
Before I dive into the solution, let me better express the problem. Below is a screenshot of raw depth data rendered to an image for reference. Objects that are closer to the Kinect are lighter in color and objects that are further away are darker.
What you're looking at is an image of me sitting at my desk. I'm sitting in the middle; there is a bookcase to the left and a fake Christmas tree to the right. As you can already tell, even without the flickering of a video feed, the quality is pretty low. The maximum resolution that you can get for depth data from the Kinect is 320x240, but even for this resolution the quality looks poor indeed. The noise in the data manifests itself as white spots continuously popping in and out of the picture. Some of the noise in the data comes from the IR light being scattered by the object it’s hitting, some comes from shadows of objects closer to the Kinect. I wear glasses and often have noise where my glasses should be due to the IR light scattering.
Another limitation to the depth data is that it has a limit to how far it can see. The current limit is about 8 meters. Do you see that giant white square behind me in the picture? That's not an object close to the Kinect; the room I'm in actually extends beyond that white square about another meter. This is how the Kinect handles objects that it can't see with depth sensing, returning a depth of Zero.
As I had mentioned briefly, the solution I have developed uses two different methods of smoothing the depth data: pixel filtering, and weighted moving average. The two methods can either be used separately or in series to produce a smoothed output. While the solution doesn't completely remove all noise, it does make an appreciable difference. The solutions I have used do not degrade the frame rate and are capable of producing real-time results for output to a screen or recording.
正如我已经简单的介绍过,我研究了两种平滑深度图数据的解决方法:像素滤波法(pixel filtering)和 加权移动平均法(weighted moving average)。这两种方法可以分开单独使用,也可以组合产生更平滑的图像。然而,这两种方法并不能完全除去噪点,但这两种方法可以使图像看起来有明显的变化。我所使用的这俩种解决方案并不会降低帧数,并能够产生实时的结果输出到屏幕上或者记录。
Pixel Filtering
The first step in the pixel filtering process is to transform the depth data from the Kinect into something that is a bit easier to process.
在像素滤波过程的第一步就是把Kinect的深度图数据转换为比较容易处理的数据。
private short[] CreateDepthArray(ImageFrame image) { short[] returnArray = new short[image.Image.Width * image.Image.Height]; byte[] depthFrame = image.Image.Bits; // Porcess each row in parallel
Parallel.For(0, 240, depthImageRowIndex => { // Process each pixel in the row
for (int depthImageColumnIndex = 0; depthImageColumnIndex < 640; depthImageColumnIndex += 2) { var depthIndex = depthImageColumnIndex + (depthImageRowIndex * 640); var index = depthIndex / 2; returnArray[index] = CalculateDistanceFromDepth(depthFrame[depthIndex], depthFrame[depthIndex + 1]); } }); return returnArray; }
This method creates a simple short[]
into which a depth value for each pixel is placed. The depth value is calculated from the byte[]
of an ImageFrame
that is sent every time the Kinect pushes a new frame. For each pixel, the byte[]
of the ImageFrame
has two values.
这种方法首先需要创建一个简单的short[]数组,用于存储每个像素对应的深度值数据。深度值数据可以从ImageFrame的byte[]数组里计算得到,ImageFrame会在每一帧更新。对于每个像素,ImageFrame的byte[]数组有2个值。
private short CalculateDistanceFromDepth(byte first, byte second) { // Please note that this would be different if you use Depth and User tracking rather than just depth
return (short)(first | second << 8); }
Now that we have an array that is a bit easier to process, we can begin applying the actual filter to it. We scan through the entire array, pixel by pixel, looking for Zero values. These are the values that the Kinect couldn't process properly. We want to remove as many of these as realistically possible without degrading performance or reducing other features of the data (more on that later).
(译者注:对于OpenNI开发,上述步骤可以省略,因为DepthGenerator提供了short数组的接口。)现在,我们有一个比较容易处理的数组,可以使用滤波器来处理数据了。我们逐像素扫描整个数组来寻找像素值为0的像素点。这些是Kinect没有妥善处理的值。我们想移除掉很多那些可能不会降低性能或者减少其他特性的数据(更多信息见后面)。
When we find a Zero value in the array, it is considered a candidate for filtering, and we must take a closer look. In particular, we want to look at the neighboring pixels. The filter effectively has two "bands" around the candidate pixel, and is used to search for non-Zero values in other pixels. The filter creates a frequency distribution of these values, and takes note of how many were found in each band. It will then compare these values to an arbitrary threshold value for each band to determine if the candidate should be filtered. If the threshold for either band is broken, then the statistical mode of all the non-Zero values will be applied to the candidate, otherwise it is left alone.
当我们在数组里发现一个值为0的数据时,它将会作为候选者用于滤波,并且我们必须要看一看。特别是,我们想看看它相邻的像素。有效的滤波单元应该有2个“带区”围绕着候选者像素,然后来寻找其它非零的像素点。该滤波单元创建了一个这些值的频率分布,然后去记录每个带区有多少。然后来比较这些值和每个带区的任意阀值,来决定候选像素点是否应该被滤波。如果任一带区的阀值被打破,那么这些非0值的统计模式将会被应用到候选者,否则闲置它。
The biggest considerations for this method are ensuring that the bands for the filter actually surround the pixel as they would be displayed in the rendered image, and not just values next to each other in the depth array. The code to apply this filter is as follows:
这种方法最大的考虑就是确保带区环绕绕的像素,可以显示在渲染图像里,并不只是在深度数组里彼此相邻。代码如下:
short[] smoothDepthArray = new short[depthArray.Length]; // We will be using these numbers for constraints on indexes
// 我们将会使用这些值来限制索引
int widthBound = width - 1; int heightBound = height - 1; // We process each row in parallel
// 我们用Parallel方法处理每一行(深度图大小:640x480)
Parallel.For(0, 240, depthArrayRowIndex => { // Process each pixel in the row
// 处理单行的每一个像素
for (int depthArrayColumnIndex = 0; depthArrayColumnIndex < 320; depthArrayColumnIndex++) { var depthIndex = depthArrayColumnIndex + (depthArrayRowIndex * 320); // We are only concerned with eliminating 'white' noise from the data. // We consider any pixel with a depth of 0 as a possible candidate for filtering.
// 我们只关心从数据里消除“白色”噪点。
// 我们认为任何深度值为0的像素点都可能做为候选者用于滤波。
if (depthArray[depthIndex] == 0) { // From the depth index, we can determine the X and Y coordinates that the index will appear in the image. We use this to help us define our filter matrix.
// 从深度索引里,我们判断会出现在图像里的X和Y坐标。我们用它来帮助我们确定我们的滤波矩阵。
int x = depthIndex % 320; int y = (depthIndex - x) / 320; // The filter collection is used to count the frequency of each depth value in the filter array.
// This is used later to determine the statistical mode for possible assignment to the candidate.
// 滤波器会在滤波矩阵中搜集用于计算频率的每一个深度图值。
// 这个是用来稍后确定可能分配给候选人的统计模式。
short[,] filterCollection = new short[24,2]; // The inner and outer band counts are used later to compare against the threshold values set in the UI to identify a positive filter result.
int innerBandCount = 0; int outerBandCount = 0; // The following loops will loop through a 5 X 5 matrix of pixels surrounding the candidate pixel.
// This defines 2 distinct 'bands' around the candidate pixel. // If any of the pixels in this matrix are non-0, we will accumulate them and count how many non-0 pixels are in each band.
// If the number of non-0 pixels breaks the threshold in either band, then the average of all non-0 pixels in the matrix is applied to the candidate pixel.
for (int yi = -2; yi < 3; yi++) { for (int xi = -2; xi < 3; xi++) { // yi and xi are modifiers that will be subtracted from and added to the candidate pixel's x and y coordinates that we calculated earlier.
// From the resulting coordinates, we can calculate the index to be addressed for processing. // We do not want to consider the candidate pixel (xi = 0, yi = 0) in our process at this point. // We already know that it's 0
if (xi != 0 || yi != 0) { // We then create our modified coordinates for each pass
var xSearch = x + xi; var ySearch = y + yi; // While the modified coordinates may in fact calculate out to an actual index, it might not be the one we want.
// Be sure to check to make sure that the modified coordinates match up with our image bounds.
if (xSearch >= 0 && xSearch <= widthBound && ySearch >= 0 && ySearch <= heightBound) { var index = xSearch + (ySearch * width); // We only want to look for non-0 values
if (depthArray[index] != 0) { // We want to find count the frequency of each depth
for (int i = 0; i < 24; i++) { if (filterCollection[i, 0] == depthArray[index]) { // When the depth is already in the filter collection // we will just increment the frequency.
filterCollection[i, 1]++; break; } else if (filterCollection[i, 0] == 0) { // When we encounter a 0 depth in the filter collection // this means we have reached the end of values already counted. // We will then add the new depth and start it's frequency at 1.
filterCollection[i, 0] = depthArray[index]; filterCollection[i, 1]++; break; } } // We will then determine which band the non-0 pixel // was found in, and increment the band counters.
if (yi != 2 && yi != -2 && xi != 2 && xi != -2) innerBandCount++; else outerBandCount++; } } } } } // Once we have determined our inner and outer band non-zero counts, and // accumulated all of those values, we can compare it against the threshold // to determine if our candidate pixel will be changed to the // statistical mode of the non-zero surrounding pixels.
if (innerBandCount >= innerBandThreshold || outerBandCount >= outerBandThreshold) { short frequency = 0; short depth = 0; // This loop will determine the statistical mode of the surrounding pixels for assignment to the candidate.
for (int i = 0; i < 24; i++) { // This means we have reached the end of our frequency distribution and can break out of the loop to save time.
if (filterCollection[i,0] == 0) break; if (filterCollection[i, 1] > frequency) { depth = filterCollection[i, 0]; frequency = filterCollection[i, 1]; } } smoothDepthArray[depthIndex] = depth; } } else { // If the pixel is not zero, we will keep the original depth.
smoothDepthArray[depthIndex] = depthArray[depthIndex]; } } });
未完继续。