人体检测HOG特征 Finding People in Images and Videos

1. Introduction

Any given class has ahuge intra-class variation.



Firstly, the imageinformation process suppresses 3-D depth information and creates dependencieson viewpoint such that even a small change in the object's position ororientation w.r.t. the camera center may change its appearance considerably. Arelated issue is the large variation in scales under which an object can beviewed. An object detector must handle the issues of viewpoint and scalechanges and provide invariance to them.


Secondly, mostnatural object classes have large within-class variations. For example, forhumans both appearance and pose change considerably between images anddifferences in clothing create further changes. A robust detector must try toachieve independence of these variations.


Thirdly, backgroundclutter is common and varies from mage to image. Examples are images taken innatural settings, outdoor scenes in cities and indoor environments. Thedetector must be capable of distinguishing object class from complex backgroundregions.


Fourthly, object colorand general illumination varies considerably.


Finally, partialocclusions create further difficulties because only part of the object isvisible for processing.


2. Stateof the Art

2.1 Image Features

2.1.1 Sparse LocalRepresentations

Sparserepresentations are based on local descriptors of relevant local image regions.The regions can be selected using either key point detectors or partsdetectors.


Point Detectors 点特征

The hypothesis isthat key point detectors select stable and more reliable image regions, whichare especially informative about local image content. The overall detectorperformance thus depends on the reliability, accuracy and repeatabilitywith which these key points can be found for the given object class and theinformativeness of the points chosen.

One advantage ofsparse key point based approaches is the compactness of the representation:there are many fewer key point descriptors than image pixels, so thelatter stages of the classification process are speeded up. However note thatmost key point detectors are designed to fire repeatedly on particular objectsand may have limitations when generalizing to object classes or categories,i.e. they may not be repeatable for general object classes.


3.Overview of Detection Methodology and Results


3.1 OverallArchitecture

The first stage of learning is the creationof the training data. The positive training examples are fixed resolution imagewindows containing the centered object, and the negative examples are similarwindows that are usually randomly subsampled and cropped from set of images notcontaining any instances of the object.


Three properties oflinear SVM make it valuable for comparative testing work: it converges reliablyand repeatedly during training; it handles large data sets gracefully; and ithas good robustness towards different choices of feature sets andparameters. 



As the linear SVMworks directly in the input feature space, it ensures that the feature set isas linearly separable as possible, so improvements in performance imply animproved encoding.



3.2 Overview ofFeature Sets

3.2.1 Static HOGDescriptors

The hypothesis isthat local object appearance and shape can often be characterized rather wellby the distribution of local intensity gradient or edge directions, evenwithout precise knowledge of the corresponding gradient or edge positions.

本方法基于一个假设:在一副图像中,局部目标的表象和形状(appearance and shape)能够被梯度或边缘的方向分布很好地描述。(本质:梯度的统计信息,而梯度主要存在于边缘的地方)(http://blog.csdn.net/zouxy09/article/details/7929348)。


The first stageapplies an optional global image normalisation equalization that is designed toreduce the influence of illumination effects. Gamma compression 



The second stagecomputes first order image gradients. These capture contour, silhouette andsome texture information, while providing further resistance to illumination variations. 



The third stage aimsto produce an encoding that is sensitive to local image content while remainingresistant to small changes in pose or appearance.



The fourth stagecomputes normalisation, which takes local groups of cells and contrastnormalises their overall responses before passing to next stage. Normalisationintroduces better invariance to illumination, shadowing, and edge contrast. 



The final stepcollects the HOG descriptors from all blocks of a dense overlapping grid ofblocks covering the detection window into a combined feature vector for use inthe window classifier.


4.Histogram of Oriented Gradients Based Encoding of Images

4.1.1 Static HOG Descriptors

All of the variantsshare the same basic processing chain described in section 3.2.1, i.e. they allare computed on a dense grid of uniformly spaced cells, they capture localshape information by encoding image gradients orientations in histograms, theyachieve a small amount of spatial invariance by locally pooling thesehistograms over spatial image regions, and they employ overlapping localcontrast normalisation for improved illumination invariance.



Rectangular HOG(R-HOG)

R-HOGs are similar toSIFT descriptors bust are used quite differently. SIFTs are computed at asparse set of scale-invariant key points, rotated to align their dominantorientations and used individually, whereas R-HOGs are computed in dense gridsat a single scale without dominant orientation alignment. The grid position ofthe block implicitly encodes spatial position relative to the detection windowin the final code vector. SIFTs are optimized for sparse wide baseline matching,R-HOGs for dense robust coding of spatial form.



4.1.2 Circular HOG(C-HOG)

In circularHOG(C-HOG) block descriptors, the cells are defined into grids of log-polarshape. The input image is covered by a dense rectangular grid of centers. Ateach center, we divide the local image patch into a number of angular andradial bins. The angular bins are uniformly distributed over the circle. Theradial bins are computed over log scales, resulting in increasing bin size withincreasing distance from the center.




The motivation forthe log-polar grid is that it allows fine coding of nearby structure to becombined with coarser coding of wider context.



The C-HOG layout hasfour spatial parameters: the number of angular and radial bins; the radius ofthe central bin in pixels; and the expansion factor for subsequent radii.


4.1.3 Bar HOG

Bar HOG descriptors are computed similarly to thegradient HOG ones, but use oriented second derivative (bar) filters instead offirst derivatives.

Bar HOG的计算和HOG的计算过程相似,区别在于前者使用二阶导数而后者使用一阶导数。

4.1.4Centre-Surround HOG


4.2 other descriptors

       GeneralizedHaar Wavelets; shape contexts; PCA-SIFT (待看)

4.3 Implementationand Performance Study

4.3.1 Gamma/Color Normalisation

When available, color information alwayshelps, e.g. for the person detector RGB and LAB color space give comparableresults, while restricting to grayscale reduces performances.



Our experience isthat square root gamma compression gives better performance for man-made objectclasses such as bicycles, motorbikes, cars, buses, and also people (whosepatterned clothing results in sudden contrast changes). For animals involvinglot of within-class color variation such as cats, dogs, and horses,unnormalised RGB turns out to be better, while for cows and sheep square rootcompression continues to yield better performance.


Gamma compression: 人造的各类物体,包括人

Unnormalised RGB: 动物

Root compression: 牛羊等


4.3.2 Gradient Computation

We computed image gradients using optionalGaussian smoothing followed by one of several discrete derivative masks andtested how performance varies. For color images (RGB or LAB space), we computedseparate gradients for each color channel and took the one with the largestnorm as the pixel’s gradient vector.



Overall detectorperformance is sensitive to the way in which gradients are computed and thesimplest scheme of centered 1-D [-1 0 1] masks at delta=0 works best. The useof any form of smoothing or of larger masks of any type seems to decrease theperformance. The most likely reason for this is the fine details are important:images are essentially edge based and smoothing decreases the edge contrast andhence the signal. A useful corollary is that the optimal image gradients can becomputed very quickly and simply.


计算梯度的时候,最简单的方法是最有效的。梯度算子是中心化的1-D算子 [-1 0 1]并且delta0.使用更大的mask或者进行平滑,会降低算法的性能。这可能是因为图像的细节被抵消的原因。


4.3.3 Spatial /Orientation Binning

Each pixel contributes a weighted vote fororientation based on the orientation of the gradient element centered on it.



Fine orientationcoding turns out to be essential for good performance for all object classes,whereas spatial binning can be rather coarse.



4.3.4 BlockNormalisation Schemes and Descriptor Overlap

       Anumber of different normalisation schemes were evaluated. Most of them arebased on grouping cells into larger spatial blocks and contrast normalizingeach block separately. In fact, the blocks are typically overlapped so thateach scalar cell response contributes several components to the finaldescriptor vector, each normalized with respect to a different block. This mayseem redundant but good normalisation is critical and including overlap significantlyimproves the performance.


a. 对输入图像的每一个颜色通道进行Gamma正则化

b. 对每一个颜色通道,分别沿着x和y方向利用模板[-1 0 1]进行卷积,得到梯度值。每个像素的orientation 和magnitude等于所有通道计算得到的最大量。





        (1) 把以当前点为中的的窗口区域划分成多个cell;

        (2) 对block内的图像梯度应该sigma= 的高斯窗口;

        (3) 创建一个的空间和方向直方图;

        (4) 对于block内的每一个像素,使用三线性插值来投票决定直方图的梯度振幅gradient magnitude。



        (1) 把以当前点为中心的图像区域分成log-polar圆形区域;通过创建角度和径度bins来把block分成多个cell。

        (2) 为每一个cell创建一个bin的方向直方图;

        (3) 对block内的每一个像素,在log-polar-orientation空间里使用三线性插值来投票决定cell的直方图。


(a) 对每一个block独立的运用L2-Hys或L1-Sqrt进行正则化;如果使用R2-HOG,对每一个3D直方图进行独立的正则化;

(b) 把所有的block的HOGs组成一个高维描述子。


输入: 正则化和确定正类样本的分辨率(宽度和长度);负类样本。



初步学习阶段(First phase learning):

       (a) 为每一个正类图像计算描述子;

       (b) 训练得到一个SVM分类器。

生成难分的负类样本(hard negative examples):对负类图像进行多尺度的扫描

       (a) 初始尺度定义为,计算结束尺度,其中和分别是图像的宽度和高度;

       (b) 计算扫描尺度数,是固定的步长

       (c) 对每一个尺度

              (1) 使用二插值的方法对图像进行缩放;

              (2) 使用编码算法并且按步长扫描图像

              (3) 把所有结果是的样本(难分类的样本hard examples)放入列表。


       (a) 估计RAM中可以存储的难分样本个数

       (b) 如果难分类样本个数大于上述个数,对难分类样本进行采样;

       (c) 利用正类样本,初始的负类样本,和难分类样本学习得到最终的SVM分类器。


Overall we studiedthe influence of various descriptor parameters and concluded that fine-scalegradient, fine orientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalisation in overlapping descriptor blocks areall important for good performance.


5 Multi-ScaleObject Localization

5.1Binary Classifier for object Localization

Scanning detection window based object detection and localizationrequires multiple overlapping detections to be merged. Our solution is based onthe following two hypotheses:

         1. If the detector is robust, it shouldgive a strong positive (though not maximum) response even if the detectionwindow is slightly off-center or off-scale on the object.

         2. A reliable detector will not firewith same frequency and confidence for non-object image windows.


         1. 如果检测算法是鲁棒的,那么既使检测窗口稍微偏离物体中心或与物体的大小稍微不同,响应也应该足够大;

         2. 对非物体的响应(频率和可信度frequency and confidence)应该是各不相同的。

         An ideal fusion method wouldincorporate the following characteristics:

         1. The higher the peak detection score,the higher the probability for the image region to be a true positive.

         2. The more overlapping detectionsthere are in the neighborhood of an image region, the higher the probabilityfor the image region to be a true positive.

         3. nearby overlapping detections shouldbe fused together, but overlaps occurring at very different scales or positive positionsshould not be fused.



         2. 出现的较大响应越密集,被检测窗口是正样本的概率越大;

         3. 重叠的响应应该融合在一起,但如果重叠的响应在尺度和位置上具有较大差异就不应该融合。

         Werepresent detections using kernel density estimation (KDE) in 3-D position andscale space.  The bandwidth of thesmoothing kernel defines the local neighborhood. The kernel width should bechosen to meet several criteria. It should not be less than the spatial and scalestride at which window classifiers are run, nor less than the natural spatialand scale width of the intrinsic classifier response (the former should be obviouslybe chosen to be less than the latter). Also it should not be wider than theobject itself so that nearby objects are not confused.

         我们使用三维位置尺度空间里的kernel densityestimation (KDE)来表示检测结果。平滑核的宽度定义了近邻的概念。核的宽度应该满足几个条件:不小于空域和尺度空间里的步长;不小于空域和尺度空间里分类器响应的自然长度。



6 OrientedHistograms of Flow and Appearance for Detecting People in Videos

6.1Formation of Motion Compensation

         The goal of this chapter is to exploitmotion cues to improve our person detector’s performance for films and videos.Detecting people in video sequences introduces new problems. Besides thechallenges already mentioned for static images such as variations in pose, appearance,clothing, illumination and background clutter, the detector has to handle themotion of the subject, the camera and the independent objects in thebackground. The main challenge is thus to find a set of features that characterizehuman motion well, while remaining resistant to camera and background motion.







