模式识别科学发展与现状(5.挑战)

5 Challenges

5 挑战

 

A lot of research effort is needed before the two novel and far-reaching paradigms are ready for practical applications. So, this section focuses on several challenges that naturally come in the current context and will be summarized for the design of automatic pattern recognition procedures. A number of fundamental problems, related to the various approaches, have already been identified in the previous sections and some will return here on a more technical level. Many of the points raised in this section have been more extensively discussed in [17]. We will emphasize these which have only been touched or are not treated in the standard books [15, 71, 76] or in the review by Jain et al. [45]. The issues to be described are just a selection of the many which are not yet entirely understood. Some of them may be solved in the future by the development of novel procedures or by gaining an additional understanding. Others may remain an issue of concern to be dealt with in each application separately. We will be systematically describe them, following the line of advancement of a pattern recognition system; see also Fig. 1:Representation and background knowledge. This is the way in which individual real world objects and phenomena are numerically described or encoded such that they can be related to each other in some meaningful mathematical framework. This framework has to allow the generalization to take place.

在上面两个新奇且遥不可及的识别模式被实际应用之前,还需要做许多的研究努力。所以这节我们承接上文的讨论,重点讨论相关的几个挑战性问题,然后总结一下自动模式识别方法的设计问题。跟各种识别方法相关的一些基础问题在前面的章节中已被提了出来,这里将就这些问题在技术层次上做更深的讨论。在引文[17]中对本节所列的观点有更广泛地论述。我们将着重描述已解决的问题或一般书本上没提到的问题或Jain等其他人提出的观点,所描述的这些问题只是许多未被完全理解的一部分,其中一些问题可能在将来因新技术的发现或其它理论知识的产生而被解决掉,其它问题仍然要结合每个具体应用分开来考虑。我们将根据模式识别系统中的各个阶段系统地对这些问题进行描述,参考图1:表示方法和识别背景,这是对现实世界对象或现象的每个个体进行数据化描述或编码的方法,这样它们就可被应用到那些复杂的数学处理系统中,这个系统具有较好的推广性。

 

Design set. This is the set of objects available or selected to develop the recognition system.

设计样本集:这是用于开发识别系统所要用到的或被选择出来的识别对象集。

Adaptation. This is usually a transformation of the representation such that it becomes more suitable for the generalization step.

适配:这通常是指表示方法的转化方法,以此可以更适合用于推广步骤。

Generalization. This is the step in which objects of the design set are related such that classes of objects can be distinguished and new objects can be accurately classified.

推广:这个步骤跟设计样本集中的对象有关,这里对象的类别可以被区分出来,新的对象能够被准确地分类。

Evaluation. This is an estimate of the performance of a developed recognition system.

评估:这是对已开发出来的识别系统进行评估。

 

5.1 Representation and Background Knowledge

5.1 表示方法和知识背景

 

The problem of representation is a core issue for pattern recognition [18, 20].Representation encodes the real world objects by some numerical description,handled by computers in such a way that the individual object representations can be interrelated. Based on that, later a generalization is achieved, establishing descriptions or discriminations between classes of objects. Originally, the issue of representation was almost neglected, as it was reduced to the demand of having discriminative features provided by some expert. Statistical learning is often believed to start in a given feature vector space. Indeed, many books on pattern recognition disregard the topic of representation, simply by assuming that objects are somehow already represented [4, 62]. A systematic study on representation [20, 56] is not easy, as it is application or domain-dependent(where the word domain refers to the nature or character of problems and the resulting type of data). For instance, the representations of a time signal,an image of an isolated 2D object, an image of a set of objects on some background, a 3D object reconstruction or the collected set of outcomes of a medical examination are entirely different observations that need individual approaches to find good representations. Anyway, if the starting point of a pattern recognition problem is not well defined, this cannot be improved later in the process of learning. It is, therefore, of crucial importance to study the representation issues seriously. Some of them are phrased in the subsequent sections.

表示方法是模式识别中的核心问题[18,20]。表示方法通过数字化方法对现实世界中的对象进行转换描述,运用计算机进行处理,这样每个独立的对象表示方法被相互关联了起来。以此为基础,推广方法在后面得到应用,在识别对象的种类间建立描述或区分方法。最初,表示方法问题几乎被忽视了,只是考虑怎么减少由专家所提供的具有区分能力的特征。统计学习经常被相信可以用在一个已有特征空间中。实际上,许多模式识别书本忽视表示方法问题,只是简单地假设对象以某种方法但又不确切的方法来表示[4,62]。系统化地研究表示方法[20,56]是不容易的,表示方法依赖于具体的应用及相关领域(这里的领域是指自然界或问题特征或数据表达类型)。例如一个时序信号、一个独立的表示2D对象的图像、在某背景下表示某个对象集的图像、一个3D对象重构或所收集到的医生诊断报告,这些都是完全不同的观察数据,需要运用不同的方法才能找到较好的表示方法。总之,如果模式识别问题的出发点没有被很好定义,在以后的学习处理中难以提高识别性能。所以,认真地研究表示方法问题是十分重要的。后面的几节会描述其中一些表示方法问题。

 

The use of vector spaces. Traditionally, objects are represented by vectors in a feature vector space. This representation makes it feasible to perform some generalization (with respect to this linear space), e.g. by estimating density functions for classes of objects. The object structure is, however, lost in such a description. If objects contain an inherent, identifiable structure or organization,then relations between their elements, like relations between neighboring pixels in an image, are entirely neglected. This also holds for spatial properties encoded by Fourier coefficients or wavelets weights. These original structures may be partially rediscovered by deriving statistics over a set of vectors representing objects, but these are not included in the representation itself. One may wonder whether the representation of objects as vectors in a space is not oversimplified to be able to reflect the nature of objects in a proper way. Perhaps objects might be better represented by convex bodies, curves or by other structures in a metric vector space. The generalization over sets of vectors,however, is heavily studied and mathematically well developed. How to generalize over a set of other structures is still an open question.

运用向量空间:传统地,对象被用特征向量空间中的向量来表示。这种表示方法比较适合执行推广操作(如线性空间),例如可以通过估计用于进行对象分类的密度函数。然而,这样的描述会丢失对象的结构信息。如果对象含有一个固有的、可以确认的结构或组织,则其中各元素之间的关系,例如图像中相邻象素间的关系,会完全被忽视,这对于在通过傅立叶系数或小波权重变换得到的空间特征中也会被忽视结构特征。通过在表示对象的向量集中进行统计可能会部份地重新发现原来的结构特征,但这些都没有包含在表示方法中。可能有人会怀疑用某空间中的向量来表示对象是否不会过分单纯地在某种程度上影响对象的自然性。或许对象通过凸集、凸曲线或其它可度量向量空间中的结构来表示会更好。然而,在向量集中进行推广研究是很复杂的,需要许多数学知识。怎样在其它结构表示体中进行推广仍然是一个未解决的问题。

 

The essential problem of the use of vector spaces for object representation is originally pointed out by Goldfarb [30, 33]. He prefers a structural representation in which the original object organization (connectedness of building structural elements) is preserved. However, as a generalization procedure for structural representations does not exist yet, Goldfarb starts from the evolving transformation systems [29] to develop a novel system [31]. As already indicated in Sec. 4.3 we see this as a possible direction for a future breakthrough.

运用向量空间来表示对象产生的基本问题最初是由Goldfarb指出来的[30,33]。他提倡用结构表示方法,这样对象原来的结构(组成对象的结构元素的连通性)可以被保留下来。然而针对结构化表示的推广方法还没有出现,于是Goldfarb以演进转化系统[29]为原理开发一个新奇的识别系统[31]。诸如在4.3节中所指明的那样,我们看到这是在未来进行突破性研究的可能方向。

 

Compactness. An important, but seldom explicitly identified property of representations is compactness [1]. In order to consider classes, which are bounded in their domains, the representation should be constrained: objects that are similar in reality should be close in their representations (where the closeness is captured by an appropriate relation, possibly a proximity measure). If this demand is not satisfied, objects may be described capriciously and, as a result, no generalization is possible. This compactness assumption puts some restriction on the possible probability density functions used to describe classes in a representation vector space. This, thereby, also narrows the set of possible classification problems. A formal description of the probability distribution of this set may be of interest to estimate the expected performance of classification procedures for an arbitrary problem.

紧性:这是很重要的,但很少有被明确地指明表示方法的性质是要紧性的[1]。为了能够识别可以进行区别的对象,表示方法应当这样被约束:现实中相似的对象在表示上也应当是相近的(这里的相近可以通过某种关系或估计方法来衡量)。如果这个条件不被满足,则对象的描述具有不稳定性,由此不可能进行推广。在向量空间表示方法中,紧性假设在对描述分类方法的未确定的概率密度函数上做了些限制,这样,由此也把所可能的分类问题集缩小了。对于一个特定的识别问题,对这个问题的概率密度分布的有效描述也许有助于对分类方法的性能估计。

 

In Sec. 3, we pointed out that the lack of a formal restriction of pattern recognition problems to those with a compact representation was the basis of pessimistic results like the No-Free-Lunch Theorem [81] and the classification error bounds resulting from the VC complexity measure [72, 73]. One of the main challenges for pattern recognition to find a formal description of compactness that can be used in error estimators the average over the set of possible pattern recognition problems.

在第三节中,我们曾指出,对于带有那些紧性表示方法的模式识别问题,缺乏有效限制是导致悲观结果的基本原因,就如没有免费的午餐理论所描述的那样,识别错误的反复产生源于VC维的复杂性[72,73]。寻找一个有效的紧性描述方法是模式识别中的一个主要的挑战问题之一,紧性描述可以被用在对所可能存在的模式识别问题的平均错误估计上。

 

Representation types. There exists numerous ways in which representations can be derived. The basic ‘numerical’ types are now distinguished as:

表示表示方法种类:已存在几种方法来用于表示方法中。基本的“数字化”描述类型区分如下:

 

• Features. Objects are described by characteristic attributes. If these attributes are continuous, the representation is usually compact in the corresponding feature vector space. Nominal, categorical or ordinal attributes may cause problems. Since a description by features is a reduction of objects to vectors, different objects may have identical representations, which may lead to class overlap.

特征:对象被描述成特征属性。如果这些属性是连续性的,则在相应的特征向量空间中的表示方法通常是紧密的。不重要的、绝对的或有序的属性可能会产生问题。既然一个特征描述是一个通过向量来对对象的约简,不同的对象可能会具有相同的表示,所以会导致种类交迭在一起。

 

• Pixels or other samples. A complete representation of an object may be approximated by its sampling. For images, these are pixels, for time signals,these are time samples and for spectra, these are wavelengths. A pixel representation is a specific, boundary case of a feature representation, as it describes the object properties in each point of observation.

象素或其它样本:对一个对象的完整表示方法可能是要通过取样来进行估值。对于图像,就是对象素点进行取样,对于时序信号,则是进行时间取样,对于光谱,则是对波长进行取样。象素表示方法是一种精细地带界线的特征表示方法,它描述了所观察到的对象的每一点的性质。

 

• Probability models. Object characteristics may be reflected by some probabilistic model. Such models may be based on expert knowledge or trained from examples. Mixtures of knowledge and probability estimates are difficult, especially for large models.

概率模型:对于某些建立在概率上的识别模型,对象特征的提取会有些困难。概率模型可能是基于专家知识或从样例中训练出来,把知识和概率估计综合起来是有困难的,特别是对于大模型。

 

• Dissimilarities, similarities or proximities. Instead of an absolute description by features, objects are relatively described by their dissimilarities to a collection of specified objects. These may be carefully optimized prototypes or representatives for the problem, but also random subsets may work well [56]. The dissimilarities may be derived from raw data, such as images, spectra or time samples, from original feature representations or from structural representations such as strings or relational graphs. If the dissimilarity measure is nonnegative and zero only for two identical objects, always belonging to the same class, the class overlap may be avoided by dissimilarity representations.

不同点,相似点或相近点:不采用取特征的绝对描述,可以通过对收集到的对象比较出相异点来描述对象。这些可能是经过严密最优化出来的可以解决问题的典型或代表数据,但其中的任意子集也可能很好地解决问题[56]。相异点可以从原始数据中得到,诸如图像、光谱或时序信号样本,也可以从原来的特征表示方法中得到,也可以从结构表示方法中得到,如字符串或相关联的图表。如果两个要区分的对象的相异性的值大于或等于零,则它们属于同一个种类,运用相异性表示方法要避免种类描述交迭在一起。

 

• Conceptual representations. Objects may be related to classes in various ways, e.g. by a set of classifiers, each based on a different representation, training set or model. The combined set of these initial classifications or clusterings constitute a new representation [56]. This is used in the area of combining clusterings [24, 25] or combining classifiers [49].

概念形式表示方法:对象可以通过各种形式与类别相关联,例如可以通过分类器集,其中每个分类器运用不同表示方法、不同的训练集或不同的识别模型。通过对这些最初的分类或聚类方法的组合来建立一个新的表示方法[56],这是运用在组合聚类和组合分类领域中。

 

In the structural approaches, objects are represented in qualitative ways. The most important are strings or sequences, graphs and their collections and hierarchical representations in the form of ontological trees or semantic nets.

在结构表示方法中,对象通过定性的方法被表示。这种方法最多表示在字符串或时序信息中,也最多表示在图表和图表集以及分等级的图表表示方法中,这类表示方法运用了本体树或语义网络的形式。

 

Vectorial object descriptions and proximity representations provide a good way for generalization in some appropriately determined spaces. It is, however, difficult to integrate them with the detailed prior or background knowledge that one has on the problem. On the other hand, probabilistic models and,especially, structural models are well suited for such an integration. The later,however, constitute a weak basis for training general classification schemes. Usually, they are limited to assigning objects to the class model that fits best, e.g. by the nearest neighbor rule. Other statistical learning techniques are applied to these if given an appropriate proximity measure or a vectorial representation space found by graph embeddings [79].

向量形式的对象描述和相似性的表示方法在某些适宜的决策空间中提供了较好的推广方法。然而,困难的是如何把跟问题有关的先验或背景知识结合起来。另一方面,概率模型,特别是结构模型能够非常好地把先验或背景知识结合起来,然而,相结合后,建立的是一个弱识别器,用于进行普通分类的训练。通常,在这个分类模型中可以达到最好分类(例如采用最邻近法则)的对象不多。如果有一个适当的相似性度量方法或一个结合图表形式的向量表示空间,统计学习技术则可以被用上。

 

It is a challenge to find representations that constitute a good basis for modeling object structure and which can also be used for generalizing from examples. The next step is to find representations not only based on background knowledge or given by the expert, but to learn or optimize them from examples.

在对象结构建模中建立一个较好的分类器,寻找一个相应的对象表示方法,且这个表示方法能够用于从用例中进行推广,这是一个挑战性问题。下面就来介绍如何找到不仅可以基于背景知识(或能由专家给出),也可以从用例中进行学习或最优化的表示方法。

 

5.2 Design Set

5.2 设计样本集

 

A pattern recognition problem is not only specified by a representation, but also by the set of examples given for training and evaluating a classifier in various stages. The selection of this set and its usage strongly influence the overall performance of the final system. We will discuss some related issues.

模式识别问题不仅与表示方法有关,也跟用于在分类器设计各个阶段进行训练和测试的用例集有关。用例集的选择和使用大大地影响了最后识别系统的整体性能。我们来讨论与此相关的一些问题。

 

Multiple use of the training set. The entire design set or its parts are used in several stages during the development of a recognition system. Usually,one starts from some exploration, which may lead to the removal of wrongly represented or erroneously labeled objects. After gaining some insights into the problem, the analyst may select a classification procedure based on the observations. Next, the set of objects may go through some preprocessing and normalization. Additionally, the representation has to be optimized, e.g. by a feature/object selection or extraction. Then, a series of classifiers has to be trained and the best ones need to be selected or combined. An overall evaluation may result in a re-iteration of some steps and different choices.

训练集的多方面运用:在开发一个识别系统的过程中,整个样本集或其中的一部分要被用在几个设计阶段。通常,在还是探索的开始阶段,可以排除那些被错误表示或不正确标识的对象。在对问题进行深入研究后,分析研究人员会基于观察数据选择某个分类方法。下一步,样本对象集就被用于一些预处理或归一化过程中。另外,表示方法在这过程中需要被优化,如进行特征/对象选择或提取。然后,一系列分类器得进行训练,训练后,其中最好的被选择出来或进行组合。最后反复在几个步骤和不同选择的方法中进行全面测试评估。

 

In this complete process the same objects may be used a number of times for the estimation, training, validation, selection and evaluation. Usually, an expected error estimation is obtained by a cross-validation or hold-out method [32, 77]. It is well known that the multiple use of objects should be avoided as it biases the results and decisions. Re-using objects, however, is almost unavoidable in practice. A general theory does not exist yet, that predicts how much a training set is ‘worn-out’ by its repetitive use and which suggests corrections that can diminish such effects.

在整个处理过程中,相同的对象可能要被使用好几次,被用于进行(参数)估计、训练、检验、选择和评估。通常,通过交叉验证和留取方法,可预见的误差可以被估计出来[32,77]。大家都知道应当避免样本对象的多次使用,因为这样会使识别结果和决策出现偏差。然而,在实践中对象的重复使用几乎是无法避免的。还没有这方面的通用理论,即预测训练集被重复使用多少次就不能用了,以及要做怎样的修正以减少这样的影响。

 

Representativeness of the training set. Training sets should be representative for the objects to be classified by the final system. It is common to take a randomly selected subset of the latter for training. Intuitively, it seems to be useless to collect many objects represented in the regions where classes do not overlap. On the contrary, in the proximity of the decision boundary, a higher sampling rate seems to be advantageous. This depends on the complexity of the decision function and the expected class overlap, and is, of course,inherently related to the chosen procedure.

训练集的典型性:训练集应当是具有代表性的对象集,以能够被最终识别系统识别。通常是随机地选取最新的样本子集进行训练。凭直觉,似乎收集许多表示在分类交迭处的对象是没有用的。相反地,在决策边界附近进行更高的取样率是有用的。这个依赖于决策问题的复杂度和分类的交迭程度,当然也跟方法的选择有关。

 

Another problem are the unstable, unknown or undetermined class distributions.Examples are the impossibility to characterize the class of non-faces in the face detection problem, or in machine diagnostics, the probability distribution of all casual events if the machine is used for undetermined production purposes. A training set that is representative for the class distributions cannot be found in such cases. An alternative may be to sample the domain of the classes such that all possible object occurrences are approximately covered. This means that for any object that could be encountered in practice there exists a sufficiently similar object in the training set, defined in relation to the specified class differences. Moreover, as class density estimates cannot be derived for such a training set, class posterior probabilities cannot be estimated. For this reason such a type of domain based sampling is only appropriate for non-overlapping classes. In particular, this problem is of interest for non-overlapping (dis)similarity based representations [18].

另一个问题是不稳定、未知或无法确定的种类分布问题。在人脸检测问题中,对于不是人脸的种类用例是无法进行描述它的特征的,或者在机器诊断中,如果诊断机器是被用在无法诊断的情况下,所有的偶然事件的概率分布是无法估计的,能够代表性地表示种类分布的训练集不可以出现在这些情况中。一个替代的方法是在所在类别范围内进行取样,这样对象所有可能出现的情况就可以近似地被覆盖到,意思是训练集取自种类分布的不同部分,对于在实践中任何可能被碰到的对象,在训练集中都相应存在一个非常相似的对象。更进一步地,因种类分布密度无法从那样的训练集中进行估计,种类的后验概率也无法被估计出来。因为这个原因,对于进行这样取样的分类方法只适合用于没有发生交迭的种类,更确切地说是表示方法上不存在相似性(或相异性)的交迭。

 

Consequently, we wonder whether it is possible to use a more general type of sampling than the classical iid sampling, namely the domain sampling. If so, the open questions refer to the verification of dense samplings and types of new classifiers that are explicitly built on such domains.

因此,我们想知道是否可能使用比独立同分布原则更为通用的取样方法,即域取样。如果可能,则还需要解决的问题是密集取样的验证问题,及显式地建立在域取样方法上的新分类方法选择问题。

 

5.3 Adaptation

5.3 适配

 

Once a recognition problem has been formulated by a set of example objects in a convenient representation, the generalization over this set may be considered, finally leading to a recognition system. The selection of a proper generalization procedure may not be evident, or several disagreements may exist between the realized and preferred procedures. This occurs e.g. when the chosen representation needs a non-linear classifier and only linear decision functions are computationally feasible, or when the space dimensionality is high with respect to the size of the training set, or the representation cannot be perfectly embedded in a Euclidean space, while most classifiers demand that. For reasons like these, various adaptations of the representation may be considered. When class differences are explicitly preserved or emphasized,such an adaptation may be considered as a part of the generalization procedure. Some adaptation issues that are less connected to classification are discussed below.

一旦识别问题通过一组用适宜的表示方法表示的用例对象集来形式化后,于是可能就要考虑在这个用例集上的推广问题,最后才产生一个识别系统。选择一个合适的推广方法可能不容易,或者在现实和理想之间存在一些冲突。例如会出现这样的情况,已选择的表示方法需要应用在非线性分类器上但计算上只有线性判断函数可行,或者如特征空间的维数很高导致了训练集数据很大,或者表示方法不能很好地结合到欧拉空间中,但大部分分类器需要在欧拉空间中进行计算。因为这些原因,对表示方法的各种适应性修改就要被考虑进来。当种类之间的区别需要被明确地保留或强调出来,这样适配方法可能要被考虑作为推广过程的一部份。下面介绍一些跟分类联系不太紧密的适配问题。

 

Problem complexity. In order to determine which classification procedures might be beneficial for a given problem, Ho and Basu [43] proposed to investigate its complexity. This is an ill-defined concept. Some of its aspects include data organization, sampling, irreducibility (or redundancy) and the interplay between the local and global character of the representation and/or of the classifier. Perhaps several other attributes are needed to define complexity such that it can be used to indicate a suitable pattern recognition solution to a given problem; see also [2].

问题复杂度:为了确定哪个分类方法可用于解决问题,Ho和Basu[43]建议考察问题的复杂度。这是一个不确切的概念。复杂度的问题关系到数据的组织方法、取样方法、还原性(或冗余性),还有表示方法和分类器中的局部和全局特征之间的相互影响。也许还有其它一些属性还需要被用来定义复杂度,这样才能够被用来确定针对某一问题的合适的模式识别解决方案,这方面问题可以查看文献[2]。

 

Selection or combining. Representations may be complex, e.g. if objects are represented by a large amount of features or if they are related to a large set of prototypes. A collection of classifiers can be designed to make use of this fact and later combined. Additionally, also a number of representations may be considered simultaneously. In all these situations, the question arises on which should be preferred: a selection from the various sources of information or some type of combination. A selection may be random or based on a systematic search for which many strategies and criteria are possible [49]. Combinations may sometimes be fixed, e.g. by taking an average, or a type of a parameterized combination like a weighted linear combination as a principal component analysis; see also [12, 56, 59].

选择或合并:表示方法可能是复杂的,例如,如果对象被一个很大的特征空间来表示,或者跟一个很大的原型集有关。分类器的选择是根据这个因素来进行选择,然后再合并选择出来的分类器。另外,多种表示方法也可能被同时考虑进来。在所有这些情形中,要优先考虑这个问题:要从多种途径来选择。选择方法可能是随机的,也可能是通过系统地寻找,这里有很多的选择策略和准则[49]。组合有时可能是用固定的方法,如通过取平均,或者参数化的组合,象用于主成份分析的带权值的线性组合,参见[12,56,59]。

 

The choice favoring either a selection or combining procedure may also be dictated by economical arguments, or by minimizing the amount of necessary measurements, or computation. If this is unimportant, the decision has to be made according to the accuracy arguments. Selection neglects some information,while combination tries to use everything. The latter, however, may suffer from overtraining as weights or other parameters have to be estimated and may be adapted to the noise or irrelevant details in the data. The sparse solutions offered by support vector machines [67] and sparse linear programming approaches [28, 35] constitute a way of compromise. How to optimize them efficiently is still a question.

不管是选择还是组合过程,选择方法的依据是实用与否,或是否能够最小化设计所需要的资源,或者跟计算复杂度有关。如果这些是不重要的,那就以识别准确性为依据。选择会忽略一些信息,而组合则试图避免丢失一些信息。然而,当权值或其它参数在被估计时为了适应数据中的噪音或无关信息则会产生过学习的问题。支持向量机[67]提出了稀疏解决办法,用稀疏线性规划方法来建立一个折衷的解决方法。如何有效地优化仍然还是一个问题。

 

Nonlinear transformations and kernels. If a representation demands or allows for a complicated, nonlinear solution, a way to proceed is to transform the representation appropriately such that linear aspects are emphasized. A simple (e.g. linear) classifier may then perform well. The use of kernels, see Sec. 3, is a general possibility. In some applications, indefinite kernels are proposed as being consistent with the background knowledge. They may result in non-Euclidean dissimilarity representations, which are challenging to handle;see [57] for a discussion.

非线性转化和核:如果一个表示方法需要或允许被复杂且非线性的方法来表示,则下一次需要对表示方法进行转化以可以用上线性的方法,于是一个简单(如线性)的分类器可以发挥很好作用。一般的方法是使用核(见第三节的描述)。在一些应用中,模糊核能够与背景知识相一致,但可能需要在非欧拉空间中的相异性表示方法,这是个待解决的挑战性问题,文献[57]中有这方面讨论。

 

5.4 Generalization

5.4 推广

 

The generalization over sets of vectors leading to class descriptions or discriminants was extensively studied in pattern recognition in the 60’s and 70’s of the previous century. Many classifiers were designed, based on the assumption of normal distributions, kernels or potential functions, nearest neighbor rules,multi-layer perceptrons, and so on [15, 45, 62, 76]. These types of studies were later extended by the fields of multivariate statistics, artificial neural networks and machine learning. However, in the pattern recognition community, there is still a high interest in the classification problem, especially in relation to practical questions concerning issues of combining classifiers, novelty detection or the handling of ill-sampled classes.

在上个世纪六七十年代,以种类描述或判定为目的的、在向量集上的推广方法被做了充分研究。许多分类器被设计了出来,这些分类器都是基于正态分布假设,运用核或势函数、最邻近原则、多层感知器等等方法[15,45,62,76]。这些在以后的多元统计、人工神经网络和机器学习中被更深入地研究了。然而,在模式识别科学研究中,分类问题仍然是一个吸引人去研究的问题,特别是在组合分类器、新奇对象检测或取样不全的分类中。

 

Handling multiple solutions. Classifier selection or classifier combination.Almost any more complicated pattern recognition problem can be solved in multiple ways. Various choices can be made for the representation,the adaptation and the classification. Such solutions usually do not only differ in the total classification performance, they may also make different errors. Some type of combining classifiers will thereby be advantageous [49]. It is to be expected that in the future most pattern recognition systems for real world problems are constituted of a set of classifiers. In spite of the fact that this area is heavily studied, a general approach on how to select, train and combine solutions is still not available. As training sets have to be used for optimizing several subsystems, the problem how to design complex systems is strongly related to the above issue of multiple use of the training set.

解决包含多识别器的方法,分类器选择或合并:几乎任何较为复杂的模式识别问题都可以通过多种方法来解决。因表示方法、适配方法和分类方法的不同会有多种选择方案。不同的选择方案不仅会产生不同整体分类性能,也可能产生不同的错误。对其中一些分类方法进行组合会产生较好的效果[49]。可以被预见在将来为解决现实世界问题的几乎所有的模式识别系统都是由一组识别器构建起来的。尽管在这方面已被研究了很多,但仍没有一个用于选择、训练和合并的通用方法。因为训练集要被用在优化几个子系统中,所以怎么设计综合性的系统跟前面提到的如何多次应用训练集的问题有很大的关系。

 

Classifier typology. Any classification procedure has its own explicit or built-in assumptions with respect to data inherent characteristics and the class distributions. This implies that a procedure will lead to relatively good performance if a problem fulfils its exact assumptions. Consequently, any classification approach has its problem for which it is the best. In some cases such a problem might be far from practical application. The construction of such problems may reveal which typical characteristics of a particular procedure are. Moreover, when new proposals are to be evaluated, it may be demanded that some examples of its corresponding typical classification problem are published, making clear what the area of application may be; see [19].

分类器种类研究:任何分类程序都它自己的明确的或内建的假设,这个是跟数据固有特性和种类分布有关的假设。这意味着如果能够完全满足这些严密的假设,分类程序可以具有相当好的性能。因此,任何分类方法都有如何确定哪个是最好的问题。在一些情况中,这样的问题可能跟实际应用关系不大。这种问题也可能是在一个特定识别过程选择哪个是典型特征的问题。还有,当评估一个新方法时,可能需要公开其中一些用于该识别问题的用例,搞清楚应用范围,详见文献[19]。

 

Generalization principles. The two basic generalization principles, see Section 4, are probabilistic inference, using the Bayes-rule [63] and the minimum description length principle that determines the most simple model in agreement with the observations (based on Occam’s razor) [37]. These two principles are essentially different. The first one is sensitive to multiple copies of an existing object in the training set, while the second one is not. Consequently,the latter is not based on densities, but just on object differences or distances.An important issue is to find in which situations each of these principle should be recommended and whether the choice should be made in the beginning, in the selection of the design set and the way of building a representation, or it should be postpone until a later stage.

推广原则:从第4节中可以看到有两个推广基本原则,一个是基于概率推导,运用贝叶斯法则[63],另一个最小化描述法则,选择与观察一致的最简单模型(Occam剃刀法则)[37]。这两个原则有本质的区别。第一种原则对于训练集中多次运用一个相同对象影响很大,但第二种原则则不会。由此,后面一种原则不是基于概率密度,只是根据对象的不同点或距离。一个重要的问题是如何发现哪种情况下应该用哪种原则,以及相应地怎么建立表示方法,或者到后面的步骤来做。

 

The use of unlabeled objects and active learning. The above mentioned principles are examples of statistical inductive learning, where a classifier is induced based on the design set and it is later applied to unknown objects. The disadvantage of such approach is that a decision function is in fact designed for all possible representations, whether valid or not. Transductive learning, see Section 4.3, is an appealing alternative as it determines the class membership only for the objects in question, while relying on the collected design set or its suitable subset [73]. The use of unlabeled objects, not just the one to be classified, is a general principle that may be applied in many situations. It may improve a classifier based on just a labeled training set. If this is understood properly, the classification of an entire test set may yield better results than the classification of individuals.

未标识对象和主动学习的运用:上面提到的原则是统计推理学习的模式,这样的分类器基于样本集进行推理,然后应用于未知对象中。这样方法的缺点是决策函数要从所有可能的对象表示进行设计,对每个可能对象进行判断。4.3节中的转化推理学习是一个吸引人的替代方法,它通过质询的方法来判断对象的类别归属,而不是依赖于所收集到的样本集或其中合适的子集[73]。

 

Classification or class detection. Two-class problems constitute the traditional basic line in pattern recognition, which reduces to finding a discriminant or a binary decision function. Multi-class problems can be formulated as a series of two-class problems. This can be done in various ways, none of them is entirely satisfactory. An entirely different approach is the description of individual classes by so-called one-class classifiers [69, 70]. In this way the focuss is given to class description instead of to class separation. This brings us to the issue of the structure of a class.

分类或种类甄别:二分类问题是模式识别传统的基本问题,它简化了寻找判别或二分决策函数方法。多分类问题可以用一系列的二分类问题来实现解决,关于这个相应地有多种方法可以用,但是没有一种方法可以完全让人满意的。有一种完全不同的方法则是用所谓的单类别分类器分别针对某个类别进行描述[69,70]。这个方法用类别描述来代替类别分离。这样就带来了种类结构的问题。

 

Traditionally classes are defined by a distribution in the representation space. However, the better such a representation, the higher its dimensionality, the more difficult it is to estimate a probability density function. Moreover, as we have seen above, it is for some applications questionable whether such a distribution exist. A class is then a part of a possible non-linear manifold in a high-dimensional space. It has a structure instead of a density distribution.It is a challenge to use this approach for building entire pattern recognition systems.

传统的种类被定义成为一个在表示空间中的分布。然而,这种表示方法表示得越全面,所需要的维数就越高,估计概率密度函数也就越困难。还有,正如上面我们所明白的,在某些应用中这样的(可分离的)分布是否存在也是让人怀疑的。于是我们把一个种类表示成一个在高维空间中可能为非线性拓扑空间的一部分,用结构的方法来表示,而不是用概率密度分布,用这个方法来建立一个完整的模式识别系统是具有挑战性的。

 

5.5 Evaluation

5.5 评估

 

Two questions are always apparent in the development of recognition systems. The first refers to the overall performance of a particular system once it is trained, and has sometimes a definite answer. The second question is more open and asks which good recognition procedures are in general.

在开发识别系统中明显存在两个问题。第一个问题跟一个特定系统整体性能有关,这个系统一旦经过训练后,就相应需要准确知道其性能。第二个问题更是还未被解决,即哪种识别方法在通用性上是好的。

 

Recognition system performance. Suitable criteria should be used to evaluate the overall performance of the entire system. Different measures with different characteristics can be applied, however, usually, only a single criterion is used. The basic ones are the average accuracy computed over all validation objects or the accuracy determined by the worst-case scenario. In the first case, we again assume that the set of objects to be recognized is well defined (in terms of distributions). Then, it can be sampled and the accuracy of the entire system is estimated based on the evaluation set. In this case, however,we neglect the issue that after having used this evaluation set together with the training set, a better system could have been found. A more interesting point is how to judge the performance of a system if the distribution of objects is ill-defined or if a domain based classification system is used as discussed above. Now, the largest mistake that is made becomes a crucial factor for this type of judgements. One needs to be careful, however, as this may refer to an unimportant outlier (resulting e.g. from invalid measurements).

识别系统性能:评估整个系统的整体性能需要相应合适的标准。可以采用具有不同特点的不同评估方法,然而,总是只有一个标准被用上。基本方法是计算所有被验证对象上的平均正确率,或者在较差环境下的准确率。在第一种方法中,我们又是假设被识别的对象集是经过明确定义的(在分布上),然后,依此进行取样,整个系统的识别率在这个评估集上被估计出来。然而,这个方法中,我们忽略了这样一个问题,即(误以为)把评估集和训练集一起用上后,就能够发现该识别系统是否更好。一个更为有意思的一点是如果对象的分布是不清楚的或者采用上面我们所讨论的分类系统,那么该怎么去判断这个系统的性能,于是用这种判断方法是最大的错误。然而,要注意的一点是,我们可能会被不重要的表面数据所误导(源于如不合理的评估方法)。

 

Practice shows that a single criterion, like the final accuracy, is insufficient to judge the overall performance of the whole system. As a result, multiple performance measures should be taken into account, possibly at each stage. These measures should not only reflect the correctness of the system, but also its flexibility to cope with unusual situations in which e.g. specific examples should be rejected or misclassification costs incorporated.

实践表明单一的评估标准,如以最终准确率为依据,对于判断整个系统的性能是不足够的。由此,应当采用多种性能评估方法,尽可能地运用在每个识别阶段。这些评估方法不只反映系统的正确性,也要反映非常情况下的适应性,例如对于特殊用例应当会拒识,或加入错识代价。

 

Prior probability of problems. As argued above, any procedure has a problem for which it performs well. So, we may wonder how large the class of such problems is. We cannot state that any classifier is better than any other classifier, unless the distribution of problems to which these classifiers will be applied is defined. Such distributions are hardly studied. What is done at most is that classifiers are compared over a collection of benchmark problems. Such sets are usually defined ad hoc and just serve as an illustration. The collection of problems to which a classification procedure will be applied is not defined. As argued in Section 3, it may be as large as all problems with a compact representation, but preferably not larger.

问题的先验概率:就如上面所讨论的,任何识别方法都有一个问题,即对于哪个种类会识别得很好。所以,我们可能会很想知道识别很好的种类有多少。除非分类的问题域被定义好,否则我们无法断定某个分类器一定会比其它分类性能要好。做得最多的是把分类器在一些基准问题集上进行比较。这样的问题集通常经过特别定义和只用于分析说明,但并不定义要用哪个分类方法。正如第三节中所讨论的那样,对于紧性表示方法这种问题跟其它所有问题一样是个大问题,但最好不是个更大的问题。

 

你可能感兴趣的:(vector,object,Class,performance,Training,classification)