摘要:数据预处理,正常化,缩放,标准化
译文:
在整个知识发现的过程中,数据挖掘之前,数据预处理起着至关重要的作用。其中首要步骤就是数据的正规化处理。在处理不同单元和尺度参数时,这一步就显得非常重要。例如,一些数据挖掘技术适用欧氏距离,因此,所有的参数应该要用相同的单位才能进行比较。
重新缩放数据通常有两种方法。正规化,即将所有的变量都归一化到[0,1]的尺度范围内。
另一方面,你可以用你自己的数据集上的标准,然后将其转化为具有零均值和单位方差。
这两种方法各有弊端。如果你的数据集是离散的,正规化数据将会使那些“正常”的数据间形成非常小的间隔。一般来说,大多数据集都是离散的。当使用标准化后的数据时,你假设你的数据是由高斯法(具有一定的均值和标准差)产生的。但这在实际情况中可能不会发生。所以我的问题是你平时数据挖掘时用什么方法并且为什么这么做?
原文:
In the overall knowledge discovery process, before data mining itself, data preprocessing plays a crucial role. One of the first steps concerns the normalization of the data. This step is very important when dealing with parameters of different units and scales. For example, some data mining techniques use the Euclidean distance. Therefore, all parameters should have the same scale for a fair comparison between them.
Two methods are usually well known for rescaling data. Normalization, which scales all numeric variables in the range [0,1]. One possible formula is given below:
On the other hand, you can use standardization on your data set. It will then transform it to have zero mean and unit variance, for example using the equation below:
Both of these techniques have their drawbacks. If you have outliers in your data set, normalizing your data will certainly scale the “normal” data to a very small interval. And generally, most of data sets have outliers. When using standardization, you make an assumption that your data have been generated with a Gaussian law (with a certain mean and standard deviation). This may not be the case in reality.
So my question is what do you usually use when mining your data and why?
Note: Thanks to Benny Raphael for fruitful discussions on this topic.
No TweetBacks yet. (Be the first to Tweet this post)
评论
Comments
Comments on Standardization vs. normalization
有时候当他们输入的数据包含级数,较大和较小的值时,我们可以取对数。然而,由于对数被定义为只针对有理数,我们子啊输入数据时要注意0与物理书。
你做的博客很好!
Sometimes perhaps we can take logarithms of input data when they contain order-of-magnitude larger and smaller values. However, since logarithms are defined for positive values only, we need to take care when the input data may contain zero and negative values.
You did a very good work on your blog!
A few points come to mind:
脑海中浮现的几点:
1、单调的数据(假设不同数据不混合)缩放不会在学习常见的逻辑算法中产生有任何影响(树和规则归纳算法)。
2、其它有效的方法,比如:减去中位数和四分数,或者有规模地进行线性划分,使得第5%和第95%的数据满足标准的范围。
3.离散值(从技术上和高标准上讲)是一个有趣的挑战。一种可能就是Winsorize缩放后的数据。
1、Monotonic scaling of the data (assuming that distinct values are not collapsed) will have no affect on the most common logical learning algorithms (tree- and rule-induction algorithms).
2. There are robust alternatives, such as: subtract the median and divide by the IQR, or scale linearly so that the 5th and 95th percentiles meet some standard range.
3. Outliers (technically, and high leverage points) present an interesting challenge. One possibility is to Winsorize the data after scaling it.
Thanks for your comment fay. I agree with you on taking the log. I use to work with data in the range 10^6 to 10^12 for example. And thanks for the remark
Will, your suggestions seem very interesting. I don’t know the “winsorize” technique, but it seems it could be used in addition to normalization.
感谢你的评论fay,我同意你博客上的评论。我曾经做过比如范围在10^6 在 10^12之间的数据处理,再一次感谢你的支持。
你好,你的建议似乎非常有趣,我不知道“winsorize”技术,但似乎它可以用子啊数据正规化处理外的地方。
不知道这种技术的读者:“Winsorizing”数据就是简单的意味着夹紧极端值。
这类似于修剪的数据,并不是丢弃数据:值大于指定上限被上限所取代,低于下限被下限取代。
通常情况下,在指定的范围表明在原分布百分位(如第5和第95个百分位)。
这个过程有时用来做更强大的常规措施,如极值方差应用。
For readers who are not aware of this technique: “Winsorizing” data simlpy means clamping the extreme values.
This is similar to trimming the data, except that instead of discarding data: values greater than the specified upper limit are replaced with the upper limit, and those below the lower limit are replace with the lower limit.
Often, the specified range is indicate in terms of percentiles of the original distribution (like the 5th and 95th percentile).
This process is sometimes used to make conventional measures more robust, as in the Winsorized variance.
请问,你能否告诉我如何进行线性扩展使第5和第95百分满足一些标准的范围?
这样做是否可以取所有的正数和负数?
另一个问题是:
如果我想要来计算指数不仅单位和尺度是不同的,但也到索引中输入指标有不同的解释 - 具体而言,一个度量是更好,如果值较高,另一个是更好的,如果值是较低我怎么能计算指数,代表所有的数字,简洁和有意义吗?
比方说,我有费用(元),利润(元)营业额(%)。 费用及营业额是更好,如果较低,但利润更好较高。
如果比较这些指标的两家公司,我要计算一个索引上显示这些参数的“最好”的执行公司,我怎么能做到这一点?
对不起,没有严格与数据挖掘有关,但认为在这里可能会有一个答案!
尝试使用Z -分数和正常化,但由于不同标准不能成功。
最终使用的一个开支和营业额的反向排名,让所有有相同的顺序。然而,职级不显示两家公司,只是他们的行列数量之间的差异!
这是一个伟大的博客,感谢所有有用的意见。
Will, can you tell me how I can scale linearly so that the 5th and 95th percentiles meet some standard range?
Can this be done with both negative and positive values?
Another question:
If I want to compute an index where not only the units and scales are different, but also the input metrics into the index have different interpretations – specifically, one metric is better if the values are higher and another one is better if the values are lower, how can I compute an index that represents all numbers concisely and meaningfully?
Let’s say I have Expenses ($), Profits($) and Turnover (%). Expenses and Turnover are better if lower, but Profits are better if higher.
If comparing two companies on these metrics, and I want to compute one index to show the “best” performing company on these parameters, how can I do this?
Sorry, not strictly data-mining relevant, but thought someone here might have an answer!
Tried using z-scores and normalizing but doesnt work due to different hi-low interpretations.
Eventually used a reverse-rank for Expenses and Turnover so that all have same order. However, rank does not show quantity difference between the two companies, just their ranks!
this is a great blog, thanks to all for helpful comments.
First, you can normalize/standardize your data. Or, on the contrary, you can maybe decide to manually fix weights to each of these metrics.
You can for example use an objective function. Let say you want to maximize a function of the Expenses, Profits and Turnover. In the objective function, give a negative weight to Expenses and Turnover and a positive one to Profits. I don’t know if this will work for your problem, but that would be my first guess.
Hi Sandro,
Interesting article and the comments which followed. I am also dealing with analysis of Mass Spectrometry data. This kind of data suffers a significant variation due to instrumental errors and limitations (even if the same sample is analyzed). Presently I am using log transformation, which is giving satisfactory result. But I am still skeptical of possibilities of false positives, as data is from the range of 10*3 to 10*6. So what do you suggest best method of normalizing this kind of data.
I am also confused with the two terms as you mentioned ’standardization’ & ‘normalization’, which to use with mass spectrometry data analysis. Although I only found research articles mentioning normalization for such kind of data not standardization. Although when I explored the internet then both the techniques were referred in similar grounds.
What is your views regarding this query.