data mining 3 Discritization, Distance, Similarity

Discritization
Discretization is the process of converting a continuous attribute into an ordinal attribute
(Regressison method can handle continuous data, but decision tree cannot)

A potentially infinite number of values are mapped into a small number of categories
Discretization is commonly used in classification
Many classification algorithms work best if both the independent and dependent variables have only a few values
We give an illustration of the usefulness of discretization using the Iris data set

Binarization
Binarization maps a continuous or categorical attribute into one or more binary variables
Typically used for association analysis
Often convert a continuous attribute to a categorical attribute and then convert a categorical attribute to a set of binary attributes

Association analysis needs asymmetric binary attributes
Examples: eye color and height measured as {low, medium, high}

Attribute Transformation (Ensure that one attribute will not dominate other attributes due to the variance of the range)
An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

Simple functions: x^k, log(x), e^x, |x|
Normalization: Refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence, mean, variance, magnitude
In statistics, standardization refers to subtracting off the means and dividing by the standard deviation

Similarity and Dissimilarity Measures

Similarity measure
- Numerical measure of how alike two data objects are.
- Is higher when objects are more alike.
- Often falls in the range [0,1]
Proximity refers to a similarity or dissimilarity

similarity/dissimilarity for simple attributes

data mining 3 Discritization, Distance, Similarity_第1张图片

image.png

Euclidean Distance

data mining 3 Discritization, Distance, Similarity_第2张图片

image.png

Standardization is necessary, if scales differ.

Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance

data mining 3 Discritization, Distance, Similarity_第3张图片

image.png

data mining 3 Discritization, Distance, Similarity_第4张图片

image.png

Mahalanobis Distance

data mining 3 Discritization, Distance, Similarity_第5张图片

image.png

Common Properties of a Distance

Distances, such as the Euclidean distance, have some well known properties.

d(p, q) >= 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
d(p, q) = d(q, p) for all p and q. (Symmetry)
d(p, r) <= d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is a metric

Common Properties of a Similarity

Similarities, also have some well known properties.

s(p, q) = 1 (or maximum similarity) only if p = q.
s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.

Similarity Between Binary Vectors

Common situation is that objects, p and q, have only binary attributes
Compute similarities using the following quantities
F01 = the number of attributes where p was 0 and q was 1
F10 = the number of attributes where p was 1 and q was 0
F00 = the number of attributes where p was 0 and q was 0
F11 = the number of attributes where p was 1 and q was 1
Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
J = number of 11 matches / number of non-zero attributes

Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 * d2) / ||d1|| ||d2|| ,

Extended Jaccard Coefficient (Tanimoto)
Variation of Jaccard for continuous or count attributes

Reduces to Jaccard for binary attributes

image.png

Correlation

Correlation measures the linear relationship between objects
To compute correlation, we standardize data objects, p and q, and then take their dot product

image.png

Drawback: only consider linear relation but not non-linear one (linear independent)

Density

Measures the degree to which data objects are close to each other in a specified area
The notion of density is closely related to that of proximity
Concept of density is typically used for clustering and anomaly detection
Examples:
Euclidean density: number of points per unit volume
Probability density: Estimate what the distribution of the data looks like
Graph-based density: connectivity

Euclidean Density: Center-Based
Euclidean density is the number of points within a specified radius of the point

data mining 3 Discritization, Distance, Similarity_第8张图片

image.png

Distance in High-Dimensional Space

data mining 3 Discritization, Distance, Similarity_第9张图片

image.png

data mining 3 Discritization, Distance, Similarity

你可能感兴趣的:(data mining 3 Discritization, Distance, Similarity)