data mining 3 Discritization, Distance, Similarity

Discritization
Discretization is the process of converting a continuous attribute into an ordinal attribute
(Regressison method can handle continuous data, but decision tree cannot)

  • A potentially infinite number of values are mapped into a small number of categories
  • Discretization is commonly used in classification
  • Many classification algorithms work best if both the independent and dependent variables have only a few values
  • We give an illustration of the usefulness of discretization using the Iris data set

Binarization
Binarization maps a continuous or categorical attribute into one or more binary variables
Typically used for association analysis
Often convert a continuous attribute to a categorical attribute and then convert a categorical attribute to a set of binary attributes

  • Association analysis needs asymmetric binary attributes
  • Examples: eye color and height measured as {low, medium, high}

Attribute Transformation (Ensure that one attribute will not dominate other attributes due to the variance of the range)
An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

  • Simple functions: x^k, log(x), e^x, |x|
  • Normalization: Refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence, mean, variance, magnitude
  • In statistics, standardization refers to subtracting off the means and dividing by the standard deviation

Similarity and Dissimilarity Measures

  • Similarity measure
    • Numerical measure of how alike two data objects are.
    • Is higher when objects are more alike.
    • Often falls in the range [0,1]
  • Proximity refers to a similarity or dissimilarity

similarity/dissimilarity for simple attributes


data mining 3 Discritization, Distance, Similarity_第1张图片
image.png

Euclidean Distance


data mining 3 Discritization, Distance, Similarity_第2张图片
image.png

Standardization is necessary, if scales differ.

Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance


data mining 3 Discritization, Distance, Similarity_第3张图片
image.png

data mining 3 Discritization, Distance, Similarity_第4张图片
image.png

Mahalanobis Distance


data mining 3 Discritization, Distance, Similarity_第5张图片
image.png

Common Properties of a Distance

  • Distances, such as the Euclidean distance, have some well known properties.
  1. d(p, q) >= 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
  2. d(p, q) = d(q, p) for all p and q. (Symmetry)
  3. d(p, r) <= d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)
    where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
  • A distance that satisfies these properties is a metric

Common Properties of a Similarity

  • Similarities, also have some well known properties.
  1. s(p, q) = 1 (or maximum similarity) only if p = q.
  2. s(p, q) = s(q, p) for all p and q. (Symmetry)
    where s(p, q) is the similarity between points (data objects), p and q.

Similarity Between Binary Vectors

  • Common situation is that objects, p and q, have only binary attributes
  • Compute similarities using the following quantities
    F01 = the number of attributes where p was 0 and q was 1
    F10 = the number of attributes where p was 1 and q was 0
    F00 = the number of attributes where p was 0 and q was 0
    F11 = the number of attributes where p was 1 and q was 1
    Simple Matching and Jaccard Coefficients
    SMC = number of matches / number of attributes
    J = number of 11 matches / number of non-zero attributes

Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 * d2) / ||d1|| ||d2|| ,

Extended Jaccard Coefficient (Tanimoto)
Variation of Jaccard for continuous or count attributes

  • Reduces to Jaccard for binary attributes


    data mining 3 Discritization, Distance, Similarity_第6张图片
    image.png

Correlation

  • Correlation measures the linear relationship between objects
  • To compute correlation, we standardize data objects, p and q, and then take their dot product


    data mining 3 Discritization, Distance, Similarity_第7张图片
    image.png

Drawback: only consider linear relation but not non-linear one (linear independent)

Density

  • Measures the degree to which data objects are close to each other in a specified area
  • The notion of density is closely related to that of proximity
  • Concept of density is typically used for clustering and anomaly detection
    Examples:
  • Euclidean density: number of points per unit volume
  • Probability density: Estimate what the distribution of the data looks like
  • Graph-based density: connectivity

Euclidean Density: Center-Based
Euclidean density is the number of points within a specified radius of the point


data mining 3 Discritization, Distance, Similarity_第8张图片
image.png

Distance in High-Dimensional Space


data mining 3 Discritization, Distance, Similarity_第9张图片
image.png

你可能感兴趣的:(data mining 3 Discritization, Distance, Similarity)