data mining 4 Data Exploration

A preliminary exploration of the data to better understand its characteristics.
Key motivations of data exploration include

  • Helping to select the right tool for preprocessing or analysis
  • Making use of humans’ abilities to recognize patterns
    • People can recognize patterns not captured by data analysis tools
      Related to the area of Exploratory Data Analysis (EDA)

Techniques Used In Data Exploration

  • In EDA, as originally defined by Tukey
    • The focus was on visualization
    • Clustering and anomaly detection were viewed as exploratory techniques
    • In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory
  • In our discussion of data exploration, we focus on
    • Summary statistics
    • Visualization
    • Online Analytical Processing (OLAP)

Summary Statistics
Summary statistics are numbers that summarize properties of the data

  • Summarized properties include frequency, location and spread (location - mean; spread - standard deviation)
  • Most summary statistics can be calculated in a single pass through the data

Frequency and Mode

  • The frequency of an attribute value is the percentage of time the value occurs in the data set
    • For example, given the attribute ‘sex’ and a representative population of people, the gender ‘female’ occurs about 50% of the time.
  • The mode of an attribute is the most frequent attribute value
  • The notions of frequency and mode are typically used with categorical data

Percentiles
For continuous data, the notion of a percentile is more useful.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the p-th percentile is a value x_p of x such that p% of the observed values of x are less than x_p.
For instance, the 50-th percentile is the value x_50% such that 50% of all values of x are less than x_50%.

Measures of Location: Mean and Median
The mean is the most common measure of the location of a set of points.
However, the mean is very sensitive to outliers.
Thus, the median or a trimmed mean is also commonly used.

data mining 4 Data Exploration_第1张图片
image.png

Geometric Mean
Indicates the central tendency or typical value of a set of numbers.


data mining 4 Data Exploration_第2张图片
image.png

Harmonic Mean
It is proper for situations when the average of rates is desired


data mining 4 Data Exploration_第3张图片
image.png

The harmonic mean of the precision and the recall is often used as an aggregated performance score for the evaluation of algorithms and systems: the F-score (or F-measure). (gives equal weight to each data point)
Arithmetic mean >= Geometric Mean >= Harmonic Mean

Measures of Spread: Range and Variance

  • Range is the difference between the max and min
  • The variance or standard deviation s_x is the most common measure of the spread of a set of points.


    data mining 4 Data Exploration_第4张图片
    image.png
  • Because of outliers, other measures are often used.


    data mining 4 Data Exploration_第5张图片
    image.png

Coefficient of Variation (CV) (more of CV indicates more of the variation, can be used in clustering)

data mining 4 Data Exploration_第6张图片
image.png

Representation

  • Is the mapping of information to a visual format
  • Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors.
  • Example
    • Objects are often represented as points
    • Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape
    • If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.

Arrangement

  • Is the placement of visual elements within a display
  • Can make a large difference in how easy it is to understand the data


    data mining 4 Data Exploration_第7张图片
    image.png

Selection

  • Is the elimination or the de-emphasis of certain objects and attributes
  • Selection may involve the choosing a subset of attributes
  • Selection may also involve choosing a subset of objects

Visualization Techniques: Histograms
Histogram

  • Usually shows the distribution of values of a single variable
  • Divide the values into bins and show a bar plot of the number of objects in each bin.
  • The height of each bar indicates the number of objects
  • Shape of histogram depends on the number of bins

Visualization Techniques: Box Plots
Box Plots

  • Invented by J. Tukey
  • Another way of displaying the distribution of data
  • Following figure shows the basic part of a box plot


    data mining 4 Data Exploration_第8张图片
    image.png

Example of Box Plots
Box plots can be used to compare attributes


data mining 4 Data Exploration_第9张图片
image.png

Visualization Techniques: Scatter Plots
Scatter plots

  • Attributes values determine the position
  • Two-dimensional scatter plots most common, but can have three-dimensional scatter plots
  • Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects
  • It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes
    • See example on the next slide


      data mining 4 Data Exploration_第10张图片
      image.png

Visualization Techniques: Contour Plots
Contour plots

  • Useful when a continuous attribute is measured on a spatial grid
  • They partition the plane into regions of similar values
  • The contour lines that form the boundaries of these regions connect points with equal values
  • The most common example is contour maps of elevation
  • Can also display temperature, rainfall, air pressure, etc.
    • An example for Sea Surface Temperature (SST) is provided on the next slide


      data mining 4 Data Exploration_第11张图片
      image.png

你可能感兴趣的:(data mining 4 Data Exploration)