We are drowning in data, but starving for knowledge. (John Naisbitt, 1982)
Data mining draws ideas from machine learning, statistics, and database systems.
Methods
Descriptive methods = unsupervised | Predictive methods = supervised |
---|---|
with a target (class) attribute | no target attribute |
Clustering, Association Rule Mining, Text Mining, Anomaly Detection, Sequential Pattern Mining | Classification, Regression, Text Mining, Time Series Prediction |
None of the data mining steps actually require a computer. But computers are scalability and they can help avoid human bias.
Basic process:
Apply data mining method -> Evaluate resulting model / patterns -> Iterate:
– Experiment with different parameter settings
– Experiment with different alternative methods
– Improve preprocessing and feature generation
– Combine different methods
Intra-cluster distances are minimized: Data points in one cluster are similar to one another.
Inter-cluster distances are maximized: Data points in separate clusters are different from each other.
Application area: Market segmentation, Document Clustering
Types:
**Clustering algorithm: ** Partitional, Hierarchical, Density-based Algorithms
**Proximity (similarity, or dissimilarity) measure: ** Euclidean Distance, Cosine Similarity, Domain-specific Similarity Measures
Application area: Product Grouping, Social Network Analysis, Grouping Search Engine Results, Image Recognition
Weaknesses1: Initial Seeds
Results can vary significantly depending on the initial choice of seeds (number and position)
Improving:
Weaknesses2: Outlier Handling
Remedy:
Evaluation
maximize Cohesion & Separation
summary
Advantages: Simple, Efficient time complexity: O(tkn) [n: number of data points, k: number of clusters, t: number of iterations]
Disadvantages: Must pick number of clusters before hand; All items are forced into a
cluster; Sensitive to outliers; Sensitve to initial seeds
K-Medoids is a K-Means variation that uses the medians of each cluster instead of the mean.
Medoids are the most central existing data points in each cluster
K-Medoids is more robust against outliers as the median is not affected by extreme values
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
density-based algorithm
Density = number of points within a specified radius (Eps)
DBSCAN Algorithm: Eliminate noise points -> Perform clustering on the remaining points
Advantages: Resistant to Noise + Can handle clusters of different shapes and sizes
Disadvantages: Varying densities + High-dimensional data
Determining EPS and MinPts?
Produces a set of nested clusters organized as a hierarchical tree. Can be visualized as a Dendrogram. (A tree like diagram that records the sequences of merges or splits. The y-axis displays the former distance between merged clusters.)
Advantages: We do not have to assume any particular number of clusters. May be used to look for meaningful taxonomies.
Step:
Starting Situation: Start with clusters of individual points and a proximity matrix
Intermediate Situation: After some merging steps, we have a number of clusters.
How to Define Inter-Cluster Similarity?
Limitations
Single Attributes: Similarity [0,1] and Dissimilarity [0, upper limit varies]
Many Mttributes:
Euclidean Distance
Caution
We are easily comparing apples and oranges.
changing units of measurement -> changes the clustering result
Recommendation: use normalization before clustering(generally, for all data mining algorithms involving distances)
Similarity of Binary Attributes
Common situation is that objects, p and q, have only binary attributes
1.Symmetric Binary Attributes -> hobbies, favorite bands, favorite movies
A binary attribute is symmetric if both of its states (0 and 1) have
equal importance, and carry the same weights
Similarity measure: Simple Matching Coefficient
2.Asymmetric Binary Attributes -> (dis-)agreement with political statements, recommendation for voting
Asymmetric: If one of the states is more important or more valuable than the other. By convention, state 1 represents the more important state. 1 is typically the rare or infrequent state. Example: Shopping Basket, Word/Document Vector
Given a set of records each of which contains some number of items from a given collection.
Produce dependency rules that will predict the occurrence of an item based on occurrences of other items.
Application area: Marketing and Sales Promotion, Content-based recommendation, Customer loyalty programs