数据挖掘ch1

What is Big Data?
“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” — Gartner

“Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” — Mckinsey & Company

数据挖掘ch1_第1张图片
Paste_Image.png

Data mining
People have been analysing and investigating data for centuries.

Statistics
Mean, Variance, Correlation, Distribution …

In modern days, data are often far beyond human comprehension.
Diversity, Volume, Dimensionality

Definition
Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data.

Not a fully automatic process
Human interventions are often inevitable.
Domain Knowledge
Data Collection and Pre-processing

Synonym: Knowledge Discovery

数据挖掘ch1_第2张图片
Paste_Image.png

Data Integration & Analysis

数据挖掘ch1_第3张图片
Paste_Image.png

Process of Data Mining

数据挖掘ch1_第4张图片
Paste_Image.png

DM Techniques - Classification
“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics (referred to as variables) and based on a training set of previously labeled items.”

Given a training set: {(x1, y1), …, (xn, yn)}, produce a classifier (function) that maps any unknown object xi to its class label yi.

Algorithms
Decision Trees
K-Nearest Neighbours
Neural Networks
Support Vector Machines

Applications
Churn Prediction
Medical Diagnosis
Classification Boundaries

数据挖掘ch1_第5张图片
Paste_Image.png

Overfitting – Classification

数据挖掘ch1_第6张图片
Paste_Image.png

Confusion Matrix

数据挖掘ch1_第7张图片
Paste_Image.png

TPR=TP/(TP+FN)

TNR=TN/(TN+FP)

Accuracy=(TP+TN)/(P+N)

Receiver Operating Characteristic


数据挖掘ch1_第8张图片
Paste_Image.png

DM Techniques - Clustering
“Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.”

Distance Metrics
Euclidean Distance
Manhattan Distance
Mahalanobis Distance

Algorithms
K-Means
Sequential Leader
Affinity Propagation

Applications
Market Research
Image Segmentation
Social Network Analysis

数据挖掘ch1_第9张图片
Paste_Image.png

Hierarchical Clustering

数据挖掘ch1_第10张图片
Paste_Image.png

DM Techniques – Association Rule

数据挖掘ch1_第11张图片
Paste_Image.png
Paste_Image.png

DM Techniques – Regression

Paste_Image.png
数据挖掘ch1_第12张图片
Paste_Image.png
数据挖掘ch1_第13张图片
Paste_Image.png

Overfitting – Regression

数据挖掘ch1_第14张图片
Paste_Image.png

Data Preprocessing
Real data are often surprisingly dirty.
A Major Challenge for Data Mining

Typical Issues
Missing Attribute Values
Different Coding/Naming Schemes
Infeasible Values
Inconsistent Data
Outliers

Data Quality
Accuracy
Completeness
Consistency
Interpretability
Credibility
Timeliness

数据挖掘ch1_第15张图片
Paste_Image.png

Data Cleaning
Fill in missing values.
Correct inconsistent data.
Identify outliers and noisy data.

Data Integration
Combine data from different sources.

Data Transformation
Normalization
Aggregation
Type Conversion

Data Reduction
Feature Selection
Sampling

Privacy Protection
Data: A Double-Edged Sword
People can benefit greatly from data analysis.
The consequence of information leakage can be catastrophic.

People may be reluctant to give sensitive information due to privacy concerns.
Drug, Tax, Sexuality …

How to find out the percentage of people with a certain attribute?
The interviewer should not know the true answer of each respondent.

Randomized Response
Used in structured survey research.
Can maintain the confidentiality of respondents.
Two questions are presented:
Q1: I have the attribute A.
Q2: I do not have the attribute A.

The respondent uses a random device to:
Answer Q1 with probability p.
Answer Q2 with probability 1-p.
The interviewer has no idea about which question is answered.

数据挖掘ch1_第16张图片
Paste_Image.png

Cloud Computing

数据挖掘ch1_第17张图片
Paste_Image.png
数据挖掘ch1_第18张图片
Paste_Image.png

Why bother so many different algorithms?

No algorithm is always superior to others.

No parameter setting is optimal over all problems.

Look for the best match between problem and algorithm.
Experience
Trial and Error

Factors to consider:
Applicability
Computational Complexity
Interpretability

Always start with simple ones.

Grouping

数据挖掘ch1_第19张图片
Paste_Image.png

你可能感兴趣的:(数据挖掘ch1)