说明：该知识大纲是根据电子科技大学计算机学院研究生学位课《Data Mining》的授课内容整理而成。该课程由邵俊明老师进行讲授，且是英文授课。

Chapter 1

Definition of data mining

Data mining consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data

image.png

Key factors:

Data storage
Data availability
Computation power

Application:

Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.
Cross-market analysis: Find associations/co-relations between product sales, & predict based on such association.
Resource planning: summarize and compare the resources and spending
Fraud detection

Task:

Association rule mining
Cluster analysis
Classification
Outlier detection

Direction:

Volume (Scale of Data)
Velocity (Data Stream)
Variety (Different format of data, difference sources)
Veracity (Uncertainty, missing value)

Chapter 2

Nearest Neighbor

image.png

Predict class label of test instance with major vote strategy

SVM Kernel

image.png

Ensemble learning

bagging: random forest
boosting: adaboost
stacking

image.png

image.png

Chapter 3

Why do we need Hashing?

Challenge in big data applications:

Curse of dimensionality
Storage cost
Query speed

Examples:

Information retrieval
Storage cost
Fast nearest neighbor search

Three steps for similar documents:

shingling
Min hashing
Locality-sensitive hashing

image.png

Min-hashing

Compute signatures of columns = small summaries of columns.
Examine pairs of signatures to find similar signatures.
(Optional) check that columns with similar signatures are really similar.

image.png

Use several (e.g., 100) independent hash functions to create a signature.

Locality-sensitive hashing

General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated.
For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

LSH for min-hash signatures

Matrix M is the matrix of signatures.

image.png

For each band, hash its portion of each column to a hash table with k buckets.

image.png

Tradeoff

Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.

Learn to hash

PCA hashing: The basic idea is rotating the data to minimize quantization loss.
Spectral hashing

image.png

Chapter 4

Definition of sampling

Giving a p(x), we want to draw some samples to represent p(x).

Inverse transform sampling

image.png

Drawbacks: Usually, it’s hard to get the inverse function

Rejection sampling

image.png

Adaptive reject sampling: only if p(x) is log-concave

Importance sampling

image.png

image.png

Markov chain Monte Carlo(MCMC)

image.png

Detailed balance condition: π(i)Pij = π(j)Pij

Acceptance ratio

image.png

Drawbacks: acceptance ratio is too small

Metropolis–Hastings (MH) Sampling

Based on MCMC rewriting the acceptance ratio

image.png

But acceptance ratio still isn’t 100%

Gibbs sampling (based on MCMC)

Idea: Gibbs sampling further make acceptance ratio being 100%

image.png

other features of Gibbs:

Do not need p(x)

image.png

Sampling on data stream

Bernoulli Sampling
Reservoir Sampling: not need to know stream size;

image.png

Chapter 5

What is data stream?

A data stream is a massive sequence of data objects which have some unique features: One by One; Potentially Unbounded; Concept Drift

image.png

Challenges:

Single Pass Handling
Memory Limitation
Low Time Complexity
Concept Drift

What is concept drift?

The probability distribution changes.

image.png

image.png

Concept drift detection

Distribution-based detector
Error-rate based detector

image.png

image.png

Data stream classification

image.png

Typical algorithm

VFDT (very fast decision tree): A decision-tree learning system based on the Hoeffding tree algorithm
CVFDT (Concept-adapting Very Fast Decision Tree learner)

VFDT

image.png

image.png

image.png

image.png

image.png

CVFDT

https://blog.csdn.net/tanhy21/article/details/53363508

image.png

image.png

image.png

image.png

Data stream clustering

Online phase: Summarize the data into memory-efficient data structures
Offline phase: Use a clustering algorithm to find the data partition (k-means, decision tree)

Framework

image.png

image.png

image.png

image.png

Chapter 6

Key node identification

Centrality
K-shell Decomposition
PageRank

image.png

image.png

PageRank

image.png

image.png

Community detection (graph clustering)

Minimum cut: find a graph partition such that the number of edges between the two sets is minimized.

image.png

But minimum cut always return an imbalanced partition.
Normalized cut & ratio cut

image.png

，prefer a balanced partition.
modularity

image.png
Random walk
Multi-level clustering
Dynamic community detection: a new viewpoint for community detection, the basic idea is Simulate the change of edge distances.
- View network as dynamical system (Dynamic vs. Static)
- Simulate the distance dynamics based on different interaction patterns (Distance dynamics vs. Node dynamics)
- All edge distances will converge, and the community structure is intuitively identified.

Chapter 7

What is hadoop?

Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Hadoop is based on a simple data model, any data will fit

image.png

image.png

image.png

Core
Filesystems and I/O:

Abstraction APIs
RPC / Persistence

Avro
Cross-language serialization:

RPC / persistence
~ Google ProtoBuf / FB Thrift

MapReduce
Distributed execution (batch)

Programming model
Scalability / fault-tolerance

HDFS
Distributed storage (read-opt.)

Replication / scalability
~ Google filesystem (GFS)

Zoo keeper
Coordination service

Locking / configuration
~ Google Chubby

HBase
Column-oriented, sparse store

Batch & random access
~ Google BigTable

Pig
Data flow language

Procedural SQL-inspired lang.
Execution environment

Hive
Distributed data warehouse

SQL-like query language
Data mgmt / query execution

image.png

image.png

image.png

image.png

image.png

Limitation of MapReduce

Inefficient for multi-pass algorithm
No efficient primitives for data sharing
- State between steps goes to distributed file system
- Slow due to replication & disk storage

Spark

Apache Spark is a fast and general-purpose cluster computing system. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for streaming processing.

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Row-key is unique for a row

image.png

data mining 知识大纲

Chapter 1

Definition of data mining

Chapter 2

Nearest Neighbor

SVM Kernel

Ensemble learning

Chapter 3

Why do we need Hashing?

Min-hashing

Locality-sensitive hashing

LSH for min-hash signatures

Tradeoff

Learn to hash

Chapter 4

Definition of sampling

Inverse transform sampling

Rejection sampling

Importance sampling

Markov chain Monte Carlo(MCMC)

Metropolis–Hastings (MH) Sampling

Gibbs sampling (based on MCMC)

Sampling on data stream

Chapter 5

What is data stream?

What is concept drift?

Concept drift detection

Data stream classification

Typical algorithm

VFDT

CVFDT

Data stream clustering

Framework

Chapter 6

Key node identification

PageRank

Community detection (graph clustering)

Chapter 7

What is hadoop?

Limitation of MapReduce

Spark

你可能感兴趣的:(data mining 知识大纲)