data mining 知识大纲

说明:该知识大纲是根据电子科技大学计算机学院研究生学位课《Data Mining》的授课内容整理而成。该课程由邵俊明老师进行讲授,且是英文授课。

Chapter 1

Definition of data mining

Data mining consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data

data mining 知识大纲_第1张图片
image.png

Key factors:

  • Data storage

  • Data availability

  • Computation power

Application:

  • Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

  • Cross-market analysis: Find associations/co-relations between product sales, & predict based on such association.

  • Resource planning: summarize and compare the resources and spending

  • Fraud detection

Task:

  • Association rule mining

  • Cluster analysis

  • Classification

  • Outlier detection

Direction:

  • Volume (Scale of Data)

  • Velocity (Data Stream)

  • Variety (Different format of data, difference sources)

  • Veracity (Uncertainty, missing value)

Chapter 2

Nearest Neighbor

data mining 知识大纲_第2张图片
image.png

Predict class label of test instance with major vote strategy

SVM Kernel

data mining 知识大纲_第3张图片
image.png

Ensemble learning

  1. bagging: random forest

  2. boosting: adaboost

  3. stacking

data mining 知识大纲_第4张图片
image.png
data mining 知识大纲_第5张图片
image.png

Chapter 3

Why do we need Hashing?

Challenge in big data applications:

  • Curse of dimensionality

  • Storage cost

  • Query speed

Examples:

  • Information retrieval

  • Storage cost

  • Fast nearest neighbor search

Three steps for similar documents:

  • shingling

  • Min hashing

  • Locality-sensitive hashing

data mining 知识大纲_第6张图片
image.png

Min-hashing

  1. Compute signatures of columns = small summaries of columns.

  2. Examine pairs of signatures to find similar signatures.

  3. (Optional) check that columns with similar signatures are really similar.

data mining 知识大纲_第7张图片
image.png

Use several (e.g., 100) independent hash functions to create a signature.

Locality-sensitive hashing

  • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated.

  • For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

LSH for min-hash signatures

Matrix M is the matrix of signatures.

data mining 知识大纲_第8张图片
image.png

For each band, hash its portion of each column to a hash table with k buckets.

data mining 知识大纲_第9张图片
image.png

Tradeoff

Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.

Learn to hash

  • PCA hashing: The basic idea is rotating the data to minimize quantization loss.

  • Spectral hashing

data mining 知识大纲_第10张图片
image.png

Chapter 4

Definition of sampling

Giving a p(x), we want to draw some samples to represent p(x).

Inverse transform sampling

data mining 知识大纲_第11张图片
image.png

Drawbacks: Usually, it’s hard to get the inverse function

Rejection sampling

data mining 知识大纲_第12张图片
image.png

Adaptive reject sampling: only if p(x) is log-concave

Importance sampling

data mining 知识大纲_第13张图片
image.png
data mining 知识大纲_第14张图片
image.png

Markov chain Monte Carlo(MCMC)

data mining 知识大纲_第15张图片
image.png

Detailed balance condition: π(i)Pij = π(j)Pij

Acceptance ratio

image.png

Drawbacks: acceptance ratio is too small

Metropolis–Hastings (MH) Sampling

Based on MCMC rewriting the acceptance ratio

data mining 知识大纲_第16张图片
image.png

But acceptance ratio still isn’t 100%

Gibbs sampling (based on MCMC)

Idea: Gibbs sampling further make acceptance ratio being 100%

data mining 知识大纲_第17张图片
image.png

other features of Gibbs:

  • Do not need p(x)
data mining 知识大纲_第18张图片
image.png

Sampling on data stream

  • Bernoulli Sampling

  • Reservoir Sampling: not need to know stream size;
    data mining 知识大纲_第19张图片
    image.png

Chapter 5

What is data stream?

A data stream is a massive sequence of data objects which have some unique features: One by One; Potentially Unbounded; Concept Drift

data mining 知识大纲_第20张图片
image.png

Challenges:

  • Single Pass Handling

  • Memory Limitation

  • Low Time Complexity

  • Concept Drift

What is concept drift?

The probability distribution changes.

data mining 知识大纲_第21张图片
image.png
data mining 知识大纲_第22张图片
image.png

Concept drift detection

  • Distribution-based detector

  • Error-rate based detector

data mining 知识大纲_第23张图片
image.png
data mining 知识大纲_第24张图片
image.png

Data stream classification

data mining 知识大纲_第25张图片
image.png

Typical algorithm

  • VFDT (very fast decision tree): A decision-tree learning system based on the Hoeffding tree algorithm

  • CVFDT (Concept-adapting Very Fast Decision Tree learner)

VFDT

data mining 知识大纲_第26张图片
image.png
data mining 知识大纲_第27张图片
image.png
data mining 知识大纲_第28张图片
image.png
data mining 知识大纲_第29张图片
image.png
data mining 知识大纲_第30张图片
image.png

CVFDT

  • https://blog.csdn.net/tanhy21/article/details/53363508
data mining 知识大纲_第31张图片
image.png
data mining 知识大纲_第32张图片
image.png
data mining 知识大纲_第33张图片
image.png
data mining 知识大纲_第34张图片
image.png

Data stream clustering

  • Online phase: Summarize the data into memory-efficient data structures

  • Offline phase: Use a clustering algorithm to find the data partition (k-means, decision tree)

Framework

data mining 知识大纲_第35张图片
image.png
data mining 知识大纲_第36张图片
image.png
data mining 知识大纲_第37张图片
image.png
data mining 知识大纲_第38张图片
image.png

Chapter 6

Key node identification

  • Centrality

  • K-shell Decomposition

  • PageRank

data mining 知识大纲_第39张图片
image.png
data mining 知识大纲_第40张图片
image.png

PageRank

data mining 知识大纲_第41张图片
image.png
data mining 知识大纲_第42张图片
image.png

Community detection (graph clustering)

  • Minimum cut: find a graph partition such that the number of edges between the two sets is minimized.
    data mining 知识大纲_第43张图片
    image.png

    But minimum cut always return an imbalanced partition.

  • Normalized cut & ratio cut
    data mining 知识大纲_第44张图片
    image.png

    ,prefer a balanced partition.

  • modularity
    data mining 知识大纲_第45张图片
    image.png
  • Random walk

  • Multi-level clustering

  • Dynamic community detection: a new viewpoint for community detection, the basic idea is Simulate the change of edge distances.

    • View network as dynamical system (Dynamic vs. Static)

    • Simulate the distance dynamics based on different interaction patterns (Distance dynamics vs. Node dynamics)

    • All edge distances will converge, and the community structure is intuitively identified.

Chapter 7

What is hadoop?

  • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

  • Hadoop is open-source implementation for Google MapReduce

  • Hadoop is based on a simple programming model called MapReduce

  • Hadoop is based on a simple data model, any data will fit

data mining 知识大纲_第46张图片
image.png
data mining 知识大纲_第47张图片
image.png
data mining 知识大纲_第48张图片
image.png

Core
Filesystems and I/O:

  • Abstraction APIs
  • RPC / Persistence

Avro
Cross-language serialization:

  • RPC / persistence
  • ~ Google ProtoBuf / FB Thrift

MapReduce
Distributed execution (batch)

  • Programming model
  • Scalability / fault-tolerance

HDFS
Distributed storage (read-opt.)

  • Replication / scalability
  • ~ Google filesystem (GFS)

Zoo keeper
Coordination service

  • Locking / configuration
  • ~ Google Chubby

HBase
Column-oriented, sparse store

  • Batch & random access
  • ~ Google BigTable

Pig
Data flow language

  • Procedural SQL-inspired lang.
  • Execution environment

Hive
Distributed data warehouse

  • SQL-like query language
  • Data mgmt / query execution

data mining 知识大纲_第49张图片
image.png
data mining 知识大纲_第50张图片
image.png
data mining 知识大纲_第51张图片
image.png
data mining 知识大纲_第52张图片
image.png
data mining 知识大纲_第53张图片
image.png

Limitation of MapReduce

  • Inefficient for multi-pass algorithm

  • No efficient primitives for data sharing

    • State between steps goes to distributed file system

    • Slow due to replication & disk storage

Spark

Apache Spark is a fast and general-purpose cluster computing system. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for streaming processing.

data mining 知识大纲_第54张图片
image.png
data mining 知识大纲_第55张图片
image.png
data mining 知识大纲_第56张图片
image.png
data mining 知识大纲_第57张图片
image.png
data mining 知识大纲_第58张图片
image.png
data mining 知识大纲_第59张图片
image.png
data mining 知识大纲_第60张图片
image.png

Row-key is unique for a row

data mining 知识大纲_第61张图片
image.png

你可能感兴趣的:(data mining 知识大纲)