machine &deep learning study notes

Machine learning

  1. AI: Artificial Intelligence: 处理人工智能的learning部分

  2. Pattern Recognition: 根据对象的特征对其进行识别分类

  3. Data Mining (数据挖掘,从数据内分析出新的信息以及知识)

模式识别: 序列标签: 给每个输入值打上一个标签,如输入是一个字,那么根据上下文判断词性

​ 句法分析(parsing): 分析输入自然语言的语法结构

​ 分类(claasification) : 已知类别,对输入进行分类

​ 聚集(clustering): 根据输入的特征创造分类

方法: Multiple features : 从能最小化错误发生概率的地方划分特征边界。

Bayesian Decision Theory

Design classifiers to recommend decisions that minimize some total expected ”risk”.

posterior(P(wj|X)= likelihood(P(X|wj))*prior(p(wj)) /evidence(P(x))

use prior

P(w1)>P(w2)===>w1

general theory

  1. use more than one features

  2. allow more than two categories

  3. Allow actions other than classifying the input to one of the possible categories (e.g., rejection).

  4. Employ a more general error function (i.e., “risk” function) by associating a “cost” (“loss” function) with each error (i.e., wrong action)

loss function

[图片上传失败...(image-510ced-1526236530463)]

find minR

zero one loss function: when i=j ==> lambda=0 ===> R(ai|x)=1-P(wi|x)

discriminant function

useful way of representing pattern classifiers gi(x)

Decision rules divide the feature space in decision regions R1, R2, …, Rc,separated by decision boundaries.

Naïve Bayes

classifier assumes that all features are conditionally independent

In the high dimensional feature space, it is difficult to

get the joint probability, so the Naïve Bayes is assume the all feature is

conditionally independent.

Naïve Bayes: the conditional independence assumption

– Training is very easy and fast; just requiring considering each attribute in each class separately

– Test is straightforward; just looking up tables or calculating conditional probabilities with estimated distributions

• A popular generative model

– Performance competitive to most of state-of-the-art classifiers even in presence of violating independence assumption

– Many successful applications, e.g., spam mail filtering

– A good candidate of a base learner in ensemble learning

ROC curve

Receiver Operating Characteristic (ROC) Curve

ROC curves can help us evaluate system performance for different thresholds, Can be used to compare tests/procedures

ROC curve is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied,in machine learning field, it is used to asses classifier and compare tests.

ROC曲线说明了二进制分类器系统的性能,因为它的识别阈值是可变的。

AUC

Area under ROC curve (AUC)

Overall measure of test performance

Comparisons between two tests based on differences between (estimated) AUC

Decision boundary

is a hypersurface that portions the underlying vector space into two set, one for each class.

KNN

K-nearest neighbors algorithm is a non-parametric method used for classification and regression.

PROS and CONS (优缺点):

缺点 1. Slower prediction with larger dataset.

  1. Many irrelevant features may lead to problem.

  2. The results tend poor on high-dimensional case.

  3. Can't get insight about patterns underlying the data.

  4. All k nearest neighbors have the same influence on the prediction.

  5. Closer nearest neighbors should (perhaps) have a higher influence on the prediction.

优点: 1. Easy to use.

  1. Have a good compatibility by choosing suitable distance measure.

  2. In complex non-linear problem, it can model better than basic linear models.

  3. Good data set makes good predictive accuracy.

  4. No learning algorithm needed. We do not explicitly learn a model out of the training data, the data is the model.

注意事项: 1. sample weight and feature weight.

​ 2. It is a good choice to select odd as the K value.

避免overfitting: choose a suitable K based on the scale of your data set.

Covariance matrix

Variance: How much a random variable varies around the expected value

Covariance is the measure the strength of the linear relationship between two random variables

For N-dimension random variable, the elements of the matrix are the covariance of every 2 random variables. This matrix called covariance matrix.

Covariance becomes more positive for each pair of values which differ from their mean in the same direction.

Covariance becomes more negative with each pair of values which differ from their mean in opposite directions.

应用案例: iris data, Eigen faces

PCA

Principal component analysis (PCA) uses the variance in the data as the structure preservation criterion.

PCA tries to preserve as much of the original variance of the data when projected to a lower-dimensional space.

Principal component analysis is a useful method to reduce dimensional (projection the high-dimensional space to lower-dimensional space)

in data preprocessing, it see the biggest variance direction in raw data as main feature ,and used the direction of biggest eigenvector represent the biggest variance direction by covariance matrix transform.

pros:

PCA is a non-parameter method which means the output on same raw input is same.

cons:

PCA is a linear method for dimension reduction, it can't work with non-linear raw data.

some time the max variance results may overlapping in two group data which is belong to same dataset, so it should choice a projection direction which is not overlapping.

the feature is in the direction of the big variance in other words is the direction of the big eigenvector.

basis matrix is the matrix compose by the eigenvector,and we can diagonalization the covariance to get the basis matrix.

SOM

a type of ANN that is trained using unsupervised learning to produce a low-dimension and discrete representation.

a clustering (vector quantization) method combined topology preservation ability and provides good data visualization possibilities

SOM is lattice of competitive neuronal units for clustering (vector quantization) and topology preserving.

topology preservation means that input patterns close in the input space are mapped to units close on the SOM lattice and units close on the SOM are close in the input space.

The training algorithm of the SOM is based on unsupervised learning, which can be either iterative or batch based.

The SOM can be used for data visualization, clustering (or classification), estimation and a variety of other purposes.

batch training

the gradient is computed for the entire input set and the map is updated toward the estimated optimum for the set.

Batch learning is opposite to the sequential learning, which means is update the weight in neuron after all input seen. It is more accurate estimate of gradient, and more fast to converges to local minimum.

U-matrix

The U-matrix shows the distance between neighboring units in SOM component planes, the high value on the U-matrix mean large distance, which means it is a cluster borders, thus visualizes the cluster structure of the map.

ANN

Computational models inspired by the human brain:

-Massively parallel, distributed system, made up of simple processing units (neurons)

-Synaptic connection strengths among neurons are used to store the acquired knowledge.

-Knowledge is acquired by the network from its environment through a learning process

Many types of models (supervised, unsupervised) for different tasks (classification, regression, clustering, visualization)

supervised

trained to produce desired outputs in response to sample inputs

Unsupervised

trained by letting the network adjust itself to inputs to find relationships (e.g. clusters) within data. Require only data, no labeled samples

delta rule

Delta learning rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network.

differentiate to get error derivatives for weights ===> update the weight

Batch vs. Sequential Learning

batch

*More accurate estimate of gradient

*Converges to local minimum faster

sequential

*Simpler to program

*May escape from local minima (change order or presentation)

Both ways, need many epochs - passes through the whole dataset

Activation functions

sigmoid activation function is used in the hidden units, and sigmoid or linear activation functions are used in the output units

Sigmoid

Tanh

Rectified Linear Unit(ReLU)

ANN summary

• Perceptron and linear regression optimize the same target function

• In both cases we compute the gradient (vector of partial derivatives)

• In the case of linear regression, we set the gradient to zero and solve for vector w. As the solution we have a closed formula for w such that the target function obtains the global minimum.

• In the case of perceptron, we iteratively go in the direction of the minimum by going in the direction of minus the gradient.

MLP

The complexity of the MLP can vary from a simple parametric model to a complex nonlinear regression model by varying the number of layers and the number of units in each layer.

Understand model complexity

there are how many models we can choose in a training case. If we can use some methods like weight decay to reduce model complexity, we will have better performance on avoiding over-fitting.

non-linear optimization the training usually stops at local minimum

Backpropagation

In general, back propagation is using the error at the output layer to correct the weight of the neurons in MLP layer by layer to reach the inputs. It using the chain rule to make sure the gradient calculation only in local term.

is used to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used in DNN.

Practical considerations

-Preprocessing of training data

-Initialization of weights

-Choosing the learning rate

-Batch or on-line learning

-Choosing the transfer functions

Normalize data

Because the raw data may have different unit(for example, Kilometers meters), different range. If we don't normalize those data,the training result is bad ever don't have any means.

Normalization makes sure each feature has its fare chance

Overfitting

overfitting meaning that the model works well with the training data, but performs poorly with the data not seen before.

解决overfitting的方法:

增加数据(data augmentation);

cross validation,交叉比对;

Weight decay(权值衰减);

dropout,

early stopping(属于正则化技术, early stopping的思想是不给训练网络形成过拟合时间,在测试集表现更好时我们就保存参数,然后继续测试,直到表现开始下降的点,然后停止测试);

Bayesian(贝叶斯法);

Noise

CNN中pool也是避免overfitting的方法

Underfitting

增加新特征; feature augmentation;

尝试非线性模型,DNN等模型; try non-linear model, for example DNN.

Vector norm

The vector norm reflects the ratio of the "length" scaling of a vector by mapping a vector to another vector.

Feature Extraction

either from the raw data or from other features ==> derived features

Feature Selection

所谓的维灾难就是当特征维度超过一定界限后,分类器的性能随着特征维度的增加反而下降(而且维度越高训练模型的时间开销也会越大)。导致分类器下降的原因往往是因为这些高纬度特征中含有无关特征和冗余特征,因此特征选择的主要目的是去除特征中的无关特征和冗余特征:

3 basic approaches

Filter methods

Wrapper methods

Embedded methods

Feature Generation

features: numerical values passed to the classifier.

Given the raw input data:

How to best describe this data with numerical features?

What values can we extract or generate from the raw input?

Problem specific solutions.

Common/general approaches: PCA.

Our focus: generating features from images.

Domain specific: DFT, DCT/DST, convolution.

Important in any pattern recognition task.

DFT

Computer can only deal with discrete and non-periodic signal, so after sampling and get discrete signal, we use DFT to transfer a discrete and limited length signal in time domain to a discrete and limited length signal in frequency domain. It lets computer can compute both in time domain and frequency domain.

DFT is Discrete Fourier Transform, DFT can transform a discrete and limited length signal(image) in time domain or spatial domain to a discrete and limited length signal(image) in frequency domain. It lets computer can compute both in time domain and frequency domain. And sometime it's more efficient to analysis data in frequency domain.

Template Matching

Assume a set of reference patterns are available.

Match an image to one of these patterns.

Typical use cases: Speech recognition. Robotic vision. Motion estimation in video coding. Image database retrieval.

Measure the similarity of two patterns (reference and test).

For 1D signals: edit distance. See the algorithms course.

For images: cross-correlation and deformable templates.

Another approach: PCA. ===> Project into lower dimensions.

If the rotated images are correlated. ======>Their projections can match.

Deep learning

Deep learning is a subset of the machine learning, it using the multiple processing layers to extract feature from lower layer to high layer and it is a hierarchical architecture.

Why deep learning

To address the challenge of generalizing to new example when working with high-dimensional data.

Traditional machine learning:Hand-crafted features

Deep learning can learn features directly from data without the need for manual feature extraction.

Why GPUs good for deep learning.

1)GPUs have many morere sources and faster bandwidth to memory

2)Deep learning computations fit well with GPU architecture.

DNN

DNN can have hundreds hidden layers to extract better feature representation.

The term “deep” refers to the number of layers in the network— the more layers,the deeper the network.

Popular DNN models

Supervised learning

each sample is a pair consisting of an input x and a desired target label y.

Goal: predict the target label of unseen inputs

Unsupervised learning

Challenge : In a lot of real-world use cases, even small-scale data collection can be extremely expensive or sometimes near-impossible(e.g.in medical imaging).

Autoencoder

  1. Encoder compresses the input into a hidden representation.

  2. Decode reconstructs the input from the hidden representation

Goal of training : minimizing the difference between the input vector x and the reconstruction vector z which is called reconstruction error(loss function).

SAE

Stacked/deep AutoEncoder (SAE) is constructed by extending the encoder and decoder of autoencoder with multiple hidden layers.

SDAE

Stacked Denoising AutoEncoder

Idea: adding noise to the input, it can force autoencoder to learn more robust features.

Phase 1: unsupervised pre-training

Phase 2: supervised fine-tuning

Improve Performance

DATA

Get more data

Invent more data: generate new data by creating randomly modified versions of existing data ======>data augmentation or data generation

Rescale the data: to the bounds based on the activation functions.

Transform the data:Scaling/ Attribute decompositions/Attribute aggregations

Feature selection

Hyper-parameters Tuning

Activation Functions

Optimization algorithm

Loss function

Weight Initialization

Batches and Epochs :Try batch size equal to training data size(batch learning) / Try a batch size of one (online learning).

Network Topology: Larger networks havea greater representational capability/ More layers offer more opportunity for hierarchical re-composition of abstract features learned from the data

Learning Rate adaptive learning rate

DNN training

Three common ways for DNN training

1.Purely supervised

-Initialize parameters randomly

-Trian in supervised mode ( typically with standard backpropagation procedures and a stocasticgradient descent algorithm with mean squared error as the loss function.

-Used in most practical systems for speech and image

2.Unsupervised, layerwise+ supervised classifier on top

-Train each layer unsupervised, one after the other

-Train a supervised classifier on top, keeping the other layers fixed

-Good when very few labeled samples are available

3.Unsupervised, layerwise+ global supervised fine-tuning

-Train each layer unsupervised, one after the other

-Add a classifier layer, and retrain the whole thing supervised

-Good when label set is poor

CNN

CNNs take advantage of the fact that the input consists of images. Ordinary NNs do not scale well to image data.

The convolution is applied using a convolution filter to produce a feature (activation ) map.

Pooling layer

Max pooling

the most common downsampling operation. Max pooling is done by applying a max filter to (usually) non-overlapping subregions of the initial representation.

Average pooling

consider the average output of a rectangular neighborhood (possibly weighted by the distance from the central pixel).

Max pooling extracts the most important features like edges whereas,average pooling extracts features so smoothly.

FC layer

At the end of the network is a fully connected(FC)layer. This layer basically takes an input volume(whatever the output is of the conv or ReLU or pool layer preceding it)and outputs an N dimensional vector where N is the number of classes.

CNN application

分类: classification(是猫是狗); detection:(图中有没有猫); segmentation(图中那部分是猫的轮廓,跟背景分离开来)

RNN

循环神经网络,用于处理前后关联的一串信息,并做出预测。

Other

Deep learning:

Improve performance:

​ Data: 1. more data. 2. bigger model. 3. more computation.

​ Hyper-parameter tuning: 1. optimize the algorithm. 2. pay attention to loss function. 3. weight initialization

Three common ways for DNN training:

Purely supervised: Used in most practical systems for speech and image.

Unsupervised, layer wise + supervised classifier on top:

Good when very few labeled samples are available

Unsupervised, layer wise + global supervised fine-tuning.

Good when label set is poor.

你可能感兴趣的:(machine &deep learning study notes)