statistical machine learning 01 introduce

Content List

statistical machine-learning
- 1.1. learning object data
- 1.2. main machine-learning
- 1.3. machine-learning steps
- 1.4. importance of machine-learning
supervised-learning introduce
- 2.1 basic concept
- 2.2 formalization of problem
three factors of machine learning
- 3.1. model
- 3.2. strategy
- 3.3. algorithm

statistical machine learning

1.1 learning object data

它从数据出发，提取数据特征，抽象出数据模型, 发现数据中的知识，又回到数据的分析与预测。

1.2 main machine learning

supervised learning
unsupervised learning
semi-supervised learning
reinforcemnt learning

1.3 machine learning steps

得到一个有限的训练数据集合
确定包含所有可能的模型的假设空间 model hypothesis space
确定模型选择的准则 model strategy
实现求解最优模型学习的算法 model algorithm
通过学习方法选择 best model
利用learning best model 对新数据进行预测或分析

1.4 importance of ml

处理海量数据的有效手段
计算机智能化的有效手段
计算机科学发展的重要部分

应用领域 : 人工智能、模式识别、数据挖掘、NLP、图像识别、信息检索、生物信息 ...

supervised-learning

监督学习是学习模型，使模型能对任何输入(input) 都能产生一个预测性的输出 (output)。

2.1 basic concept

feature space
output space

每个具体的 input 是一个实例 instance, 通常由 feature vector 表示. 这时，所有 feature vector 存在的空间成为 feature space, feature space 的每一维对应于一个 feature.

将 input, output 看作是定义在 input(feature) space 与 output space 上随机变量的取值.

(1) input instane feature vector 记作 :

$$
x = (x_i^{(1)}, x_i^{(2)}, ..., x_i^{(i)}, ..., x_i^{(n)})^T
$$

$x_i^{(i)}$ 表示 x 的第 i 个 feature

supervised learning learning model from training data sets， then to predict the test data, training data by input(feature vector) and output composition.

training sets 训练集表示为 :

$$
T = { (x_1, y_1), (x_2, y_2), ... , (x_N, y_N) }
$$

test data input、output对 成为 sample(样本) and 样本点

classification、regression、tagging

input、output 都为 离散变量 的 prediction problem，称为 classification(分类) 问题
input、output 都为 连续变量 的 prediction problem，称为 regression(回归) 问题
input、output 都为 变量序列 的 prediction problem，称为 tagging(标注) 问题

(2) 联合概率分布

supervised learning 假设 input、output 的随机变量 X 和 Y 遵循联合概率分布 P(X, Y)

联合分布 more_info

在学习的过程中，假设 P(X, Y) 存在，但对学习系统而言 P(X, Y) 具体定义是未知的。training data、test data 被看作是依联合概率分布 P(X, Y) 独立同分布产生的。这就是 supervised learning 关于数据的基本假设。

hypothesis space 假设空间

supervised learning. model 属于由 input-space to output-space 的映射的集合, 这个集合就是 hypothesis space 假设空间.

y的集合么 ??

model of supervised learning 可以是概率模型或非概率模型。由条件概率分布 $P(X|Y)$ 或 decision function 决策函数 $Y = f(X)$ 表示. 对具体的输入进行输出预测 $P(y|x)$ or $y = f(x)$

2.2 formalization of problem

训练集 :

$$
T = { (x_1, y_1), (x_2, y_2), ... , (x_N, y_N) }
$$

$(x_i, y_i)$ 称为样本点 sample

$x_i in chi subseteq R^n$ 输入观测值

$y_i in Y$ 输出观测值

three factors of ml

method = model + strategy + algorithm

3.1 model

supervised learning，model 就是所要学习的 条件概率分布 (conditional probability) 或 决策函数（decision function）

hypothesis space 可以定义为 decision function 的集合

$$
F = { f | Y = f(x) }
$$

$$
F = { f | Y = f_theta(x) , theta in R^n }
$$

hypothesis space 可以定义为 conditional probability 的集合

$$
F = { P | P(Y|X) }
$$

$$
F = { P | P_theta(Y|X) , theta in R^n }
$$

3.2 strategy

machine-learning 的目标在于从 hypothesis space 选取 best model。

loss function

supervised-learning 问题是在 hypothesis-space F select model $f$ as decision-function, 输出的预测值 $f(X)$ 与真实值 $Y$，可能不一致。loss function 记作 : $L(Y, f(X))$

0-1 loss function

$$
L(Y, f(X))=left{
begin{array}{ll}
1, &mbox{$Yne f(x)$}\
0, &mbox{$Y= f(x)$}
end{array}
right.
$$

quadratic loss function
absolute loss function
logarithmic loss function / log-likelihood loss function

3.3 algorithm

algorithm is learning model concrete [ˈkɑ:ŋkri:t] method. learning model from training data sets，by learning strategy， select best-model from hypothesis space.

statistical machine learning 01 introduce