算法直观与对模型的理解
很多机器学习算法以数据的线性可分为基本假设或叫基本前提。
Many machine learning algorithms make assumptions about the linear separability of the input data.
感知机模型(perceptron)甚至要求训练数据的完美地线性可分才会收敛。
streaming workflows with pipelines
所谓迭代,即是存在这样一个递推关系式:
比如,E-M算法为GMM给出的关于参数的递推式为:
z 是算法提供的引变量(Latent variables)。
加入的引变量,不能改变marginal distribution( P(X) )。
也即:
求一个函数的期望,必须给出其分布(distribution)。不同的期望,其分布是不同的。
也即(以连续型随机变量为例):
f(x) 即为其概率密度分布。
非监督的方法,比如(PCA/K-means),的特点是不使用label信息。
Linear Discriminant Analysis can be used as a technique for feature extraction to increase computational efficiency and reduce the degree of over-fitting due to the curse of dimensionality in non-regularized models.
The general concept behind LDA is very similar to PCA, whereas PCA attempts to find the orhogonal component axes of maximum variance in a dataset; the goal in LDA is to find the feature subspace that optimizes class separability.
Both LDA and PCA are linear transformation techniques that can be used to reduce the number of dimensions in a dataset; the former is an unsupervised algorithm, whereas the latter is supervised.
using Kernel PCA for nonlinear mappings
Many machine learning algorithms make assumptions about the linear separability of the input data.
we can tackle nonlinear problems by projecting them onto a new feature space of higher dimensionality(更高维的特征空间) where the classes become linearly separable.
Kernel PCA:perform a nonlinear mapping that transforms the data onto a higher-dimensional space;
standard PCA:project the data back onto a lower-dimensional space where the samples can be separated by a linear classifier (under the condition that the samples can be separated by density in the input space)。
However, one downside of this approach is that it is computationally very expensive, and this is where we use the kernel trick.
In machine learning, we have two types of parameters: those that are learned from the training data, for example, the weights in logistic regression, and the parameters of a learning algorithm that are optimized separately. The latter are the tuning parameters, also called hyper-parameters.
tuning one of its hyper-parameters
a powerful hyper-parameters optimization technique called grid search that can further help to improve the performance of a model by finding the optimal combination of hyperparameter values.
Many learning algorithms(PCA,LogisticRegression) require input features on the same scale for optimal performance.
a model can either suffer from underfitting (high bias) if the model is too simple,
or it can overfit the training data (high variance) if the model is too complex for the underlying training data.
to find an acceptable bias-variance trade-off, we need to evaluate our model carefully.
useful cross-validation techniques:
from sklearn.cross_validation import train_test_spit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)
which can help us to obtain reliable estimates of the model’s generalization error, that is, how well the model performs on unseen data.
One of the key steps in building a machine learning model is to estimate its performance on (new) data that the model hasn’t seen before.