吴恩达深度学习课程《神经网络与深度学习》
机器学习十大算法
from sklearn.datasets import load_iris #鸢尾花数据集
from sklearn.datasets import load-boston #波士顿房价数据集
from sklearn.datasets import load_digits #手写数字数据集
from sklearn.datasets import load_wine #葡萄酒数据集
from sklearn.datasets import load-barest-cancer #乳腺癌数据集
from sklearn.datasets import load-diabetes #糖尿病数据集
from sklearn.model_selection import train_test_split #数据集划分
from sklearn.model_selection import cross_validate #交叉验证
from sklearn.model_selection import KFold # K折交叉验证
from sklearn.preprocessing import OneHotEncoder #one-hot编码
from sklearn.preprocessing import LabelEncoder #标签编码
from sklearn.preprocessing import MinMaxScaler #归一化
from sklearn.preprocessing import StandardScaler #标准化
from sklearn.preprocessing import normalize #正则化 (norm=L1/L2)
from sklearn.linear_model import SGDClassifier #SGD分类器
from sklearn import svm #SVM分类器
from sklearn.tree import DecisionTreeClassifier #决策树分类
from sklearn.naive_bayes import GaussianNB #朴素贝叶斯(高斯分布型)
from sklearn.naive_bayes import BernoulliNB # 朴素贝叶斯(伯努利型)
from sklearn.neighbors import KNeighborsClassifier #KNN分类器
from sklearn.neighbors import KNeighborsRegressor #KNN回归
from sklearn.tree import DecisionTreeRegressor #决策树回归
from sklearn.svm import LinearSVR #SVM回归
from sklearn.linear_model import LinearRegression #线性回归
from sklearn.cluster import KMeans #K均值聚类
from sklearn.cluster import AgglomerativeClustering #层次聚类
from sklearn.decomposition import PCA #PCA降维
from sklearn.manifold import LocallyLinearEmbedding #LLE降维
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis #LDA降维
通过拟合最佳直线来建立自变量和因变量的关系。
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets
# Create linear regression object
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
#Equation coefficient and Intercept
print('Coefficient: n', linear.coef_)
print('Intercept: n', linear.intercept_)
#Predict Output
predicted= linear.predict(x_test)
通过将数据拟合进一个逻辑函数来预估一个事件出现的概率。因为它预估的是概率,所以它的输出值大小在 0 和 1 之间。
#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Equation coefficient and Intercept
print('Coefficient: n', model.coef_)
print('Intercept: n', model.intercept_)
#Predict Output
predicted= model.predict(x_test)
这个监督式学习算法通常被用于分类问题。
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini')
# for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
找到将两组不同数据分开的一条直线,两个分组中距离最近的两个点到这条线的距离同时最优化。
#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object
model = svm.svc()
# there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
朴素贝叶斯分类器假设一个分类的特性与该分类的其它特性不相关,贝叶斯定理提供了一种从P( c ),P(x)和P(x|c) 计算后验概率 P(c|x) 的方法。
#Import Library
from sklearn.naive_bayes import GaussianNB
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
该算法可用于分类问题和回归问题,多用于分类问题。根据一个距离函数,新案例会被分配到它的 K 个近邻中最普遍的类别中去。
这些距离函数可以是欧式距离、曼哈顿距离、明式距离或者是汉明距离。前三个距离函数用于连续函数,第四个函数用于分类变量。
#Import Library
from sklearn.neighbors import KNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model
KNeighborsClassifier(n_neighbors=6)
# default value for n_neighbors is 5
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
一种用于解决聚类问题的非监督式学习算法,将一个数据归入一定数量的集群。
K – 均值算法怎样形成集群:
#Import Library
from sklearn.cluster import KMeans
#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Create KNeighbors classifier object model
k_means = KMeans(n_clusters=3, random_state=0)
# Train the model using the training sets and check score
model.fit(X)
#Predict Output
predicted= model.predict(x_test)
在随机森林算法中,有一系列的决策树。为了根据一个新对象的属性将其分类,每一个决策树有一个分类,视为这个决策树“投票”给该分类。这个森林选择森林里(在所有树中)获得票数最多的分类。
#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
在减少需要分析的指标同时,尽量减少原指标包含信息的损失,以达到对所收集数据进行全面分析的目的。
降维算法主要有:
奇异值分解(SVD)
主成分分析(PCA)
因子分析(FA)
独立成分分析(ICA)
#Import Library
from sklearn import decomposition
#Assumed you have training and test data set as train and test
# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
# For Factor analysis
#fa= decomposition.FactorAnalysis()
# Reduced the dimension of training dataset using PCA
train_reduced = pca.fit_transform(train)
#Reduced the dimension of test dataset
test_reduced = pca.transform(test)
#For more detail on this, please refer this link.
当我们要处理很多数据来做一个有高预测能力的预测时,我们会用到 GBM 和 AdaBoost 这两种 boosting 算法。boosting 算法是一种集成学习算法。它结合了建立在多个基础估计值基础上的预测结果,来增进单个估计值的可靠程度。
#Import Library
from sklearn.ensemble import GradientBoostingClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier object
model= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
组成随机森林的树并行生成的分类树或回归树,多数投票输出结果;而GBDT只能由串行生成回归树组成,加权累加输出结果。
随机森林是通过减少模型方差提高性能;GBDT是通过减少模型偏差提高性能。
鸢尾花数据集(Iris.txt)
以鸢尾花的特征作为数据来源,数据集包含150个数据集,分为3类,每类50个数据,每个数据包含4个属性: 萼片长度, 萼片宽度,花瓣长度, 花瓣宽度.
三类分别为:setosa, versicolor, virginica
print iris.target #输出真实标签
print len(iris.target) #150个样本 每个样本4个特征
print iris.data.shape
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
150
(150L, 4L)
Github代码传送门
根据萼片长度,萼片宽度绘制散点图的分类结果:
根据花瓣长度,花瓣宽度绘制散点图的分类结果:
KMeans算法的基本思想是初始随机给定K个簇中心,按照最邻近原则把待分类样本点分到各个簇。然后按平均法重新计算各个簇的质心,从而确定新的簇心。一直迭代,直到簇心的移动距离小于某个给定的值。
K-Means聚类算法主要分为三个步骤:
算法流程:
由于聚类中心随机产生,每次运行都会产生不同的分类结果。
这周主要是playing code,书没有往后看多少,主要还是这几个分类算法的实际应用,本来以为调库模板已经总结在这里了,一动手才发现有些算法实现还有些问题,主要是数据处理和sklearn调库应用起来还不太熟练,github上先push了几个实现的算法,以后再把十大算法的坑填上。
吴恩达的课看完第一章了,抽零碎时间看的,还没总结,看下周能不能把《改善神经网络》的优化算法看完,然后一块总结一下。