仅供学习使用
参考https://github.com/Avik-Jain/100-Days-Of-ML-Code
使用一个单独的特征,预测结果。
# coding:utf-8
'''
简单的线性回归
'''
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('~/git/100-Days-Of-ML-Code/datasets/studentscores.csv')
X = dataset.iloc[:, : 1].values
Y = dataset.iloc[:, 1].values
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1 / 4, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor = regressor.fit(X_train, Y_train)
Y_pred = regressor.predict(X_test)
plt.scatter(X_train, Y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.scatter(X_test, Y_test, color='red')
plt.plot(X_test, regressor.predict(X_test), color='blue')
plt.show()
逻辑回归用来解决另外一类问题,叫做分类问题。目的是预测物体属于的类别。离散的结果,在0-1直接。
使用逻辑回归函数。sigmoid。
逻辑回归是离散的结果,线性回归是连续的结果。
学习损失函数是如何算的,在预测时候,如何使用梯度下降算法来降低损失函数的误差。
https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Code/Day%206%20Logistic%20Regression.md
'''
实现逻辑回归
'''
import pandas as pd
dataset = pd.read_csv('/Users/huihui/git/100-Days-Of-ML-Code/datasets/Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
找到k值是不容易的。
较小的k,意味着有结果中有噪音;
较大的k,使得计算复杂度很高。
依赖独立的case,最好是运行可能的k值,然后自己做决定
学习这里
https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
简单介绍什么是SVM
如何用来解决分类问题
深入了解SVM、实现K近邻算法
实现KNN算法,完成分类任务。
SVM可以解决分类问题和回归问题。但是,多用于分类任务。
这个算法,我们把每一个数据绘制为一个N维度的点,N是特征的个数。
注意:
有线性可分的、有线性不可分的
scikit-learn实现SVM
# coding:utf-8
# 2019/10/10 15:03
# huihui
# ref:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
dataset = pd.read_csv('~/git/100-Days-Of-ML-Code/datasets/Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
from sklearn.svm import SVC
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha=0.75, cmap=ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c=ListedColormap(('red', 'green'))(i), label=j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
学习不同类型的朴素贝叶斯分类器。
https://bloomberg.github.io/foml/#home
Also started the lectures by Bloomberg. First one in the playlist was Black Box Machine Learning. It gives the whole overview about prediction functions, feature extraction, learning algorithms, performance evaluation, cross-validation, sample bias, nonstationarity, overfitting, and hyperparameter tuning.
使用Scikit-Learn实现SVM算法,加入kernel,将数据点映射到高维空间
Completed the whole Week 1 and Week 2 on a single day. Learned Logistic regression as Neural Network.
【略】
Lecture 2 of 18 of Caltech’s Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. Learned about Hoeffding Inequality.
ID3
Lec 3 of Bloomberg ML course introduced some of the core concepts like input space, action space, outcome space, prediction functions, loss functions, and hypothesis spaces.
# coding:utf-8
# 2019/10/10 15:16
# huihui
# ref:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
dataset = pd.read_csv('~/git/100-Days-Of-ML-Code/datasets/Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha=0.75, cmap=ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c=ListedColormap(('red', 'green'))(i), label=j)
plt.title('Decision Tree Classification (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha=0.75, cmap=ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c=ListedColormap(('red', 'green'))(i), label=j)
plt.title('Decision Tree Classification (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
随机森林是有监督的集成学习模型,用于分类和回归。
随机森林构建多个决策树,并将它们合并在一起,得到一个更加准确、稳定的预测。
https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Code/Day%2034%20Random_Forest.md
https://www.youtube.com/watch?v=aircAruvnKk&t=7s
对于神经网络的很好的理解。
通过手写数字识别的案例,解释相关概念。
https://www.youtube.com/watch?v=IHZwWFHWa-w
用一种幽默的方式,解释了梯度下降的概念。
推荐必须学习。
https://www.youtube.com/watch?v=Ilg3gGewQ5U
解释偏导和反向传播。
https://www.youtube.com/watch?v=tIeHLnjs5U8
https://www.youtube.com/watch?v=wQ8BIBpya2k&t=19s&index=2&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN
https://www.youtube.com/watch?v=j-3vuBynnOE&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=2
深度学习基础
https://www.youtube.com/watch?v=WvoLTXIjBYU&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=3
https://www.youtube.com/watch?v=BqgTU7_cBnk&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=4
思考非监督学习,研究聚类。
https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master
https://github.com/jakevdp/PythonDataScienceHandbook
Introduction to Numpy. Covered topics like Data Types, Numpy arrays and Computations on Numpy arrays.
Aggregations, Comparisions and Broadcasting
Link to Notebook:
Aggregations: Min, Max, and Everything In Between
Computation on Arrays: Broadcasting
Comparisons, Masks, and Boolean Logic
Fancy Indexing, sorting arrays, Struchered Data
Link to Notebook:
Fancy Indexing
Sorting Arrays
Structured Data: NumPy’s Structured Arrays
Data Manipulation with Pandas
Covered Various topics like Pandas Objects, Data Indexing and Selection, Operating on Data, Handling Missing Data, Hierarchical Indexing, ConCat and Append.
Link To the Notebooks:
Data Manipulation with Pandas
Introducing Pandas Objects
Data Indexing and Selection
Operating on Data in Pandas
Handling Missing Data
Hierarchical Indexing
Combining Datasets: Concat and Append
Chapter 3: Completed following topics- Merge and Join, Aggregation and grouping and Pivot Tables.
Combining Datasets: Merge and Join
Aggregation and Grouping
Pivot Tables
Chapter 3: Vectorized Strings Operations, Working with Time Series
Links to Notebooks:
Vectorized String Operations
Working with Time Series
High-Performance Pandas: eval() and query()
Matplotlib可视化
Learned about Simple Line Plots, Simple Scatter Plotsand Density and Contour Plots.
Links to Notebooks:
Visualization with Matplotlib
Simple Line Plots
Simple Scatter Plots
Visualizing Errors
Density and Contour Plots
Matplotlib可视化
Learned about Histograms, How to customize plot legends, colorbars, and buliding Multiple Subplots.
链接到Notebooks:
Histograms, Binnings, and Density
Customizing Plot Legends
Customizing Colorbars
Multiple Subplots
Text and Annotation
三维绘图
连接到Notebooks:
Three-Dimensional Plotting in Matplotlib
研究层次聚类
动图