100-Days-Of-ML-Code笔记

仅供学习使用
参考https://github.com/Avik-Jain/100-Days-Of-ML-Code

100-Days-Of-ML-Code

day1 数据预处理

  • 引入必要的库
  • 引入数据集
  • 处理丢失数据
  • 给类别数据编码
  • 将数据集分为测试集和训练集
  • 特征scaling
    大部分的机器学习算法,在计算的时候,使用欧几里德距离作为两个数据点的距离。

day2 简单线性回归

使用一个单独的特征,预测结果。

# coding:utf-8
'''
简单的线性回归
'''
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('~/git/100-Days-Of-ML-Code/datasets/studentscores.csv')
X = dataset.iloc[:, : 1].values
Y = dataset.iloc[:, 1].values

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1 / 4, random_state=0)

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor = regressor.fit(X_train, Y_train)

Y_pred = regressor.predict(X_test)

plt.scatter(X_train, Y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.scatter(X_test, Y_test, color='red')
plt.plot(X_test, regressor.predict(X_test), color='blue')

plt.show()

day3 多元线性回归

day4 逻辑回归

逻辑回归用来解决另外一类问题,叫做分类问题。目的是预测物体属于的类别。离散的结果,在0-1直接。
使用逻辑回归函数。sigmoid。
逻辑回归是离散的结果,线性回归是连续的结果。

day5 逻辑回归

学习损失函数是如何算的,在预测时候,如何使用梯度下降算法来降低损失函数的误差。

day6 实现逻辑回归

https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Code/Day%206%20Logistic%20Regression.md


'''
实现逻辑回归
'''

import pandas as pd

dataset = pd.read_csv('/Users/huihui/git/100-Days-Of-ML-Code/datasets/Social_Network_Ads.csv')

X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

day7 K近邻

找到k值是不容易的。
较小的k,意味着有结果中有噪音;
较大的k,使得计算复杂度很高。
依赖独立的case,最好是运行可能的k值,然后自己做决定

day8 逻辑回归背后的数学

学习这里
https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

day9 支持向量机SVM

简单介绍什么是SVM
如何用来解决分类问题

day10 SVM和KNN

深入了解SVM、实现K近邻算法

day11 实现K近邻

实现KNN算法,完成分类任务。

day12 支持向量机

SVM可以解决分类问题和回归问题。但是,多用于分类任务。
这个算法,我们把每一个数据绘制为一个N维度的点,N是特征的个数。

  • 如何分类?
    找到一个超平面,能够将不同的类别区分开来。
    换句话说,算法输出一个最佳的超平面,将新样本分类。
  • 什么是最佳的超平面?
    能够让所有标签保持最大边距的那个超平面。
    换句话说,那个超平面,距离每一个类别的最近元素,都是都是最远的。

注意:
有线性可分的、有线性不可分的

  • kernel
  • gamma
  • regularization
  • margin

day13 朴素贝叶斯分类

scikit-learn实现SVM

day14 实现SVM

# coding:utf-8
# 2019/10/10 15:03
# huihui
# ref:


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

dataset = pd.read_csv('~/git/100-Days-Of-ML-Code/datasets/Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

from sklearn.svm import SVC

classifier = SVC(kernel='linear', random_state=0)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

from matplotlib.colors import ListedColormap

X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
                     np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha=0.75, cmap=ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c=ListedColormap(('red', 'green'))(i), label=j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

day15 朴素贝叶斯分类和黑箱机器学习

学习不同类型的朴素贝叶斯分类器。
https://bloomberg.github.io/foml/#home
Also started the lectures by Bloomberg. First one in the playlist was Black Box Machine Learning. It gives the whole overview about prediction functions, feature extraction, learning algorithms, performance evaluation, cross-validation, sample bias, nonstationarity, overfitting, and hyperparameter tuning.

day16 使用 Kernel Trick实现SVM

使用Scikit-Learn实现SVM算法,加入kernel,将数据点映射到高维空间

day17 开始深度学习

Completed the whole Week 1 and Week 2 on a single day. Learned Logistic regression as Neural Network.

day18 深度学习

day21 网站抓取

【略】

day22 学习可行否?

Lecture 2 of 18 of Caltech’s Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. Learned about Hoeffding Inequality.

day23 决策树

ID3

day24 统计学习理论简介

Lec 3 of Bloomberg ML course introduced some of the core concepts like input space, action space, outcome space, prediction functions, loss functions, and hypothesis spaces.

day25 实现决策树

# coding:utf-8
# 2019/10/10 15:16
# huihui
# ref:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

dataset = pd.read_csv('~/git/100-Days-Of-ML-Code/datasets/Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

from matplotlib.colors import ListedColormap

X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
                     np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha=0.75, cmap=ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c=ListedColormap(('red', 'green'))(i), label=j)
plt.title('Decision Tree Classification (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

from matplotlib.colors import ListedColormap

X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
                     np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha=0.75, cmap=ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c=ListedColormap(('red', 'green'))(i), label=j)
plt.title('Decision Tree Classification (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

day30 微积分

day33 随机森林

随机森林是有监督的集成学习模型,用于分类和回归。
随机森林构建多个决策树,并将它们合并在一起,得到一个更加准确、稳定的预测。

  • 两个步骤
  1. 随机创建一个森林
  2. 做预测
  • 随机森林和决策树的区别:
    随机森林中,寻找根节点和拆分特征节点的过程,是随机的。

day34 实现随机森林

https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Code/Day%2034%20Random_Forest.md

day35 什么是神经网络

https://www.youtube.com/watch?v=aircAruvnKk&t=7s
对于神经网络的很好的理解。
通过手写数字识别的案例,解释相关概念。

day36 梯度下降,神经网络是如何学习的?

https://www.youtube.com/watch?v=IHZwWFHWa-w
用一种幽默的方式,解释了梯度下降的概念。
推荐必须学习。

day37 反向传播,在做什么?

https://www.youtube.com/watch?v=Ilg3gGewQ5U
解释偏导和反向传播。

day38 反向传播微积分

https://www.youtube.com/watch?v=tIeHLnjs5U8

day39 深度学习:python、TensorFlow、Keras教程

https://www.youtube.com/watch?v=wQ8BIBpya2k&t=19s&index=2&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN

day40 加载你自己的数据

https://www.youtube.com/watch?v=j-3vuBynnOE&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=2
深度学习基础

day41 卷积神经网络

https://www.youtube.com/watch?v=WvoLTXIjBYU&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=3

day42 TensorBoard分析模型

https://www.youtube.com/watch?v=BqgTU7_cBnk&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=4

day43 K Means聚类

思考非监督学习,研究聚类。

day44 实现K均值聚类

https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master

day45 numpy-1

https://github.com/jakevdp/PythonDataScienceHandbook
Introduction to Numpy. Covered topics like Data Types, Numpy arrays and Computations on Numpy arrays.

  • 学习:

    Introduction to NumPy

    Understanding Data Types in Python

    The Basics of NumPy Arrays

    Computation on NumPy Arrays: Universal Functions

day46 numpy-2

Aggregations, Comparisions and Broadcasting
Link to Notebook:

Aggregations: Min, Max, and Everything In Between

Computation on Arrays: Broadcasting

Comparisons, Masks, and Boolean Logic

day47 numpy-3

Fancy Indexing, sorting arrays, Struchered Data

Link to Notebook:

Fancy Indexing

Sorting Arrays

Structured Data: NumPy’s Structured Arrays

day48 pandas-1

Data Manipulation with Pandas

Covered Various topics like Pandas Objects, Data Indexing and Selection, Operating on Data, Handling Missing Data, Hierarchical Indexing, ConCat and Append.

Link To the Notebooks:

Data Manipulation with Pandas

Introducing Pandas Objects

Data Indexing and Selection

Operating on Data in Pandas

Handling Missing Data

Hierarchical Indexing

Combining Datasets: Concat and Append

day49 pandas-2

Chapter 3: Completed following topics- Merge and Join, Aggregation and grouping and Pivot Tables.

Combining Datasets: Merge and Join

Aggregation and Grouping

Pivot Tables

day50 pandas-3

Chapter 3: Vectorized Strings Operations, Working with Time Series

Links to Notebooks:

Vectorized String Operations

Working with Time Series

High-Performance Pandas: eval() and query()

day51 matplotlib-1

Matplotlib可视化
Learned about Simple Line Plots, Simple Scatter Plotsand Density and Contour Plots.

Links to Notebooks:

Visualization with Matplotlib

Simple Line Plots

Simple Scatter Plots

Visualizing Errors

Density and Contour Plots

day52 matplotlib-2

Matplotlib可视化
Learned about Histograms, How to customize plot legends, colorbars, and buliding Multiple Subplots.
链接到Notebooks:


Histograms, Binnings, and Density

Customizing Plot Legends

Customizing Colorbars

Multiple Subplots

Text and Annotation

day53 matplotlib-3

三维绘图
连接到Notebooks:
Three-Dimensional Plotting in Matplotlib

day54 Hierarchical Clustering 层次聚类

研究层次聚类
动图

你可能感兴趣的:(笔记)