机器学习基础
1. 数据集
2. 特征工程
3. 学习分类
4. 模型
5. 损失函数
6. 优化
7. 过拟合
8. 欠拟合
数据集
又称资料集、数据集合或者资料集合,是一种由数据所组成的集合
特征工程
1. 特征需求
2. 特征设计
3. 特征处理
特征预处理、特征选择、特征降维
4. 特征验证
特征预处理
特征预处理:
1.无量纲化
2.信息提取
3.信息数据化
4.缺失补全
5.信息利用率均衡
无量纲化
1.标准化
import numpy as np
from sklearn.preprocessing import StandardScaler
x = np.arange(7).reshape(7, 1)
y = np.array([2, 10, 35, 100, 45, 20, 5]).reshape(7, 1)
x_data = np.hstack((x, y))
print(x_data)
xx = (x_data - np.mean(x_data))/np.std(x_data)
print(xx)
scaler = StandardScaler()
xx = scaler.fit_transform(x_data)
print(xx)
"""
标准化使用前提:让数据处理后处于同一规格,并且任然呈现 正态分布
1、数据的规格或者单位不一致
2、数据成正态分布
"""
D:\Anaconda\anaconda\envs\tf\python.exe D:\pycharm\python\day6\1.特征预处理\1.无量纲化\1.标准化.py
[[ 0 2]
[ 1 10]
[ 2 35]
[ 3 100]
[ 4 45]
[ 5 20]
[ 6 5]]
[[-0.64175426 -0.56625376]
[-0.60400401 -0.26425176]
[-0.56625376 0.67950451]
[-0.52850351 3.13327081]
[-0.49075326 1.05700702]
[-0.45300301 0.11325075]
[-0.41525276 -0.45300301]]
[[-1.5 -0.91367316]
[-1. -0.66162539]
[-0.5 0.12602388]
[ 0. 2.173912 ]
[ 0.5 0.44108359]
[ 1. -0.34656568]
[ 1.5 -0.81915524]]
进程已结束,退出代码0
2.归一化
import numpy as np
from sklearn.preprocessing import Normalizer
x = np.arange(7).reshape(7, 1)
y = np.array([2, 10, 35, 60, 100, 200, 250]).reshape(7, 1)
x_data = np.hstack((x, y))
xx = (x_data - np.mean(x_data)) / (np.max(x_data) - np.min(x_data))
print(x_data)
print(xx)
normalizer = Normalizer()
xx = normalizer.fit_transform(x_data)
print(xx)
"""
归一化处理前提:处理后的数据处于同一量级,并且被缩放到[0, 1]之间
1.数据规格或者单位不一致
2.数据没有呈现正态分布,呈现线性变化
"""
D:\Anaconda\anaconda\envs\tf\python.exe D:\pycharm\python\day6\1.特征预处理\1.无量纲化\2.归一化.py
[[ 0 2]
[ 1 10]
[ 2 35]
[ 3 60]
[ 4 100]
[ 5 200]
[ 6 250]]
[[-0.19371429 -0.18571429]
[-0.18971429 -0.15371429]
[-0.18571429 -0.05371429]
[-0.18171429 0.04628571]
[-0.17771429 0.20628571]
[-0.17371429 0.60628571]
[-0.16971429 0.80628571]]
[[0. 1. ]
[0.09950372 0.99503719]
[0.05704979 0.99837133]
[0.04993762 0.99875234]
[0.03996804 0.99920096]
[0.02499219 0.99968765]
[0.02399309 0.99971212]]
进程已结束,退出代码0
信息数据化
1.特征二值化
import numpy as np
from sklearn.preprocessing import Binarizer
x = np.array([20, 35, 40, 75, 60, 55, 50]).reshape(-1, 1)
scaler = Binarizer(threshold=50)
xx = scaler.fit_transform(x)
print(xx)
D:\Anaconda\anaconda\envs\tf\python.exe D:\pycharm\python\day6\1.特征预处理\2.信息数据化\1.特征二值化.py
[[0]
[0]
[0]
[1]
[1]
[1]
[0]]
2. Ont-hot编码
"""
ont-hot编码,又称独热编码。目的是保证每个数据 距远点相同位置。每个可能出现的结果概率相同
"""
import numpy as np
from sklearn.preprocessing import OneHotEncoder
y = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).reshape(-1, 1)
scaler = OneHotEncoder(sparse=False)
yy = scaler.fit_transform(y)
print(yy)
D:\Anaconda\anaconda\envs\tf\python.exe D:\pycharm\python\day6\1.特征预处理\2.信息数据化\ont-hot编码.py
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
3.缺失数据补全
import numpy as np
from sklearn.impute import SimpleImputer
x = np.array([[1, 2, 3, 4],
[1, np.nan, 5, 6],
[7, 2, np.nan, 11],
[np.nan, 25, 25, 16]])
"""
补全方法 strategy:
1.“mean”:平均数补齐法。当前特征列其余数据的平均值
2."median":中位数补全法。数据从小到大中间的数据
3."most_frequent":出现次数最多的数据补全。如果出现次数都一样,则取第一个
"""
xx = SimpleImputer(strategy="mean").fit_transform(x)
xx = SimpleImputer(strategy="median").fit_transform(x)
xx = SimpleImputer(strategy="most_frequent").fit_transform(x)
xx = SimpleImputer(strategy="constant").fit_transform(x)
print(xx)
D:\Anaconda\anaconda\envs\tf\python.exe D:\pycharm\python\day6\1.特征预处理\2.信息数据化\缺失数据补全.py
[[ 1. 2. 3. 4.]
[ 1. 0. 5. 6.]
[ 7. 2. 0. 11.]
[ 0. 25. 25. 16.]]
特征选择
1.方差选择法
import numpy as np
from sklearn.feature_selection import VarianceThreshold
x = np.array([
[78, 23, 12, 34, 98],
[23, 22, 13, 56, 71],
[10, 21, 14, 31, 60],
[5, 29, 26, 30, 40]])
for i in range(x.shape[1]):
print("第{}列的方差值为{}" .format(i, np.var(x[:, i])))
feature = VarianceThreshold(threshold=100)
xx = feature.fit_transform(x)
print(xx)
print(feature.variances_)
"""
方差选择法:特征列数据越发散,特征就越明显,方差值就越大
1.特征选择法,可以让预处理后的特征数据量减小,提升机器学习的效率
2.特征量少了,特征值反而更明显,机器学习的准确性更强
"""
D:\Anaconda\anaconda\envs\tf\python.exe D:\pycharm\python\day6\2.特征选择\1.方差选择法.py
第0列的方差值为843.5
第1列的方差值为9.6875
第2列的方差值为32.1875
第3列的方差值为113.1875
第4列的方差值为438.6875
[[78 34 98]
[23 56 71]
[10 31 60]
[ 5 30 40]]
[843.5 9.6875 32.1875 113.1875 438.6875]
2.相关系数法
"""
相关系数法:判断特征数据对于目标(结果)的相关性。相关性越强说明特征越明显
"""
import numpy as np
from sklearn.feature_selection import SelectKBest
x = np.array([
[78, 23, 12, 34, 98],
[23, 22, 13, 56, 71],
[10, 21, 14, 31, 60],
[5, 29, 26, 30, 40]])
y = np.array([1, 1, 1, 0])
k = SelectKBest(k=3)
xx = k.fit_transform(x, y)
print(k.pvalues_)
print(k.scores_)
print(xx)
D:\Anaconda\anaconda\envs\tf\python.exe D:\pycharm\python\day6\2.特征选择\2.相关系数法.py
[0.5229015 0.02614832 0.00779739 0.5794261 0.24884702]
[ 0.58940905 36.75 126.75 0.42978638 2.5895855 ]
[[23 12 98]
[22 13 71]
[21 14 60]
[29 26 40]]