数据集:特征值 + 目标值
目标值:类别 – 分类问题
目标值:连续性的数据 – 回归问题
监督学习(supervised learning)
无监督学习(unsupervised learning)
conda install numpy scipy
pip install sklearn
获取小规模数据集,数据保存在datasets里。用法:datasets.load_ *() 注:(* 代表数据集名称)
from sklearn.datasets import load_iris
def datasets_demo():
iris = load_iris()
print("数据集的键:", list(iris.keys()))
print("数据集:", iris['data'])
print("数据集描述:", iris['DESCR'])
print("数据集特征值的名字:", (iris['feature_names']))
print("数据值大小:", iris['data'].shape)
print("目标值", iris['target_names'])
if __name__ == '__main__':
数据集的键: ['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module']
数据集: [[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
Iris plants dataset
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
数据集特征值的名字: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
数据值大小: (150, 4)
目标值 ['setosa' 'versicolor' 'virginica']
from sklearn.model_selection import train_test_split
def split_datasets():
iris = load_iris()
# 返回值:训练集特征值,测试集特征值,训练集目标值,测试集目标值
x_train, x_test, y_train, y_test = train_test_split(iris['data'], iris['target'], test_size=0.2)
print("训练集的特征值:", x_train, x_train.shape) # (120, 4)
if __name__ == '__main__':
from sklearn.feature_extraction import DictVectorizer
def dict_extraction():
data = [{'city': '北京', 'temperature': 36}, {'city': '上海', 'temperature': 37}, {'city': '深圳', 'temperature': 35}]
# 实例化一个转换器 sparse 为稀疏矩阵
# transfer = DictVectorizer(sparse=True)
transfer = DictVectorizer(sparse=False)
data_new = transfer.fit_transform(data)
if __name__ == '__main__':
(0, 1) 1.0
(0, 3) 36.0
(1, 0) 1.0
(1, 3) 37.0
(2, 2) 1.0
(2, 3) 35.0
[[ 0. 1. 0. 36.]
[ 1. 0. 0. 37.]
[ 0. 0. 1. 35.]]
对于特征当中存在类别信息的都处理成 one-hot 编码。
from sklearn.feature_extraction.text import CountVectorizer
def text_extraction():
data = ["life is too short,i like python", "python is very efficient for machine learning "]
transfer = CountVectorizer()
data_new = transfer.fit_transform(data)
name = transfer.get_feature_names_out()
# 将稀疏矩阵转换为二位列表
if __name__ == '__main__':
['efficient' 'for' 'is' 'learning' 'life' 'like' 'machine' 'python'
'short' 'too' 'very']
[[0 0 1 0 1 1 0 1 1 1 0]
[1 1 1 1 0 0 1 1 0 0 1]]
pip install jieba
from sklearn.feature_extraction.text import CountVectorizer
import jieba
def chinese_text_extraction():
datas = ["生活苦短", "我爱爬虫"]
# 将列表转换为字符串,并用空格隔开。
new_datas = [" ".join(list(jieba.cut(data))) for data in datas]
transfer = CountVectorizer()
data_final = transfer.fit_transform(new_datas)
if __name__ == '__main__':
['爬虫' '生活' '苦短']
[[0 1 1]
[1 0 0]]
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import jieba
def tfidf_text():
datas = ["中国的“动力电池白名单”政策被认为是推动中国新能源车及动力电池产业高速发展的重要政策手段", "中国“白名单”政策从2015年5月实施,到2019年6月废止,共执行四年时间"]
# 将列表转换为字符串,并用空格隔开。
new_datas = [" ".join(list(jieba.cut(data))) for data in datas]
transfer = TfidfVectorizer(stop_words=['认为', '重要'])
data_final = transfer.fit_transform(new_datas)
if __name__ == '__main__':
['2015' '2019' '中国' '产业' '动力电池' '发展' '四年' '实施' '废止' '手段' '执行' '推动' '政策'
'新能源' '时间' '白名单' '车及' '高速']
[[0. 0. 0.3607931 0.25354106 0.50708212 0.25354106
0. 0. 0. 0.25354106 0. 0.25354106
0.3607931 0.25354106 0. 0.18039655 0.25354106 0.25354106]
[0.34261985 0.34261985 0.24377685 0. 0. 0.
0.34261985 0.34261985 0.34261985 0. 0.34261985 0.
0.24377685 0. 0.34261985 0.24377685 0. 0. ]]
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
def normalization():
# 1.获取数据
filepath = 'dating.txt'
datas = pd.read_csv(filepath)
# 划分:所有行 0到3列 包括3
datas = datas.iloc[:, 0:3]
# 2.实例化一个转换器对象 feature_range默认(0,1)
transfer = MinMaxScaler()
# 3.调用fit_transform
data_new = transfer.fit_transform(datas)
[[0.44832535 0.39805139 0.56233353]
[0.15873259 0.34195467 0.98724416]
[0.28542943 0.06892523 0.47449629]
[0.29115949 0.50910294 0.51079493]
[0.52711097 0.43665451 0.4290048 ]
[0.47940793 0.3768091 0.78571804]]
from sklearn.preprocessing import StandardScaler
import pandas as pd
def standardization():
# 1.获取数据
filepath = 'dating.txt'
datas = pd.read_csv(filepath)
# 划分:所有行 0到3列 包括3
datas = datas.iloc[:, 0:3]
# 2.实例化一个转换器对象 feature_range默认(0,1)
transfer = StandardScaler()
# 3.调用fit_transform
data_new = transfer.fit_transform(datas)
[[ 0.33193158 0.41660188 0.24523407]
[-0.87247784 0.13992897 1.69385734]
[-0.34554872 -1.20667094 -0.05422437]
[-0.32171752 0.96431572 0.06952649]
[ 0.65959911 0.60699509 -0.20931587]
[ 0.46120328 0.31183342 1.00680598]]
from sklearn.feature_selection import VarianceThreshold
import pandas as pd
def low_variance():
# 1.获取factor_return.csv数据
filepath = 'factor_returns.csv'
datas = pd.read_csv(filepath)
datas = datas.iloc[:, 1:-2]
# 2.实例化一个转换器对象 默认方差为0,即threshold为零。一列数据元素相等
transform = VarianceThreshold(threshold=5)
# 3.调用fit_transform
data_new = transform.fit_transform(datas)
(2318, 9)
(2318, 8)
皮尔逊相关系数(Pearson Correlation Coefficient)反映变量之间相关关系密切程度的统计指标。
相关系数的值介于 -1 与 +1 之间,即 -1 ≤ r ≤ 1。其性质如下:
conda install scipy
from scipy.stats import pearsonr
import pandas as pd
def correlation():
filepath = 'factor_returns.csv'
datas = pd.read_csv(filepath)
datas = datas.iloc[:, 1:-2]
# 计算 pe_ratio 和 pb_ratio 之间的相关系数
corr = pearsonr(datas['pe_ratio'], datas['pb_ratio'])
pe_ratio pb_ratio ... revenue total_expense
0 5.9572 1.1818 ... 2.070140e+10 1.088254e+10
1 7.0289 1.5880 ... 2.930837e+10 2.378348e+10
2 -262.7461 7.0003 ... 1.167983e+07 1.203008e+07
3 16.4760 3.7146 ... 9.189387e+09 7.935543e+09
4 12.5878 2.5616 ... 8.951453e+09 7.091398e+09
... ... ... ... ... ...
2313 25.0848 4.2323 ... 1.148170e+10 1.041419e+10
2314 59.4849 1.6392 ... 1.731713e+09 1.089783e+09
2315 39.5523 4.0052 ... 1.789082e+10 1.749295e+10
2316 52.5408 2.4646 ... 6.465392e+09 6.009007e+09
2317 14.2203 1.4103 ... 4.509872e+10 4.132842e+10
[2318 rows x 9 columns]
(-0.004389322779936285, 0.8327205496564927)
散点图可以直观地看出特征 revenue 和 total_expense 之间的相关程度
def scatter_diagram():
filepath = 'factor_returns.csv'
datas = pd.read_csv(filepath)
datas = datas.iloc[:, 1:-2]
plt.figure(figsize=(20, 8), dpi=100)
plt.title('the correlation of revenue and total_expense')
plt.scatter(x=datas['revenue'], y=datas['total_expense'])
corr2 = pearsonr(datas['revenue'], datas['total_expense'])
# (0.99584504131361, 0.0)
from sklearn.decomposition import PCA
import pandas as pd
def pca():
datas = np.array([[1, 3, 8, 9], [5, 7, 4, 9], [5, 6, 9, 2]])
# 降到维度为3
transfer = PCA(n_components=3)
data_new = transfer.fit_transform(datas)
[[-2.64575131e+00 -3.46410162e+00 1.79236038e-16]
[-2.64575131e+00 3.46410162e+00 1.79236038e-16]
[ 5.29150262e+00 1.63168795e-15 1.79236038e-16]]