即机器学习的过程为:分析数据 --> 得出模型 --> 利用模型分析新数据
scikit learn
from sklearn.datasets import load_iris
# 导入莺尾花数据集
.. _iris_dataset:
Iris plants dataset
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
from sklearn.model_selection import train_test_split
[[5.8 2.7 5.1 1.9]
[6.9 3.1 4.9 1.5]
[7.7 3.8 6.7 2.2]
[6.4 3.2 4.5 1.5]
[5.1 3.3 1.7 0.5]]
[2 2 0 2 2 2 0 0 0 1 1 1 0 0 2 2 0 0 2 0 2 2 1 2 1 2 2 2 1 2]
1.字典特征提取 (实际就是将类别处理为one-hot编码)
one hot编码是将类别变量转换为机器学习算法易于利用的一种形式的过程,使得不同类别间无优先级差异
sparse matrix干了什么?就是只把非零值按位置表示出来,从而达到节省内存提高加载效率的作用。可以通过参数sparse=Flase设置不返回稀疏矩阵也可以得到sparse matrix后用toarray方法将稀疏矩阵转换为一般的矩阵。
应用场景:(1)数据集中有类别变量,那么久将数据集转字典或字典迭代器类型再用DictVectorizer做转换 (2)数据集本身就是字典
from sklearn import feature_extraction
# 实例化转换器类
# 调用fit_transform方法对字典迭代器进行特征值化
# 注get_feature_names()方法将在未来版本被移除,请使用get_feature_names_out()方法
[[ 0. 1. 0. 100.]
[ 0. 0. 1. 200.]
[ 1. 0. 0. 300.]]
['city=dalian' 'city=mianyang' 'city=shanghai' 'num']
from sklearn.feature_extraction import text
data=['i love china too','he loves china too ha ha ']
# 实例化转换器类
transfer=text.CountVectorizer() # 这个实例化方法没提供sparse参数
# 调用fit_transform方法进行转换
[[1 0 0 1 0 1]
[1 2 1 0 1 1]]
['china' 'ha' 'he' 'love' 'loves' 'too']
import jieba
def words_cut(text):
# jieba.cut(text)返回一个生成器,用list转生成器对象为list后才能转为字符串
# 注意列表单词间以空格分割
return ' '.join(list(jieba.cut(text)))
# 中文单词分割
new_data=[words_cut(text) for text in data]
['我 爱 中国 , 中国 是 一个 美丽 的 国家', '我 来自 四川 , 四川 有 大熊猫 , 他们 很 可爱']
# 实例化转换器对象
# 调用fit_transform方法转换
[[1 2 0 0 0 1 0 0 1]
[0 0 1 1 2 0 1 1 0]]
['一个' '中国' '他们' '可爱' '四川' '国家' '大熊猫' '来自' '美丽']
# 实例化转换器对象
# 调用fit_transform方法转换
# 第一个字符串的的‘一个’这个单词的tfidf计算
# tf=1/10=0.1 idf=lg 2/1 tfidf=0.1*
[[0.37796447 0.75592895 0. 0. 0. 0.37796447
0. 0. 0.37796447]
[0. 0. 0.35355339 0.35355339 0.70710678 0.
0.35355339 0.35355339 0. ]]
['一个' '中国' '他们' '可爱' '四川' '国家' '大熊猫' '来自' '美丽']
import sklearn.preprocessing
# scaler是放缩器的意思
# 实例化转换器类
transfer=sklearn.preprocessing.MinMaxScaler(feature_range=(0,1)) # 归一化到0,1区间
[[1. 1. 0. 0. 0. 1. 0. 0. 1.]
[0. 0. 1. 1. 1. 0. 1. 1. 0.]]
import sklearn.preprocessing
# scaler是放缩器的意思
# 实例化转换器类
transfer=sklearn.preprocessing.StandardScaler() # 归一化到0,1区间
[[ 1. 1. -1. -1. -1. 1. -1. -1. 1.]
[-1. -1. 1. 1. 1. -1. 1. 1. -1.]]
注: threshold /ˈθreʃhəʊld/ n.阈值
(2)相关系数法:相关系数大的特征过滤其一 或 按一定权重加权得到新特征删除原来两个特征 或 主成分分析
import sklearn.feature_selection
from sklearn.datasets import load_diabetes
# 实例化转换器类用方差选择法降维
# 调用fit_transform方法
[[ 0.03807591 0.05068012 0.06169621 ... -0.00259226 0.01990842
[-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974
[ 0.08529891 0.05068012 0.04445121 ... -0.00259226 0.00286377
[ 0.04170844 0.05068012 -0.01590626 ... -0.01107952 -0.04687948
[-0.04547248 -0.04464164 0.03906215 ... 0.02655962 0.04452837
[-0.04547248 -0.04464164 -0.0730303 ... -0.03949338 -0.00421986
import sklearn.decomposition
import numpy as np
# 实例化转换器类
# 调用fit_transform进行转换
print('PCA decompositon后的特征值:\n',new_data)
# 原来有4个特征PCA降维后只有两个特征,样本量不变
PCA decompositon后的特征值:
[[-3.13587302e-16 3.82970843e+00]
[-5.74456265e+00 -1.91485422e+00]
[ 5.74456265e+00 -1.91485422e+00]]
# 用pandas来读文件,以及合并操作
import pandas as pd
order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
product_id | product_name | aisle_id | department_id | |
0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
1 | 2 | All-Seasons Salt | 104 | 13 |
2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
order_id | product_id | add_to_cart_order | reordered | |
0 | 2 | 33120 | 1 | 1 |
1 | 2 | 28985 | 2 | 1 |
2 | 2 | 9327 | 3 | 0 |
aisle_id | aisle | |
0 | 1 | prepared soups salads |
1 | 2 | specialty cheeses |
2 | 3 | energy granola bars |
aisle_id | aisle | product_id | product_name | department_id | |
0 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 |
1 | 1 | prepared soups salads | 554 | Turkey Chili | 20 |
2 | 1 | prepared soups salads | 886 | Whole Grain Salad with Roasted Pecans & Mango ... | 20 |
3 | 1 | prepared soups salads | 1600 | Mediterranean Orzo Salad | 20 |
4 | 1 | prepared soups salads | 2539 | Original Potato Salad | 20 |
5 | 1 | prepared soups salads | 2941 | Broccoli Salad | 20 |
6 | 1 | prepared soups salads | 3991 | Moms Macaroni Salad | 20 |
7 | 1 | prepared soups salads | 4112 | Chopped Salad Bowl Italian Salad with Salami &... | 20 |
8 | 1 | prepared soups salads | 4369 | American Potato Salad | 20 |
9 | 1 | prepared soups salads | 4977 | Mushroom Barley Soup | 20 |
10 | 1 | prepared soups salads | 5351 | Smoked Whitefish Salad | 20 |
11 | 1 | prepared soups salads | 5653 | Chicken Curry Salad | 20 |
12 | 1 | prepared soups salads | 6778 | Soup, Golden Quinoa and Kale | 20 |
13 | 1 | prepared soups salads | 8121 | Split Pea Soup | 20 |
14 | 1 | prepared soups salads | 8382 | Organic Tomato Bisque | 20 |
15 | 1 | prepared soups salads | 8946 | Organic Spinach Pow Salad | 20 |
16 | 1 | prepared soups salads | 9431 | San Francisco Potato Salad | 20 |
17 | 1 | prepared soups salads | 10059 | Quinoa Salad Pistachio Citrus | 20 |
18 | 1 | prepared soups salads | 10288 | Broccoli with Almond Soup | 20 |
19 | 1 | prepared soups salads | 10617 | Butternut Squash Cumin Soup | 20 |
aisle_id | aisle | product_id | product_name | department_id | order_id | add_to_cart_order | reordered | |
0 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 94246 | 5 | 0 |
1 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 192465 | 2 | 1 |
2 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 195206 | 18 | 1 |
3 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 227717 | 1 | 1 |
4 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 260072 | 13 | 0 |
5 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 289399 | 4 | 1 |
6 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 340960 | 7 | 1 |
7 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 344099 | 10 | 0 |
8 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 379434 | 7 | 0 |
9 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 472683 | 4 | 0 |
10 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 473054 | 5 | 1 |
11 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 520382 | 11 | 0 |
12 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 600934 | 3 | 1 |
13 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 632958 | 5 | 1 |
14 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 650024 | 14 | 1 |
15 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 657646 | 4 | 1 |
16 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 686260 | 1 | 0 |
17 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 789744 | 3 | 0 |
18 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 836624 | 2 | 0 |
19 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 909037 | 2 | 0 |
aisle_id | aisle | product_id | product_name | department_id | order_id | add_to_cart_order | reordered | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
0 | 1 | prepared soups salads | 209 | Italian Pasta Salad | 20 | 94246 | 5 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
1 | 1 | prepared soups salads | 22853 | Pesto Pasta Salad | 20 | 94246 | 4 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
2 | 4 | instant foods | 12087 | Chicken Flavor Ramen Noodle Soup | 9 | 94246 | 15 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
3 | 4 | instant foods | 47570 | Original Flavor Macaroni & Cheese Dinner | 9 | 94246 | 14 | 1 | 114082 | prior | 26 | 0 | 20 | 1.0 |
4 | 13 | prepared meals | 10089 | Dolmas | 20 | 94246 | 25 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
5 | 13 | prepared meals | 19687 | Butternut Squash With Cranberries | 20 | 94246 | 6 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
6 | 24 | fresh fruits | 13176 | Bag of Organic Bananas | 4 | 94246 | 24 | 1 | 114082 | prior | 26 | 0 | 20 | 1.0 |
7 | 24 | fresh fruits | 14159 | Seedless Watermelon | 4 | 94246 | 1 | 1 | 114082 | prior | 26 | 0 | 20 | 1.0 |
8 | 24 | fresh fruits | 36082 | Organic Mango | 4 | 94246 | 11 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
9 | 51 | preserved dips spreads | 19415 | Roasted Tomato Salsa Serrano-Tomatillo | 13 | 94246 | 26 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
10 | 61 | cookies cakes | 46373 | Donuts, Powdered | 19 | 94246 | 21 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
11 | 63 | grains rice dried goods | 26313 | Pad Thai Noodles | 9 | 94246 | 2 | 1 | 114082 | prior | 26 | 0 | 20 | 1.0 |
12 | 66 | asian foods | 41481 | Kung Pao Noodles | 6 | 94246 | 3 | 1 | 114082 | prior | 26 | 0 | 20 | 1.0 |
13 | 77 | soft drinks | 19125 | Extra Ginger Brew Jamaican Style Ginger Beer | 7 | 94246 | 27 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
14 | 81 | canned jarred vegetables | 14962 | Hearts of Palm | 15 | 94246 | 12 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
15 | 83 | fresh vegetables | 9839 | Organic Broccoli | 4 | 94246 | 18 | 1 | 114082 | prior | 26 | 0 | 20 | 1.0 |
16 | 83 | fresh vegetables | 29139 | Red Creamer Potato | 4 | 94246 | 20 | 1 | 114082 | prior | 26 | 0 | 20 | 1.0 |
17 | 83 | fresh vegetables | 30027 | Organic Chard Green | 4 | 94246 | 17 | 0 | 114082 | prior | 26 | 0 | 20 | 1.0 |
18 | 83 | fresh vegetables | 41690 | California Cauliflower | 4 | 94246 | 19 | 1 | 114082 | prior | 26 | 0 | 20 | 1.0 |
19 | 83 | fresh vegetables | 46979 | Asparagus | 4 | 94246 | 23 | 1 | 114082 | prior | 26 | 0 | 20 | 1.0 |
# 做user_id和aisle的交叉表
aisle | air fresheners candles | asian foods | baby accessories | baby bath body care | baby food formula | bakery desserts | baking ingredients | baking supplies decor | beauty | beers coolers | ... | spreads | tea | tofu meat alternatives | tortillas flat bread | trail mix snack mix | trash bags liners | vitamins supplements | water seltzer sparkling water | white wines | yogurt |
user_id | |||||||||||||||||||||
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 3 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | ... | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 42 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
5 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
206205 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
206206 | 0 | 4 | 0 | 0 | 0 | 0 | 4 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
206207 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 4 | 0 | 2 | 1 | 0 | 0 | 11 | 0 | 15 |
206208 | 0 | 3 | 0 | 0 | 3 | 0 | 4 | 0 | 0 | 0 | ... | 5 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 33 |
206209 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 3 |
206209 rows × 134 columns
# 降维成功从134个特征到44个特征
(206209, 44)