①监督学习:
分类问题:特征值——目标值(类别)
回归问题:目标值为连续数值
②无监督学习:无目标值
可用数据集搜索:kaggle、UCI、scikit-learn
①获取小规模数据集:sklearn.datasets.load_*()
②获取大规模数据集:sklearn.datasets.fetch_*(data_home=None,subset=‘tarin’) #subset为选择train\test\all
调用方式:① dict[“key”] = value ②bunch.key = value
1、训练集(80%)、测试集(20%)
2、sklearn.model_selection.train_test_split(arrays,*options)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
def datastes_demo():
iris=load_iris()
feature_train,feature_test,label_train,label_test=train_test_split(iris.data,iris.target,test_size=0.2)
特征工程决定机器学习上限
特征提取API:sklearn.feature_extraction
sklearn.feature_extraction.DictVectorzer(sparse=True…)
from sklearn.feature_extraction import DictVectorizer
def dict_demo():
data=[{"city":'北京',"temperature":'100'},
{"city":'上海',"temperature":'60'},
transfer=DictVectorizer() #实例化
data_new=transfer.fit_transform(data) #返回默认为稀疏矩阵(本质是one-hot编码)
print(data_new)
sklearn.feature_extraction.text.CountVectorizer(stop_words=[]) 返回词频矩阵
中文文字特征提取可用jieba分词 import jieba
import jieba
def cut_ch_demo(text):
result=" ".join(list(jieba.lcut(text)))
return result
sklearn.feature_extraction.text.TfidfVectorizer(stop_words=[])
返回词的权重矩阵
特征预处理API:sklearn.preprocessing
sklearn.preprocessing.MinMaxScaler(feature_range=(0,1)…)
sklearn.preprocessing.StandarScaler(feature_range=(0,1)…)
降低特征个数,使特征与特征不相关。
低方差特征过滤
模块:sklearn.feature_selection
sklearn.feature_selection.VarianceThreshold(threshold=0.0)
def variance_demo():
return None
相关系数(皮尔逊)
r ( X , Y ) = C o v ( X , Y ) V a r ( X ) V a r ( Y ) r(X,Y)={{Cov(X,Y)}\over\sqrt{Var(X)Var(Y)}} r(X,Y)=Var(X)Var(Y)Cov(X,Y)
|r|约接近0,表示相关性越弱
模块:from scipy.stats import pearsonr
from scipy.stats import pearsonr
r = pearsonr(datax,datay)
#返回值为(相关系数,P值——即显著水平)
相关性高时,选取其中一个或加权求和
数据维数压缩
找到一个直线,通过一个矩阵运算得出主成分的结果
模块:sklearn.decomposition.PCA(n_components=None)
import pandas as pd
from sklearn.decomposition import PCA
order_products = pd.read_csv("./source/order_products__prior.csv")
product = pd.read_csv("./source/products.csv")
aisles = pd.read_csv("./source/aisles.csv")
orders = pd.read_csv("./source/orders.csv")
tab1 = pd.merge(aisles, product, on=["aisle_id", "aisle_id"])
tab2 = pd.merge(tab1, order_products, on=["product_id", "product_id"])
tab3 = pd.merge(tab2, orders, on=["order_id", "order_id"])
tab4 = pd.crosstab( tab3["user_id"],tab3["aisle"])
print(tab4.shape)
# 实例化
transfer = PCA(n_components=0.95)
# 调用方法
data_new = transfer.fit_transform(tab4)
print(data_new)
print(data_new.shape)
项目实例(数据降维)代码中的四个数据文件见百度网盘:
链接:https://pan.baidu.com/s/1k9QKTdwazpcOuTV1sucQfg
提取码:yd2k