数据处理
- 分析数据的分布模式,发现异常值/噪声(boxplot、quantile)
- pandas工具库:数据类型(时间型读成字符串),数字形态的类别型(没有大小关系)
- 缺失值:缺失比例,缺失数据为数据型/类别型
- 时间序列:趋势分析(连续值displot、类别型countplot/value_counts),主要包括单维度分析、关联维度(corr,heatmap)
- 业务数据中做建模:最有效的特征通常时统计特征(怎么做统计、有那些类别的列可以作为groupby的对象,有哪些数据类型的列可以用于统计聚合)。特别留意置信度,总数很小的时候,统计值不稳定,比例型特征稳定度高于绝对值
特征工程
数值型
幅度缩放(归一化)
离散化/分箱分桶(等距pd.cut、等频pd.qut)(非线性/加速/特征交叉/健壮性)
统计值(max min quantile)
四则运算
幅度变换(有一些模型对于输入数据有分布假设,lr假设输入连续值特征符合正态分布,可以使用log1p/exp)
监督学习分箱(用决策树建模,用决策树学习连续值划分方式,把决策树中间节点取出来作为组合特征),sklearn.DecisionTree中有个apply函数,可以用于监督学习分箱
类别型
one-hot-encoding
label-encoding
binary-encoding
category-encoding(利用贝叶斯的先验做一些变换)
时间型
时间点/时间段(星期几、几点钟)
时间分组(工作日、周末、法定节假日…)
时间间隔(距离当前为止…)
和数值型一起做统计特征的时候,会选区不同的时间窗
文本型
词袋模型
tf-idf
lda
word2vec/word embedding
特征选择
过滤型(filter),可以根据一些卡方检验等做一些过滤
包裹型(wrapper)
嵌入型(embedded)
基于树模型去判断特征的重要去,做实验去筛选
缺失值处理
pandas fillna
age = df_train['Age'].fillna(value=df_train['Age'].mean())
df_train.loc[:,'Age']=age
sklearn Imputer
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
age = imp.fit_transform(df_train[['Age']].values)
df_train.loc[:,'Age']=age
numpy+apply
import numpy as np
log_age = df_train['Age'].apply(lambda x:np.log(x))
df_train.loc[:,'log_age'] = log_age
#对几列数据同时进行操作
import pandas as pd
import numpy as np
df = pd.DataFrame ({'a' : np.random.randn(6),
'b' : ['foo', 'bar'] * 3,
'c' : np.random.randn(6)})
def my_test(a, b):
return a + b
df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)
最大值-最小值
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()
fare_trans = mm_scaler.fit_transform(df_train[['Fare']])
标准化
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
fare_std_trans = std_scaler.fit_transform(df_train[['Fare']])
max,min,quantile
# 最大最小值
max_age = df_train['Age'].max()
min_age = df_train["Age"].min()
#分位数
age_quarter_1 = df_train['Age'].quantile(0.25)
age_quarter_3 = df_train['Age'].quantile(0.75)
四则运算
df_train.loc[:,'family_size'] = df_train['SibSp']+df_train['Parch']+1
高次特征与交叉特征
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_fea = poly.fit_transform(df_train[['SibSp','Parch']])
离散化后,使用高次特征从某种程度上理解为交叉特征
离散化与分箱分桶
df_train.loc[:, 'fare_cut'] = pd.cut(df_train['Fare'], 5)
# 等频切分
df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 5)
OneHot encoding/独热向量编码
#pandas get_dummies
embarked_oht = pd.get_dummies(df_train[['Embarked']])
#OneHotEncoder()
fare_qcut_oht = pd.get_dummies(df_train[['fare_qcut']])
日期型
pandas to_datetime
car_sales = pd.read_csv('car_data.csv')
car_sales.loc[:,'date'] = pd.to_datetime(car_sales['date_t'])
#取出月份
car_sales.loc[:,'month'] = car_sales['date'].dt.month
# 取出来是几号
car_sales.loc[:,'dom'] = car_sales['date'].dt.day
# 取出一年当中的第几天
car_sales.loc[:,'doy'] = car_sales['date'].dt.dayofyear
# 取出星期几
car_sales.loc[:,'dow'] = car_sales['date'].dt.dayofweek
词袋模型
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is a very good class',
'students are very very very good',
'This is the third sentence',
'Is this the last doc'
]
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
#查看结果
X.toarray()
vec = CountVectorizer(ngram_range=(1,3))
X_ngram = vec.fit_transform(corpus)
vec.get_feature_names()
TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
tfidf_X = tfidf_vec.fit_transform(corpus)
tfidf_vec.get_feature_names()
组合特征
自行实现
可以选择方差法:如果某个特征的取值差异不大,通常认为特征对区分样本的贡献度不大,因此在构造特征过程中去掉方差小于阈值的特征。注:方差选择法适用于离散型特征,离散型特征需要离散化后使用
from sklearn.datasets import load_iris
iris = load_iris()
print("iris特征名称\n",iris.feature_names)
print("iris特征矩阵\n",iris.data)
# 特征选择--方差选择法
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold = 1) # threshold为方差的阈值,默认0
vt = vt.fit_transform(iris.data) # 函数返回值为特征选择后的特征
print("方差选择法选择的特征\n",vt)
可以选择卡方检验法:使用卡方检验作为特征评分标准,卡方检验越大,相关性越强。
from sklearn.feature_selection import SelectKBest
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
X_new = SelectKBest(k=2).fit_transform(X,y)
皮尔森相关系数法
# -*- coding: utf-8 -*-
# 载入数据
from sklearn.datasets import load_iris
irisdata = load_iris()
# 特征选择(pearson相关系数法)
from sklearn.feature_selection import SelectKBest # 移除topK外的特征
from scipy.stats import pearsonr # 计算皮尔森相关系数
from numpy import array
"""
# 函数返回值:保留topk特征,移除topk外特征
# 第一个参数:皮尔森相关系数(输入特征矩阵和目标向量,输出二元组(评分,P),二数组第i项为第i个特征的评分和p值
# 第二个参数:topK个数
"""
skb = SelectKBest(lambda X, Y: tuple(map(tuple,array(list(map(lambda x:pearsonr(x, Y), X.T))).T)), k=3)
skb = skb.fit_transform(irisdata.data, irisdata.target)
互信息系数法(相关性)
特点:互信息系数法能够衡量各种相关性的特征集,计算相对复杂
LVW
此方法是一种典型的包裹式特征选择方法,他在拉斯维加斯方法框架下使用随机策略来进行子集搜索,并以最终分类器的误差作为特征子集的评价标准。每次从特征集A中随机产生一个特征子集a,然后使用交叉验证的方法,评估学习器在a上的误差,若误差小于以前获得的最小误差,或者与之前的最小误差相当但a中包含的特征数更少,则将a保留下来
RFE(递归消除法)
对特征进行训练,得到权重;提出权重最小的特征,构成新的集合;不断重复直至满足条件为止
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rfe = RFE(estimator=rf, n_features_to_select=2)
X_rfe = rfe.fit_transform(X,y)
特征选择之嵌入型
过滤式和包裹式选择方法中,特征选择过程与学习器训练过程有明显区别;而嵌入式特征选择与学习器训练过程融为一体,两者在同一个过程中完成,即在学习器训练过程中自动进行特征选择。主要方法有:岭回归、与基于树模型的特征选择:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
#带L1惩罚项的逻辑回归作为基模型的特征选择
SelectFromModel(LogisticRegression(penalty="l1", C=0.1)).fit_transform(iris.data, iris.target)
#基于树模型的特征选择
from sklearn.ensemble import GradientBoostingClassifier
#GBDT作为基模型的特征选择
SelectFromModel(GradientBoostingClassifier()).fit_transform(iris.data, iris.target)
sklearn中的交叉验证
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
iris = load_iris()
logreg = LogisticRegression()
scores = cross_val_score(logreg, iris.data, iris.target)
print("cross-validation scores: ", scores)
GridSearchCV = grid_search(产出候选超参数)+cross_validation(评估方式)
先设定参数候选字典
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
建模
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
训练
grid_search.fit(X_train, y_train)
获取最好的参数与得分
grid_search.best_params_
grid_search.best_score_
获取最优模型
grid_search.best_estimator_
K折采样
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
#标准采样
kfold = KFold(n_folds=3, shuffle=True, random_state=0)
cross_val_score(logreg, iris.data, iris.target, cv=kfold)
#分层采样
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
cross_val_score(logreg, iris.data, iris.target, cv=kfold)
留一法交叉验证
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(logreg, iris.data, iris.target, cv=loo)
print("number of cv iterations: ", len(scores))
print("mean accuracy: ", scores.mean())
乱序分割交叉验证
from sklearn.model_selection import ShuffleSplit
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_iter=10)
cross_val_score(logreg, iris.data, iris.target, cv=shuffle_split)
mlxtend(用于模型集成)
投票器模型融合
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
X = array[:,0:8]
Y = array[:,8]
kfold = model_selection.KFold(n_splits=5, random_state=2018)
# 创建投票器的子模型
estimators = []
model_1 = LogisticRegression()
estimators.append(('logistic', model_1))
model_2 = DecisionTreeClassifier()
estimators.append(('dt', model_2))
model_3 = SVC()
estimators.append(('svm', model_3))
# 构建投票器融合
ensemble = VotingClassifier(estimators)
result = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
Bagging
from sklearn.ensemble import BaggingClassifier
dt = DecisionTreeClassifier()
num = 100
kfold = model_selection.KFold(n_splits=5, random_state=2018)
model = BaggingClassifier(base_estimator=dt, n_estimators=num, random_state=2018)
result = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(result.mean())
RandomForest
from sklearn.ensemble import RandomForestClassifier
num_trees = 100
max_feature_num = 5
kfold = model_selection.KFold(n_splits=5, random_state=2018)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_feature_num)
result = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(result.mean())
Adaboost
from sklearn.ensemble import AdaBoostClassifier
num_trees = 25
kfold = model_selection.KFold(n_splits=5, random_state=2018)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=2018)
result = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(result.mean())