输入:
#决策树 随机森林 梯度提升决策树
import pandas as pd
titanic=pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
# head()默认查看前5行数据 tail默认查看倒数5行数据 head(num)默认查看前num行数据 tail同理
titanic.head()
titanic.info()
####此部分是对于特征的选择,接着对选择的特征进行探查,
X=titanic[['pclass','age','sex']]
y=titanic['survived']
X.info()
#age数据列只有633个数据,需要进行补全,可以选择使用中位数或者平均数,都是对模型偏离造成最小影响的策略。
X['age'].fillna(X['age'].mean(),inplace=True)
X.info
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=33)
#DictVectorizer 将非数值的东西数值化 数值化的东西保持不变 比如 性别 无非就是男女 按照分类的思想 用0表示男 1表示女
from sklearn.feature_extraction import DictVectorizer
dv=DictVectorizer(sparse=False)
X_train=dv.fit_transform(X_train.to_dict(orient='record'))
X_test=dv.fit_transform(X_test.to_dict(orient='record'))
#单个决策树
from sklearn.tree import DecisionTreeClassifier
dtc=DecisionTreeClassifier()
dtc.fit(X_train,y_train)
y_predict=dtc.predict(X_test)
print('Accuracy of Mulitinomial:',dtc.score(X_test,y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test,y_predict,target_names=['died','survived']))
#随机森林
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(X_train,y_train)
y_predict=rfc.predict(X_test)
print('Accuracy of Mulitinomial:',rfc.score(X_test,y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test,y_predict,target_names=['died','survived']))
#梯度提升决策树
from sklearn.ensemble import GradientBoostingClassifier
gbc=GradientBoostingClassifier()
gbc.fit(X_train,y_train)
y_predict=gbc.predict(X_test)
print('Accuracy of Mulitinomial:',gbc.score(X_test,y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test,y_predict,target_names=['died','survived']))
输出:
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 11 columns):
row.names 1313 non-null int64
pclass 1313 non-null object
survived 1313 non-null int64
name 1313 non-null object
age 633 non-null float64
embarked 821 non-null object
home.dest 754 non-null object
room 77 non-null object
ticket 69 non-null object
boat 347 non-null object
sex 1313 non-null object
dtypes: float64(1), int64(2), object(8)
memory usage: 113.0+ KB
D:\Anaconda3\lib\site-packages\pandas\core\generic.py:6287: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._update_inplace(new_data)
D:\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
D:\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
Accuracy of Mulitinomial: 0.7811550151975684
precision recall f1-score support
died 0.78 0.91 0.84 202
survived 0.80 0.58 0.67 127
accuracy 0.78 329
macro avg 0.79 0.74 0.75 329
weighted avg 0.78 0.78 0.77 329
D:\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
D:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
Accuracy of Mulitinomial: 0.7781155015197568
precision recall f1-score support
died 0.77 0.91 0.83 202
survived 0.80 0.57 0.66 127
accuracy 0.78 329
macro avg 0.78 0.74 0.75 329
weighted avg 0.78 0.78 0.77 329
Accuracy of Mulitinomial: 0.790273556231003
precision recall f1-score support
died 0.78 0.92 0.84 202
survived 0.82 0.58 0.68 127
accuracy 0.79 329
macro avg 0.80 0.75 0.76 329
weighted avg 0.80 0.79 0.78 329
#head()默认查看前5行数据 tail默认查看倒数5行数据 head(num)默认查看前num行数据 tail同理
用于查看数据的统计特征,
fillna()函数用于把空缺的位置填充数据。
mean()函数用于取平均值。
将特征与值的映射字典组成的列表转换成向量。
DictVectorizer通过使用scikit-learn的estimators,将特征名称与特征值组成的映射字典构成的列表转换成Numpy数组或者Scipy.sparse矩阵。
参数:
dtype:callable, 可选参数, 默认为float。特征值的类型,传递给Numpy.array或者Scipy.sparse矩阵构造器作为dtype参数。
separator: string, 可选参数, 默认为"="。当构造One-hot编码的特征值时要使用的分割字符串。分割传入字典数据的键与值的字符串,生成的字符串会作为特征矩阵的列名。
sparse: boolearn, 可选参数,默认为True。transform是否要使用scipy产生一个sparse矩阵。DictVectorizer的内部实现是将数据直接转换成sparse矩阵,如果sparse为False, 再把sparse矩阵转换成numpy.ndarray型数组。
sort:boolearn,可选参数,默认为True。在拟合时是否要多feature_names和vocabulary_进行排序。
pandas 中的to_dict 可以对DataFrame类型的数据进行转换,可以选择六种的转换类型,分别对应于参数 ‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’。本文程序中使用的是“record”。详细问题见链接:
https://www.jb51.net/article/141481.htm
各个分类算法的导入:单一决策树、随机森林、梯度提升决策树