UCLA(加利福尼亚大学洛杉矶分校)研究生录取的二分类问题,数据来源:http://www.ats.ucla.edu/stat/data/binary.csv
使用的属性主要有下面四列(前三列为属性,最后一列为结果(类别能否被录用):
gpa
gre分数
rank表示本科生母校的声望
admit则是二分类目标变量(binary target variable),它表明考生最终是否被录取。
使用python pandas数据预处理,包括数据的统计信息,dummy variable(哑变量的处理),数据归一化
使用sklearn的不同分类方法:LogisticRegression, RandomForestClassifier, KNeighborsClassifier, xgboost对处理好的数据进行分类
GridSearchCV自动参数寻优
对比数据不做dummy variable处理情况下的识别率。
开发环境建议使用anaconda(python2.7) + pychram
下载上面连接的csv文件,命名为UCLA_dataset.csv
import pandas as pd
def get_data():
f_path = "./dataset/UCLA_dataset.csv"
df = pd.read_csv(f_path)
print df.head() ## 输出前五行
输出结果如下:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
由于rank为pandas的DataFrame的列属性这里将其rename为prestige(n.
威望;声望)
df.rename(columns={"rank": "prestige"}, inplace=True)
print df.head()
return df
前五行输出如下:
admit gre gpa prestige
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
def statistic_data(df):
print df.describe()
输出如下:
admit gre gpa prestige
count 400.000000 400.000000 400.000000 400.00000
mean 0.317500 587.700000 3.389900 2.48500
std 0.466087 115.516536 0.380567 0.94446
min 0.000000 220.000000 2.260000 1.00000
25% 0.000000 520.000000 3.130000 2.00000
50% 0.000000 580.000000 3.395000 2.00000
75% 1.000000 660.000000 3.670000 3.00000
max 1.000000 800.000000 4.000000 4.00000
print pd.crosstab(df['admit'], df['prestige'], rownames=['admit'])
输出如下:
prestige 1 2 3 4
admit
0 28 97 93 55
1 33 54 28 12
可见学校的声望越好(1最好),被录取的概率更高
# 没有被录取的
no_admits = df[df.admit == 0]
print no_admits.shape
# 没有被录取的学生gre统计信息
print no_admits.gre.describe()
# 被录取的
admits = df[df.admit == 1]
print admits.shape
# 被录取的学生gre统计信息
print admits.gre.describe()
输出如下:
# 没有被录取的学生公有273个,的平均分为 573.18
(273, 4)
count 273.000000
mean 573.186813
std 115.830243
min 220.000000
25% 500.000000
50% 580.000000
75% 660.000000
max 800.000000
Name: gre, dtype: float64
# 被录取的学生公有127个,的平均分为 618.89
(127, 4)
count 127.000000
mean 618.897638
std 108.884884
min 300.000000
25% 540.000000
50% 620.000000
75% 680.000000
max 800.000000
Name: gre, dtype: float64
由此可见gre分数越高,越容易被录取。下面看下直方图:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
f.suptitle('gre per by class')
bins = 20
ax1.hist(no_admits.gre, bins=bins)
ax1.set_title('no_admits')
ax2.hist(admits.gre, bins=bins)
ax2.set_title('admits')
plt.xlabel('gre')
plt.ylabel('Number of gre')
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
df.hist()
plt.show()
虚拟变量,也叫哑变量,可用来表示分类变量、非数量因素可能产生的影响。在计量经济学模型,需要经常考虑属性因素的影响。例如,职业、文化程度、季节等属性因素往往很难直接度量它们的大小。只能给出它们的“Yes—D=1”或”No—D=0”,或者它们的程度或等级。为了反映属性因素和提高模型的精度,必须将属性因素“量化”。通过构造0-1型的人工变量来量化属性因素。
pandas提供了一系列分类变量的控制。我们可以用get_dummies来将”prestige”一列虚拟化。
在本例中,prestige有四个级别:1,2,3以及4(1代表最有声望),prestige作为分类变量更加合适。当调用get_dummies时,会产生四列的dataframe,每一列表示四个级别中的一个。
def process_df(df):
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
print dummy_ranks.tail()
输出如下:
prestige_1 prestige_2 prestige_3 prestige_4
395 0.0 1.0 0.0 0.0
396 0.0 0.0 1.0 0.0
397 0.0 1.0 0.0 0.0
398 0.0 1.0 0.0 0.0
399 0.0 0.0 1.0 0.0
引入的虚拟变量列数应为虚拟变量总列数减1,减去的1列作为基准
.最终的构造数据如下:
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
print data.tail()
输出如下:
admit gre gpa prestige_2 prestige_3 prestige_4
395 0 620 4.00 1.0 0.0 0.0
396 0 560 3.04 0.0 1.0 0.0
397 0 460 2.63 1.0 0.0 0.0
398 0 700 3.65 1.0 0.0 0.0
399 0 600 3.89 0.0 1.0 0.0
考虑把gre, gpa两列数据归一化(数据的量级不是很大,大概100倍),不过实际使用中发现除了KNeighborsClassifier分类方法,对于其他的分类方法而言,归一化并没有提升分类准确率:
def process_df_norm(df):
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
print dummy_ranks.tail()
cols_to_keep = ['admit', 'gre', 'gpa']
df = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
cols_norm = ['gre', 'gpa']
f_norm = lambda x: (x - x.min()) / (x.max() - x.min())
df[cols_norm] = df[cols_norm].apply(f_norm)
print df.tail()
return df
输出如下:
admit gre gpa prestige_2 prestige_3 prestige_4
395 0 0.689655 1.000000 1.0 0.0 0.0
396 0 0.586207 0.448276 0.0 1.0 0.0
397 0 0.413793 0.212644 1.0 0.0 0.0
398 0 0.827586 0.798851 1.0 0.0 0.0
399 0 0.655172 0.936782 0.0 1.0 0.0
from sklearn.model_selection import train_test_split
def split_train_test(df):
X = df.ix[:, 1:]
Y = df.admit
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.2, random_state=40)
return X_train, X_test, Y_train, Y_test
from sklearn.metrics import confusion_matrix, recall_score
from sklearn.metrics import accuracy_score
def print_cm_accuracy(Y_true, Y_pred):
cnf_matrix = confusion_matrix(Y_true, Y_pred)
print cnf_matrix
accuracy_percent = accuracy_score(Y_true, Y_pred)
print "accuracy is: %s%s" % (accuracy_percent * 100, '%')
from sklearn.linear_model import LogisticRegression
def lr_fit_test(X_train, X_test, Y_train, Y_test):
lr = LogisticRegression()
# lr = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
# intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
lr.fit(X_train, Y_train)
Y_pred = lr.predict(X_test)
print_cm_accuracy(Y_test, Y_pred)
输出如下:
[[53 0]
[23 4]]
accuracy is: 71.25%
def rfc_fit_test(X_train, X_test, Y_train, Y_test):
rf = RandomForestClassifier(n_jobs=4)
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)
print_cm_accuracy(Y_test, Y_pred)
输出:
[[46 7]
[17 10]]
accuracy is: 70.0%
相同的数据集,随机森林的精度较低(逻辑回归为71.25%)
下面看下数据归一化后的准确率:(原有的基础上面下降了)
[[48 5]
[20 7]]
accuracy is: 68.75%
下面给出K近邻的不同K值是准确度,以及数据是否归一化的对比:
def knn_fit_test(X_train, X_test, Y_train, Y_test):
# 对比不同k值,准确率的变化
ks = [1, 3, 5, 7]
for k in ks:
print 'KNeighborsClassifier k is %s' % k
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
print_cm_accuracy(Y_test, Y_pred)
数据不归一化:
KNeighborsClassifier k is 1
[[41 12]
[15 12]]
accuracy is: 66.25%
KNeighborsClassifier k is 3
[[44 9]
[17 10]]
accuracy is: 67.5%
KNeighborsClassifier k is 5
[[47 6]
[17 10]]
accuracy is: 71.25%
KNeighborsClassifier k is 7
……..太多了省略了
数据归一化:
KNeighborsClassifier k is 1
[[41 12]
[15 12]]
accuracy is: 66.25%
KNeighborsClassifier k is 3
[[44 9]
[14 13]]
accuracy is: 71.25%
KNeighborsClassifier k is 5
[[44 9]
[15 12]]
accuracy is: 70.0%
当然K值也可以使用GridSearchCV网格参数寻优:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
def knn_fit_test(X_train, X_test, Y_train, Y_test):
paras = {'n_neighbors': [1, 3, 5, 7]}
knn = KNeighborsClassifier()
clf = GridSearchCV(knn, paras, n_jobs=-1)
clf.fit(X_train, Y_train)
print 'The parameters of the best model are: '
print clf.best_params_
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
Y_pred = clf.predict(X_test)
print classification_report(Y_test, Y_pred)
输出如下:
0.616 (+/-0.060) for {'n_neighbors': 1}
0.656 (+/-0.049) for {'n_neighbors': 3}
0.669 (+/-0.055) for {'n_neighbors': 5}
0.666 (+/-0.063) for {'n_neighbors': 7}
precision recall f1-score support
0 0.75 0.89 0.81 53
1 0.65 0.41 0.50 27
avg / total 0.71 0.72 0.71 80
前面已经发现,学校声望越高越容易被录取,而且学校声望已经按照顺序排列,因此考虑,直接使用原来的值(试验发现,准确率更高了):
对于K近邻算法:
对于每种方法,直接使用学校声望的值,对于每种方法结果准确率跟高了。
下面给出xgboost的预测代码:
def xgb_fit_test(X_train, X_test, Y_train, Y_test):
import os
mingw_path = 'C:/Program Files/mingw-w64/x86_64-6.2.0-posix-seh-rt_v5-rev1/mingw64/bin'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print_cm_accuracy(Y_test, Y_pred)
注:这个示例不是来源与kaggle,但是这里方便文字归类,将其命名为kaggle
转载注明出处哈。
http://blog.yhat.com/posts/logistic-regression-python-rodeo.html
http://www.powerxing.com/logistic-regression-in-python/