用sklearn分析基金数据<1>
python爬虫获取基金数据<2>
数据预处理:数据清洗、生成样本数据<3>
用sklearn训练样本数据<4>
用模型进行预测及改进<5>
拿到样本数据后就可以用sklearn中的各个分类算法来做训练了,由于样本是很简单的二分类问题,特征变量也很少,像多元分类、降维这类问题都不用考虑了,我也想过拿回归算法来做看看对下月增长率的预测情况,
LinearRegression,Ridge,Lasso的MSE得分依次降低,拟合效果略有增强,但实际测试结果误差太多,应该是样本不呈线性,也没必要转成多项式回归研究了,没有参考价值也就去掉了。
import pandas as pd
import numpy as np
Allrecords=pd.read_csv('2017-12_data.csv')
Allrecords.index=Allrecords.iloc[:,0]
Allrecords['calsstype'] = Allrecords['calsstype'].replace([0],[-1])
data= Allrecords
data_train = data.iloc[:int(0.8*len(data)), :] #选取前80%为训练数据
data_test = data.iloc[int(0.8*len(data)):, :] #选取前20%为测试数据
Xtrain = data_train.iloc[:,1:6]
Ytrain_lr = data_train.iloc[:,9]
Ytest_lr = data_test.iloc[:,9]
Xtest = data_test.iloc[:,1:6]
#LinearRegression
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(Xtrain, Ytrain_lr)
from sklearn import metrics
# 用scikit-learn计算MSE
print("MSE:",metrics.mean_squared_error(Ytest_lr, y_pred_lr))
# 用scikit-learn计算RMSE
print("RMSE:",np.sqrt(metrics.mean_squared_error(Ytest_lr, y_pred_lr)))
y_pred_lr=pd.DataFrame(y_pred_lr,index=newindex)
#Ridge
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1)
ridge.fit(Xtrain, Ytrain_lr)
print(linreg.intercept_)
print(linreg.coef_)
#用scikit-learn选择Ridge回归超参数α
from sklearn.linear_model import RidgeCV
ridgecv = RidgeCV(alphas=[0.01, 0.1, 0.5, 1, 3, 5, 7, 10, 20, 100])
ridgecv.fit(Xtrain, Ytrain_lr)
print(ridgecv.alpha_)
y_pred_ridge = ridgecv.predict(Xtest)
# 用scikit-learn计算MSE
print("MSE:",metrics.mean_squared_error(Ytest_lr, y_pred_ridge))
# 用scikit-learn计算RMSE
print("RMSE:",np.sqrt(metrics.mean_squared_error(Ytest_lr, y_pred_ridge)))
#Lasso
from sklearn.linear_model import Lasso
ridge = Lasso(alpha=1)
ridge.fit(Xtrain, Ytrain_lr)
print(linreg.intercept_)
print(linreg.coef_)
#用scikit-learn选择Ridge回归超参数α
from sklearn.linear_model import LassoCV
ridgecv = LassoCV(alphas=[0.01, 0.1, 0.5, 1, 3, 5, 7, 10, 20, 100])
ridgecv.fit(Xtrain, Ytrain_lr)
print(ridgecv.alpha_)
y_pred_lasso = ridgecv.predict(Xtest)
# 用scikit-learn计算MSE
print("MSE:",metrics.mean_squared_error(Ytest_lr, y_pred_lasso))
# 用scikit-learn计算RMSE
print("RMSE:",np.sqrt(metrics.mean_squared_error(Ytest_lr, y_pred_lasso)))
然后就是各种分类算法了,先提一下模型评估,通过分析生成样本Y值可以发现样本类型分布不均,正例数据所占比重较小,约10%,此时单看准确率已经不够了,比如全部预测为负类都有90%准确率,这时召回率(Recall) R=TP/(TP+FN)更值得参考。
分类算法有很多,从简单开始我都试了一下
LogisticRegression,DecisionTreeClassifier,KNeighborsClassifier(KNN),GaussianNB(bayes),以上算法结果准确率还行,但召回率很低,基本全部预测为负类也就可以排除不用了。再住复杂些分类算法走,情况就好些了,准确率都有90%+,召回率也有50%+,算法有四个,SVN和集成方法:RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier。
具体代码如下:
import pandas as pd
Allrecords=pd.read_csv('2017-12_data.csv')
Allrecords.index=Allrecords.iloc[:,0]
Allrecords['calsstype'] = Allrecords['calsstype'].replace([0],[-1])
data= Allrecords
data_train = data.iloc[:int(0.8*len(data)), :] #选取前80%为训练数据
data_test = data.iloc[int(0.8*len(data)):, :] #选取前20%为测试数据
Ytrain = data_train.iloc[:,8]
Xtrain = data_train.iloc[:,1:6]
Ytest = data_test.iloc[:,8]
Xtest = data_test.iloc[:,1:6]
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
rf1 = RandomForestClassifier(n_estimators= 50,max_depth=14,
oob_score=True, random_state=10)
rf1.fit(Xtrain,Ytrain)
print(rf1.oob_score_)
y_predprob = rf1.predict_proba(Xtrain)[:,1]
y_testpred = rf1.predict(Xtest)
newindex = data_test.index
y_testpred=pd.DataFrame(y_testpred,index=newindex)
#SVN
from sklearn.svm import SVC
grid = GridSearchCV(SVC(), param_grid={"C":[0.1, 1, 10], "gamma": [1, 0.1, 0.01]}, cv=4)
grid.fit(Xtrain,Ytrain)
y_pred_svc = grid.predict(Xtest)
y_pred_svc=pd.DataFrame(y_pred_svc,index=newindex)
#adaboost
from sklearn.ensemble import AdaBoostClassifier
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2, min_samples_split=20, min_samples_leaf=5),
algorithm="SAMME",
n_estimators=200, learning_rate=0.8)
bdt.fit(Xtrain,Ytrain)
y_pred_ada = bdt.predict(Xtest)
y_pred_ada=pd.DataFrame(y_pred_ada,index=newindex)
#GBDT
from sklearn.ensemble import GradientBoostingClassifier
gbm0 = GradientBoostingClassifier(random_state=10)
gbm0.fit(Xtrain,Ytrain)
y_pred_gbdt = gbm0.predict(Xtest)
y_pred_gbdt=pd.DataFrame(y_pred_gbdt,index=newindex)
result = pd.concat([data_test,y_testpred,y_pred_svc,y_pred_ada,y_pred_gbdt], axis=1)
outputfile = 'output201712.xls'
result.to_excel(outputfile)
由此得到样本数据的四种分类算法的测试结果:
可以看出前面几中较简单的分类算法效果很差,而后面四种算法则效果不错。
除了关心正确率和召回率以外,也可以计算预测为正值的平均收益率,比如说预测结果为正而样本为负,但实际利率为4.6%其实也是很好的结果了,即使为1%也是正不会产生损失的风险,所以计算结果的平均收益率也是一个重要的评估指标。