老师给了一个数据集,然后要做回归,看R^2。给了个做分类的数据然后让你做回归。。。坑 还说是超过0.07的R方才有资格跟他讨论超参数调优
数据下载链接: https://pan.baidu.com/s/1slUKBrn 密码: kyn4
sklearn上多元回归、随机森林500颗树的效果 R2 score=0.06
因为正在学tensorflow,所以打算用TF实现同样的效果
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
import tensorflow as tf
from sklearn import preprocessing
ts=pd.read_csv('file:///E:/data/HFT_XY_unselected.csv').astype('float32')
ts=ts.dropna()
ts.drop('Unnamed: 0',axis=1,inplace=True)#删去列
featrue_columns = ['X'+str(i) for i in range(1,333)]
X=ts[featrue_columns].copy()
#X=X.iloc[:,1:10]
#Y
y=ts.realY.copy()
#数据预处理——主成分提取
pca=PCA(n_components=50)#'mle',自动确定保留的特征数
reduce_X = pca.fit_transform(X)
#特征得分,方差贡献率
sum(pca.explained_variance_ratio_)
PCA_X=reduce_X.astype('float32')
这里可以看到PCA得分,大概能体现90%左右的方差,个人认为效果还可以
#拆分训练集和测试集
train_size=int(PCA_X.shape[0]*0.7)
X_train=PCA_X[:train_size].reshape(-1,50)
y_train=y[:train_size].reshape(-1,1)
X_test=PCA_X[train_size:].reshape(-1,50)
y_test=y[train_size:].reshape(-1,1)
#标准化
scaler = preprocessing.StandardScaler().fit(X_train)
print (scaler.mean_, scaler.scale_)
x_data_standard = scaler.transform(X_train)
n_samples = X_train.shape[0]#行数
# 学习率
learning_rate = 2
# 迭代次数
training_epochs = 10
# 每多少次输出一次迭代结果
display_step = 1
x = tf.placeholder(tf.float32, shape=[None, 50])#输入层50个变量
y_ = tf.placeholder(tf.float32, shape=[None,1])#输出层
keep_prob=tf.placeholder(tf.float32)
# 定义模型参数
#隐藏层
W1 = tf.Variable(tf.truncated_normal([50,200],stddev=0.1),name='W')
b1 = tf.Variable(tf.constant(0.1,shape=[200]),name='b')
L1 = tf.nn.tanh(tf.matmul(x,W1) + b1,name="L")
L1_drop = tf.nn.dropout(L1,keep_prob)
#中间层
W2 = tf.Variable(tf.truncated_normal([200,50],stddev=0.1),name='W')
b2 = tf.Variable(tf.constant(0.1,shape=[50]),name='b')
L2 = tf.nn.tanh(tf.matmul(L1_drop,W2) + b2,name="L")
L2_drop = tf.nn.dropout(L2,keep_prob)
#输出层
W3=tf.Variable(tf.truncated_normal([50,1],stddev=0.1),name='W')
b3=tf.Variable(tf.constant(0.1,shape=[1]),name='b')
#模型
pred= tf.matmul(L2_drop ,W3)+b3
#pred=tf.matmul(X_train, W) + b
# 定义损失函数
cost = tf.reduce_sum(tf.pow(pred-y_, 2)) / (2 * n_samples)
# 使用Adam算法
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
correct_prediction = tf.equal(tf.argmax(y_,1), tf.argmax(pred,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
with tf.Session() as sess:
#初始化
init = tf.global_variables_initializer()
sess.run(init)
#交叉验证
for step in range(500):
#for (xs, ys) in zip(x_data_standard[1:100], y_train[1:100]):
sess.run(optimizer, feed_dict={x: x_data_standard, y_: y_train,keep_prob: 0.8})
if step % 10 == 0:
# to visualize the result and improvement
prediction_value = sess.run(pred, feed_dict={x: scaler.transform(X_test), y_: y_test,keep_prob: 0.8})
score = r2_score(y_test,prediction_value)
print('R2 score %i:'%step,score)
R2 score 0: -30244.3268412
R2 score 10: -1340.94961339
R2 score 20: -596.268456144
R2 score 30: -194.559941395
R2 score 40: -39.0541979543
R2 score 50: -3.91464581928
R2 score 60: -2.91056697861
R2 score 70: -2.67173714001
R2 score 80: -1.0624426114
R2 score 90: -0.165357421399
R2 score 100: -0.0385794292803
R2 score 110: -0.0354491289759
R2 score 120: -0.0131951483968
R2 score 130: 0.00014439312723
R2 score 140: 0.00127453319753
R2 score 150: 0.00167685360921
R2 score 160: 0.00177724882822
R2 score 170: 0.00193877464792
R2 score 180: 0.00233761433903
R2 score 190: 0.00232340264801
R2 score 200: 0.00220291985519
最后多次训练R方能接近0.05左右,这里网络设计的比较糟糕,计算慢效果差
只是为了记录一下全连接网络,和上篇文章TF一元线性回归的单特征、多特征的对比