记录作为小白的第一个深度学习项目。
The task is to predict house sale prices based on the house information, such as # of bedrooms, living areas, locations, near-by schools, and the seller summary. The data consist of houses sold in California on 2020, with houses in the test dataset sold after the ones in the training dataset. Also the private leaderboard houses were sold after the ones in the public leaderboard.
题目和在课程中演示的基础竞赛一样,也是房价预测,回归模型。
# 读取数据
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
观察train.csv与test.csv,训练集共有47.3k个数据,训练集31.1k,数据量比课程中演示的都要大。此时采用One-hot编码处理文本标签就会导致内存爆炸,因此,需要对数据进行观察分析,分门别类地处理。
经过观察,数据分为以下类别
数值型数据处理的方法在课程中已经给出
# 提取数值特征并标准化
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / (x.std()))
# 把数值缺失值填0
all_features[numeric_features] = all_features[numeric_features].fillna(0)
观察文本型数据,发现有些可以简单分类,有些则需要运用NLP方法提取特征词来进行分类。
例如,Address类中有:540 Pine Ln、1727 W 67th St、28093 Pine Ave等数万个数据,难以进行分类,作为小白很难处理这种数据,几经尝试之后还是反复在内存上报错。因此只能暂时去掉这一类的特征,等掌握更多方法之后再来进行高阶的处理,相似的特征还有:Summary
将所有特征合并至all_features(上述需要删除的特征除外)
# 删除部分特征(ID,Address,summary)
all_features = pd.concat((train_data.iloc[:, 4:], test_data.iloc[:, 3:]))
其他一些特征,某些类别出现较多频次,也存在不好分类的描述,如Flooring特征,主要可以分为三大类[null],wood,other,而other类别中还有三万多类。对于此类特征,我采用较为粗略的大类分类法,选择频次最高的八类与缺失值,Other共同组成至少十类,并用One-hot编码处理,达到降维的目的。
# 处理离散值/文本标签
text_features = all_features.dtypes[all_features.dtypes == 'object'].index
for feature in text_features:
type_label = []
# 统计词频
word_counts = collections.Counter(all_features[feature])
word_counts_top = word_counts.most_common(8)
# 选择分类标签
for counts in word_counts_top:
type_label.append(counts[0])
all_features[feature] = all_features[feature].fillna('NAN')
type_label.append('NAN')
# 将分类标签以外的类别归为Other类
all_features[feature] = [i if i in type_label else 'Other' for i in all_features[feature]]
# one-hot编码
all_features = pd.get_dummies(all_features, dummy_na=True)
然后将所有数据转换为张量格式,以输入神经网络
n_train = train_data.shape[0]
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(train_data.iloc[:, 2], dtype=torch.float32)
最开始先按照课堂给出的线性模型进行训练
损失为均方差损失,但在比较时使用log_rmse形式比较,网络为线性网络
loss = nn.MSELoss()
in_features = train_features.shape[1]
def get_net():
net = nn.Sequential(nn.Linear(in_features, 1))
return net
def log_rmse(net, features, labels):
# 为了在取对数时进一步稳定该值,将小于1的值设置为1
clipped_preds = torch.clamp(net(features), 1, float('inf'))
rmse = torch.sqrt(loss(torch.log(clipped_preds),
torch.log(labels)))
return rmse.item()
训练函数中使用批梯度下降、Adam优化算法
def train(net, train_features, train_labels, test_features, test_labels,
num_epochs, learning_rate, weight_decay, batch_size):
train_ls, test_ls = [], []
train_iter = d2l.load_array((train_features, train_labels), batch_size)
# Adam优化算法
optimizer = torch.optim.Adam(net.parameters(),
lr=learning_rate,
weight_decay=weight_decay)
for epoch in range(num_epochs):
for X, y in train_iter:
optimizer.zero_grad()
l = loss(net(X), y)
l.backward()
optimizer.step()
train_ls.append(log_rmse(net, train_features, train_labels))
if test_labels is not None:
test_ls.append(log_rmse(net, test_features, test_labels))
return train_ls, test_ls
get_k_fold_data获取第k折的切片数据,进行k次训练返回平均误差
def get_k_fold_data(k, i, X, y):
assert k > 1
fold_size = X.shape[0] // k
X_train, y_train = None, None
for j in range(k):
idx = slice(j * fold_size, (j + 1) * fold_size)
X_part, y_part = X[idx, :], y[idx]
if j == i:
X_valid, y_valid = X_part, y_part
elif X_train is None:
X_train, y_train = X_part, y_part
else:
X_train = torch.cat([X_train, X_part], 0)
y_train = torch.cat([y_train, y_part], 0)
return X_train, y_train, X_valid, y_valid
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
batch_size):
train_l_sum, valid_l_sum = 0, 0
for i in range(k):
data = get_k_fold_data(k, i, X_train, y_train)
net = get_net()
train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
weight_decay, batch_size)
train_l_sum += train_ls[-1]
valid_l_sum += valid_ls[-1]
print(f'折{i + 1},训练log rmse{float(train_ls[-1]):f}, '
f'验证log rmse{float(valid_ls[-1]):f}')
return train_l_sum / k, valid_l_sum / k
调整超参数,来找到较低的一个误差值
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
weight_decay, batch_size)
print(f'{k}-折验证: 平均训练log rmse: {float(train_l):f}, '
f'平均验证log rmse: {float(valid_l):f}')
略微调了一下参,整体处理的比较粗糙,所以得到一个误差较大的结果如下
def train_and_pred(train_features, test_feature, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size):
net = get_net()
train_ls, _ = train(net, train_features, train_labels, None, None,
num_epochs, lr, weight_decay, batch_size)
d2l.plot(np.arange(1, num_epochs + 1), [train_ls], xlabel='epoch',
ylabel='log rmse', xlim=[1, num_epochs], yscale='log')
print(f'训练log rmse:{float(train_ls[-1]):f}')
# 将网络应用于测试集。
preds = net(test_features).detach().numpy()
# 将其重新格式化以导出到Kaggle
test_data['Sold Price'] = pd.Series(preds.reshape(1, -1)[0])
submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
submission.to_csv('submission.csv', index=False)
最后得分如下
因为这个项目已经关闭了,只能看到大概在一百四十多名,排名是比较差的,不过对小白来说已经满足了。
总之,沐神的课非常不错,让人收获满满,想入门的朋友们可以试试。
相关链接:
kaggle竞赛:California House Prices | Kaggle
沐神关于这一章的教程:4.10. 实战Kaggle比赛:预测房价 — 动手学深度学习 2.0.0-beta0 documentation (d2l.ai)