pytorch----用多层感知机实现Kaggle的泰坦尼克号竞赛

昨天刚看完softmax回归,大概的设计思路缕清了,所以想拿个简单的竞赛练练手。多层感知机呢,这里只是在softmax上加了个隐藏层。李沐老师的视频真的不错,但刚开始接触pytorch有点懵,如果看不下去的,建议在b站上先看这个老师的视频。
《PyTorch深度学习实践》完结合集

1. 参考资料:

Kaggle实战入门(一)之泰坦尼克号
李沐视频:Kaggle房价预测

2. 代码

import torch
import pandas as pd
from d2l import torch as d2l
from torch import nn
#数据读取
test_data = pd.read_csv(r'C:\Users\Administrator\Desktop\DeepLearningInfo\titanic\test.csv')
train_data = pd.read_csv(r'C:\Users\Administrator\Desktop\DeepLearningInfo\titanic\train.csv')
#找出输出项
train_lables = train_data['Survived']

#去除预测项
train_data = train_data.drop('Survived', axis=1)
# print(train_data.info())
# print(train_data.columns)

#合并训练和测试值、处理一些空值什么的
features = pd.concat((train_data, test_data), axis=0)
#港口、票名去掉、感觉重要性不大
features = features.drop('Ticket', axis=1)
features = features.drop('Cabin', axis=1)
features = features.drop('PassengerId', axis=1)

#名字跟地位有关,用新建立一个Title标签,更改名字,然后删除原有的‘Name’标签
features['Title'] = features['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
features['Title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'],'Officer', inplace=True)
features['Title'].replace(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty', inplace=True)
features['Title'].replace(['Mme', 'Ms', 'Mrs'],'Mrs', inplace=True)
features['Title'].replace(['Mlle', 'Miss'], 'Miss', inplace=True)
features['Title'].replace(['Master','Jonkheer'],'Master', inplace=True)
features['Title'].replace(['Mr'], 'Mr', inplace=True)
features = features.drop('Name', axis=1)

#年龄中存在很多缺失值--用随机森林
from sklearn.ensemble import RandomForestRegressor
ages = features[['Age', 'Pclass','Sex','Title']]
ages=pd.get_dummies(ages)
known_ages = ages[ages.Age.notnull()].values
unknown_ages = ages[ages.Age.isnull()].values
y = known_ages[:, 0]
X = known_ages[:, 1:]
rfr = RandomForestRegressor(random_state=60, n_estimators=100, n_jobs=-1)
rfr.fit(X, y)
pre_ages = rfr.predict(unknown_ages[:, 1::])
features.loc[(features.Age.isnull()), 'Age' ] = pre_ages
# print(features['Age'])

#填充Embarked的空值, S比较多、所以选了空值填充为'S'
features['Embarked'] = features['Embarked'].fillna('S')

# #非离散数据标准化
features_numberic = features.dtypes[features.dtypes != 'object'].index
#标准化
features[features_numberic] = features[features_numberic].apply(lambda x: (x - x.mean() / x.std()))
features[features_numberic] = features[features_numberic].fillna(0)

#离散数据--one-hot
features = pd.get_dummies(features, dummy_na=False)

#转化为张量
train_num = train_data.shape[0]
test_num = test_data.shape[0]
train_features = torch.tensor(features[:train_num].values, dtype=torch.float32)
train_lables = torch.tensor(train_lables[:].values, dtype=torch.long).reshape(-1)
test_features = torch.tensor(features[train_num:].values, dtype=torch.float32)
test_lables = train_lables[:test_features.shape[0]]

#批量数据加载
train_iter  = d2l.load_array((train_features, train_lables), batch_size=64, is_train=True)
test_iter = d2l.load_array((test_features, test_lables), batch_size=64, is_train=False)

加个隐藏层损失会小一丢丢,不加之前是0.41左右,加了之后是0.39左右。损失还是蛮大的,但是目前我能力就只有这样子啦

#模型构建--多层感知机
net = nn.Sequential(nn.Linear(16, 8), nn.ReLU(), nn.Linear(8, 4), nn.ReLU(), nn.Linear(4, 2))
def init_weight(m):
    if m == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)
net.apply(init_weight)

#损失和算法
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.001)
# optimizer = torch.optim.Adam(net.parameters(), lr=0.001, weight_decay=1)

#训练
num_epochs = 10000
for epochs in range(num_epochs):
    for X, y in train_iter:
        optimizer.zero_grad()
        l = loss(net(X), y)
        l.backward()
        optimizer.step()
    l = loss(net(train_features), train_lables)
    print(f'epochs:{epochs}, loss:{l}')

#预测
y = net(test_features)

def softmax(x):
    x_exp = torch.exp(x)
    partition = x_exp.sum(axis=1, keepdim=True)
    return x_exp / partition

#转化成类别的概率
y = softmax(y)

#取出概率大的列数作为输出类别
Survived = []
for data in y:
    max = 0
    for index in range(y.shape[1]):
        if data[max].item() < data[index].item():
            max = index
    Survived.append(max)

#保存进csv文件
survived = pd.DataFrame(columns=['Survived'], data=Survived)
submission = pd.concat((test_data['PassengerId'], survived), axis=1)
submission.to_csv(r'C:\Users\Administrator\Desktop\DeepLearningInfo\titanic\submission.csv', index=False)

总结:数据处理还是有很多不懂,基本都是从论坛或者Kaggle上看来的,还是得慢慢了解,softmax回归大致流程了解清楚啦,还是挺开心的。代码上还是哪里有问题,请大佬们指出,谢谢!

你可能感兴趣的:(python,pytorch,深度学习)