Titanic: Machine Learning from Disaster——Logistic regression

Logistic regression can make linear regression output values between 0 and 1.
One good way to think of logistic regression is that it takes the output of a linear regression

from sklearn import cross_validation

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

output:0.787878787879
  • 比前一篇的Linear Regression结果要好,但还是不够好。

Processing the test set

  • 以上都是对train数据进行模型的建立,接着要进行test数据的预测,然后才能提交结果,进行评估。
titanic_test = pandas.read_csv("titanic_test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")

titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

测试数据与训练数据略有不同:

  • 测试数据的”Age”属性的缺省值是用训练数据的中位数进行填充。
  • 训练数据中”Fare”属性没有缺省,而测试数据中有缺省,因此采用测试数据的中位数进行填充。
  • 其它的非数值型属性的调整与训练数据相同。

Generating a submission file

First, we have to train an algorithm on the training data. Then, we make predictions on the test set. Finally, we’ll generate a csv file with the predictions and passenger ids.

# Initialize the algorithm class
alg = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })
submission.to_csv("kaggle.csv", index=False)

# we submit (~.75)

你可能感兴趣的:(Titanic: Machine Learning from Disaster——Logistic regression)