kaggle比赛流程

对问题的认识

搞清楚有哪些特征,各自代表的意义是什么。(看特征说明结合head)
对将要预测的连续变量做一个describe,有一个直观的认识

对数据的认识

**首先,**依据直觉将数值类特征和类型类特征分别进行绘图处理,查看他们与标签的关系。
数值类特征,通过绘制散点图观察特征与标签的关系,来估计特征的重要程度。如
kaggle比赛流程_第1张图片

var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

类别类特征,观察不同类别时对标签的影响,如
kaggle比赛流程_第2张图片

var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

其次,通过协方差图来观看数据之间的关系,
kaggle比赛流程_第3张图片

corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

挑选比较重要的特征再次通过协方差图进行分析,
kaggle比赛流程_第4张图片

k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

对比较感兴趣的特征进行散点图的绘制,部分图像,可以看出来是对称性的
kaggle比赛流程_第5张图片

sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show();

对缺失值的处理

对缺失值有一个基本的认识,查看有哪些缺失值,部分数据
kaggle比赛流程_第6张图片

total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

当缺失的数据超过百分之十五时,不对其进行操作,而是选择直接删除特征列,如’PoolQC’, ‘MiscFeature’ and ‘FireplaceQu’ (存疑)。
当考虑到缺失值的其他特征存在可以直接排除的特征时,我们也不予进行补充,如GarageX’ 和BsmtX’ ,认为他们本身可以用其他特征来代替。(所以还是根据之前对数据进行分析时看到的特征重要程度来决定如何对特征进行处理)
而当特征缺失值仅有一个时,可以选择将样本删除掉,而不是对特征进行补充。

#dealing with missing data
df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)
df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)
df_train.isnull().sum().max() 
#just checking that there's no missing data missing...

对离群点的处理

单变量分析,将标签进行标准化处理,显示了前十个和后十个值。查看最大和最小值是否呈现明显的离群趋势。
kaggle比赛流程_第7张图片

#standardizing data
saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print('outer range (low) of the distribution:')
print(low_range)
print('\nouter range (high) of the distribution:')
print(high_range)

双变量分析,如图一中的两个点很明显为异常点,可以对其进行删除处理。

#deleting points
df_train.sort_values(by = 'GrLivArea', ascending = False)[:2]
df_train = df_train.drop(df_train[df_train['Id'] == 1299].index)
df_train = df_train.drop(df_train[df_train['Id'] == 524].index)

核心分析

数据应当通过的假设,1、数据应当满足正态分布;2、同方差性;3、大部分特征都具有线性关系;4、关联性错误。
1、观测数据的分布状况
kaggle比赛流程_第8张图片

#histogram and normal probability plot
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)

从图上可以看出,偏差很严重(且为正的时,即向上偏移),且尖峰也存在,此时可以使用log函数来对其进行处理,将其全部转换为线性分布(仅仅是此人的一种手段,是否具有推广性存疑),
kaggle比赛流程_第9张图片

#applying log transformation
df_train['SalePrice'] = np.log(df_train['SalePrice'])
#transformed histogram and normal probability plot
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)

通过上述处理,能够很好地解决掉同方差性的问题,即绘出来的散点图不再具有锥形性质。

变量的映射化

训练模型

# Create data set to train data imputation methods
X = df[df.loc[:, df.columns != 'Survived'].columns]
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)
# Fit logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

评估性能

kaggle比赛流程_第10张图片

k折交叉验证
# Model performance
scores = cross_val_score(logreg, X_train, y_train, cv=10)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
其结果为:CV accuracy: 0.786 +/- 0.026


学习曲线的评估
# Plot learning curves
title = "Learning Curves (Logistic Regression)"
cv = 10
plot_learning_curve(logreg, title, X_train, y_train, ylim=(0.7, 1.01), cv=cv, n_jobs=1);
# 上述函数已经被定义

kaggle比赛流程_第11张图片

模型验证曲线(超参数的选择)
# Plot validation curve
title = 'Validation Curve (Logistic Regression)'
param_name = 'C'
param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0] 
cv = 10
plot_validation_curve(estimator=logreg, title=title, X=X_train, y=y_train, param_name=param_name,
                      ylim=(0.5, 1.01), param_range=param_range);
# 上述函数已经被定义

优化

通过分析,发现存在着偏差。进而对学习提出优化。
第一,通过对数据本身进行优化(如将年龄进行分段表示)。
通过其他属性来对age特征进行补充。
其中,彩色部分表示年龄的分布,黑线代表偏差程度。
kaggle比赛流程_第12张图片

查看两个变量间的关系,也可以实现与输出变量的关系观测。
# Plot bar plot (titles, age and sex)
plt.figure(figsize=(15,5))
sns.barplot(x=df['Title'], y=df_raw['Age']);
对age缺失样本的补充(by title)
# Means per title
df_raw['Title'] = df['Title']  # To simplify data handling
means = df_raw.groupby('Title')['Age'].mean()
# Transform means into a dictionary for future mapping
map_means = means.to_dict()
# Impute ages based on titles
idx_nan_age = df.loc[np.isnan(df['Age'])].index
df.loc[idx_nan_age,'Age'].loc[idx_nan_age] = df['Title'].loc[idx_nan_age].map(map_means)

查看某个特征对其他特征的影响
kaggle比赛流程_第13张图片

# Compare with other variables
df.groupby(['Embarked']).mean()

第二,对特征进行优化。
将非数值类的数据均转换为categorial,然后dummies
kaggle比赛流程_第14张图片

# Transform object into categorical
df['Embarked'] = pd.Categorical(df['Embarked'])
df['Pclass'] = pd.Categorical(df['Pclass'])

# Transform categorical features into dummy variables
df = pd.get_dummies(df, drop_first=1)  
df.head()

利用boxcox实现非线性变量的变换

# Apply Box-Cox transformation
from scipy.stats import boxcox

X_train_transformed = X_train.copy()
X_train_transformed['Fare'] = boxcox(X_train_transformed['Fare'] + 1)[0]
X_test_transformed = X_test.copy()
X_test_transformed['Fare'] = boxcox(X_test_transformed['Fare'] + 1)[0]

创建更多的特征,通过多项式。
首先,将已有特征标准化

# Rescale data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_transformed_scaled = scaler.fit_transform(X_train_transformed)
X_test_transformed_scaled = scaler.transform(X_test_transformed)

然后,通过多项式产生新的特征

# Get polynomial features
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2).fit(X_train_transformed)
X_train_poly = poly.transform(X_train_transformed_scaled)
X_test_poly = poly.transform(X_test_transformed_scaled)

最后,特征的选取

# Select features using chi-squared test
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

## Get score using original model
logreg = LogisticRegression(C=1)
logreg.fit(X_train, y_train)
scores = cross_val_score(logreg, X_train, y_train, cv=10)
print('CV accuracy (original): %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
highest_score = np.mean(scores)
## Get score using models with feature selection
for i in range(1, X_train_poly.shape[1]+1, 1):
    # Select i features
    select = SelectKBest(score_func=chi2, k=i)
    select.fit(X_train_poly, y_train)
    X_train_poly_selected = select.transform(X_train_poly)

    # Model with i features selected
    logreg.fit(X_train_poly_selected, y_train)
    scores = cross_val_score(logreg, X_train_poly_selected, y_train, cv=10)
    print('CV accuracy (number of features = %i): %.3f +/- %.3f' % (i, 
                                                                     np.mean(scores), 
                                                                     np.std(scores)))
     # Save results if best score
    if np.mean(scores) > highest_score:
        highest_score = np.mean(scores)
        std = np.std(scores)
        k_features_highest_score = i
    elif np.mean(scores) == highest_score:
        if np.std(scores) < std:
            highest_score = np.mean(scores)
            std = np.std(scores)
            k_features_highest_score = i
        
# Print the number of features
print('Number of features when highest score: %i' % k_features_highest_score)                        

第三,对算法进行优化。

你可能感兴趣的:(kaggle比赛流程)