搞清楚有哪些特征,各自代表的意义是什么。(看特征说明结合head)
对将要预测的连续变量做一个describe,有一个直观的认识
**首先,**依据直觉将数值类特征和类型类特征分别进行绘图处理,查看他们与标签的关系。
数值类特征,通过绘制散点图观察特征与标签的关系,来估计特征的重要程度。如
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
对比较感兴趣的特征进行散点图的绘制,部分图像,可以看出来是对称性的
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show();
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
当缺失的数据超过百分之十五时,不对其进行操作,而是选择直接删除特征列,如’PoolQC’, ‘MiscFeature’ and ‘FireplaceQu’ (存疑)。
当考虑到缺失值的其他特征存在可以直接排除的特征时,我们也不予进行补充,如GarageX’ 和BsmtX’ ,认为他们本身可以用其他特征来代替。(所以还是根据之前对数据进行分析时看到的特征重要程度来决定如何对特征进行处理)
而当特征缺失值仅有一个时,可以选择将样本删除掉,而不是对特征进行补充。
#dealing with missing data
df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)
df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)
df_train.isnull().sum().max()
#just checking that there's no missing data missing...
单变量分析,将标签进行标准化处理,显示了前十个和后十个值。查看最大和最小值是否呈现明显的离群趋势。
#standardizing data
saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print('outer range (low) of the distribution:')
print(low_range)
print('\nouter range (high) of the distribution:')
print(high_range)
双变量分析,如图一中的两个点很明显为异常点,可以对其进行删除处理。
#deleting points
df_train.sort_values(by = 'GrLivArea', ascending = False)[:2]
df_train = df_train.drop(df_train[df_train['Id'] == 1299].index)
df_train = df_train.drop(df_train[df_train['Id'] == 524].index)
数据应当通过的假设,1、数据应当满足正态分布;2、同方差性;3、大部分特征都具有线性关系;4、关联性错误。
1、观测数据的分布状况
#histogram and normal probability plot
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
从图上可以看出,偏差很严重(且为正的时,即向上偏移),且尖峰也存在,此时可以使用log函数来对其进行处理,将其全部转换为线性分布(仅仅是此人的一种手段,是否具有推广性存疑),
#applying log transformation
df_train['SalePrice'] = np.log(df_train['SalePrice'])
#transformed histogram and normal probability plot
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
通过上述处理,能够很好地解决掉同方差性的问题,即绘出来的散点图不再具有锥形性质。
# Create data set to train data imputation methods
X = df[df.loc[:, df.columns != 'Survived'].columns]
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)
# Fit logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
k折交叉验证
# Model performance
scores = cross_val_score(logreg, X_train, y_train, cv=10)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
其结果为:CV accuracy: 0.786 +/- 0.026
学习曲线的评估
# Plot learning curves
title = "Learning Curves (Logistic Regression)"
cv = 10
plot_learning_curve(logreg, title, X_train, y_train, ylim=(0.7, 1.01), cv=cv, n_jobs=1);
# 上述函数已经被定义
模型验证曲线(超参数的选择)
# Plot validation curve
title = 'Validation Curve (Logistic Regression)'
param_name = 'C'
param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
cv = 10
plot_validation_curve(estimator=logreg, title=title, X=X_train, y=y_train, param_name=param_name,
ylim=(0.5, 1.01), param_range=param_range);
# 上述函数已经被定义
通过分析,发现存在着偏差。进而对学习提出优化。
第一,通过对数据本身进行优化(如将年龄进行分段表示)。
通过其他属性来对age特征进行补充。
其中,彩色部分表示年龄的分布,黑线代表偏差程度。
查看两个变量间的关系,也可以实现与输出变量的关系观测。
# Plot bar plot (titles, age and sex)
plt.figure(figsize=(15,5))
sns.barplot(x=df['Title'], y=df_raw['Age']);
对age缺失样本的补充(by title)
# Means per title
df_raw['Title'] = df['Title'] # To simplify data handling
means = df_raw.groupby('Title')['Age'].mean()
# Transform means into a dictionary for future mapping
map_means = means.to_dict()
# Impute ages based on titles
idx_nan_age = df.loc[np.isnan(df['Age'])].index
df.loc[idx_nan_age,'Age'].loc[idx_nan_age] = df['Title'].loc[idx_nan_age].map(map_means)
# Compare with other variables
df.groupby(['Embarked']).mean()
第二,对特征进行优化。
将非数值类的数据均转换为categorial,然后dummies
# Transform object into categorical
df['Embarked'] = pd.Categorical(df['Embarked'])
df['Pclass'] = pd.Categorical(df['Pclass'])
# Transform categorical features into dummy variables
df = pd.get_dummies(df, drop_first=1)
df.head()
利用boxcox实现非线性变量的变换
# Apply Box-Cox transformation
from scipy.stats import boxcox
X_train_transformed = X_train.copy()
X_train_transformed['Fare'] = boxcox(X_train_transformed['Fare'] + 1)[0]
X_test_transformed = X_test.copy()
X_test_transformed['Fare'] = boxcox(X_test_transformed['Fare'] + 1)[0]
创建更多的特征,通过多项式。
首先,将已有特征标准化
# Rescale data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_transformed_scaled = scaler.fit_transform(X_train_transformed)
X_test_transformed_scaled = scaler.transform(X_test_transformed)
然后,通过多项式产生新的特征
# Get polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2).fit(X_train_transformed)
X_train_poly = poly.transform(X_train_transformed_scaled)
X_test_poly = poly.transform(X_test_transformed_scaled)
最后,特征的选取
# Select features using chi-squared test
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
## Get score using original model
logreg = LogisticRegression(C=1)
logreg.fit(X_train, y_train)
scores = cross_val_score(logreg, X_train, y_train, cv=10)
print('CV accuracy (original): %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
highest_score = np.mean(scores)
## Get score using models with feature selection
for i in range(1, X_train_poly.shape[1]+1, 1):
# Select i features
select = SelectKBest(score_func=chi2, k=i)
select.fit(X_train_poly, y_train)
X_train_poly_selected = select.transform(X_train_poly)
# Model with i features selected
logreg.fit(X_train_poly_selected, y_train)
scores = cross_val_score(logreg, X_train_poly_selected, y_train, cv=10)
print('CV accuracy (number of features = %i): %.3f +/- %.3f' % (i,
np.mean(scores),
np.std(scores)))
# Save results if best score
if np.mean(scores) > highest_score:
highest_score = np.mean(scores)
std = np.std(scores)
k_features_highest_score = i
elif np.mean(scores) == highest_score:
if np.std(scores) < std:
highest_score = np.mean(scores)
std = np.std(scores)
k_features_highest_score = i
# Print the number of features
print('Number of features when highest score: %i' % k_features_highest_score)
第三,对算法进行优化。