原文地址 https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python/notebook
原文主要内容是利用Titanic数据集讲解常用的机器学习算法,原数据集的主要任务是根据相关变量预测乘客是否可以存活(It is your job to predict if a passenger survived the sinking of the Titanic or not.)。这份数据集总共包括12个变量:
- Age 年龄
- Sibsp 家庭关系(family relations)
a、兄弟姐妹
b、丈夫妻子 - Parch 家庭关系
a、父母
b、子女
c、保姆带小孩(some children travelled only with a nanny, therefore parch=0 for them) - Pclass 社会经济地位(A proxy for socio-economic status (SES))
- Embarked (nominal datatype不太明白这是什么意思)
- Name (nominal datatype)
- Sex
- Ticket (that have no impact on the outcome variable. Thus, they will be excluded from analysis)
- Cabin 船舱等级
- Fare 票价
- PassengerID 乘客ID have no impact on the outcome variable. Thus, it will be excluded from analysis.
- Survival 是否存活
原数据集的格式(数据集1)
经过一系列的代码处理数据集变成了(数据集2)
可以看到数据集2全部变成了离散变量
这一步使用到的代码
X = df_train.iloc[:,:-1].values
y = df_train.iloc[:,-1].values
X
y
def simplify_ages(df):
df.Age = df.Age.fillna(-0.5)
bins = (-1,0,5,12,18,25,35,60,120)
group_names = ['Unknown','Baby','Child','Teenager','Student','Young Adult','Adult','Senior']
categories = pd.cut(df.Age,bins,labels=group_names)
df.Age = categories
return df
def simplify_cabins(df):
df.Cabin = df.Cabin.fillna('N')
df.Cabin = df.Cabin.apply(lambda x: x[0])
return df
def simplify_fares(df):
df.Fare = df.Fare.fillna(-0.5)
bins = (-1,0,8,15,31,1000)
group_names = ['Unknown','1_quartile','2_quartile','3_quartile','4_quartile']
categories = pd.cut(df.Fare,bins,labels=group_names)
df.Fare = categories
return df
def format_name(df):
df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])
df['NamePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])
return df
def drop_features(df):
return df.drop(['Ticket','Name','Embarked'],axis=1)
def transform_features(df):
df = simplify_ages(df)
df = simplify_cabins(df)
df = simplify_fares(df)
df = format_name(df)
df = drop_features(df)
return df
df_train = transform_features(df_train)
df_test = transform_features(df_test)
这一步遇到的新函数
pd.cut()
df.Name.apply(lambda x: x.split(' ')[0])
如何将数据集1整理为数据集2的格式不是我们今天关注的重点,重点是如何根据数据集2作接下来的数据分析。
Feature Encoding
原文:In machine learning projects, one important part is feature engineering. It is very common to see categorical features in a dataset. However, our machine learning algorithm can only read numerical values. It is essential to encoding categorial features into nuemrical values
机器学习项目中,特征工程是一件非常重要的事情。数据集中经常会遇到离散变量。然而常用的机器学习算法只认识数值变量。如何离散变量转换为数值变量非常重要。
原文的处理方式
1、Encode labels with value between 0 and n_classes-1
2、LabelEncoder can be used to normalize labels.
3、It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.
这三句话自己还看不太懂,直接看实际的操作方式
原数据集(数据集2)
转换以后的数据集(数据集3)
转换用到的代码
from sklearn import preprocessing
def encode_features(df_train,df_test):
features = ['Fare','Cabin','Age','Sex','Lname','NamePrefix']
df_combined = pd.concat([df_train[features],df_test[features]])
for feature in features:
le = preprocessing.LabelEncoder()
le = le.fit(df_combined[feature])
df_train[feature] = le.transform(df_train[feature])
df_test[feature] = le.transform(df_test[feature])
return df_train, df_test
df_train, df_test = encode_features(df_train,df_test)
df_train.head()
df_test.head()
###新接触到的函数
pd.concat()
help(pd.concat)
https://www.jianshu.com/p/2e97f2bd75f8
这篇文章中也有一小部分涉及到了离散变量的处理,抽时间看这篇文章及对应的原文!