数学建模中离散变量的处理——笔记二

原文地址 https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python/notebook

原文主要内容是利用Titanic数据集讲解常用的机器学习算法,原数据集的主要任务是根据相关变量预测乘客是否可以存活(It is your job to predict if a passenger survived the sinking of the Titanic or not.)。这份数据集总共包括12个变量:


数学建模中离散变量的处理——笔记二_第1张图片
图片来自原文 https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python/notebook
  • Age 年龄
  • Sibsp 家庭关系(family relations)
    a、兄弟姐妹
    b、丈夫妻子
  • Parch 家庭关系
    a、父母
    b、子女
    c、保姆带小孩(some children travelled only with a nanny, therefore parch=0 for them)
  • Pclass 社会经济地位(A proxy for socio-economic status (SES))
  • Embarked (nominal datatype不太明白这是什么意思)
  • Name (nominal datatype)
  • Sex
  • Ticket (that have no impact on the outcome variable. Thus, they will be excluded from analysis)
  • Cabin 船舱等级
  • Fare 票价
  • PassengerID 乘客ID have no impact on the outcome variable. Thus, it will be excluded from analysis.
  • Survival 是否存活

原数据集的格式(数据集1)


数学建模中离散变量的处理——笔记二_第2张图片
image.png

经过一系列的代码处理数据集变成了(数据集2)


数学建模中离散变量的处理——笔记二_第3张图片
image.png

可以看到数据集2全部变成了离散变量
这一步使用到的代码

X = df_train.iloc[:,:-1].values
y = df_train.iloc[:,-1].values
X
y

def simplify_ages(df):
    df.Age = df.Age.fillna(-0.5)
    bins = (-1,0,5,12,18,25,35,60,120)
    group_names = ['Unknown','Baby','Child','Teenager','Student','Young Adult','Adult','Senior']
    categories = pd.cut(df.Age,bins,labels=group_names)
    df.Age = categories
    return df

def simplify_cabins(df):
    df.Cabin = df.Cabin.fillna('N')
    df.Cabin = df.Cabin.apply(lambda x: x[0])
    return df

def simplify_fares(df):
    df.Fare = df.Fare.fillna(-0.5)
    bins = (-1,0,8,15,31,1000)
    group_names = ['Unknown','1_quartile','2_quartile','3_quartile','4_quartile']
    categories = pd.cut(df.Fare,bins,labels=group_names)
    df.Fare = categories
    return df

def format_name(df):
    df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])
    df['NamePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])
    return df

def drop_features(df):
    return df.drop(['Ticket','Name','Embarked'],axis=1)

def transform_features(df):
    df = simplify_ages(df)
    df = simplify_cabins(df)
    df = simplify_fares(df)
    df = format_name(df)
    df = drop_features(df)
    return df

df_train = transform_features(df_train)
df_test = transform_features(df_test)

这一步遇到的新函数

pd.cut()
df.Name.apply(lambda x: x.split(' ')[0])

如何将数据集1整理为数据集2的格式不是我们今天关注的重点,重点是如何根据数据集2作接下来的数据分析。

Feature Encoding

原文:In machine learning projects, one important part is feature engineering. It is very common to see categorical features in a dataset. However, our machine learning algorithm can only read numerical values. It is essential to encoding categorial features into nuemrical values
机器学习项目中,特征工程是一件非常重要的事情。数据集中经常会遇到离散变量。然而常用的机器学习算法只认识数值变量。如何离散变量转换为数值变量非常重要。

原文的处理方式

1、Encode labels with value between 0 and n_classes-1
2、LabelEncoder can be used to normalize labels.
3、It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.
这三句话自己还看不太懂,直接看实际的操作方式
原数据集(数据集2)


数学建模中离散变量的处理——笔记二_第4张图片
image.png

转换以后的数据集(数据集3)


数学建模中离散变量的处理——笔记二_第5张图片
image.png

转换用到的代码

from sklearn import preprocessing


def encode_features(df_train,df_test):
    features = ['Fare','Cabin','Age','Sex','Lname','NamePrefix']
    df_combined = pd.concat([df_train[features],df_test[features]])

    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])

    return df_train, df_test

df_train, df_test = encode_features(df_train,df_test)
df_train.head()
df_test.head()

###新接触到的函数
pd.concat()
help(pd.concat)

https://www.jianshu.com/p/2e97f2bd75f8
这篇文章中也有一小部分涉及到了离散变量的处理,抽时间看这篇文章及对应的原文!

你可能感兴趣的:(数学建模中离散变量的处理——笔记二)