基于Python的数据预处理实战操作(缺失值的处理以及自定义函数)

基于泰坦尼克号乘客信息的数据预处理,首先读入titanic_train.csv文件,文件可以私信找我拿

import numpy as np
import pandas as pd
titanic_survival = pd.read_csv("titanic_train.csv")
titanic_survival.head()
Out:
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

一、缺失值的处理

panda库使用代表“not a number”的NaN来表示缺失的值。
我们可以使用pandas.isnull()函数,该函数接受panda的列并返回列的真值和假值,打印缺失值数量。

age = titanic_survival["Age"]
age_is_null = pd.isnull(age)
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)

这样做的结果是mean_agenan。这是因为我们对NaN值所做的任何计算都会得到null

mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print (mean_age)

过滤缺失值

在计算平均值之前,我们必须过滤掉遗漏的值,不然打印的平均值就是NULL

good_ages = titanic_survival["Age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print (correct_mean_age)

因为丢失数据是非常常见的,所以许多panda方法会自动对其进行筛选缺失值

correct_mean_age = titanic_survival["Age"].mean()
print (correct_mean_age)

实战演练

1、求船舱等级的平均票价

passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
    pclass_fares = pclass_rows["Fare"]
    fare_for_class = pclass_fares.mean()
    fares_by_class[this_class] = fare_for_class
print (fares_by_class)

2、求船舱等级的获救率的均值

pivot_table反映列与列之前的关系,index告诉方法根据哪一列进行分组,values值是我们要将计算用到的列,aggfunc指定要执行的计算方法

passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print (passenger_survival)

3、求各个船舱等级的平均年龄

passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
print(passenger_age)

4、求一个量与其他两个量之间的关系

port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)

若要指定axis=1axis='columns'将删除任何具有null值的列

drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])
print (new_titanic_survival)

5、定位到指定行

row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print (row_index_83_age))
print (row_index_1000_pclass)

二、自定义函数

new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)
print (new_titanic_survival[0:10])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print('------------------------')
print(titanic_reindexed.iloc[0:10])

1、从一个列中返回第100项

def hundredth_row(column):
    # 提取第100项
    hundredth_item = column.iloc[99]
    return hundredth_item

从每一列返回第100项

hundredth_row = titanic_survival.apply(hundredth_row)
print (hundredth_row)

2、统计每个列有多少个缺失值

def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)

column_null_count = titanic_survival.apply(not_null_count)
print (column_null_count)

可以通过传递axis=1参数,我们可以使用datafame.apply()方法来遍历行而不是列。

def which_class(row):
    pclass = row['Pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    elif pclass == 3:
        return "Third Class"

classes = titanic_survival.apply(which_class, axis=1)
print (classes)

3、将连续的值进行离散化

比如说,将大于18岁的乘客定义为adult,小于18岁的乘客定义为minor

def is_minor(row):
    if row["Age"] < 18:
        return True
    else:
        return False

minors = titanic_survival.apply(is_minor, axis=1)

def generate_age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)
print (age_labels)

4、获救率的平均值与是否为成年人的关系

titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print (age_group_survival)

你可能感兴趣的:(Python数据分析学习)