基于泰坦尼克号乘客信息的数据预处理,首先读入titanic_train.csv
文件,文件可以私信找我拿
import numpy as np
import pandas as pd
titanic_survival = pd.read_csv("titanic_train.csv")
titanic_survival.head()
Out:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
panda
库使用代表“not a number
”的NaN
来表示缺失的值。
我们可以使用pandas.isnull()
函数,该函数接受panda
的列并返回列的真值和假值,打印缺失值数量。
age = titanic_survival["Age"]
age_is_null = pd.isnull(age)
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)
这样做的结果是mean_age
为nan
。这是因为我们对NaN
值所做的任何计算都会得到null
值
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print (mean_age)
在计算平均值之前,我们必须过滤掉遗漏的值,不然打印的平均值就是NULL
。
good_ages = titanic_survival["Age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print (correct_mean_age)
因为丢失数据是非常常见的,所以许多panda
方法会自动对其进行筛选缺失值
correct_mean_age = titanic_survival["Age"].mean()
print (correct_mean_age)
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
pclass_fares = pclass_rows["Fare"]
fare_for_class = pclass_fares.mean()
fares_by_class[this_class] = fare_for_class
print (fares_by_class)
pivot_table
反映列与列之前的关系,index
告诉方法根据哪一列进行分组,values
值是我们要将计算用到的列,aggfunc
指定要执行的计算方法
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print (passenger_survival)
passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
print(passenger_age)
port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)
若要指定axis=1
或axis='columns'
将删除任何具有null
值的列
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])
print (new_titanic_survival)
row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print (row_index_83_age))
print (row_index_1000_pclass)
new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)
print (new_titanic_survival[0:10])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print('------------------------')
print(titanic_reindexed.iloc[0:10])
def hundredth_row(column):
# 提取第100项
hundredth_item = column.iloc[99]
return hundredth_item
从每一列返回第100项
hundredth_row = titanic_survival.apply(hundredth_row)
print (hundredth_row)
def not_null_count(column):
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print (column_null_count)
可以通过传递axis=1
参数,我们可以使用datafame.apply()
方法来遍历行而不是列。
def which_class(row):
pclass = row['Pclass']
if pd.isnull(pclass):
return "Unknown"
elif pclass == 1:
return "First Class"
elif pclass == 2:
return "Second Class"
elif pclass == 3:
return "Third Class"
classes = titanic_survival.apply(which_class, axis=1)
print (classes)
比如说,将大于18岁的乘客定义为adult
,小于18岁的乘客定义为minor
def is_minor(row):
if row["Age"] < 18:
return True
else:
return False
minors = titanic_survival.apply(is_minor, axis=1)
def generate_age_label(row):
age = row["Age"]
if pd.isnull(age):
return "unknown"
elif age < 18:
return "minor"
else:
return "adult"
age_labels = titanic_survival.apply(generate_age_label, axis=1)
print (age_labels)
titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print (age_group_survival)