pclass – the cabin class of the passenger. 1 was the best cabin class, followed by 2, then 3.
name – the name of the passenger.
sex – the gender of the passenger.
age – the age of the passenger.
boat – the lifeboat the passenger got into.
body – the body number of the passenger.
Python中有一种数据类型None,标示no value.
Panda中有一种数据类型NaN,标示not a number,标示缺失值。
import pandas as pd
titanic_survival = pd.read_csv("titanic_survival.csv")
age_null = pd.isnull(titanic_survival["age"])
age_null_true = age_null[age_null == True]
age_null_count = len(age_null_true)
知道了age属性中有那么多缺失值,下一步就是如何处理缺失值,采用肥缺失值的平均值来填充.
- 但是需要注意的是:此时计算所有非缺失值的和的时候不能用总的数据去算:这样计算的结果是0,因为NaN的任何计算都是0.
import pandas as pd
mean_age = sum(titanic_survival["age"]) / len(titanic_survival["age"])
import pandas as pd
age_null = pd.isnull(titanic_survival["age"])
good_ages = titanic_survival["age"][age_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
import pandas as pd
correct_mean_age = titanic_survival["age"].mean()
correct_mean_fare = titanic_survival["fare"].mean()
passenger_classes = [1, 2, 3]
fares_by_class = {}
for pclass in passenger_classes:
fare_for_class = None
fares_by_class[pclass] = fare_for_class
fares_by_class = {}
for pclass in passenger_classes:
pclass_fares = titanic_survival["fare"][titanic_survival["pclass"] == pclass]
fare_for_class = pclass_fares.mean()
fares_by_class[pclass] = fare_for_class
import pandas as pd
import numpy as np
passenger_survival = titanic_survival.pivot_table(index="pclass", values="survived", aggfunc=np.mean)
# First class passengers had a much higher survival chance
print(passenger_survival)
passenger_age = titanic_survival.pivot_table(index="pclass", values="age", aggfunc=np.mean)
'''
pclass
1 0.619195
2 0.429603
3 0.255289
Name: survived, dtype: float64
'''
import numpy as np
# This will compute the mean survival chance and the mean age for each passenger class
passenger_survival = titanic_survival.pivot_table(index="pclass", values=["age", "survived"], aggfunc=np.mean)
print(passenger_survival)
port_stats = titanic_survival.pivot_table(index="embarked", values=["age", "survived", "fare"], aggfunc=np.mean)
'''
age survived
pclass
1 39.159918 0.619195
2 29.506705 0.429603
3 24.816367 0.255289
'''
import pandas as pd
# Drop all rows that have missing values
new_titanic_survival = titanic_survival.dropna()
# It looks like we have an empty dataframe now.
# This is because every row has at least one missing value.
print(new_titanic_survival)
'''
mpty DataFrame
Columns: [pclass, survived, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked, boat, body, home.dest]
Index: []
'''
# We can also use the axis argument to drop columns that have missing values
new_titanic_survival = titanic_survival.dropna(axis=1)
print(new_titanic_survival)
'''
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
'''
# We can use the subset argument to only drop rows if certain columns have missing values.
# This drops all rows where "age" or "sex" is missing.
new_titanic_survival = titanic_survival.dropna(subset=["age", "sex"])
new_titanic_survival = titanic_survival.dropna(subset=["age", "body", "home.dest"])
当对DataFrame数据进行删除缺失值后,需要对其index进行调整:将reset_index的drop值设置为False表示重新排index从0开始,一般是不进行调整的,这样可以保持原有的数据索引。
# The indexes are the original numbers from titanic_survival
new_titanic_survival = titanic_survival.dropna(subset=["body"])
print(new_titanic_survival)
# Reset the index to an integer sequence, starting at 0.
# The drop keyword argument specifies whether or not to make a dataframe column with the index values.
# If True, it won't, if False, it will.
# We'll almost always want to set it to True.
new_titanic_survival = new_titanic_survival.reset_index(drop=True)
# Now we have indexes starting from 0!
print(new_titanic_survival)
new_titanic_survival = titanic_survival.dropna(subset=["age", "boat"])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
import pandas as pd
def not_null_count(column):
column_null = pd.isnull(column)
null = column[column_null == False]
return len(null) # 非空元素的个数
column_not_null_count = titanic_survival.apply(not_null_count) #迭代计算每行
# 计算乘客是否是未成年(<18)
def is_minor(row):
if row["age"] < 18:
return True
else:
return False
minors = titanic_survival.apply(is_minor, axis=1) # axis=1表示逐行
import pandas as pd
# 根据年龄贴标签
def generate_age_label(row):
age = row["age"]
if pd.isnull(age):
return "unknown"
elif age < 18:
return "minor"
else:
return "adult"
age_labels = titanic_survival.apply(generate_age_label, axis=1)
import numpy as np
age_group_survival = titanic_survival.pivot_table(index=age_labels , values=[ "survived"], aggfunc=np.mean)