Pandas——NaN&Pivot&dropna&reset_index

  • 本文的数据是Titanic上的船客的信息,有这么几个属性:

pclass – the cabin class of the passenger. 1 was the best cabin class, followed by 2, then 3.
name – the name of the passenger.
sex – the gender of the passenger.
age – the age of the passenger.
boat – the lifeboat the passenger got into.
body – the body number of the passenger.

Finding The Missing Data

Python中有一种数据类型None,标示no value.
Panda中有一种数据类型NaN,标示not a number,标示缺失值。

  • 下面这段代码用来找到age属性中的缺失值个数,其中的isnull函数用来判断DataFrame中的元素是否为NaN,是NaN则为True。
import pandas as pd
titanic_survival = pd.read_csv("titanic_survival.csv")
age_null = pd.isnull(titanic_survival["age"])
age_null_true = age_null[age_null == True]
age_null_count = len(age_null_true)

Whats The Big Deal With Missing Data?

知道了age属性中有那么多缺失值,下一步就是如何处理缺失值,采用肥缺失值的平均值来填充.
- 但是需要注意的是:此时计算所有非缺失值的和的时候不能用总的数据去算:这样计算的结果是0,因为NaN的任何计算都是0.

import pandas as pd
mean_age = sum(titanic_survival["age"]) / len(titanic_survival["age"])
  • 修改如下(先过滤掉缺失值)
import pandas as pd
age_null = pd.isnull(titanic_survival["age"])
good_ages = titanic_survival["age"][age_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)

Easier Ways To Do Math

  • 上面的工作其实pandas已经内置了一个过滤缺失值的求平均的函数.mean()函数:
import pandas as pd
correct_mean_age = titanic_survival["age"].mean()
correct_mean_fare = titanic_survival["fare"].mean()

Computing Summary Statistics

  • 船上的客人通过pclass属性被分为1,2,3等级。计算每个等级的平均票价:
  • titanic_survival[“fare”][titanic_survival[“pclass”] == pclass]这个长句子表达式titanic_survival中等级为pclass的fare值组成的变量。
passenger_classes = [1, 2, 3]
fares_by_class = {}
for pclass in passenger_classes:
    fare_for_class = None 
    fares_by_class[pclass] = fare_for_class
fares_by_class = {}
for pclass in passenger_classes:
    pclass_fares = titanic_survival["fare"][titanic_survival["pclass"] == pclass]
    fare_for_class = pclass_fares.mean()
    fares_by_class[pclass] = fare_for_class

Making Pivot Tables

  • pivot_table函数计算每个等级用户的成活的概率,pivot_table是一个聚合函数,聚合的方式是均值,按index分组,survived的值作为聚合对象。
import pandas as pd
import numpy as np
passenger_survival = titanic_survival.pivot_table(index="pclass", values="survived", aggfunc=np.mean)

# First class passengers had a much higher survival chance
print(passenger_survival)
passenger_age = titanic_survival.pivot_table(index="pclass", values="age", aggfunc=np.mean)
'''
pclass
1         0.619195
2         0.429603
3         0.255289
Name: survived, dtype: float64
'''

More Complex Pivot Tables

import numpy as np

# This will compute the mean survival chance and the mean age for each passenger class
passenger_survival = titanic_survival.pivot_table(index="pclass", values=["age", "survived"], aggfunc=np.mean)
print(passenger_survival)
port_stats = titanic_survival.pivot_table(index="embarked", values=["age", "survived", "fare"], aggfunc=np.mean)
'''
              age  survived
pclass                     
1       39.159918  0.619195
2       29.506705  0.429603
3       24.816367  0.255289
'''

Drop Missing Values

import pandas as pd

# Drop all rows that have missing values
new_titanic_survival = titanic_survival.dropna()

# It looks like we have an empty dataframe now.
# This is because every row has at least one missing value.
print(new_titanic_survival)
'''
mpty DataFrame
Columns: [pclass, survived, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked, boat, body, home.dest]
Index: []
'''

# We can also use the axis argument to drop columns that have missing values
new_titanic_survival = titanic_survival.dropna(axis=1)
print(new_titanic_survival)
'''
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
'''
# We can use the subset argument to only drop rows if certain columns have missing values.
# This drops all rows where "age" or "sex" is missing.
new_titanic_survival = titanic_survival.dropna(subset=["age", "sex"])
new_titanic_survival = titanic_survival.dropna(subset=["age", "body", "home.dest"])

Reindex Rows

当对DataFrame数据进行删除缺失值后,需要对其index进行调整:将reset_index的drop值设置为False表示重新排index从0开始,一般是不进行调整的,这样可以保持原有的数据索引。

# The indexes are the original numbers from titanic_survival
new_titanic_survival = titanic_survival.dropna(subset=["body"])
print(new_titanic_survival)

# Reset the index to an integer sequence, starting at 0.
# The drop keyword argument specifies whether or not to make a dataframe column with the index values.
# If True, it won't, if False, it will.
# We'll almost always want to set it to True.
new_titanic_survival = new_titanic_survival.reset_index(drop=True)
# Now we have indexes starting from 0!
print(new_titanic_survival)
new_titanic_survival = titanic_survival.dropna(subset=["age", "boat"])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)

apply

import pandas as pd

def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null == False]
    return len(null) # 非空元素的个数

column_not_null_count = titanic_survival.apply(not_null_count) #迭代计算每行

# 计算乘客是否是未成年(<18)
def is_minor(row):
    if row["age"] < 18:
        return True
    else:
        return False
minors = titanic_survival.apply(is_minor, axis=1) # axis=1表示逐行
import pandas as pd
# 根据年龄贴标签
def generate_age_label(row):
    age = row["age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)

Computing Survival Percentage By Age Group

  • 最明智的方法就是用前面提到的:
import numpy as np
age_group_survival = titanic_survival.pivot_table(index=age_labels , values=[ "survived"], aggfunc=np.mean)

你可能感兴趣的:(数据挖掘—dataquest)