pandas数据分析模块(二)

一.准备知识

pandas.isnull(Series对象) 返回bool型的Series对象
同 Series对象.isnull()

DataFrame和Series都可以通过bool型的Series取值
bool型的Series和True/False进行比较,可用于数据过滤  

NAN和任何值做计算时,结果都为NAN,所以在对某一列做求和,求均值等操作时,需先过滤掉缺失值
*.mean()方法可以自动过滤缺失值,再求平均

使用DataFrame的pivot_table()实现分组 聚合处理:
my_dataframe.pivot_table(index="列名1", values="列名2"/["列名2","列名3",...], aggfunc=np.聚合函数)
备注:index指定分组依据的列;values指定对哪几个列做聚合统计运算;aggfunc指定处理的numpy函数

my_dataframe.dropna(axis=1) 或 my_dataframe.dropna(axis='column') 删除存在空值的列
my_dataframe.dropna(axis=0, subset=['column1','column2',...]) 删除指定列存在空值的记录
备注:返回一个新的DataFrame,原DataFrame不会发生改变

my_dataframe.loc[行索引,列索引] 获取指定值

my_dataframe.sortvalues('列名', inplace=False, ascending=True) #默认:返回一个新的升序dataframe
my_dataframe.reset_index(drop=True)  #用于排序后重置索引

二.代码示例(运行环境:python2.7)

import pandas as pd
import numpy as np

titanic_survival = pd.read_csv('titanic_train.csv')


# =============================================================================
# age = titanic_survival["Age"]
# # age_is_null = pd.isnull(age)
# age_is_null = age.isnull()
# 
# # bool型Series统计缺失值数量
# age_null_true = age[age_is_null]
# print(len(age_null_true))
# 
# # 过滤缺失值
# age_null_false = age[age_is_null == False]
# print(age_null_false)
# print(titanic_survival.loc[age_is_null == False])
# =============================================================================


# =============================================================================
# # 计算年龄的均值
# age_null = titanic_survival["Age"].isnull()
# age = titanic_survival["Age"][age_null == False]
# mean_age = sum(age) / len(age)
# print(mean_age)
# 
# mean_age = titanic_survival["Age"].mean()
# print(mean_age)
# =============================================================================


# =============================================================================
# # 旅客类型
# passenger_classes = [1, 2, 3]
# # 计算每类旅客的平均票价:分组(按照旅客类型分组)聚合(求每组的平均值)
# fares_by_class = {}
# for this_class in passenger_classes:
#     pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
#     pclass_fares = pclass_rows["Fare"]
#     fare_for_class = pclass_fares.mean()
#     fares_by_class[this_class] = fare_for_class
# print(fares_by_class)
# 
# # 使用DataFrame的pivot_table()实现数据透视表
# fares_by_class = titanic_survival.pivot_table(index="Pclass", values="Fare", aggfunc=np.mean)
# # 返回的类型为DataFrame
# print(fares_by_class)
# 
# # 对多个列做同一个聚合操作
# port_stats = titanic_survival.pivot_table(index="Embarked", values=["Survived","Fare"], aggfunc=np.sum)
# print(port_stats)
# =============================================================================


# =============================================================================
# drop_na_colums = titanic_survival.dropna(axis=1)
# print(drop_na_colums.shape)
# print(titanic_survival.shape)
# # 返回一个新的DataFrame,原DataFrame不会发生改变
# new_titanic_survival = titanic_survival.dropna(axis=0, subset=["Age","Sex"])
# print(new_titanic_survival.shape)
# print(titanic_survival.shape)
# =============================================================================


# =============================================================================
# print(titanic_survival.loc[0,"Age"])
# print(titanic_survival.loc[100,"Pclass"])
# =============================================================================


# =============================================================================
# # sort_values()默认返回一个新的DataFrame;inplace=True时,就地修改
# new_titanic_survival = titanic_survival.sort_values("Age", ascending=False)
# print(new_titanic_survival.iloc[0:3])
# reindex_titanic = new_titanic_survival.reset_index(drop=True)
# print(reindex_titanic[0:3])
# =============================================================================


def hundredth_row(column):
    hundredth_item = column.iloc[99]
    return hundredth_item

# 对DataFrame的每一列使用hundredth_row()函数处理
hundredth_row = titanic_survival.apply(hundredth_row)
print(hundredth_row)

CSV文件网盘下载链接:https://pan.baidu.com/s/1jAOOXCobDSeBZc3h3qGybQ

你可能感兴趣的:(python数据分析)