《Pandas 1.x Cookbook · 第二版》第02章 DataFrame基础运算

第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐


2.1 从DataFrame中选择多列

使用列名列表提取DataFrame的多列:

>>> import pandas as pd
>>> import numpy as np
>>> movies = pd.read_csv("data/movie.csv")
>>> movie_actor_director = movies[
...     [
...         "actor_1_name",
...         "actor_2_name",
...         "actor_3_name",
...         "director_name",
...     ]
... ]
>>> movie_actor_director.head()
  actor_1_name actor_2_name actor_3_name director_name
0  CCH Pounder  Joel Dav...    Wes Studi  James Ca...
1  Johnny Depp  Orlando ...  Jack Dav...  Gore Ver...
2  Christop...  Rory Kin...  Stephani...   Sam Mendes
3    Tom Hardy  Christia...  Joseph G...  Christop...
4  Doug Walker   Rob Walker          NaN  Doug Walker
# 提取单列时,列表和键名提取出来的数据类型不同。
>>> type(movies[["director_name"]])
   # DataFrame类型
>>> type(movies["director_name"])
   # Series类型

也可以使用loc提取多列。

>>> type(movies.loc[:, ["director_name"]])

>>> type(movies.loc[:, "director_name"])

预先将列名存储在列表中,可以提高代码的可读性。

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movie_actor_director = movies[cols]

如果没有使用列表,则会报KeyError错误。

>>> movies[
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
Traceback (most recent call last):
  ...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

2.2 使用方法提取多列

缩短列名之后查看每种数据类型的个数:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return (
...         str(col)
...         .replace("facebook_likes", "fb")
...         .replace("_for_reviews", "")
...     )
>>> movies = movies.rename(columns=shorten)
>>> movies.dtypes.value_counts()
float64    13
int64       3
object     12
dtype: int64

使用.select_dtypes方法提取整型的列:

>>> movies.select_dtypes(include="int").head()
   num_voted_users  cast_total_fb  movie_fb
0           886204           4834     33000
1           471220          48350         0
2           275868          11700     85000
3          1144337         106759    164000
4                8            143         0

选择所有数值类型的列:

>>> movies.select_dtypes(include="number").head()
   num_critics  duration  ...  aspect_ratio  movie_fb
0        723.0     178.0  ...         1.78      33000
1        302.0     169.0  ...         2.35          0
2        602.0     148.0  ...         2.35      85000
3        813.0     164.0  ...         2.35     164000
4          NaN       NaN  ...          NaN          0

选择整型和字符串的列:

>>> movies.select_dtypes(include=["int", "object"]).head()
   color        direc/_name  ... conte/ating movie_fb
0  Color      James Cameron  ...       PG-13    33000
1  Color     Gore Verbinski  ...       PG-13        0
2  Color         Sam Mendes  ...       PG-13    85000
3  Color  Christopher Nolan  ...       PG-13   164000
4    NaN        Doug Walker  ...         NaN        0

提取所有非浮点类型的列:

>>> movies.select_dtypes(exclude="float").head()
   color director_name  ... content_rating movie_fb
0  Color  James Ca...   ...        PG-13      33000
1  Color  Gore Ver...   ...        PG-13          0
2  Color   Sam Mendes   ...        PG-13      85000
3  Color  Christop...   ...        PG-13     164000
4    NaN  Doug Walker   ...          NaN          0

使用.filter方法筛选所有列名中包含fb的列:

>>> movies.filter(like="fb").head()
   director_fb  actor_3_fb  ...  actor_2_fb  movie_fb
0          0.0       855.0  ...       936.0     33000
1        563.0      1000.0  ...      5000.0         0
2          0.0       161.0  ...       393.0     85000
3      22000.0     23000.0  ...     23000.0    164000
4        131.0         NaN  ...        12.0         0

items参数可以用来选择多列:

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movies.filter(items=cols).head()
      actor_1_name  ...      director_name
0      CCH Pounder  ...      James Cameron
1      Johnny Depp  ...     Gore Verbinski
2  Christoph Waltz  ...         Sam Mendes
3        Tom Hardy  ...  Christopher Nolan
4      Doug Walker  ...        Doug Walker

regex参数可以用来进行正则匹配,下面的代码提取出了列名中包含数字的列:

>>> movies.filter(regex=r"\d").head()
   actor_3_fb actor_2_name  ...  actor_3_name actor_2_fb
0       855.0  Joel Dav...  ...    Wes Studi       936.0
1      1000.0  Orlando ...  ...  Jack Dav...      5000.0
2       161.0  Rory Kin...  ...  Stephani...       393.0
3     23000.0  Christia...  ...  Joseph G...     23000.0
4         NaN   Rob Walker  ...          NaN        12.0

2.3 按列名进行排列

对列进行排序的原则:

  • 将列分为分类型和连续型;
  • 按照分类型和连续型对列分组;
  • 分类型排在连续型的前面;

下面是个例子。先读取数据,缩短列名:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return col.replace("facebook_likes", "fb").replace(
...         "_for_reviews", ""
...     )
>>> movies = movies.rename(columns=shorten)

对下面的列名进行

>>> movies.columns
Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
       'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
       'movie_fb'],
      dtype='object')
>>> cat_core = [
...     "movie_title",
...     "title_year",
...     "content_rating",
...     "genres",
... ]
>>> cat_people = [
...     "director_name",
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
... ]
>>> cat_other = [
...     "color",
...     "country",
...     "language",
...     "plot_keywords",
...     "movie_imdb_link",
... ]
>>> cont_fb = [
...     "director_fb",
...     "actor_1_fb",
...     "actor_2_fb",
...     "actor_3_fb",
...     "cast_total_fb",
...     "movie_fb",
... ]
>>> cont_finance = ["budget", "gross"]
>>> cont_num_reviews = [
...     "num_voted_users",
...     "num_user",
...     "num_critic",
... ]
>>> cont_other = [
...     "imdb_score",
...     "duration",
...     "aspect_ratio",
...     "facenumber_in_poster",
... ]

将上面所有列表连起来,组成最终的列的顺序,并确认没有遗漏任何列:

>>> new_col_order = (
...     cat_core
...     + cat_people
...     + cat_other
...     + cont_fb
...     + cont_finance
...     + cont_num_reviews
...     + cont_other
... )
>>> set(movies.columns) == set(new_col_order)
True

将新的列数组传给movies,得到排好列的对象:

>>> movies[new_col_order].head()
   movie_title  title_year  ... aspect_ratio facenumber_in_poster
0       Avatar      2009.0  ...         1.78          0.0
1  Pirates ...      2007.0  ...         2.35          0.0
2      Spectre      2015.0  ...         2.35          1.0
3  The Dark...      2012.0  ...         2.35          0.0
4  Star War...         NaN  ...          NaN          0.0

2.4 对DataFrame进行概括性分析

查看数据集的属性:shape、size、ndim。

>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2

.count方法可以统计所有的非缺失值:

>>> movies.count()
color                      4897
director_name              4814
num_critic_for_reviews     4867
duration                   4901
director_facebook_likes    4814
                           ... 
title_year                 4810
actor_2_facebook_likes     4903
imdb_score                 4916
aspect_ratio               4590
movie_facebook_likes       4916
Length: 28, dtype: int64

.min.max.mean.median.std方法,返回的是数值列的统计信息:

>>> movies.min()
num_critic_for_reviews        1.00
duration                      7.00
director_facebook_likes       0.00
actor_3_facebook_likes        0.00
actor_1_facebook_likes        0.00
                            ...   
title_year                 1916.00
actor_2_facebook_likes        0.00
imdb_score                    1.60
aspect_ratio                  1.18
movie_facebook_likes          0.00
Length: 16, dtype: float64

.describe是一个非常强大的方法,可以返回描述性统计信息和分位数,如果想在屏幕中显示更多信息,可以用.T进行矩阵转置:

>>> movies.describe().T
               count         mean  ...       75%       max
num_criti...  4867.0   137.988905  ...    191.00     813.0
duration      4901.0   107.090798  ...    118.00     511.0
director_...  4814.0   691.014541  ...    189.75   23000.0
actor_3_f...  4893.0   631.276313  ...    633.00   23000.0
actor_1_f...  4909.0  6494.488491  ...  11000.00  640000.0
...              ...          ...  ...       ...       ...
title_year    4810.0  2002.447609  ...   2011.00    2016.0
actor_2_f...  4903.0  1621.923516  ...    912.00  137000.0
imdb_score    4916.0     6.437429  ...      7.20       9.5
aspect_ratio  4590.0     2.222349  ...      2.35      16.0
movie_fac...  4916.0  7348.294142  ...   2000.00  349000.0

.describe方法中通过percentiles参数,可以得到任意分位数:

>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
               count         mean  ...       99%       max
num_criti...  4867.0   137.988905  ...    546.68     813.0
duration      4901.0   107.090798  ...    189.00     511.0
director_...  4814.0   691.014541  ...  16000.00   23000.0
actor_3_f...  4893.0   631.276313  ...  11000.00   23000.0
actor_1_f...  4909.0  6494.488491  ...  44920.00  640000.0
...              ...          ...  ...       ...       ...
title_year    4810.0  2002.447609  ...   2016.00    2016.0
actor_2_f...  4903.0  1621.923516  ...  17000.00  137000.0
imdb_score    4916.0     6.437429  ...      8.50       9.5
aspect_ratio  4590.0     2.222349  ...      4.00      16.0
movie_fac...  4916.0  7348.294142  ...  93850.00  349000.0

如果在上述的描述性方法中,将参数skipna设为False,则可以将所有列都体现出来:

>>> movies.min(skipna=False)
num_critic_for_reviews     NaN
duration                   NaN
director_facebook_likes    NaN
actor_3_facebook_likes     NaN
actor_1_facebook_likes     NaN
                          ... 
title_year                 NaN
actor_2_facebook_likes     NaN
imdb_score                 1.6
aspect_ratio               NaN
movie_facebook_likes       0.0
Length: 16, dtype: float64

2.5 DataFrame的链式方法

使用.isnull方法,判断每个值是否是缺失值:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return col.replace("facebook_likes", "fb").replace(
...         "_for_reviews", ""
...     )
>>> movies = movies.rename(columns=shorten)
>>> movies.isnull().head()
   color  director_name  ...  aspect_ratio  movie_fb
0  False        False    ...        False      False
1  False        False    ...        False      False
2  False        False    ...        False      False
3  False        False    ...        False      False
4   True        False    ...         True      False

.sum方法可以对TrueFalse求和,True是1,False是0,这样就能清楚地看到每列有多少缺失值:

>>> (movies.isnull().sum().head())
color             19
director_name    102
num_critic        49
duration          15
director_fb      102
dtype: int64

再进一步,两个.sum方法连用,可以知道总共有多少缺失值:

>>> movies.isnull().sum().sum()
2654

如果仅仅想知道是否有缺失值,使用.any()更为便捷:

>>> movies.isnull().any().any()
True

原理

.isnull方法将原始的DataFrame转换为了相同大小的布尔值矩阵:

>>> movies.isnull().dtypes.value_counts()
bool    28
dtype: int64

更多

如果object类型的数据存在缺失值,则在做聚合运算(.min.max.sum)时,返回为空:

>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)

如果想让返回值不为空,需要对缺失值进行填充:

>>> movies.select_dtypes(["object"]).fillna("").max()
color                            Color
director_name            Étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             Óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object

出于可读性考虑,链式方法通常用括号括起来,这样对每个方法做注释和调试时非常方便:

>>> (movies.select_dtypes(["object"]).fillna("").max())
color                            Color
director_name            Étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             Óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object

2.6 DataFrame运算

DataFrame的列的类型可能是数值,也可能是对象,直接+5的话,会报类型错误:

>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
  ...
TypeError: can only concatenate str (not "int") to str

.filter方法筛选出所有列名以'UGDS_'开头的列,该列是按照种族分类的本科生:

>>> colleges = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

Pandas采取的是“四舍六入五成双” 的 Banker's Rounding 规则 (五后有数入、五后无数凑偶数)。观察UGDS_BLACK是如何变化的:

>>> name = "Northwest-Shoals Community College"
>>> college_ugds.loc[name]
UGDS_WHITE    0.7912
UGDS_BLACK    0.1250
UGDS_HISP     0.0339
UGDS_ASIAN    0.0036
UGDS_AIAN     0.0088
UGDS_NHPI     0.0006
UGDS_2MOR     0.0012
UGDS_NRA      0.0033
UGDS_UNKN     0.0324
Name: Northwest-Shoals Community College, dtype: float64
>>> college_ugds.loc[name].round(2)
UGDS_WHITE    0.79
UGDS_BLACK    0.12
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

如果圆整之前加0.0001,看看变化:

>>> (college_ugds.loc[name] + 0.0001).round(2)
UGDS_WHITE    0.79
UGDS_BLACK    0.13
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

在开始圆整探险之前,将college_ugds中的每个数加0.00501

>>> college_ugds + 0.00501
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...     0.03831     0.94031  ...   0.01091    0.01881
Universit...     0.59721     0.26501  ...   0.02291    0.01501
Amridge U...     0.30401     0.42421  ...   0.00501    0.27651
Universit...     0.70381     0.13051  ...   0.03821    0.04001
Alabama S...     0.02081     0.92581  ...   0.02931    0.01871
...                  ...         ...  ...       ...        ...
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...

使用//,将DataFrame中的值圆整为百分比的整数:

>>> (college_ugds + 0.00501) // 0.01
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...         3.0        94.0  ...       1.0        1.0
Universit...        59.0        26.0  ...       2.0        1.0
Amridge U...        30.0        42.0  ...       0.0       27.0
Universit...        70.0        13.0  ...       3.0        4.0
Alabama S...         2.0        92.0  ...       2.0        1.0
...                  ...         ...  ...       ...        ...
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...       NaN        NaN
National ...         NaN         NaN  ...       

如果将其除以100,则:

>>> college_ugds_op_round =(
...     (college_ugds + 0.00501) // 0.01 / 100
... )
>>> college_ugds_op_round.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...        0.03        0.94  ...      0.01       0.01
Universit...        0.59        0.26  ...      0.02       0.01
Amridge U...        0.30        0.42  ...      0.00       0.27
Universit...        0.70        0.13  ...      0.03       0.04
Alabama S...        0.02        0.92  ...      0.02       0.01

下面使用round方法,因为是bankers规则,加个0.00001:

>>> college_ugds_round = (college_ugds + 0.00001).round(2)

判断两个结果是否相同:

>>> college_ugds_op_round.equals(college_ugds_round)
True

原理

浮点运算会产生误差:

>>> 0.045 + 0.005
0.049999999999999996

加0.00001之后,则变为:

>>> 0.045 + 0.005 + 0.00001
0.05001

更多

和Series相同,DataFrame也有对应的函数运算方法:

>>> college2 = (
...     college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True 

2.7 比较缺失值

Pandas使用np.nan表示缺失值,这个对象很独特:

>>> np.nan == np.nan
False
>>> None == None
True
>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True

先加载数据:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

==是对DataFrame中每个元素进行比较:

>>> college_ugds == 0.0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False

但是,如果DataFrame中有缺失值,用==就会出现问题:

>>> college_self_compare = college_ugds == college_ugds
>>> college_self_compare.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Alabama A...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Amridge U...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Alabama S...        True        True  ...      True       True

看起来好像没问题,但是如果用.all方法,就会发现问题:

>>> college_self_compare.all()
UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

这是因为缺失值不能相互比较,如果像下面用`== np.nan判断有没有缺失值,就会得到0:

>>> (college_ugds == np.nan).sum()
UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

统计缺失值个数的方法是使用.isna

>>> college_ugds.isna().sum()
UGDS_WHITE    661
UGDS_BLACK    661
UGDS_HISP     661
UGDS_ASIAN    661
UGDS_AIAN     661
UGDS_NHPI     661
UGDS_2MOR     661
UGDS_NRA      661
UGDS_UNKN     661
dtype: int64

比较两个DataFrame的正确方法是使用.equals方法:

>>> college_ugds.equals(college_ugds)
True

更多

.eq方法等价于==

>>> college_ugds.eq(0.0019)  # same as college_ugds == .0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

pandas.testing包中有个断言方法assert_frame_equal,可以用于判断两个DataFrame是否相同,如果不同返回AssertionError,如果相同返回None

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True

2.8 转置DataFrame运算的方向

DataFrame的许多方法都使用了axis参数,这个参数控制了运算方向。axis参数可以是index0)或columns1)。字符串更清晰,建议使用字符串。

提取并筛选数据。

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350

.count()方法默认axis=0,等价于college_ugds.count(axis=0)college_ugds.count(axis='index')

>>> college_ugds.count()
UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

axis参数改为columns,可以得到每行非空值的数量:

>>> college_ugds.count(axis="columns").head()
INSTNM
Alabama A & M University               9
University of Alabama at Birmingham    9
Amridge University                     9
University of Alabama in Huntsville    9
Alabama State University               9
dtype: int64

计算每行是否是百分之百:

>>> college_ugds.sum(axis="columns").head()
INSTNM
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

计算每列的中位数:

>>> college_ugds.median(axis="index")
UGDS_WHITE    0.55570
UGDS_BLACK    0.10005
UGDS_HISP     0.07140
UGDS_ASIAN    0.01290
UGDS_AIAN     0.00260
UGDS_NHPI     0.00000
UGDS_2MOR     0.01750
UGDS_NRA      0.00000
UGDS_UNKN     0.01430
dtype: float64

更多

使用cumsum可以沿着列的方向进行累计求和,能从另一个视角观察白人和黑人所占比例:

>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9686  ...    0.9862     1.0000
Universit...      0.5922      0.8522  ...    0.9899     0.9999
Amridge U...      0.2990      0.7182  ...    0.7285     1.0000
Universit...      0.6988      0.8243  ...    0.9650     1.0000
Alabama S...      0.0158      0.9366  ...    0.9863     1.0000

2.9 校园的多样性

US News 的多样性指数TOP10高校如下:

>>> pd.read_csv(
...     "data/college_diversity.csv", index_col="School"
... )
                                                   Diversity Index
School
Rutgers University--Newark  Newark, NJ                        0.76
Andrews University  Berrien Springs, MI                       0.74
Stanford University  Stanford, CA                             0.74
University of Houston  Houston, TX                            0.74
University of Nevada--Las Vegas  Las Vegas, NV                0.74
University of San Francisco  San Francisco, CA                0.74
San Francisco State University  San Francisco, CA             0.73
University of Illinois--Chicago  Chicago, IL                  0.73
New Jersey Institute of Technology  Newark, NJ                0.72
Texas Woman's University  Denton, TX                          0.72

对于我们的问题,先读取数据:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

计算每行有多少缺失值,并从大到小排列:

>>> (
...     college_ugds.isnull()
...     .sum(axis="columns")
...     .sort_values(ascending=False)
...     .head()
... )
INSTNM
Excel Learning Center-San Antonio South         9
Philadelphia College of Osteopathic Medicine    9
Assemblies of God Theological Seminary          9
Episcopal Divinity School                       9
Phillips Graduate Institute                     9
dtype: int64

如果一行的九列都是缺失值,则使用.dropna方法删掉该行:

>>> college_ugds = college_ugds.dropna(how="all")
>>>; college_ugds.isnull().sum()
UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

如果某个种族的比例超过15%,则进行统计:

>>> college_ugds.ge(0.15)
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False        True  ...     False      False
Universit...        True        True  ...     False      False
Amridge U...        True        True  ...     False       True
Universit...        True       False  ...     False      False
Alabama S...       False        True  ...     False      False
...                  ...         ...  ...       ...        ...
Hollywood...        True        True  ...     False      False
Hollywood...       False        True  ...     False      False
Coachella...        True       False  ...     False      False
Dewey Uni...       False       False  ...     False      False
Coastal P...        True        True  ...     False      False

使用sum方法对每行进行统计:

>>> diversity_metric = college_ugds.ge(0.15).sum(
...     axis="columns"
... )
>>> diversity_metric.head()
INSTNM
Alabama A & M University               1
University of Alabama at Birmingham    2
Amridge University                     3
University of Alabama in Huntsville    1
Alabama State University               1
dtype: int64

使用.value_counts查看该序列是如何分布的:

>>> diversity_metric.value_counts()
1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

惊讶地发现,有两所学校有5个种族的比例超过了15%。对diversity_metric进行排列:

>>> diversity_metric.sort_values(ascending=False).head()
INSTNM
Regency Beauty Institute-Austin          5
Central Texas Beauty College-Temple      5
Sullivan and Cogliano Training Center    4
Ambria College of Nursing                4
Berkeley College-New York                4
dtype: int64

查看排名最高的两所学校:

>>> college_ugds.loc[
...     [
...         "Regency Beauty Institute-Austin",
...         "Central Texas Beauty College-Temple",
...     ]
... ]
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Regency B...      0.1867      0.2133  ...       0.0     0.2667
Central T...      0.1616      0.2323  ...       0.0     0.1515

再来看看 US News中的排名前五的学校的表现:

>>> us_news_top = [
...     "Rutgers University-Newark",
...     "Andrews University",
...     "Stanford University",
...     "University of Houston",
...     "University of Nevada-Las Vegas",
... ]
>>> diversity_metric.loc[us_news_top]
INSTNM
Rutgers University-Newark         4
Andrews University                3
Stanford University               3
University of Houston             3
University of Nevada-Las Vegas    3
dtype: int64

更多

查看最不具有多样性的学校的前十名:

>>> (
...     college_ugds.max(axis=1)
...     .sort_values(ascending=False)
...     .head(10)
... )
INSTNM
Dewey University-Manati                               1.0
Yeshiva and Kollel Harbotzas Torah                    1.0
Mr Leon's School of Hair Design-Lewiston              1.0
Dewey University-Bayamon                              1.0
Shepherds Theological Seminary                        1.0
Yeshiva Gedolah Kesser Torah                          1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
Yeshiva Shaar Hatorah                                 1.0
Bais Medrash Elyon                                    1.0
Yeshiva of Nitra Rabbinical College                   1.0
dtype: float64

是否存在所有种族比例都超过1%的学校:

>>> (college_ugds > 0.01).all(axis=1).any()
True

第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐

你可能感兴趣的:(《Pandas 1.x Cookbook · 第二版》第02章 DataFrame基础运算)