《Pandas 1.x Cookbook · 第二版》第04章 开始数据分析

第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐


4.1 数据分析流程

探索性数据分析,Exploratory Data Analysis (EDA) ,通常不包括创建模型,但包括总结数据集的特征和可视化。

读取数据,使用.sample方法查看数据:

>>> import pandas as pd
>>> import numpy as np
>>> college = pd.read_csv("data/college.csv")
>>> college.sample(random_state=42)
           INSTNM         CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
3649  Career P...  San Antonio  ...        20700           14977

查看数据集的维度:

>>> college.shape
(7535, 27)

.info方法,查看每列的数据类型,非空值的数量,内存占用:

>>> college.info()

RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   INSTNM              7535 non-null   object
 1   CITY                7535 non-null   object
 2   STABBR              7535 non-null   object
 3   HBCU                7164 non-null   float64
 4   MENONLY             7164 non-null   float64
 5   WOMENONLY           7164 non-null   float64
 6   RELAFFIL            7535 non-null   int64
 7   SATVRMID            1185 non-null   float64
 8   SATMTMID            1196 non-null   float64
 9   DISTANCEONLY        7164 non-null   float64
 10  UGDS                6874 non-null   float64
 11  UGDS_WHITE          6874 non-null   float64
 12  UGDS_BLACK          6874 non-null   float64
 13  UGDS_HISP           6874 non-null   float64
 14  UGDS_ASIAN          6874 non-null   float64
 15  UGDS_AIAN           6874 non-null   float64
 16  UGDS_NHPI           6874 non-null   float64
 17  UGDS_2MOR           6874 non-null   float64
 18  UGDS_NRA            6874 non-null   float64
 19  UGDS_UNKN           6874 non-null   float64
 20  PPTUG_EF            6853 non-null   float64
 21  CURROPER            7535 non-null   int64
 22  PCTPELL             6849 non-null   float64
 23  PCTFLOAN            6849 non-null   float64
 24  UG25ABV             6718 non-null   float64
 25  MD_EARN_WNE_P10     6413 non-null   object
 26  GRAD_DEBT_MDN_SUPP  7503 non-null   object
dtypes: float64(20), int64(2), object(5)
memory usage: 1.6+ MB

查看数值列的情况,转置以读取更多输出:

>>> college.describe(include=[np.number]).T
            count        mean  ...         75%    max
HBCU       7164.0    0.014238  ...    0.000000    1.0
MENONLY    7164.0    0.009213  ...    0.000000    1.0
WOMENONLY  7164.0    0.005304  ...    0.000000    1.0
RELAFFIL   7535.0    0.190975  ...    0.000000    1.0
SATVRMID   1185.0  522.819409  ...  555.000000  765.0
...           ...         ...  ...         ...    ...
PPTUG_EF   6853.0    0.226639  ...    0.376900    1.0
CURROPER   7535.0    0.923291  ...    1.000000    1.0
PCTPELL    6849.0    0.530643  ...    0.712900    1.0
PCTFLOAN   6849.0    0.522211  ...    0.745000    1.0
UG25ABV    6718.0    0.410021  ...    0.572275    1.0

查看对象(字符串)列的情况:

>>> college.describe(include=[np.object]).T
             count unique          top  freq
INSTNM        7535   7535  Academy ...     1
CITY          7535   2514     New York    87
STABBR        7535     59           CA   773
MD_EARN_W...  6413    598  PrivacyS...   822
GRAD_DEBT...  7503   2038  PrivacyS...  1510

更多

.describe方法中指定分位数:

>>> college.describe(
>>>     include=[np.number],
...     percentiles=[
...         0.01,
...         0.05,
...         0.10,
...         0.25,
...         0.5,
...         0.75,
...         0.9,
...         0.95,
...         0.99,
...     ],
... ).T
            count        mean  ...         99%    max
HBCU       7164.0    0.014238  ...    1.000000    1.0
MENONLY    7164.0    0.009213  ...    0.000000    1.0
WOMENONLY  7164.0    0.005304  ...    0.000000    1.0
RELAFFIL   7535.0    0.190975  ...    1.000000    1.0
SATVRMID   1185.0  522.819409  ...  730.000000  765.0
...           ...         ...  ...         ...    ...
PPTUG_EF   6853.0    0.226639  ...    0.946724    1.0
CURROPER   7535.0    0.923291  ...    1.000000    1.0
PCTPELL    6849.0    0.530643  ...    0.993908    1.0
PCTFLOAN   6849.0    0.522211  ...    0.986368    1.0
UG25ABV    6718.0    0.410021  ...    0.917383    1.0

4.2 数据字典

数据字典是个包含元数据和注释的表格,它的主要目的是解释列名。

college_data_dictionary.csv中包含了大学数据集的数据字典:

>>> pd.read_csv("data/college_data_dictionary.csv")
    column_name  description
0        INSTNM  Institut...
1          CITY  City Loc...
2        STABBR  State Ab...
3          HBCU  Historic...
4       MENONLY  0/1 Men ...
..          ...          ...
22      PCTPELL  Percent ...
23     PCTFLOAN  Percent ...
24      UG25ABV  Percent ...
25  MD_EARN_...  Median E...
26  GRAD_DEB...  Median d...

4.3 通过改变数据类型降低内存

选取一些数据类型不同的列,以便观察内存占用:

>>> college = pd.read_csv("data/college.csv")
>>> different_cols = [
...     "RELAFFIL",
...     "SATMTMID",
...     "CURROPER",
...     "INSTNM",
...     "STABBR",
... ]
>>> col2 = college.loc[:, different_cols]
>>> col2.head()
   RELAFFIL  SATMTMID  ...       INSTNM STABBR
0         0     420.0  ...  Alabama ...     AL
1         0     565.0  ...  Universi...     AL
2         1       NaN  ...  Amridge ...     AL
3         0     590.0  ...  Universi...     AL
4         0     430.0  ...  Alabama ...     AL

检查每列的数据类型:

>>> col2.dtypes
RELAFFIL      int64
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

检查每列占用了多少内存:

>>> original_mem = col2.memory_usage(deep=True)
>>> original_mem
Index          128
RELAFFIL     60280
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

RELAFFIL没有必要使用64位,使用.astype方法将其转换为8位:

>>> col2["RELAFFIL"] = col2["RELAFFIL"].astype(np.int8)

再次检查数据类型:

>>> col2.dtypes
RELAFFIL       int8
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object 

再次检查内存占用,可以看到RELAFFIL一列的内存大大降低:

>>> col2.memory_usage(deep=True)
Index          128
RELAFFIL      7535
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

要节省更多内存,如果基数(基数是唯一值的数量)低的话,可以将object数据改为category:

>>> col2.select_dtypes(include=["object"]).nunique()
INSTNM    7535
STABBR      59
dtype: int64

STABBR列的基数低,不到原始数据的百分之一,可以将其做转换:

>>> col2["STABBR"] = col2["STABBR"].astype("category")
>>> col2.dtypes
RELAFFIL        int8
SATMTMID     float64
CURROPER       int64
INSTNM        object
STABBR      category
dtype: object

再次查看内存占用:

>>> new_mem = col2.memory_usage(deep=True)
>>> new_mem
Index          128
RELAFFIL      7535
SATMTMID     60280
CURROPER     60280
INSTNM      660699
STABBR       13576
dtype: int64   

最后,比较内存优化前后的比例:

>>> new_mem / original_mem
Index       1.000000
RELAFFIL    0.125000
SATMTMID    1.000000
CURROPER    1.000000
INSTNM      1.000695
STABBR      0.030538
dtype: float64

更多

改变列中的一个值,可以观察内存的变化:

>>> college.loc[0, "CURROPER"] = 10000000
>>> college.loc[0, "INSTNM"] = (
...     college.loc[0, "INSTNM"] + "a"
... )
>>> college[["CURROPER", "INSTNM"]].memory_usage(deep=True)
Index           80
CURROPER     60280
INSTNM      660804
dtype: int64

如果整数列中的一个值变为缺失值,则该列会强制变为浮点类型:

>>> college["MENONLY"].dtype
dtype('float64')
>>> college["MENONLY"].astype(np.int8)
Traceback (most recent call last):
  ...
ValueError: Cannot convert non-finite values (NA or inf) to integer

下面的四种方法的效果相同:

college.describe(include=['int64', 'float64']).T
college.describe(include=[np.int64, np.float64]).T
college.describe(include=['int', 'float']).T
college.describe(include=['number']).T

数据类型字符串还可以用在.astype方法中:

>>> college.assign(
...     MENONLY=college["MENONLY"].astype("float16"),
...     RELAFFIL=college["RELAFFIL"].astype("int8"),
... )
           INSTNM         CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
0     Alabama ...       Normal  ...        30300           33888       
1     Universi...   Birmingham  ...        39700         21941.5       
2     Amridge ...   Montgomery  ...        40100           23370       
3     Universi...   Huntsville  ...        45500           24097       
4     Alabama ...   Montgomery  ...        26600         33118.5       
...           ...          ...  ...          ...             ...       
7530  SAE Inst...   Emeryville  ...          NaN            9500       
7531  Rasmusse...  Overland...  ...          NaN           21163       
7532  National...  Highland...  ...          NaN            6333       
7533  Bay Area...     San Jose  ...          NaN     PrivacyS...       
7534  Excel Le...  San Antonio  ...          NaN           12125

查看RangeIndexInt64Index两种索引类型的内存占用:

>>> college.index = pd.Int64Index(college.index)
>>> college.index.memory_usage()  # 原先是80
60280

4.4 选择最大值中最小值

使用.nlargest.nsmallest从排名前100的电影中,选择成本最低的5个电影:

>>> movie = pd.read_csv("data/movie.csv")
>>> movie2 = movie[["movie_title", "imdb_score", "budget"]]
>>> movie2.head()
   movie_title  imdb_score       budget
0       Avatar         7.9  237000000.0
1  Pirates ...         7.1  300000000.0
2      Spectre         6.8  245000000.0
3  The Dark...         8.5  250000000.0
4  Star War...         7.1          NaN

选择imdb_score得分前100的电影:

>>> movie2.nlargest(100, "imdb_score").head()
      movie_title  imdb_score      budget
                   movie_title  imdb_score      budget
2725          Towering Inferno         9.5         NaN
1920  The Shawshank Redemption         9.3  25000000.0
3402             The Godfather         9.2   6000000.0
2779                   Dekalog         9.1         NaN
4312      Kickboxer: Vengeance         9.1  17000000.0

选择前百名电影中,成本最低的5个电影:

>>> (
...     movie2.nlargest(100, "imdb_score").nsmallest(
...         5, "budget"
...     )
... )
               movie_title  imdb_score    budget
4804        Butterfly Girl         8.7  180000.0
4801    Children of Heaven         8.5  180000.0
4706          12 Angry Men         8.9  350000.0
4550          A Separation         8.4  500000.0
4636  The Other Dream Team         8.4  500000.0

4.5 使用排序选择每组中最大值

在本节中,我们尝试找到每年评分最高的电影。

读取电影数据集,只保留其中的三个列:movie_titletitle_yearimdb_score

>>> movie = pd.read_csv("data/movie.csv")
>>> movie[["movie_title", "title_year", "imdb_score"]]
                                     movie_title  ...
0                                         Avatar  ...
1       Pirates of the Caribbean: At World's End  ...
2                                        Spectre  ...
3                          The Dark Knight Rises  ...
4     Star Wars: Episode VII - The Force Awakens  ...
...                                          ...  ...
4911                     Signed Sealed Delivered  ...
4912                               The Following  ...
4913                        A Plague So Pleasant  ...
4914                            Shanghai Calling  ...
4915                           My Date with Drew  ...

按照title_year进行从小到大排列:

>>> (
...     movie[
...         ["movie_title", "title_year", "imdb_score"]
...     ].sort_values("title_year", ascending=True)
... )
                                           movie_title  ...
4695  Intolerance: Love's Struggle Throughout the Ages  ...
4833                    Over the Hill to the Poorhouse  ...
4767                                    The Big Parade  ...
2694                                        Metropolis  ...
4697                               The Broadway Melody  ...
...                                                ...  ...
4683                                            Heroes  ...
4688                                       Home Movies  ...
4704                                        Revolution  ...
4752                                      Happy Valley  ...
4912                                     The Following  ...

同时对两列进行排列:

>>> (
...     movie[
...          ["movie_title", "title_year", "imdb_score"]
...     ].sort_values(
...         ["title_year", "imdb_score"], ascending=False
...     )
... )
                      movie_title  title_year  imdb_score
4312         Kickboxer: Vengeance      2016.0         9.1
4277  A Beginner's Guide to Snuff      2016.0         8.7
3798                      Airlift      2016.0         8.5
27     Captain America: Civil War      2016.0         8.2
98            Godzilla Resurgence      2016.0         8.2
...                           ...         ...         ...
1391                    Rush Hour         NaN         5.8
4031                     Creature         NaN         5.0
2165              Meet the Browns         NaN         3.5
3246   The Bold and the Beautiful         NaN         3.5
2119                 The Bachelor         NaN         2.9

使用.drop_duplicates方法,只保留每年的第一行数据:

>>> (
...     movie[["movie_title", "title_year", "imdb_score"]]
...     .sort_values(
...         ["title_year", "imdb_score"], ascending=False
...     )
...     .drop_duplicates(subset="title_year")
... )
      movie_title  title_year  imdb_score
4312  Kickboxe...      2016.0         9.1
3745  Running ...      2015.0         8.6
4369  Queen of...      2014.0         8.7
3935  Batman: ...      2013.0         8.4
3     The Dark...      2012.0         8.5
...           ...         ...         ...
2694   Metropolis      1927.0         8.3
4767  The Big ...      1925.0         8.3
4833  Over the...      1920.0         4.8
4695  Intolera...      1916.0         8.0
2725  Towering...         NaN         9.5

更多

还可以使用.groupby方法,实现同样的目标:

>>> (
...     movie[["movie_title", "title_year", "imdb_score"]]
...     .groupby("title_year", as_index=False)
...     .apply(
...         lambda df:df.sort_values(
...             "imdb_score", ascending=False
...         ).head(1)
...     )
...     .droplevel(0)
...     .sort_values("title_year", ascending=False)
... )
         movie_title  title_year  imdb_score
90 4312  Kickboxe...      2016.0         9.1
89 3745  Running ...      2015.0         8.6
88 4369  Queen of...      2014.0         8.7
87 3935  Batman: ...      2013.0         8.4
86 3     The Dark...      2012.0         8.5
...              ...         ...         ...
4  4555  Pandora'...      1929.0         8.0
3  2694   Metropolis      1927.0         8.3
2  4767  The Big ...      1925.0         8.3
1  4833  Over the...      1920.0         4.8
0  4695  Intolera...      1916.0         8.0

参数ascending可以传入一组布尔值,对列分别进行排列:

>>> (
...     movie[
...         [
...             "movie_title",
...             "title_year",
...             "content_rating",
...             "budget",
...         ]
...     ]
...     .sort_values(
...         ["title_year", "content_rating", "budget"],
...         ascending=[False, False, True],
...     )
...     .drop_duplicates(
...         subset=["title_year", "content_rating"]
...     )
... )
      movie_title  title_year content_rating      budget
4026    Compadres      2016.0            R     3000000.0
4658  Fight to...      2016.0        PG-13      150000.0
4661   Rodeo Girl      2016.0           PG      500000.0
3252  The Wailing      2016.0    Not Rated           NaN
4659  Alleluia...      2016.0          NaN      500000.0
...           ...         ...          ...           ...
2558   Lilyhammer         NaN        TV-MA    34000000.0
807   Sabrina,...         NaN         TV-G     3000000.0
848   Stargate...         NaN        TV-14     1400000.0
2436       Carlos         NaN    Not Rated           NaN
2119  The Bach...         NaN          NaN     3000000.0

4.6 使用sort_values复制n个最大值

使用nlargest. nsmallest实现4.5中同样的目标:

>>> movie = pd.read_csv("data/movie.csv")
>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .nlargest(100, "imdb_score")
...     .nsmallest(5, "budget")
... )
               movie_title  imdb_score    budget
4804        Butterfly Girl         8.7  180000.0
4801    Children of Heaven         8.5  180000.0
4706          12 Angry Men         8.9  350000.0
4550          A Separation         8.4  500000.0
4636  The Other Dream Team         8.4  500000.0

使用.sort_values实现提取imdb_score最高的100部电影:

>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .sort_values("imdb_score", ascending=False)
...     .head(100)
... )
      movie_title  imdb_score      budget
2725  Towering...         9.5         NaN
1920  The Shaw...         9.3  25000000.0
3402  The Godf...         9.2   6000000.0
2779      Dekalog         9.1         NaN
4312  Kickboxe...         9.1  17000000.0
...           ...         ...         ...
3799  Anne of ...         8.4         NaN
3777  Requiem ...         8.4   4500000.0
3935  Batman: ...         8.4   3500000.0
4636  The Othe...         8.4    500000.0
2455       Aliens         8.4  18500000.0

紧接着上一步操作,提取其中成本最低的5部电影:

>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .sort_values("imdb_score", ascending=False)
...     .head(100)
...     .sort_values("budget")
...     .head(5)
... )
                    movie_title  imdb_score    budget
4815  A Charlie Brown Christmas         8.4  150000.0
4801         Children of Heaven         8.5  180000.0
4804             Butterfly Girl         8.7  180000.0
4706               12 Angry Men         8.9  350000.0
4636       The Other Dream Team         8.4  500000.0

更多

因为8.4分的电影比较多,.nlargest.sort_values略有不同,导致得分前百的电影的DataFrame并不相同。如果在.sort_values中传入kind='mergsort',两者的结果就相同了。

>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .nlargest(100, "imdb_score")
...     .tail()
... )
                movie_title  imdb_score     budget
4023                 Oldboy         8.4  3000000.0
4163  To Kill a Mockingbird         8.4  2000000.0
4395         Reservoir Dogs         8.4  1200000.0
4550           A Separation         8.4   500000.0
4636   The Other Dream Team         8.4   500000.0
>>> (
...     movie[["movie_title", "imdb_score", "budget"]]
...     .sort_values("imdb_score", ascending=False)
...     .head(100)
...     .tail()
... )
      movie_title  imdb_score      budget
3799  Anne of ...         8.4         NaN
3777  Requiem ...         8.4   4500000.0
3935  Batman: ...         8.4   3500000.0
4636  The Othe...         8.4    500000.0
2455       Aliens         8.4  18500000.0

4.7 计算跟踪止损单

止损单是股票交易中常用的术语。如果用100美元每股购买了股票,如果是90%止损,则止损单的价格是90美元。

如果股票涨到了120美元,则止损单价格变为了108美元。如果股票降到了110美元,止损单价格仍然是108美元。只有股票上涨时,止损单价格才上涨。

本节需要安装第三方库pandas-datareader,使用conda install pandas-datareaderpip install pandas-datareader进行安装。可能还需要安装requests_cache

这里使用的的是特斯拉的股票(TSLA),假设在2017开盘日购买了股票:

>>> import datetime
>>> import pandas_datareader.data as web
>>> import requests_cache
>>> session = requests_cache.CachedSession(
...     cache_name="cache",
...     backend="sqlite",
...     expire_after=datetime.timedelta(days=90),
... )
>>> tsla = web.DataReader(
...     "tsla",
...     data_source="yahoo",
...     start="2017-1-1",
...     session=session,
... )
>>> tsla.head(8)
                  High         Low  ...    Volume   Adj Close
Date                                ...
2017-01-03  220.330002  210.960007  ...   5923300  216.990005
2017-01-04  228.000000  214.309998  ...  11213500  226.990005
2017-01-05  227.479996  221.949997  ...   5911700  226.750000
2017-01-06  230.309998  225.449997  ...   5527900  229.009995
2017-01-09  231.919998  228.000000  ...   3979500  231.279999
2017-01-10  232.000000  226.889999  ...   3660000  229.869995
2017-01-11  229.979996  226.679993  ...   3650800  229.729996
2017-01-12  230.699997  225.580002  ...   3790200  229.589996

简单起见,使用的是每个交易日的收盘价:

>>> tsla_close = tsla["Close"]

使用.cummax方法跟踪最高收盘价:

>>> tsla_cummax = tsla_close.cummax()
>>> tsla_cummax.head()
Date
2017-01-03    216.990005
2017-01-04    226.990005
2017-01-05    226.990005
2017-01-06    229.009995
2017-01-09    231.279999
Name: Close, dtype: float64

乘以0.9就能得到止损单价格了:

>>> (tsla["Close"].cummax().mul(0.9).head())
Date
2017-01-03    195.291005
2017-01-04    204.291005
2017-01-05    204.291005
2017-01-06    206.108995
2017-01-09    208.151999
Name: Close, dtype: float64

更多

cummax相对,.cummin可以追踪最低值:

weight.cummin() * 1.05

第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐

你可能感兴趣的:(《Pandas 1.x Cookbook · 第二版》第04章 开始数据分析)