第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐
下载本书:https://www.jianshu.com/p/62524f4c240e
1.1 引入Pandas和Numpy
>>> import pandas as pd
>>> import numpy as np
1.2 Pandas的DataFrame(数据帧)
使用read_csv()
函数将数据从磁盘读入内存中的DataFrame对象。
所有数据可从GitHub下载:下载地址
>>> movies = pd.read_csv("data/movie.csv")
>>> movies
color direc/_name ... aspec/ratio movie/likes
0 Color James Cameron ... 1.78 33000
1 Color Gore Verbinski ... 2.35 0
2 Color Sam Mendes ... 2.35 85000
3 Color Christopher Nolan ... 2.35 164000
4 NaN Doug Walker ... NaN 0
... ... ... ... ... ...
4911 Color Scott Smith ... NaN 84
4912 Color NaN ... 16.00 32000
4913 Color Benjamin Roberds ... NaN 16
4914 Color Daniel Hsia ... 2.35 660
4915 Color Jon Gunn ... 1.85 456
在上图中,索引index是0轴,列column是1轴。
Pandas使用NaN(not a number)表示缺失值。
movies.head(n)
可以返回前n
行,movies.tail(n)
可以返回后n
行。
1.3 DataFrame的属性
提取DataFrame的列、索引和数据:
>>> movies = pd.read_csv("data/movie.csv")
>>> columns = movies.columns
>>> index = movies.index
>>> data = movies.to_numpy()
展示列、索引和数据:
>>> columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype='object')
>>> index
RangeIndex(start=0, stop=4916, step=1)
>>> data
array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
...,
['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)
列、索引和数据的数据类型:
>>> type(index)
>>> type(columns)
>>> type(data)
index和column是Index
的子类,有时也被称为行索引和列索引:
>>> issubclass(pd.RangeIndex, pd.Index)
True
>>> issubclass(columns.__class__, pd.Index)
True
DataFrame的.values
属性(或.to_numpy()
方法)可以将索引、列、数据转换为ndarray
,也就是Numpy的n维数组:
>>> index.to_numpy()
array([ 0, 1, 2, ..., 4913, 4914, 4915], dtype=int64))
>>> columns.to_numpy()
array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
'director_facebook_likes', 'actor_3_facebook_likes',
'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
'actor_1_name', 'movie_title', 'num_voted_users',
'cast_total_facebook_likes', 'actor_3_name',
'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
'num_user_for_reviews', 'language', 'country', 'content_rating',
'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
'aspect_ratio', 'movie_facebook_likes'], dtype=object)
1.4 了解数据类型
广义上讲,可以将数据分为连续数据和离散的类别数据。
-
float
- NumPy的浮点类型,支持缺失值; -
int
- NumPy的整数类型,不支持缺失值; -
Int64
- Pandas的整数类型,支持缺失值; -
object
- NumPy用于存储字符串和混合类型的的数据类型; -
category
- Pandas的类别类型,支持缺失值; -
bool
- NumPy的布尔类型,不支持缺失值(None变为False,np.nan
变为True); -
boolean
- Pandas的布尔类型,支持缺失值; -
datetime64[ns]
- NumPy的日期类型,支持缺失值(NaT);
可以使用.dtypes
属性展示列名和对应的数据类型:
>>> movies = pd.read_csv("data/movie.csv")
>>> movies.dtypes
color object
director_name object
num_critic_for_reviews float64
duration float64
director_facebook_likes float64
...
title_year float64
actor_2_facebook_likes float64
imdb_score float64
aspect_ratio float64
movie_facebook_likes int64
Length: 28, dtype: object
使用.value_counts
方法返回每种数据类型的数量:
>>> movies.dtypes.value_counts()
float64 13
int64 3
object 12
dtype: int64
使用.info
方法查看数据类型:
>>> movies.info()
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 28 columns):
color 4897 non-null object
director_name 4814 non-null object
num_critic_for_reviews 4867 non-null float64
duration 4901 non-null float64
director_facebook_likes 4814 non-null float64
actor_3_facebook_likes 4893 non-null float64
actor_2_name 4903 non-null object
actor_1_facebook_likes 4909 non-null float64
gross 4054 non-null float64
genres 4916 non-null object
actor_1_name 4909 non-null object
movie_title 4916 non-null object
num_voted_users 4916 non-null int64
cast_total_facebook_likes 4916 non-null int64
actor_3_name 4893 non-null object
facenumber_in_poster 4903 non-null float64 plot_keywords 4764 non-null object
movie_imdb_link 4916 non-null object
num_user_for_reviews 4895 non-null float64
language 4904 non-null object
country 4911 non-null object
content_rating 4616 non-null object
budget 4432 non-null float64
title_year 4810 non-null float64
actor_2_facebook_likes 4903 non-null float64
imdb_score 4916 non-null float64
aspect_ratio 4590 non-null float64
movie_facebook_likes 4916 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB
Pandas默认将数值类型用64位表示,所以上面出现的是int64和float64。
object
类型中可能包含任意Python的数据类型,也可能包含缺失值。对于Pandas的Series,如果有缺失值和字符串,则数据类型是O:
上来就讲应用最广的DataFrame是这本书的一个特点,原本应该从Series讲起的。
>>> pd.Series(["Paul", np.nan, "George"]).dtype
dtype('O')
1.5 选择一列
使用列索引选择一列:
>>> movies = pd.read_csv("data/movie.csv")
>>> movies["director_name"]
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
使用属性选择一列:
>>> movies.director_name
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
使用.loc
和.iloc
选择一列,前者使用列名,后者使用位置序号:
# :表示从第一行到最后一行全选
>>> movies.loc[:, "director_name"]
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
>>> movies.iloc[:, 1]
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
查看列的属性
>>> movies["director_name"].index
RangeIndex(start=0, stop=4916, step=1)
>>> movies["director_name"].dtype
dtype('O')
>>> movies["director_name"].size
4196
>>> movies["director_name"].name
'director_name'
确认输出是Series对象:
>>> type(movies["director_name"])
DataFrame中的每一列都可以被取出,当做Series进行操作。
1.6 调用Series方法
用dir()
查看pd.Series和pd.DataFrame的方法:
>>> s_attr_methods = set(dir(pd.Series))
>>> len(s_attr_methods)
471
>>> df_attr_methods = set(dir(pd.DataFrame))
>>> len(df_attr_methods)
458
>>> len(s_attr_methods & df_attr_methods)
400
先读取两列:
>>> movies = pd.read_csv("data/movie.csv")
>>> director = movies["director_name"]
>>> fb_likes = movies["actor_1_facebook_likes"]
>>> director.dtype
dtype('O')
>>> fb_likes.dtype
dtype('float64')
除了可以用.head
方法列出Series的前5行,还可以用.sample
查看数据:
>>> director.head()
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
Name: director_name, dtype: object
>>> director.sample(n=5, random_state=42)
2347 Brian Percival
4687 Lucio Fulci
691 Phillip Noyce
3911 Sam Peckinpah
2488 Rowdy Herrington
Name: director_name, dtype: object
>>> fb_likes.head()
0 1000.0
1 40000.0
2 11000.0
3 27000.0
4 131.0
Name: actor_1_facebook_likes, dtype: float64
Series的数据类型决定了哪些方法最常用。例如,object
最常用的方法是.value_counts
:
>>> director.value_counts()
Steven Spielberg 26
Woody Allen 22
Clint Eastwood 20
Martin Scorsese 20
Ridley Scott 16
..
Eric England 1
Moustapha Akkad 1
Jay Oliva 1
Scott Speer 1
Leon Ford 1
Name: director_name, Length: 2397, dtype: int64
数值型数据也可以使用.value_counts
>>> fb_likes.value_counts()
1000.0 436
11000.0 206
2000.0 189
3000.0 150
12000.0 131
...
362.0 1
216.0 1
859.0 1
225.0 1
334.0 1
Name: actor_1_facebook_likes, Length: 877, dtype: int64
用.size
、.shape
、len()
查看个数,.uinique()
返回唯一值:
>>> director.size
4916
>>> director.shape
(4916,)
>>> len(director)
4916
>>> director.unique()
array(['James Cameron', 'Gore Verbinski', 'Sam Mendes', ...,
'Scott Smith', 'Benjamin Roberds', 'Daniel Hsia'], dtype=object)
.count()
返回的是非缺失值:
>>> director.count()
4814
>>> fb_likes.count()
4909
方法.min
、.max
、.mean
、.median
、.std
,可以查看统计值:
>>> fb_likes.min()
0.0
>>> fb_likes.max()
640000.0
>>> fb_likes.mean()
6494.488490527602
>>> fb_likes.median()
982.0
>>> fb_likes.std()
15106.986883848309
.describe
也可以返回统计信息:
>>> fb_likes.describe()
count 4909.000000
mean 6494.488491
std 15106.986884
min 0.000000
25% 607.000000
50% 982.000000
75% 11000.000000
max 640000.000000
Name: actor_1_facebook_likes, dtype: float64
>>> director.describe()
count 4814
unique 2397
top Steven Spielberg
freq 26
Name: director_name, dtype: object
.quantile()
方法可以返回分位数:
>>> fb_likes.quantile(0.2)
510.0
>>> fb_likes.quantile(
... [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
... )
0.1 240.0
0.2 510.0
0.3 694.0
0.4 854.0
0.5 982.0
0.6 1000.0
0.7 8000.0
0.8 13000.0
0.9 18000.0
Name: actor_1_facebook_likes, dtype: float64
.isna()
用于查看是否有缺失值:
>>> director.isna()
0 False
1 False
2 False
3 False
4 False
...
4911 False
4912 True
4913 False
4914 False
4915 False
Name: director_name, Length: 4916, dtype: bool
.fillna()
用于填充缺失值:
>>> fb_likes_filled = fb_likes.fillna(0)
>>> fb_likes_filled.count()
4916
.dropna()
用于删除缺失值:
>>> fb_likes_dropped = fb_likes.dropna()
>>> fb_likes_dropped.size
4909
对于.value_counts()
方法,将参数normalize
设为True
,返回的是相对频率:
>>> director.value_counts(normalize=True)
Steven Spielberg 0.005401
Woody Allen 0.004570
Clint Eastwood 0.004155
Martin Scorsese 0.004155
Ridley Scott 0.003324
...
Eric England 0.000208
Moustapha Akkad 0.000208
Jay Oliva 0.000208
Scott Speer 0.000208
Leon Ford 0.000208
Name: director_name, Length: 2397, dtype: float64
另一个查看是否有缺失值的属性是.hasnans
:
>>> director.hasnans
True
.notna()
方法返回是否不是缺失值:
>>> director.notna()
0 True
1 True
2 True
3 True
4 True
...
4911 True
4912 False
4913 True
4914 True
4915 True
Name: director_name, Length: 4916, dtype: bool
.isnull()
的作用和.isna()
相同,因为Pandas中使用NaN表示缺失值,后者更便于记忆。
1.7 Series运算
加载列imdb_score:
>>> movies = pd.read_csv("data/movie.csv")
>>> imdb_score = movies["imdb_score"]
>>> imdb_score
0 7.9
1 7.1
2 6.8
3 8.5
4 7.1
...
4911 7.7
4912 7.5
4913 6.3
4914 6.3
4915 6.6
Name: imdb_score, Length: 4916, dtype: float64
加减乘除、指数运算,直接对列操作就成:
>>> imdb_score + 1
0 8.9
1 8.1
2 7.8
3 9.5
4 8.1
...
4911 8.7
4912 8.5
4913 7.3
4914 7.3
4915 7.6
Name: imdb_score, Length: 4916, dtype: float64
//
和%
分别返回除法的整数和余数部分:
>>> imdb_score // 7
0 1.0
1 1.0
2 0.0
3 1.0
4 1.0
...
4911 1.0
4912 1.0
4913 0.0
4914 0.0
4915 0.0
Name: imdb_score, Length: 4916, dtype: float64
六种比较运算符,>
、<
、>=
、<=
、==
、!=
返回的是布尔值:
>>> imdb_score > 7
0 True
1 True
2 False
3 True
4 True
...
4911 True
4912 True
4913 False
4914 False
4915 False
Name: imdb_score, Length: 4916, dtype: bool
>>> director = movies["director_name"]
>>> director == "James Cameron"
0 True
1 False
2 False
3 False
4 False
...
4911 False
4912 False
4913 False
4914 False
4915 False
Name: director_name, Length: 4916, dtype: bool
.add()
方法等同于+
:
>>> imdb_score.add(1) # imdb_score + 1
0 8.9
1 8.1
2 7.8
3 9.5
4 8.1
...
4911 8.7
4912 8.5
4913 7.3
4914 7.3
4915 7.6
Name: imdb_score, Length: 4916, dtype: float64
>>> imdb_score.gt(7) # imdb_score > 7
0 True
1 True
2 False
3 True
4 True
...
4911 True
4912 True
4913 False
4914 False
4915 False
Name: imdb_score, Length: 4916, dtype: bool
使用方法的原因是,方法中可以添加参数,比如.sub
方法中,可以设置参数fill_value
:
>>> money = pd.Series([100, 20, None])
>>> money – 15
0 85.0
1 5.0
2 NaN
dtype: float64
>>> money.sub(15, fill_value=0)
0 85.0
1 5.0
2 -15.0
dtype: float64
算数方法包括:.add
、.sub
、.mul
、.div
、.floordiv
、.mod
、.pow
。
比较方法包括:.lt
、.gt
、.le
、.ge
、.eq
、.ne
。
1.8 链式方法
将方法连用。
>>> movies = pd.read_csv("data/movie.csv")
>>> fb_likes = movies["actor_1_facebook_likes"]
>>> director = movies["director_name"]
>>> director.value_counts().head(3)
Steven Spielberg 26
Woody Allen 22
Clint Eastwood 20
Name: director_name, dtype: int64
统计缺失值的个数。
>>> fb_likes.isna().sum()
7
>>> fb_likes.dtype
dtype('float64')
>>> (fb_likes.fillna(0).astype(int).head())
0 1000
1 40000
2 11000
3 27000
4 131
Name: actor_1_facebook_likes, dtype: int64
.pipe()
可以用于检测链式方法中的中间值:
>>> def debug_ser(ser):
... print("BEFORE")
... print(ser)
... print("AFTER")
... return ser
>>> (fb_likes.fillna(0).pipe(debug_ser).astype(int).head())
BEFORE
0 1000.0
1 40000.0
2 11000.0
3 27000.0
4 131.0
...
4911 637.0
4912 841.0
4913 0.0
4914 946.0
4915 86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
AFTER
0 1000
1 40000
2 11000
3 27000
4 131
Name: actor_1_facebook_likes, dtype: int64
用全局变量存储中间值,也可以使用.pipe
:
>>> intermediate = None
>>> def get_intermediate(ser):
... global intermediate
... intermediate = ser
... return ser
>>> res = (
... fb_likes.fillna(0)
... .pipe(get_intermediate)
... .astype(int)
... .head()
... )
>>> intermediate
0 1000.0
1 40000.0
2 11000.0
3 27000.0
4 131.0
...
4911 637.0
4912 841.0
4913 0.0
4914 946.0
4915 86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
1.9 对列进行重命名
>>> movies = pd.read_csv("data/movie.csv")
先定义好列名字典
>>> col_map = {
... "director_name": "director",
... "num_critic_for_reviews": "critic_reviews",
... }
将列名字典传给rename
方法:
>>> movies.rename(columns=col_map).head()
color director ... aspec/ratio movie/likes
0 Color James Cameron ... 1.78 33000
1 Color Gore Verbinski ... 2.35 0
2 Color Sam Mendes ... 2.35 85000
3 Color Christopher Nolan ... 2.35 164000
4 NaN Doug Walker ... NaN 0
重命名行索引:
>>> idx_map = {
... "Avatar": "Ratava",
... "Spectre": "Ertceps",
... "Pirates of the Caribbean: At World's End": "POC",
... }
>>> col_map = {
... "aspect_ratio": "aspect",
... "movie_facebook_likes": "fblikes",
... }
>>> (
... movies.set_index("movie_title")
... .rename(index=idx_map, columns=col_map)
... .head(3)
... )
color director_name ... aspect fblikes
movie_title ...
Ratava Color James Cameron ... 1.78
重命名行索引和列索引的另一种方法,是直接对属性.index
和.column
赋值:
>>> movies = pd.read_csv(
... "data/movie.csv", index_col="movie_title"
... )
>>> ids = movies.index.to_list()
>>> columns = movies.columns.to_list()
# rename the row and column labels with list assignments
>>> ids[0] = "Ratava"
>>> ids[1] = "POC"
>>> ids[2] = "Ertceps"
>>> columns[1] = "director"
>>> columns[-2] = "aspect"
>>> columns[-1] = "fblikes"
>>> movies.index = ids
>>> movies.columns = columns
>>> movies.head(3)
color director ... aspect fblikes
Ratava Color James Cameron ... 1.78 33000
POC Color Gore Verbinski ... 2.35 0
Ertceps Color Sam Mendes ... 2.35 85000
另一种方法,是将一个函数传给.rename
方法。下面的例子删去了列名中的空格,将所有字母转换成了小写:
>>> def to_clean(val):
... return val.strip().lower().replace(" ", "_")
>>> movies.rename(columns=to_clean).head(3)
color director ... aspect fblikes
Ratava Color James Cameron ... 1.78 33000
POC Color Gore Verbinski ... 2.35 0
Ertceps Color Sam Mendes ... 2.35 85000
用列表生成式的方法,重命名列索引:
>>> cols = [
... col.strip().lower().replace(" ", "_")
... for col in movies.columns
... ]
>>> movies.columns = cols
>>> movies.head(3)
color director ... aspect fblikes
Ratava Color James Cameron ... 1.78 33000
POC Color Gore Verbinski ... 2.35 0
Ertceps Color Sam Mendes ... 2.35 85000
1.10 创建和删除列
最简单的创建列的方法是赋值:
>>> movies = pd.read_csv("data/movie.csv")
>>> movies["has_seen"] = 0
使用.assign
方法进行赋值:
>>> movies = pd.read_csv("data/movie.csv")
>>> idx_map = {
... "Avatar": "Ratava",
... "Spectre": "Ertceps",
... "Pirates of the Caribbean: At World's End": "POC",
... }
>>> col_map = {
... "aspect_ratio": "aspect",
... "movie_facebook_likes": "fblikes",
... }
>>> (
... movies.rename(
... index=idx_map, columns=col_map
... ).assign(has_seen=0)
... )
color director_name ... fblikes has_seen
0 Color James Cameron ... 33000 0
1 Color Gore Verbinski ... 0 0
2 Color Sam Mendes ... 85000 0
3 Color Christopher Nolan ... 164000 0
4 NaN Doug Walker ... 0 0
... ... ... ... ... ...
4911 Color Scott Smith ... 84 0
4912 Color NaN ... 32000 0
4913 Color Benjamin Roberds ... 16 0
4914 Color Daniel Hsia ... 660 0
4915 Color Jon Gunn ... 456 0
对几列进行操作之后,再赋值到新列:
最简单的方法,先对列进行操作:
>>> total = (
... movies["actor_1_facebook_likes"]
... + movies["actor_2_facebook_likes"]
... + movies["actor_3_facebook_likes"]
... + movies["director_facebook_likes"]
... )
>>> total.head(5)
0 2791.0
1 46563.0
2 11554.0
3 95000.0
4 NaN
dtype: float64
第二种方法,使用.sum
方法:
>>> cols = [
... "actor_1_facebook_likes",
... "actor_2_facebook_likes",
... "actor_3_facebook_likes",
... "director_facebook_likes",
... ]
>>> sum_col = movies.loc[:, cols].sum(axis="columns")
>>> sum_col.head(5)
0 2791.0
1 46563.0
2 11554.0
3 95000.0
4 274.0
dtype: float64
>>> movies.assign(total_likes=sum_col).head(5)
color direc/_name ... movie/likes total/likes
0 Color James Cameron ... 33000 2791.0
1 Color Gore Verbinski ... 0 46563.0
2 Color Sam Mendes ... 85000 11554.0
3 Color Christopher Nolan ... 164000 95000.0
4 NaN Doug Walker ... 0 274.0
另一种方法是将函数传入.assign
方法中:
>>> def sum_likes(df):
... return df[
... [
... c
... for c in df.columns
... if "like" in c
... and ("actor" in c or "director" in c)
... ]
... ].sum(axis=1)
>>> movies.assign(total_likes=sum_likes).head(5)
color direc/_name ... movie/likes total/likes
0 Color James Cameron ... 33000 2791.0
1 Color Gore Verbinski ... 0 46563.0
2 Color Sam Mendes ... 85000 11554.0
3 Color Christopher Nolan ... 164000 95000.0
4 NaN Doug Walker ... 0 274.0
如果列中有缺失值,则运算后该行会变成NaN
,.sum
方法将NaN
变成了0:
>>> (
... movies.assign(total_likes=sum_col)["total_likes"]
... .isna()
... .sum()
... )
0
>>> (
... movies.assign(total_likes=total)["total_likes"]
... .isna()
... .sum()
... )
122
# 填充缺失值之后,结果就变为0了。
>>> (
... movies.assign(total_likes=total.fillna(0))[
... "total_likes"
... ]
... .isna()
... .sum()
... )
0
movie中有一列cast_total_facebook_likes
,现在想比较一下cast_total_facebook_likes
和刚刚创建的列total_likes
:
>>> def cast_like_gt_actor(df):
... return (
... df["cast_total_facebook_likes"]
... >= df["total_likes"]
... )
>>> df2 = movies.assign(
... total_likes=total,
... is_cast_likes_more=cast_like_gt_actor,
... )
用.all
方法检查is_cast_likes_more
是否全为True
:
>>> df2["is_cast_likes_more"].all()
False
至少存在一行的total_likes
大于cast_total_facebook_likes
,这可能是因为director Facebook likes不属于total likes。所以先删掉total_likes
这列:
>>> df2 = df2.drop(columns="total_likes")
重新创建只包含actor likes的列:
>>> actor_sum = movies[
... [
... c
... for c in movies.columns
... if "actor_" in c and "_likes" in c
... ]
... ].sum(axis="columns")
>>> actor_sum.head(5)
0 2791.0
1 46000.0
2 11554.0
3 73000.0
4 143.0
dtype: float64
再次检查是否cast_total_facebook_likes
大于actor_sum
:
>>> movies["cast_total_facebook_likes"] >= actor_sum
0 True
1 True
2 True
3 True
4 True
...
4911 True
4912 True
4913 True
4914 True
4915 True
Length: 4916, dtype: bool
>>> movies["cast_total_facebook_likes"].ge(actor_sum)
0 True
1 True
2 True
3 True
4 True
...
4911 True
4912 True
4913 True
4914 True
4915 True
Length: 4916, dtype: bool
>>> movies["cast_total_facebook_likes"].ge(actor_sum).all()
True
最后,计算actor_sum
和cast_total_facebook_likes
的比例:
>>> pct_like = actor_sum.div(
... movies["cast_total_facebook_likes"]
... ).mul(100)
检查pct_like
这列中的值是否位于0和1之间:
>>> pct_like.describe()
count 4883.000000
mean 83.327889
std 14.056578
min 30.076696
25% 73.528368
50% 86.928884
75% 95.477440
max 100.000000
dtype: float64
使用movie_title
作为索引创建一个Series:
>>> pd.Series(
... pct_like.to_numpy(), index=movies["movie_title"]
... ).head()
movie_title
Avatar 57.736864
Pirates of the Caribbean: At World's End 95.139607
Spectre 98.752137
The Dark Knight Rises 68.378310
Star Wars: Episode VII - The Force Awakens 100.000000
dtype: float64
用insert
在指定位置插入一列,insert
方法不返回新的对象。
>>> profit_index = movies.columns.get_loc("gross") + 1
>>> profit_index
9
>>> movies.insert(
... loc=profit_index,
... column="profit",
... value=movies["gross"] - movies["budget"],
... )
del
命令同样可以删除列,但不返回新对象。
>>> del movies["director_name"]
第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐