参考文献:pandas cookbook
这一系列的文章,既不能算原创,也不能说就是翻译,为了方便我还是选择点【原创】,参考书就是参考文献那条。
movies=pd.read_csv("movie.csv")
# 方法一,直接创建。
# one way to create a new column : do an index assignment
# note that this will not return a new DataFrame but mutate the existing DataFrame 改动了原数据帧
movies["has_seen"]=0
# assign zero for every value
# by default, new columns are append to the end
movies.head()
Out[69]:
color director_name ... movie_facebook_likes has_seen
0 Color James Cameron ... 33000 0
1 Color Gore Verbinski ... 0 0
2 Color Sam Mendes ... 85000 0
3 Color Christopher Nolan ... 164000 0
4 NaN Doug Walker ... 0 0
[5 rows x 29 columns]
# 方法二
# use .assign method to create a new column
# in this way, it will return a new dataframe with the new column
# it uses the parameter names as the column names
movies.assign(has_seen2=0)
Out[74]:
color director_name ... has_seen has_seen2
0 Color James Cameron ... 0 0
1 Color Gore Verbinski ... 0 0
2 Color Sam Mendes ... 0 0
3 Color Christopher Nolan ... 0 0
4 NaN Doug Walker ... 0 0
... ... ... ... ...
4911 Color Scott Smith ... 0 0
4912 Color NaN ... 0 0
4913 Color Benjamin Roberds ... 0 0
4914 Color Daniel Hsia ... 0 0
4915 Color Jon Gunn ... 0 0
[4916 rows x 30 columns]
# 一个小栗子:点赞数据的汇总。注意使用加号和sum方法的区别:前者会保持缺失值,后者会将缺失值化为0再加和。
# add up all actor and director Facebook like columns
# and assign them to the `total_like column
# method 1
total = movies["actor_1_facebook_likes"]+movies["actor_2_facebook_likes"]\
+movies["actor_3_facebook_likes"]+movies["director_facebook_likes"]
# movies["total_facebook_likes"]=total
# method 2
# .sum ignore the missing values
# but operator + , the result had the missing numbers
cols=["actor_1_facebook_likes","actor_2_facebook_likes","actor_3_facebook_likes","director_facebook_likes"]
sum_col=movies.loc[:,cols].sum(axis="columns")
movies.assign(total_likes=sum_col).head()
Out[85]:
color director_name ... has_seen total_likes
0 Color James Cameron ... 0 2791.0
1 Color Gore Verbinski ... 0 46563.0
2 Color Sam Mendes ... 0 11554.0
3 Color Christopher Nolan ... 0 95000.0
4 NaN Doug Walker ... 0 274.0
[5 rows x 30 columns]
# method 3 :用assign方法调用函数
# pass in a function as the value of the parameter in the call to the .assign method
# and the function accepts a DataFrame as input and should return a Series
def sum_like(df):
return df[[c for c in df.columns if "like" in c and ("actor" in c or "director" in c) ]].sum(axis="columns")
movies.assign(total_likes=sum_like).head(5)
Out[90]:
color director_name ... has_seen total_likes
0 Color James Cameron ... 0 2791.0
1 Color Gore Verbinski ... 0 46563.0
2 Color Sam Mendes ... 0 11554.0
3 Color Christopher Nolan ... 0 95000.0
4 NaN Doug Walker ... 0 274.0
[5 rows x 30 columns]
# with .sum method it converts Nan to zero
# 使用.assign方法同时插入两列
def cast_like_gt_actor(df):
return (df["cast_total_facebook_likes"]>=df["total_likes"])
df2=movies.assign(total_likes=sum_like,is_cast_likes_more=cast_like_gt_actor)
insert a new column into a specific location in a DataFrame with .insert
method ;
.insert
method takes the integer position of the new column as the first argument, the name of the new column as its second argument, and the values as its third ;
# get_loc index method to find the integer location of the column name
profit_index = movies.columns.get_loc("gross")+1
profit_index
Out[109]: 9
# .insert method modifies the calling DataFrame in-place, so there won't be an assignment statement
# 直接在原数据帧上插入了一列,修改了原数据帧
movies.insert(loc=profit_index,column="profit",value=movies["gross"]-movies["budget"])
# 看吧 gross下面就是有个profit (*^_^*)
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 color 4897 non-null object
1 director_name 4814 non-null object
2 num_critic_for_reviews 4867 non-null float64
3 duration 4901 non-null float64
4 director_facebook_likes 4814 non-null float64
5 actor_3_facebook_likes 4893 non-null float64
6 actor_2_name 4903 non-null object
7 actor_1_facebook_likes 4909 non-null float64
8 gross 4054 non-null float64
9 profit 3789 non-null float64
10 genres 4916 non-null object
11 actor_1_name 4909 non-null object
12 movie_title 4916 non-null object
13 num_voted_users 4916 non-null int64
14 cast_total_facebook_likes 4916 non-null int64
15 actor_3_name 4893 non-null object
16 facenumber_in_poster 4903 non-null float64
17 plot_keywords 4764 non-null object
18 movie_imdb_link 4916 non-null object
19 num_user_for_reviews 4895 non-null float64
20 language 4904 non-null object
21 country 4911 non-null object
22 content_rating 4616 non-null object
23 budget 4432 non-null float64
24 title_year 4810 non-null float64
25 actor_2_facebook_likes 4903 non-null float64
26 imdb_score 4916 non-null float64
27 aspect_ratio 4590 non-null float64
28 movie_facebook_likes 4916 non-null int64
29 has_seen 4916 non-null int64
dtypes: float64(14), int64(4), object(12)
memory usage: 1.1+ MB
# delete the 'total_likes' column
df2.drop(columns="total_likes")
Out[96]:
color director_name ... has_seen is_cast_likes_more
0 Color James Cameron ... 0 True
1 Color Gore Verbinski ... 0 True
2 Color Sam Mendes ... 0 True
3 Color Christopher Nolan ... 0 True
4 NaN Doug Walker ... 0 False
... ... ... ... ...
4911 Color Scott Smith ... 0 True
4912 Color NaN ... 0 True
4913 Color Benjamin Roberds ... 0 True
4914 Color Daniel Hsia ... 0 True
4915 Color Jon Gunn ... 0 True
[4916 rows x 30 columns]
actor_sum=movies[[c for c in movies.columns if "actor_" in c and "_likes" in c]].sum(axis="columns")
actor_sum.head()
Out[98]:
0 2791.0
1 46000.0
2 11554.0
3 73000.0
4 143.0
dtype: float64
# Series 之间比较大小的不同方法,方法1
movies["cast_total_facebook_likes"]>=actor_sum
Out[99]:
0 True
1 True
2 True
3 True
4 True
...
4911 True
4912 True
4913 True
4914 True
4915 True
Length: 4916, dtype: bool
# Series 之间比较大小的不同方法,方法2
movies["cast_total_facebook_likes"].ge(actor_sum)
Out[100]:
0 True
1 True
2 True
3 True
4 True
...
4911 True
4912 True
4913 True
4914 True
4915 True
Length: 4916, dtype: bool
# 计算百分比,注意除数在前
# calculate the percentage of the 'cast_total_facebook_likes' that come from 'actor_sum'
pct_like=actor_sum.div(movies["cast_total_facebook_likes"]).mul(100)
pct_like.describe()
Out[103]:
count 4883.000000
mean 83.327889
std 14.056578
min 30.076696
25% 73.528368
50% 86.928884
75% 95.477440
max 100.000000
dtype: float64
# create a Series using the 'movie_title' column as the index
# 看下面失败的案例(第一个结果)可知,to_numpy()是不能省略的,我猜这是因为pct_like是一个Series,有自己的index。
pd.Series(pct_like,index=movies["movie_title"]).head()
Out[105]:
movie_title
Avatar NaN
Pirates of the Caribbean: At World's End NaN
Spectre NaN
The Dark Knight Rises NaN
Star Wars: Episode VII - The Force Awakens NaN
dtype: float64
pd.Series(pct_like.to_numpy(),index=movies["movie_title"]).head()
Out[106]:
movie_title
Avatar 57.736864
Pirates of the Caribbean: At World's End 95.139607
Spectre 98.752137
The Dark Knight Rises 68.378310
Star Wars: Episode VII - The Force Awakens 100.000000
dtype: float64
# 看吧 pct_like是一个Series,有自己的index (*^_^*)
pct_like
Out[107]:
0 57.736864
1 95.139607
2 98.752137
3 68.378310
4 100.000000
...
4911 62.417871
4912 100.000000
4913 NaN
4914 90.276614
4915 76.687117
Length: 4916, dtype: float64
# not return a new DataFrame
del movies["director_name"]
# 看看 `director_name`列真的没有了
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 color 4897 non-null object
1 num_critic_for_reviews 4867 non-null float64
2 duration 4901 non-null float64
3 director_facebook_likes 4814 non-null float64
4 actor_3_facebook_likes 4893 non-null float64
5 actor_2_name 4903 non-null object
6 actor_1_facebook_likes 4909 non-null float64
7 gross 4054 non-null float64
8 profit 3789 non-null float64
9 genres 4916 non-null object
10 actor_1_name 4909 non-null object
11 movie_title 4916 non-null object
12 num_voted_users 4916 non-null int64
13 cast_total_facebook_likes 4916 non-null int64
14 actor_3_name 4893 non-null object
15 facenumber_in_poster 4903 non-null float64
16 plot_keywords 4764 non-null object
17 movie_imdb_link 4916 non-null object
18 num_user_for_reviews 4895 non-null float64
19 language 4904 non-null object
20 country 4911 non-null object
21 content_rating 4616 non-null object
22 budget 4432 non-null float64
23 title_year 4810 non-null float64
24 actor_2_facebook_likes 4903 non-null float64
25 imdb_score 4916 non-null float64
26 aspect_ratio 4590 non-null float64
27 movie_facebook_likes 4916 non-null int64
28 has_seen 4916 non-null int64
dtypes: float64(14), int64(4), object(11)
memory usage: 1.1+ MB