
参考文献:pandas cookbook

1. create new columns by using the .assign method

1.1 不讲究顺序地插入一列

# 方法一,直接创建。
# one way to create a new column : do an index assignment 
# note that this will not return a new DataFrame but mutate the existing DataFrame 改动了原数据帧
# assign zero for every value
# by default, new columns are append to the end
   color      director_name  ...  movie_facebook_likes  has_seen
0  Color      James Cameron  ...                 33000         0
1  Color     Gore Verbinski  ...                     0         0
2  Color         Sam Mendes  ...                 85000         0
3  Color  Christopher Nolan  ...                164000         0
4    NaN        Doug Walker  ...                     0         0
[5 rows x 29 columns]

# 方法二
# use .assign method to create a new column
# in this way, it will return a new dataframe with the new column
# it uses the parameter names as the column names
      color      director_name  ...  has_seen  has_seen2
0     Color      James Cameron  ...         0          0
1     Color     Gore Verbinski  ...         0          0
2     Color         Sam Mendes  ...         0          0
3     Color  Christopher Nolan  ...         0          0
4       NaN        Doug Walker  ...         0          0
     ...                ...  ...       ...        ...
4911  Color        Scott Smith  ...         0          0
4912  Color                NaN  ...         0          0
4913  Color   Benjamin Roberds  ...         0          0
4914  Color        Daniel Hsia  ...         0          0
4915  Color           Jon Gunn  ...         0          0
[4916 rows x 30 columns]

# 一个小栗子:点赞数据的汇总。注意使用加号和sum方法的区别:前者会保持缺失值,后者会将缺失值化为0再加和。
# add up all actor and director Facebook like columns 
# and assign them to the `total_like column
# method 1
total = movies["actor_1_facebook_likes"]+movies["actor_2_facebook_likes"]\
# movies["total_facebook_likes"]=total
# method 2
# .sum ignore the missing values
# but operator + , the result had the missing numbers
   color      director_name  ...  has_seen  total_likes
0  Color      James Cameron  ...         0       2791.0
1  Color     Gore Verbinski  ...         0      46563.0
2  Color         Sam Mendes  ...         0      11554.0
3  Color  Christopher Nolan  ...         0      95000.0
4    NaN        Doug Walker  ...         0        274.0
[5 rows x 30 columns]
# method 3 :用assign方法调用函数
# pass in a function as the value of the parameter in the call to the .assign method
# and the function accepts a DataFrame as input and should return a Series
def sum_like(df):
    return df[[c for c in df.columns if "like" in c and ("actor" in c or "director" in c) ]].sum(axis="columns")
   color      director_name  ...  has_seen  total_likes
0  Color      James Cameron  ...         0       2791.0
1  Color     Gore Verbinski  ...         0      46563.0
2  Color         Sam Mendes  ...         0      11554.0
3  Color  Christopher Nolan  ...         0      95000.0
4    NaN        Doug Walker  ...         0        274.0
[5 rows x 30 columns]
# with .sum method it converts Nan to zero

# 使用.assign方法同时插入两列
def cast_like_gt_actor(df):
    return (df["cast_total_facebook_likes"]>=df["total_likes"])

1.2 讲究顺序地插入一列(即指定位置)

insert a new column into a specific location in a DataFrame with .insert method ;
.insert method takes the integer position of the new column as the first argument, the name of the new column as its second argument, and the values as its third ;

# get_loc index method to find the integer location of the column name
profit_index = movies.columns.get_loc("gross")+1
Out[109]: 9
# .insert method modifies the calling DataFrame in-place, so there won't be an assignment statement
# 直接在原数据帧上插入了一列,修改了原数据帧 
# 看吧 gross下面就是有个profit (*^_^*)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 30 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      4897 non-null   object 
 1   director_name              4814 non-null   object 
 2   num_critic_for_reviews     4867 non-null   float64
 3   duration                   4901 non-null   float64
 4   director_facebook_likes    4814 non-null   float64
 5   actor_3_facebook_likes     4893 non-null   float64
 6   actor_2_name               4903 non-null   object 
 7   actor_1_facebook_likes     4909 non-null   float64
 8   gross                      4054 non-null   float64
 9   profit                     3789 non-null   float64
 10  genres                     4916 non-null   object 
 11  actor_1_name               4909 non-null   object 
 12  movie_title                4916 non-null   object 
 13  num_voted_users            4916 non-null   int64  
 14  cast_total_facebook_likes  4916 non-null   int64  
 15  actor_3_name               4893 non-null   object 
 16  facenumber_in_poster       4903 non-null   float64
 17  plot_keywords              4764 non-null   object 
 18  movie_imdb_link            4916 non-null   object 
 19  num_user_for_reviews       4895 non-null   float64
 20  language                   4904 non-null   object 
 21  country                    4911 non-null   object 
 22  content_rating             4616 non-null   object 
 23  budget                     4432 non-null   float64
 24  title_year                 4810 non-null   float64
 25  actor_2_facebook_likes     4903 non-null   float64
 26  imdb_score                 4916 non-null   float64
 27  aspect_ratio               4590 non-null   float64
 28  movie_facebook_likes       4916 non-null   int64  
 29  has_seen                   4916 non-null   int64  
dtypes: float64(14), int64(4), object(12)
memory usage: 1.1+ MB

2. delete columns with the .drop method

2.1 .drop()

# delete the 'total_likes' column
      color      director_name  ...  has_seen  is_cast_likes_more
0     Color      James Cameron  ...         0                True
1     Color     Gore Verbinski  ...         0                True
2     Color         Sam Mendes  ...         0                True
3     Color  Christopher Nolan  ...         0                True
4       NaN        Doug Walker  ...         0               False
     ...                ...  ...       ...                 ...
4911  Color        Scott Smith  ...         0                True
4912  Color                NaN  ...         0                True
4913  Color   Benjamin Roberds  ...         0                True
4914  Color        Daniel Hsia  ...         0                True
4915  Color           Jon Gunn  ...         0                True
[4916 rows x 30 columns]
actor_sum=movies[[c for c in movies.columns if "actor_" in c and "_likes" in c]].sum(axis="columns")
0     2791.0
1    46000.0
2    11554.0
3    73000.0
4      143.0
dtype: float64
# Series 之间比较大小的不同方法,方法1
0       True
1       True
2       True
3       True
4       True
4911    True
4912    True
4913    True
4914    True
4915    True
Length: 4916, dtype: bool
# Series 之间比较大小的不同方法,方法2
0       True
1       True
2       True
3       True
4       True
4911    True
4912    True
4913    True
4914    True
4915    True
Length: 4916, dtype: bool

# 计算百分比,注意除数在前
# calculate the percentage of the 'cast_total_facebook_likes' that come from 'actor_sum'
count    4883.000000
mean       83.327889
std        14.056578
min        30.076696
25%        73.528368
50%        86.928884
75%        95.477440
max       100.000000
dtype: float64

# create a Series using the 'movie_title' column as the index
# 看下面失败的案例(第一个结果)可知,to_numpy()是不能省略的,我猜这是因为pct_like是一个Series,有自己的index。
Avatar                                       NaN
Pirates of the Caribbean: At World's End     NaN
Spectre                                      NaN
The Dark Knight Rises                        NaN
Star Wars: Episode VII - The Force Awakens   NaN
dtype: float64

Avatar                                         57.736864
Pirates of the Caribbean: At World's End       95.139607
Spectre                                        98.752137
The Dark Knight Rises                          68.378310
Star Wars: Episode VII - The Force Awakens    100.000000
dtype: float64

# 看吧 pct_like是一个Series,有自己的index (*^_^*)
0        57.736864
1        95.139607
2        98.752137
3        68.378310
4       100.000000
4911     62.417871
4912    100.000000
4913           NaN
4914     90.276614
4915     76.687117
Length: 4916, dtype: float64


2.2 del

# not return a new DataFrame
del movies["director_name"]
# 看看 `director_name`列真的没有了
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      4897 non-null   object 
 1   num_critic_for_reviews     4867 non-null   float64
 2   duration                   4901 non-null   float64
 3   director_facebook_likes    4814 non-null   float64
 4   actor_3_facebook_likes     4893 non-null   float64
 5   actor_2_name               4903 non-null   object 
 6   actor_1_facebook_likes     4909 non-null   float64
 7   gross                      4054 non-null   float64
 8   profit                     3789 non-null   float64
 9   genres                     4916 non-null   object 
 10  actor_1_name               4909 non-null   object 
 11  movie_title                4916 non-null   object 
 12  num_voted_users            4916 non-null   int64  
 13  cast_total_facebook_likes  4916 non-null   int64  
 14  actor_3_name               4893 non-null   object 
 15  facenumber_in_poster       4903 non-null   float64
 16  plot_keywords              4764 non-null   object 
 17  movie_imdb_link            4916 non-null   object 
 18  num_user_for_reviews       4895 non-null   float64
 19  language                   4904 non-null   object 
 20  country                    4911 non-null   object 
 21  content_rating             4616 non-null   object 
 22  budget                     4432 non-null   float64
 23  title_year                 4810 non-null   float64
 24  actor_2_facebook_likes     4903 non-null   float64
 25  imdb_score                 4916 non-null   float64
 26  aspect_ratio               4590 non-null   float64
 27  movie_facebook_likes       4916 non-null   int64  
 28  has_seen                   4916 non-null   int64  
dtypes: float64(14), int64(4), object(11)
memory usage: 1.1+ MB
