本文涉及的代码基于python 3.6.5
pandas 0.23.0
pandas是一个在numpy之上,提供了丰富的数据预处理接口的模块。
使用pandas模块之前首先需要在代码之前导入该模块:
import pandas as pd
现在我们有一个food_info.csv文件,格式如下所示:
NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),Calcium_(mg),Iron_(mg),Magnesium_(mg),Phosphorus_(mg),Potassium_(mg),Sodium_(mg),Zinc_(mg),Copper_(mg),Manganese_(mg),Selenium_(mcg),Vit_C_(mg),Thiamin_(mg),Riboflavin_(mg),Niacin_(mg),Vit_B6_(mg),Vit_B12_(mcg),Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
01001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0,0.06,24,0.02,2,24,24,643,0.09,0,0,1,0,0.005,0.034,0.042,0.003,0.17,2499,684,2.32,1.5,60,7,51.368,21.021,3.043,215
其中,第一行为标题,第二行为具体的数据,该文件具有多行数据,这里为了节约篇幅,只展示了一行。csv文件中的各个数据之间一般以逗号分隔,可以用Excel等工具以表格的形式打开。
food_info = pd.read_csv("food_info.csv")
print(type(food_info)) #
print(food_info.dtypes)
# NDB_No int64
# Shrt_Desc object
# Water_(g) float64
# Energ_Kcal int64
# Protein_(g) float64
# Lipid_Tot_(g) float64
# Ash_(g) float64
# Carbohydrt_(g) float64
# Fiber_TD_(g) float64
# Sugar_Tot_(g) float64
# Calcium_(mg) float64
# Iron_(mg) float64
# Magnesium_(mg) float64
# Phosphorus_(mg) float64
# Potassium_(mg) float64
# Sodium_(mg) float64
# Zinc_(mg) float64
# Copper_(mg) float64
# Manganese_(mg) float64
# Selenium_(mcg) float64
# Vit_C_(mg) float64
# Thiamin_(mg) float64
# Riboflavin_(mg) float64
# Niacin_(mg) float64
# Vit_B6_(mg) float64
# Vit_B12_(mcg) float64
# Vit_A_IU float64
# Vit_A_RAE float64
# Vit_E_(mg) float64
# Vit_D_mcg float64
# Vit_D_IU float64
# Vit_K_(mcg) float64
# FA_Sat_(g) float64
# FA_Mono_(g) float64
# FA_Poly_(g) float64
# Cholestrl_(mg) float64
# dtype: object
我们使用上述代码将food_info.csv读取进来,读进来后保存在DataFrame对象中。DataFrame可以看做是一种矩阵结构。
再观察pandas中每个数据的具体类型,可以看到一般是int64,float64类型的,但是也有object类型,pandas里将字符串处理成object类型,在这里就是Shrt_Desc这列的数据。
常用的数据类型说明如下:
类型 | 说明 |
---|---|
object | for string values |
int | for integer values |
float | for float values |
datetime | for time values |
bool | for Boolean values |
如果对read_csv()方法还不清楚,可以使用如下代码查看文档:
print(help(pd.read_csv()))
我们可以使用DataFrame的实例方法head()检查读取的数据格式是否正确,代码如下:
print(food_info.head())
# NDB_No Shrt_Desc ... FA_Poly_(g) Cholestrl_(mg)
# 0 1001 BUTTER WITH SALT ... 3.043 215.0
# 1 1002 BUTTER WHIPPED WITH SALT ... 3.012 219.0
# 2 1003 BUTTER OIL ANHYDROUS ... 3.694 256.0
# 3 1004 CHEESE BLUE ... 0.800 75.0
# 4 1005 CHEESE BRICK ... 0.784 94.0
#
# [5 rows x 36 columns]
head()方法有默认参数n=5,表示展示的行数,这个可以自己定义。
同样,我们还可以使用tail()方法查看读取进来的数据的后几行信息:
print(food_info.tail())
# NDB_No ... Cholestrl_(mg)
# 8613 83110 ... 95.0
# 8614 90240 ... 41.0
# 8615 90480 ... 0.0
# 8616 90560 ... 50.0
# 8617 93600 ... 50.0
#
# [5 rows x 36 columns]
columns属性可以用来获取csv表格的表头(列名):
print(food_info.columns)
# Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
# 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
# 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
# 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
# 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
# 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
# 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
# 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
# 'Cholestrl_(mg)'],
# dtype='object')
前面我们说过可以把DataFrame看做是矩阵,那么这里就可以利用shape属性获取矩阵的形状:
print(food_info.shape) # (8618, 36)
这里矩阵的行数不包括表头行,表示数据中有8618个样本,每个样本有36个特征。
想要获取DataFrame中某一行的数据,要通过loc方法结合"[]"内的下标获取。例如:
print(food_info.loc[0])
# NDB_No 1001
# Shrt_Desc BUTTER WITH SALT
# Water_(g) 15.87
# Energ_Kcal 717
# Protein_(g) 0.85
# Lipid_Tot_(g) 81.11
# Ash_(g) 2.11
# Carbohydrt_(g) 0.06
# Fiber_TD_(g) 0
# Sugar_Tot_(g) 0.06
# Calcium_(mg) 24
# Iron_(mg) 0.02
# Magnesium_(mg) 2
# Phosphorus_(mg) 24
# Potassium_(mg) 24
# Sodium_(mg) 643
# Zinc_(mg) 0.09
# Copper_(mg) 0
# Manganese_(mg) 0
# Selenium_(mcg) 1
# Vit_C_(mg) 0
# Thiamin_(mg) 0.005
# Riboflavin_(mg) 0.034
# Niacin_(mg) 0.042
# Vit_B6_(mg) 0.003
# Vit_B12_(mcg) 0.17
# Vit_A_IU 2499
# Vit_A_RAE 684
# Vit_E_(mg) 2.32
# Vit_D_mcg 1.5
# Vit_D_IU 60
# Vit_K_(mcg) 7
# FA_Sat_(g) 51.368
# FA_Mono_(g) 21.021
# FA_Poly_(g) 3.043
# Cholestrl_(mg) 215
# Name: 0, dtype: object
当然,这里也可以使用切片来获取多行数据:
print(food_info.loc[3:6]) # 获取3、4、5、6行数据
print(food_info.loc[[2, 5, 10]]) # 获取2、5、10行数据
获取某列的数据可以使用列名,例如:
print(food_info["NDB_No"])
# 0 1001
# 1 1002
# 2 1003
# 3 1004
# ...
# 8615 90480
# 8616 90560
# 8617 93600
# Name: NDB_No, Length: 8618, dtype: int64
获取多列数据:
columns = ["NDB_No", "Cholestrl_(mg)"]
print(food_info[columns])
下面再看一个简单的数据处理过程:
col_names = food_info.columns.tolist()
print(col_names)
# ['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)',
# 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
# 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)',
# 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
# 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)',
# 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)']
gram_columns = []
for c in col_names:
if c.endswith("(g)"):
gram_columns.append(c)
gram_df = food_info[gram_columns]
print(gram_df.head(3))
# Water_(g) Protein_(g) ... FA_Mono_(g) FA_Poly_(g)
# 0 15.87 0.85 ... 21.021 3.043
# 1 15.87 0.85 ... 23.426 3.012
# 2 0.24 0.28 ... 28.732 3.694
#
# [3 rows x 10 columns]
上述代码的功能是获取food_info.csv中以(g)为单位的列的数据。首先利用columns属性获取列名并将其转换为列表,然后筛选列表中以"(g)"结尾的元素放入新的列表,再利用新列表获取到这些数据。这是pandas常见的数据处理方法。
下面我们来看一下利用pandas对数据进行一些计算的操作:
div_1000 = food_info["Iron_(mg)"] / 1000
add_100 = food_info["Iron_(mg)"] + 100
sub_100 = food_info["Iron_(mg)"] - 100
mult_2 = food_info["Iron_(mg)"] * 2
上面的代码对"Iron_(mg)"列里的每一个数据进行相同的算术运算,类似于NumPy里的广播机制或者说是向量化操作。
再看下面这段代码,它实现了将两列对应位置的值相乘,再讲相乘的结果缩放以创建新的列,并将新列添加到DataFrame中。
print(food_info.shape) # (8618, 36)
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
iron_grams = food_info["Iron_(mg)"] / 1000
food_info["Iron_(g)"] = iron_grams
print(food_info.shape) # (8618, 37)
当然,我们还可以对一列的数据求最大值、最小值、平均值、求和等:
max_calories = food_info["Energ_Kcal"].max()
min_calories = food_info["Energ_Kcal"].min()
mean_calories = food_info["Energ_Kcal"].mean()
sum_calories = food_info["Energ_Kcal"].sum()
print(max_calories, min_calories, mean_calories, sum_calories)
# 902 0 226.43861684845672 1951448
利用上面的代码,对数据归一化就变得简便起来。
我们还可以对数据进行排序操作:
food_info.sort_values("Sodium_(mg)", inplace=True)
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)
sort_values()函数的第一个参数表示要排序的数据列,第二个参数inplace为True表示在原数据上做排序,第三个参数ascending为True表示按升序排列数据。还有其他的一些参数这里不再赘述,可以根据自己的需要进行设置。
下面我们通过一个实例来展示pandas常用的数据预处理方法。我们使用的数据titanic_train.csv来自kaggle的泰坦尼克号生存预测。
首先是读取数据:
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("titanic_train.csv")
print(titanic_survival.head())
# PassengerId Survived Pclass ... Fare Cabin Embarked
# 0 1 0 3 ... 7.2500 NaN S
# 1 2 1 1 ... 71.2833 C85 C
# 2 3 1 3 ... 7.9250 NaN S
# 3 4 1 1 ... 53.1000 C123 S
# 4 5 0 3 ... 8.0500 NaN S
具体的数据含义这里就不细说了,网上都能搜到。
获取"Age"列数据:
age = titanic_survival["Age"]
print(age.loc[0:9])
# 0 22.0
# 1 38.0
# 2 26.0
# 3 35.0
# 4 35.0
# 5 NaN
# 6 54.0
# 7 2.0
# 8 27.0
# 9 14.0
# Name: Age, dtype: float64
上述代码展示了"Age"前10个数据,我们发现其中有数据值为NaN,在pandas里NaN表示缺失值,也就是Not a Number。我们可以用pandas.isnull()函数对csv中的数据进行判断,是NaN的返回True,否则返回False。
age_is_null = pd.isnull(age)
print(age_is_null)
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 True
# ...
# 887 False
# 888 True
# 889 False
# 890 False
这样我们就可以将数据中NaN分离出来了,在此之前先观察我们分离出来的数据对不对吧。
age_null_true = age[age_is_null]
print(age_null_true)
# 5 NaN
# 17 NaN
# 19 NaN
# ..
# 868 NaN
# 878 NaN
# 888 NaN
# Name: Age, Length: 177, dtype: float64
由上面的输出我们也可以看到有"Age"列中有177个数据为NaN。
如果我们在没有将NaN分离出来的情况对"Age"列求均值,得到的结果也是NaN,这是因为我们对NaN值进行的任何计算的结果也只可能是NaN。
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print(mean_age) # nan
当然,我们可以通过去除NaN值来获取正确的均值:
good_ages = titanic_survival["Age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age) # 29.69911764705882
上述代码通过age_is_null == False条件来获取"Age"列中不为NaN的年龄然后求取了均值。
其实我们可以调用pandas里的mean()方法来求均值,pandas自带的一些方法自动对缺失值做了处理。
correct_mean_age = titanic_survival["Age"].mean()
print(correct_mean_age) # 29.69911764705882
我们还可以显示调用dropna()方法来去除一些含有NaN值的行或列:
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0, subset=["Age", "Sex"])
titanic_survival本身是有891行12列数据的矩阵,drop_na_columns去除含有NA值(NaN,NaT)的列之后变成891行9列的矩阵。注意这里可以设置how参数为any或all,设置为any意味着有NA值就drop掉,all意味着某一列或某一行数据全为NA才drop掉。new_titanic_survival表示删除掉在"Age"和"Sex"列中有NA值的行,也就是在删除行时需要结合subset中给定的列的值,经过这样处理后new_titanic_survival具有714行12列数据。
另外,除了删除NaN值,我们还可以对其进行填充:
# 用0来填充所有的缺失值
fill_na_columns = titanic_survival.fillna(0)
# 将Age列缺失值填充为20,Sex列缺失值填充为male
values = {'Age': 20, 'Sex': 'male'}
other_fill_na_columns = titanic_survival.fillna(value=values)
具体的填充方式可以根据不同的参数设置方式实现,详情请查看文档。
接下来我们先看一个例子:
passenger_classes = [1, 2, 3] # 乘客所在客舱等级
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
pclass_fares = pclass_rows["Fare"]
fare_for_class = pclass_fares.mean()
fares_by_class[this_class] = fare_for_class
print(fares_by_class)
# {1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}
上述代码求取了不同等级客舱客人的船费平均值,流程是先获取某个等级客舱的乘客数据,然后对"Fare"列求均值再写入结果。上面的代码未免显得繁琐,利用pandas的pivot_table方法一行代码就可以解决了。
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Fare", aggfunc=np.mean)
print(passenger_survival)
# Fare
# Pclass
# 1 84.154687
# 2 20.662183
# 3 13.675550
pivot_table()方法中的参数index表示用哪一列的数据来分组,values表示分组后进行计算的列,而aggfunc定义了我们想要应用的计算方法(默认是pandas中的mean()方法)。
当然我们也可以同时对多列数据进行处理,如下所示:
port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)
# Fare Survived
# Embarked
# C 10072.2962 93
# Q 1022.2543 30
# S 17439.3988 217
接下来我们做一下排序:
new_titanic_survival = titanic_survival.sort_values("Age", ascending=False)
print(new_titanic_survival[0:4])
# PassengerId Survived Pclass ... Fare Cabin Embarked
# 630 631 1 1 ... 30.0000 A23 S
# 851 852 0 3 ... 7.7750 NaN S
# 493 494 0 1 ... 49.5042 NaN C
# 96 97 0 1 ... 34.6542 A5 C
#
# [4 rows x 12 columns]
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.iloc[0:4])
# PassengerId Survived Pclass ... Fare Cabin Embarked
# 0 631 1 1 ... 30.0000 A23 S
# 1 852 0 3 ... 7.7750 NaN S
# 2 494 0 1 ... 49.5042 NaN C
# 3 97 0 1 ... 34.6542 A5 C
#
# [4 rows x 12 columns]
new_titanic_survival是我们单纯根据"Age"列排序的结果,我们可以看到每行数据开头的index还是原始的index。我们可以像titanic_reindexed一样重设index,reset_index方法将原来的index转换为新的column然后生成新的index,设置drop=True就不会将原始index转换为新的column而只生成新的index。
下面我们再来看如何自定义方法处理数据:
def hundredth_row(column):
hundredth_item = column.iloc[99]
return hundredth_item
# Return the hundredth item from each column
h_row = titanic_survival.apply(hundredth_row)
print(h_row)
# PassengerId 100
# Survived 0
# Pclass 2
# Name Kantor, Mr. Sinai
# Sex male
# Age 34
# SibSp 1
# Parch 0
# Ticket 244367
# Fare 26
# Cabin NaN
# Embarked S
# dtype: object
hundredth_row()函数是我们定义的用来获取某一列第100行数据的。我们调用titanic_survival这个DataFrame的apply()方法并把hundredth_row作为参数传进去,也就完成了我们自定义的操作。
当然我们可以设置axis参数为1来对行进行处理,如下所示:
def which_class(row):
pclass = row['Pclass']
if pd.isnull(pclass):
return "Unknown"
elif pclass == 1:
return "First Class"
elif pclass == 2:
return "Second Class"
elif pclass == 3:
return "Third Class"
classes = titanic_survival.apply(which_class, axis=1)
print(classes)
# 0 Third Class
# 1 First Class
# 2 Third Class
# ...
# 888 Third Class
# 889 First Class
# 890 Third Class
# Length: 891, dtype: object
接下来我们用类似的方法对年龄列进行处理:
def generate_age_label(row):
age = row["Age"]
if pd.isnull(age):
return "unknown"
elif age < 18:
return "minor"
else:
return "adult"
age_labels = titanic_survival.apply(generate_age_label, axis=1)
print(age_labels)
# 0 adult
# 1 adult
# 2 adult
# ...
# 888 unknown
# 889 adult
# 890 adult
# Length: 891, dtype: object
然后我们再用得到的结果在DataFrame中创建新的列并结合pivot_table()方法得到一个易于阅读的统计信息:
titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print(age_group_survival)
# Survived
# age_labels
# adult 0.381032
# minor 0.539823
# unknown 0.293785
从上述结果我们就可以很容易得到在泰坦尼克号事故中未成年人获救比例较高。
Series是pandas里的一种基本数据结构,是collection of values,而之前介绍的DataFrame是collection of Series objects。下面我们结合fandango_score_comparison.csv来进行讲解,这是一个电影评分数据集。前几条数据是这样的:
FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,Metacritic_norm,Metacritic_user_nom,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5,4.5,3.7,4.3,3.3,3.55,3.9,3.5,4.5,3.5,3.5,4,1330,271107,14846,0.5
Cinderella (2015),85,80,67,7.5,7.1,5,4.5,4.25,4,3.35,3.75,3.55,4.5,4,3.5,4,3.5,249,65709,12640,0.5
Ant-Man (2015),80,90,64,8.1,7.8,5,4.5,4,4.5,3.2,4.05,3.9,4,4.5,3,4,4,627,103660,12055,0.5
可以看到它包括一些电影的名称和媒体的评分以及其他的一些分析结果。
我们先读取该csv数据,得到电影名称和烂番茄指数两列数据:
fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']
print(series_film[0:5])
# 0 Avengers: Age of Ultron (2015)
# 1 Cinderella (2015)
# 2 Ant-Man (2015)
# 3 Do You Believe? (2015)
# 4 Hot Tub Time Machine 2 (2015)
# Name: FILM, dtype: object
series_rt = fandango['RottenTomatoes']
print(series_rt[0:5])
# 0 74
# 1 85
# 2 80
# 3 18
# 4 14
# Name: RottenTomatoes, dtype: int64
其中的series_film和series_rt都是Series对象,然后我们打印一下信息:
film_names = series_film.values
print(type(film_names)) #
可以看Series的values的类型是numpy.ndarray类型的,也就是说Series是numpy.ndarray的封装。
我们可以还利用series_rt.values和series_film.values自己构建一个Series:
rt_scores = series_rt.values
series_custom = Series(rt_scores, index=film_names)
print(series_custom)
# Avengers: Age of Ultron (2015) 74
# Cinderella (2015) 85
# Ant-Man (2015) 80
# Do You Believe? (2015) 18
# Hot Tub Time Machine 2 (2015) 14
series_custom利用电影名称file_names作为索引,烂番茄指数rt_scores作为values。
我们同样可以利用切片访问上述数据:
fiveten = series_custom[5:10]
print(fiveten)
# The Water Diviner (2015) 63
# Irrational Man (2015) 42
# Top Five (2014) 86
# Shaun the Sheep Movie (2015) 99
# Love & Mercy (2015) 89
# dtype: int64
也可以按照索引对Series进行排序:
original_index = series_custom.index.tolist()
sorted_index = sorted(original_index)
sorted_by_index = series_custom.reindex(sorted_index)
这里就是按照电影名称的字典序进行排序,然后重新建立索引,构建排序后的Series。
当然,我们也可以直接调用Series提供的接口对index或者values进行排序。
sc_i = series_custom.sort_index()
print(sc_i[0:3])
# '71 (2015) 97
# 5 Flights Up (2015) 52
# A Little Chaos (2015) 40
# dtype: int64
sc_v = series_custom.sort_values()
print(sc_v[0:3])
# Paul Blart: Mall Cop 2 (2015) 5
# Hitman: Agent 47 (2015) 7
# Hot Pursuit (2015) 8
# dtype: int64
因为Series是基于ndarray的,所以我们还可以将numpy里的一些接口应用于Series:
import numpy as np
np_add = np.add(series_custom, series_custom)
np_sin = np.sin(series_custom)
np_max = np.max(series_custom)
上述代码中有用到numpy的add()方法将series_custom的值相加的,其实我们也可以直接利用Series进行加法操作:
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_sum = (rt_critics + rt_users)
需要注意的是两个Series相加时,它们的index必须是一致的。
上述的操作都是基于列的,那么对于行我们怎么操作呢?
fandango_films = fandango.set_index('FILM', drop=False)
print(fandango_films.loc['Kumiko, The Treasure Hunter (2015)'])
# FILM Kumiko, The Treasure Hunter (2015)
# RottenTomatoes 87
# RottenTomatoes_User 63
# Metacritic 68
# Metacritic_User 6.4
# IMDB 6.7
# Fandango_Stars 3.5
# Fandango_Ratingvalue 3.5
# RT_norm 4.35
# RT_user_norm 3.15
# Metacritic_norm 3.4
# Metacritic_user_nom 3.2
# IMDB_norm 3.35
# RT_norm_round 4.5
# RT_user_norm_round 3
# Metacritic_norm_round 3.5
# Metacritic_user_norm_round 3
# IMDB_norm_round 3.5
# Metacritic_user_vote_count 19
# IMDB_user_vote_count 5289
# Fandango_votes 41
# Fandango_Difference 0
# Name: Kumiko, The Treasure Hunter (2015), dtype: object
上述代码中我们首先将"FILM"列设置成index,然后再根据电影名去获取数据,我们检查获取到的数据类型:
print(type(fandango_films.loc['Kumiko, The Treasure Hunter (2015)']))
#
当获取单条数据时,它是Series类型,而当获取多条数据时:
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
print(type(fandango_films.loc[movies]))
#
这个时候返回的是DataFrame类型的数据,这里需要注意一下。
最后,Series也可以像DataFrame那样利用apply()方法执行自定义函数,如下所示:
def my_filter(series):
if series > 50:
return True
else:
return False
print(type(rt_users))
#
print(rt_users[0:5])
# FILM
# Avengers: Age of Ultron (2015) 86
# Cinderella (2015) 80
# Ant-Man (2015) 90
# Do You Believe? (2015) 84
# Hot Tub Time Machine 2 (2015) 28
# dtype: int64
filtered = rt_users.apply(my_filter)
print(filtered[0:5])
# FILM
# Avengers: Age of Ultron (2015) True
# Cinderella (2015) True
# Ant-Man (2015) True
# Do You Believe? (2015) True
# Hot Tub Time Machine 2 (2015) False
# dtype: bool
因为DataFrame是Series的进一步封装,所以它们有很多相似的地方,这里就不一一述说了。
最后,谢谢各位读者耐心读完本文。如果文章中有什么不对的地方,欢迎在评论中批评指正,我会及时修改,以免误导他人。