原文链接:https://www.kaggle.com/residentmario/summary-functions-and-maps
import pandas as pd
reviews = pd.read_csv('.../winemag-data-130k-v2.csv',index_col=0)
#构建dataframe
a=pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
print("the dataframe is\n",a)
#构建series
b=pd.Series([30, 35, 40],
index=['2015 Sales', '2016 Sales', '2017 Sales'],
name='Product A')
print("the series is\n",b)
the dataframe is
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.
the series is
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
animals.to_csv("C:/Users/Administrator/Desktop/wine-reviews/cows_and_goats.csv")
#访问dataframe的country列两种方法:
print("(1)the countries in reviews:\n",reviews.country)
print("(2)the countries in reviews:\n",reviews['country'])
(1)the countries in reviews:
0 Italy
1 Portugal
2 US
3 US
4 US
…
129966 Germany
129967 US
129968 France
129969 France
129970 France
Name: country, Length: 129971, dtype: object
(2)the countries in reviews:
0 Italy
1 Portugal
2 US
3 US
4 US
…
129966 Germany
129967 US
129968 France
129969 France
129970 France
Name: country, Length: 129971, dtype: object
#pandas有自己的访问运算符loc和iloc
#使用dataframe的第一行数据 .iloc[行,列],一般情况下选择前面的行来查找
print("the first row:\n",reviews.iloc[0])
#要使用iloc获取列
print("the first column:\n",reviews.iloc[:, 0])
#从1到三行的打印:下标【0.1.2】
print("the first 3 row:\n",reviews.iloc[:3, 0])
#下标[1:3,0]
print("the 1-3 row:\n",reviews.iloc[1:3, 0])
the first row:
country Italy
description Aromas include tropical fruit, broom, brimston…
designation Vulkà Bianco
points 87
price NaN
province Sicily & Sardinia
region_1 Etna
region_2 NaN
taster_name Kerin O’Keefe
taster_twitter_handle @kerinokeefe
title Nicosia 2013 Vulkà Bianco (Etna)
variety White Blend
winery Nicosia
Name: 0, dtype: object
the first column:
0 Italy
1 Portugal
2 US
3 US
4 US
…
129966 Germany
129967 US
129968 France
129969 France
129970 France
Name: country, Length: 129971, dtype: object
the first 3 row:
0 Italy
1 Portugal
2 US
Name: country, dtype: object
the 1-3 row:
1 Portugal
2 US
Name: country, dtype: object
【注意】iloc使用Python stdlib索引方案,其中包含范围的第一个元素,而排除最后一个,所以[0:10]将包含0,…,9。而loc的[0:10]将包含0,…,10。
#loc使用索引中的信息来完成其工作,一般情况下选择后面的列来查找
print("find data of column 'taster_name', 'taster_twitter_handle', 'points':\n",reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']])
#选择带有索引标签“ 1”,“ 2”,“ 3”,“ 5”和“ 8”的记录
print("find data of row [1,2,3,5,8]:\n",reviews.loc[[1,2,3,5,8]])
#loc查找一个区间的值【从'Apples'到'Potatoes'】:df.loc['Apples':'Potatoes']
find data of column ‘taster_name’, ‘taster_twitter_handle’, ‘points’:
taster_name taster_twitter_handle points
0 Kerin O’Keefe @kerinokeefe 87
1 Roger Voss @vossroger 87
2 Paul Gregutt @paulgwine 87
3 Alexander Peartree NaN 87
4 Paul Gregutt @paulgwine 87
… … … …
129966 Anna Lee C. Iijima NaN 90
129967 Paul Gregutt @paulgwine 90
129968 Roger Voss @vossroger 90
129969 Roger Voss @vossroger 90
129970 Roger Voss @vossroger 90
[129971 rows x 3 columns]
find data of row [1,2,3,5,8]:
country … winery
1 Portugal … Quinta dos Avidagos
2 US … Rainstorm
3 US … St. Julian
5 Spain … Tandem
8 Germany … Heinz Eifel
[5 rows x 13 columns]
#添加索引
reviews.set_index("title")
#检查国家是否为'Italy'
reviews.country == 'Italy'
#条件筛选
loc_italy_find=reviews.loc[reviews.country == 'Italy']
loc_italy_find_and_90=reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
loc_italy_find_or_90=reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
#isin是让您选择值“在”值列表中的数据。 这样就可以同时筛选某列的两个以上的值。
isin_italy_and_france=reviews.loc[reviews.country.isin(['Italy', 'France'])]
#isnull notnull
notnull_price=reviews.loc[reviews.price.notnull()]
#给某列分配一个常量值
reviews['critic'] = 'everyone'
print("the create:\n",reviews.critic)
#或分配具有可迭代的值:
reviews['index_backwards'] = range(len(reviews), 0, -1)
print("the change:\n",reviews.index_backwards)
the create:
0 everyone
1 everyone
2 everyone
3 everyone
4 everyone
…
129966 everyone
129967 everyone
129968 everyone
129969 everyone
129970 everyone
Name: critic, Length: 129971, dtype: object
the change:
0 129971
1 129970
2 129969
3 129968
4 129967
…
129966 5
129967 4
129968 3
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int32
#PANDAS提供了许多简单的“摘要功能”(不是官方名称),它们以某种有用的方式重组了数据。
# 例如describe()方法: 提供了具体的属性描述
points_describe=reviews.points.describe()
'''
count 129971.000000
mean 88.447138
...
75% 91.000000
max 100.000000
Name: points, Length: 8, dtype: float64
#查看平均值
points_mean=reviews.points.mean()
#查看唯一值
taster_name_unique=reviews.taster_name.unique()
#查看唯一字段以及其统计频率
taster_name_value_counts=reviews.taster_name.value_counts()
Roger Voss 25514
Michael Schachner 15134
…
Fiona Adams 27
Christina Pickard 6
Name: taster_name, Length: 19, dtype: int64
#映射是一个从数学中借来的术语,表示一个函数,它接受一组值并将它们“映射”到另一组值。
#例如,假设我们想将收到的葡萄酒的分数修正为0。我们可以这样做:
review_points_mean = reviews.points.mean()
points_map=reviews.points.map(lambda p: p - review_points_mean)
print("the change map:\n",points_map)
#传递给map()的函数应该期望得到Series中的单个值(在上面的示例中为点值),并返回该值的转换版本。 map()返回一个新的Series,其中所有值都已由您的函数转换。
the change map:
0 -1.447138
1 -1.447138
2 -1.447138
3 -1.447138
4 -1.447138
…
129966 1.552862
129967 1.552862
129968 1.552862
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
#如果我们要通过在每一行上调用自定义方法来转换整个DataFrame,则apply()是等效的方法。
def remean_points(row):
row.points = row.points - review_points_mean
return row
#axis表示计算的维度/位置;如axis=0(默认)表示计算列;axis=1表示计算行。 如果我们使用axis ='index'调用了reviews.apply(),则需要传递一个函数来转换每一列,而不是传递函数来转换每一行。
reviews.apply(remean_points, axis='columns')
print("the change points:\n",reviews.points)
the change points:
0 87
1 87
2 87
3 87
4 87
…
129966 90
129967 90
129968 90
129969 90
129970 90
Name: points, Length: 129971, dtype: int64
请注意,map()和apply()分别返回新的,转换后的Series和DataFrames。 他们不会修改被调用的原始数据。
#在原来数据上直接修改的方法
#pandas提供了许多常见的内置映射操作。 例如,这是一种重新定义我们的points列的更快方法:
print("change dericter:\n",reviews.points - review_points_mean)
change dericter:
0 -1.447138
1 -1.447138
2 -1.447138
3 -1.447138
4 -1.447138
…
129966 1.552862
129967 1.552862
129968 1.552862
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
在此代码中,我们在左侧的多个值(系列中的所有值)和右侧的单个值(平均值)之间执行运算。 Pandas查看了此表达式,并指出我们必须要从数据集中的每个值中减去该平均值。
在数据集中合并国家和地区信息的一种简单方法是执行以下操作:
print("reviews.country - reviews.region_1:\n",(reviews.country + " - " + reviews.region_1))
reviews.country - reviews.region_1:
0 Italy - Etna
1 NaN
2 US - Willamette Valley
3 US - Lake Michigan Shore
4 US - Willamette Valley
…
129966 NaN
129967 US - Oregon
129968 France - Alsace
129969 France - Alsace
129970 France - Alsace
Length: 129971, dtype: object
'''创建一个变量“ bargain_wine”,其名称与数据集中最高的price/point葡萄酒名称相同。'''
#注意此处index.max的用法
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
print("the bargain_wine:",bargain_wine)
the bargain_wine: Bandit NV Merlot (California)
计算这两个字段【“tropical”,“fruity”】中的每一个出现在数据集“ description”列中的次数。
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
print(points_describe)
count 129971.000000
mean 88.447138
std 3.039730
min 80.000000
25% 86.000000
50% 88.000000
75% 91.000000
max 100.000000
Name: points, dtype: float64
构建评级系统(介于80到100分之间) 95分或更高的得分为3星,至少85分但小于95的得分为2星。 其他任何得分均为1星。另外,来自加拿大的任何葡萄酒都将自动获得3星(无论分数如何)。
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
star_ratings = reviews.apply(stars, axis='columns')
参考文章:https://blog.csdn.net/liuhehe123/article/details/85786200