下面是练习题的数据集,尽量下载下来使用。下面习题的连接不一定能打开。
https://github.com/justmarkham/pandas-videos/tree/master/data
GroupBy can be summarizes as Split-Apply-Combine.
代码如下:
import pandas as pd
代码如下:
drinks = pd.read_csv('drinks.csv', ',')
drinks
输出结果如下:
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | AS |
1 | Albania | 89 | 132 | 54 | 4.9 | EU |
2 | Algeria | 25 | 0 | 14 | 0.7 | AF |
3 | Andorra | 245 | 138 | 312 | 12.4 | EU |
4 | Angola | 217 | 57 | 45 | 5.9 | AF |
5 | Antigua & Barbuda | 102 | 128 | 45 | 4.9 | NaN |
6 | Argentina | 193 | 25 | 221 | 8.3 | SA |
7 | Armenia | 21 | 179 | 11 | 3.8 | EU |
8 | Australia | 261 | 72 | 212 | 10.4 | OC |
9 | Austria | 279 | 75 | 191 | 9.7 | EU |
10 | Azerbaijan | 21 | 46 | 5 | 1.3 | EU |
11 | Bahamas | 122 | 176 | 51 | 6.3 | NaN |
12 | Bahrain | 42 | 63 | 7 | 2.0 | AS |
13 | Bangladesh | 0 | 0 | 0 | 0.0 | AS |
14 | Barbados | 143 | 173 | 36 | 6.3 | NaN |
15 | Belarus | 142 | 373 | 42 | 14.4 | EU |
16 | Belgium | 295 | 84 | 212 | 10.5 | EU |
17 | Belize | 263 | 114 | 8 | 6.8 | NaN |
18 | Benin | 34 | 4 | 13 | 1.1 | AF |
19 | Bhutan | 23 | 0 | 0 | 0.4 | AS |
20 | Bolivia | 167 | 41 | 8 | 3.8 | SA |
21 | Bosnia-Herzegovina | 76 | 173 | 8 | 4.6 | EU |
22 | Botswana | 173 | 35 | 35 | 5.4 | AF |
23 | Brazil | 245 | 145 | 16 | 7.2 | SA |
24 | Brunei | 31 | 2 | 1 | 0.6 | AS |
25 | Bulgaria | 231 | 252 | 94 | 10.3 | EU |
26 | Burkina Faso | 25 | 7 | 7 | 4.3 | AF |
27 | Burundi | 88 | 0 | 0 | 6.3 | AF |
28 | Cote d'Ivoire | 37 | 1 | 7 | 4.0 | AF |
29 | Cabo Verde | 144 | 56 | 16 | 4.0 | AF |
... | ... | ... | ... | ... | ... | ... |
163 | Suriname | 128 | 178 | 7 | 5.6 | SA |
164 | Swaziland | 90 | 2 | 2 | 4.7 | AF |
165 | Sweden | 152 | 60 | 186 | 7.2 | EU |
166 | Switzerland | 185 | 100 | 280 | 10.2 | EU |
167 | Syria | 5 | 35 | 16 | 1.0 | AS |
168 | Tajikistan | 2 | 15 | 0 | 0.3 | AS |
169 | Thailand | 99 | 258 | 1 | 6.4 | AS |
170 | Macedonia | 106 | 27 | 86 | 3.9 | EU |
171 | Timor-Leste | 1 | 1 | 4 | 0.1 | AS |
172 | Togo | 36 | 2 | 19 | 1.3 | AF |
173 | Tonga | 36 | 21 | 5 | 1.1 | OC |
174 | Trinidad & Tobago | 197 | 156 | 7 | 6.4 | NaN |
175 | Tunisia | 51 | 3 | 20 | 1.3 | AF |
176 | Turkey | 51 | 22 | 7 | 1.4 | AS |
177 | Turkmenistan | 19 | 71 | 32 | 2.2 | AS |
178 | Tuvalu | 6 | 41 | 9 | 1.0 | OC |
179 | Uganda | 45 | 9 | 0 | 8.3 | AF |
180 | Ukraine | 206 | 237 | 45 | 8.9 | EU |
181 | United Arab Emirates | 16 | 135 | 5 | 2.8 | AS |
182 | United Kingdom | 219 | 126 | 195 | 10.4 | EU |
183 | Tanzania | 36 | 6 | 1 | 5.7 | AF |
184 | USA | 249 | 158 | 84 | 8.7 | NaN |
185 | Uruguay | 115 | 35 | 220 | 6.6 | SA |
186 | Uzbekistan | 25 | 101 | 8 | 2.4 | AS |
187 | Vanuatu | 21 | 18 | 11 | 0.9 | OC |
188 | Venezuela | 333 | 100 | 3 | 7.7 | SA |
189 | Vietnam | 111 | 2 | 1 | 2.0 | AS |
190 | Yemen | 6 | 0 | 0 | 0.1 | AS |
191 | Zambia | 32 | 19 | 4 | 2.5 | AF |
192 | Zimbabwe | 64 | 18 | 4 | 4.7 | AF |
193 rows × 6 columns
代码如下:
drinks.groupby('continent').beer_servings.mean()
输出结果如下:
continent
AF 61.471698
AS 37.045455
EU 193.777778
OC 89.687500
SA 175.083333
Name: beer_servings, dtype: float64
代码如下:
drinks.groupby('continent').wine_servings.describe()
输出结果如下:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
continent | ||||||||
AF | 53.0 | 16.264151 | 38.846419 | 0.0 | 1.0 | 2.0 | 13.00 | 233.0 |
AS | 44.0 | 9.068182 | 21.667034 | 0.0 | 0.0 | 1.0 | 8.00 | 123.0 |
EU | 45.0 | 142.222222 | 97.421738 | 0.0 | 59.0 | 128.0 | 195.00 | 370.0 |
OC | 16.0 | 35.625000 | 64.555790 | 0.0 | 1.0 | 8.5 | 23.25 | 212.0 |
SA | 12.0 | 62.416667 | 88.620189 | 1.0 | 3.0 | 12.0 | 98.50 | 221.0 |
代码如下:
drinks.groupby('continent').mean()
输出结果如下:
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
continent | ||||
AF | 61.471698 | 16.339623 | 16.264151 | 3.007547 |
AS | 37.045455 | 60.840909 | 9.068182 | 2.170455 |
EU | 193.777778 | 132.555556 | 142.222222 | 8.617778 |
OC | 89.687500 | 58.437500 | 35.625000 | 3.381250 |
SA | 175.083333 | 114.750000 | 62.416667 | 6.308333 |
代码如下:
drinks.groupby('continent').median()
输出结果如下:
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
continent | ||||
AF | 32.0 | 3.0 | 2.0 | 2.30 |
AS | 17.5 | 16.0 | 1.0 | 1.20 |
EU | 219.0 | 122.0 | 128.0 | 10.00 |
OC | 52.5 | 37.0 | 8.5 | 1.75 |
SA | 162.5 | 108.5 | 12.0 | 6.85 |
代码如下:
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
# agg聚合函数,对分组后数据进行聚合,默认情况对分组后其他列进行聚合。
# 对分组后的部分列进行聚合,某些情况下,只需要对部分数据进行不同的聚合操作,可以通过字典来构建
# spirit_servings_info = {'spirit_servings':['min','mean','max']}
# print(df.groupby('continent').agg(spirit_servings_info))
输出结果如下:
mean | min | max | |
---|---|---|---|
continent | |||
AF | 16.339623 | 0 | 152 |
AS | 60.840909 | 0 | 326 |
EU | 132.555556 | 0 | 373 |
OC | 58.437500 | 0 | 254 |
SA | 114.750000 | 25 | 302 |
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.
代码如下:
import pandas as pd
代码如下:
users = pd.read_table('u.user', sep='|', index_col = 'user_id')
users.head()
输出结果如下:
age | gender | occupation | zip_code | |
---|---|---|---|---|
user_id | ||||
1 | 24 | M | technician | 85711 |
2 | 53 | F | other | 94043 |
3 | 23 | M | writer | 32067 |
4 | 24 | M | technician | 43537 |
5 | 33 | F | other | 15213 |
代码如下:
users.groupby('occupation').age.mean()
输出结果如下:
occupation
administrator 38.746835
artist 31.392857
doctor 43.571429
educator 42.010526
engineer 36.388060
entertainment 29.222222
executive 38.718750
healthcare 41.562500
homemaker 32.571429
lawyer 36.750000
librarian 40.000000
marketing 37.615385
none 26.555556
other 34.523810
programmer 33.121212
retired 63.071429
salesman 35.666667
scientist 35.548387
student 22.081633
technician 33.148148
writer 36.311111
Name: age, dtype: float64
代码如下:
# create function
def gender_to_numeric(x):
if x == 'M':
return 1
if x == 'F':
return 0
users['gender_n'] = users['gender'].apply(gender_to_numeric)
a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
a.sort_values(ascending = False)
输出结果如下:
doctor 100.000000
engineer 97.014925
technician 96.296296
retired 92.857143
programmer 90.909091
executive 90.625000
scientist 90.322581
entertainment 88.888889
lawyer 83.333333
salesman 75.000000
educator 72.631579
student 69.387755
other 65.714286
marketing 61.538462
writer 57.777778
none 55.555556
administrator 54.430380
artist 53.571429
librarian 43.137255
healthcare 31.250000
homemaker 14.285714
dtype: float64
代码如下:
users.groupby('occupation').age.agg(['min', 'max'])
输出结果如下:
min | max | |
---|---|---|
occupation | ||
administrator | 21 | 70 |
artist | 19 | 48 |
doctor | 28 | 64 |
educator | 23 | 63 |
engineer | 22 | 70 |
entertainment | 15 | 50 |
executive | 22 | 69 |
healthcare | 22 | 62 |
homemaker | 20 | 50 |
lawyer | 21 | 53 |
librarian | 23 | 69 |
marketing | 24 | 55 |
none | 11 | 55 |
other | 13 | 64 |
programmer | 20 | 63 |
retired | 51 | 73 |
salesman | 18 | 66 |
scientist | 23 | 55 |
student | 7 | 42 |
technician | 21 | 55 |
writer | 18 | 60 |
代码如下:
users.groupby(['occupation', 'gender']).mean()
输出结果如下:
age | gender_n | ||
---|---|---|---|
occupation | gender | ||
administrator | F | 40.638889 | 0.0 |
M | 37.162791 | 1.0 | |
artist | F | 30.307692 | 0.0 |
M | 32.333333 | 1.0 | |
doctor | M | 43.571429 | 1.0 |
educator | F | 39.115385 | 0.0 |
M | 43.101449 | 1.0 | |
engineer | F | 29.500000 | 0.0 |
M | 36.600000 | 1.0 | |
entertainment | F | 31.000000 | 0.0 |
M | 29.000000 | 1.0 | |
executive | F | 44.000000 | 0.0 |
M | 38.172414 | 1.0 | |
healthcare | F | 39.818182 | 0.0 |
M | 45.400000 | 1.0 | |
homemaker | F | 34.166667 | 0.0 |
M | 23.000000 | 1.0 | |
lawyer | F | 39.500000 | 0.0 |
M | 36.200000 | 1.0 | |
librarian | F | 40.000000 | 0.0 |
M | 40.000000 | 1.0 | |
marketing | F | 37.200000 | 0.0 |
M | 37.875000 | 1.0 | |
none | F | 36.500000 | 0.0 |
M | 18.600000 | 1.0 | |
other | F | 35.472222 | 0.0 |
M | 34.028986 | 1.0 | |
programmer | F | 32.166667 | 0.0 |
M | 33.216667 | 1.0 | |
retired | F | 70.000000 | 0.0 |
M | 62.538462 | 1.0 | |
salesman | F | 27.000000 | 0.0 |
M | 38.555556 | 1.0 | |
scientist | F | 28.333333 | 0.0 |
M | 36.321429 | 1.0 | |
student | F | 20.750000 | 0.0 |
M | 22.669118 | 1.0 | |
technician | F | 38.000000 | 0.0 |
M | 32.961538 | 1.0 | |
writer | F | 37.631579 | 0.0 |
M | 35.346154 | 1.0 |
代码如下:
# a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
# print(a.sort_values(ascending = False))
# b = 100 - a
# print(b.sort_values(ascending=True))
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'}) # 计算各个职业男女人数
occup_count = users.groupby(['occupation']).agg('count') # 计算各个职业总人数
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100 # 求出各个职业男女占比,返回一个DataFrame
occup_gender.loc[:, 'gender'] # 显示gender数据
输出结果如下:
occupation gender
administrator F 45.569620
M 54.430380
artist F 46.428571
M 53.571429
doctor M 100.000000
educator F 27.368421
M 72.631579
engineer F 2.985075
M 97.014925
entertainment F 11.111111
M 88.888889
executive F 9.375000
M 90.625000
healthcare F 68.750000
M 31.250000
homemaker F 85.714286
M 14.285714
lawyer F 16.666667
M 83.333333
librarian F 56.862745
M 43.137255
marketing F 38.461538
M 61.538462
none F 44.444444
M 55.555556
other F 34.285714
M 65.714286
programmer F 9.090909
M 90.909091
retired F 7.142857
M 92.857143
salesman F 25.000000
M 75.000000
scientist F 9.677419
M 90.322581
student F 30.612245
M 69.387755
technician F 3.703704
M 96.296296
writer F 42.222222
M 57.777778
Name: gender, dtype: float64
Special thanks to: http://chrisalbon.com/ for sharing the dataset and materials.
代码如下:
import pandas as pd
代码如下:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
代码如下:
regiment = pd.DataFrame(raw_data, columns = raw_data.keys())
regiment
输出结果如下:
regiment | company | name | preTestScore | postTestScore | |
---|---|---|---|---|---|
0 | Nighthawks | 1st | Miller | 4 | 25 |
1 | Nighthawks | 1st | Jacobson | 24 | 94 |
2 | Nighthawks | 2nd | Ali | 31 | 57 |
3 | Nighthawks | 2nd | Milner | 2 | 62 |
4 | Dragoons | 1st | Cooze | 3 | 70 |
5 | Dragoons | 1st | Jacon | 4 | 25 |
6 | Dragoons | 2nd | Ryaner | 24 | 94 |
7 | Dragoons | 2nd | Sone | 31 | 57 |
8 | Scouts | 1st | Sloan | 2 | 62 |
9 | Scouts | 1st | Piger | 3 | 70 |
10 | Scouts | 2nd | Riani | 2 | 62 |
11 | Scouts | 2nd | Ali | 3 | 70 |
代码如下:
regiment[regiment['regiment'] == 'Nighthawks'].groupby('regiment').mean()
# regiment[regiment['regiment'] == 'Nighthawks'].mean()
输出结果如下:
preTestScore | postTestScore | |
---|---|---|
regiment | ||
Nighthawks | 15.25 | 59.5 |
代码如下:
regiment.groupby('company').describe()
输出结果如下:
postTestScore | preTestScore | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
company | ||||||||||||||||
1st | 6.0 | 57.666667 | 27.485754 | 25.0 | 34.25 | 66.0 | 70.0 | 94.0 | 6.0 | 6.666667 | 8.524475 | 2.0 | 3.00 | 3.5 | 4.00 | 24.0 |
2nd | 6.0 | 67.000000 | 14.057027 | 57.0 | 58.25 | 62.0 | 68.0 | 94.0 | 6.0 | 15.500000 | 14.652645 | 2.0 | 2.25 | 13.5 | 29.25 | 31.0 |
代码如下:
regiment.groupby('company').preTestScore.mean()
输出结果如下:
company
1st 6.666667
2nd 15.500000
Name: preTestScore, dtype: float64
代码如下:
regiment.groupby(['regiment', 'company']).preTestScore.mean()
输出结果如下:
regiment company
Dragoons 1st 3.5
2nd 27.5
Nighthawks 1st 14.0
2nd 16.5
Scouts 1st 2.5
2nd 2.5
Name: preTestScore, dtype: float64
代码如下:
'''
stack()和unstack()
stack:将数据的列“旋转”为行。
unstack:将数据的行“旋转”为列。
如果是多层索引,则以上函数是针对内层索引。
'''
regiment.groupby(['regiment', 'company']).preTestScore.mean().unstack()
输出结果如下:
company | 1st | 2nd |
---|---|---|
regiment | ||
Dragoons | 3.5 | 27.5 |
Nighthawks | 14.0 | 16.5 |
Scouts | 2.5 | 2.5 |
代码如下:
regiment.groupby(['regiment', 'company']).mean()
输出结果如下:
preTestScore | postTestScore | ||
---|---|---|---|
regiment | company | ||
Dragoons | 1st | 3.5 | 47.5 |
2nd | 27.5 | 75.5 | |
Nighthawks | 1st | 14.0 | 59.5 |
2nd | 16.5 | 59.5 | |
Scouts | 1st | 2.5 | 66.0 |
2nd | 2.5 | 66.0 |
代码如下:
regiment.groupby(['regiment', 'company']).size()
输出结果如下:
regiment company
Dragoons 1st 2
2nd 2
Nighthawks 1st 2
2nd 2
Scouts 1st 2
2nd 2
dtype: int64
代码如下:
for name, group in regiment.groupby('regiment'):
print(name)
print(group)
输出结果如下:
Dragoons
regiment company name preTestScore postTestScore
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
Nighthawks
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
Scouts
regiment company name preTestScore postTestScore
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70
今天的pandas练习题就这么多了,大家坚持练习呀!还有英文的题目这次就没翻译了,各位要适应看英文。大家加油学习呀!有问题可以评论区探讨,欢迎大家一起学习进步!