Python数据分析pandas入门练习题(五)

Python数据分析基础

  • Preparation
  • Exercise 1-GroupBy
      • Introduction:
      • Step 1. Import the necessary libraries
      • Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv).
      • Step 3. Assign it to a variable called drinks.
      • Step 4. Which continent drinks more beer on average?
      • Step 5. For each continent print the statistics for wine consumption.
      • Step 6. Print the mean alcoohol consumption per continent for every column
      • Step 7. Print the median alcoohol consumption per continent for every column
      • Step 8. Print the mean, min and max values for spirit consumption.
        • This time output a DataFrame
  • Exercise 2-Occupation
      • Introduction:
      • Step 1. Import the necessary libraries
      • Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user).
      • Step 3. Assign it to a variable called users.
      • Step 4. Discover what is the mean age per occupation
      • Step 5. Discover the Male ratio per occupation and sort it from the most to the least
      • Step 6. For each occupation, calculate the minimum and maximum ages
      • Step 7. For each combination of occupation and gender, calculate the mean age
      • Step 8. For each occupation present the percentage of women and men
  • Exercise 3-Regiment
      • Introduction:
      • Step 1. Import the necessary libraries
      • Step 2. Create the DataFrame with the following values:
      • Step 3. Assign it to a variable called regiment.
        • Don't forget to name each column
      • Step 4. What is the mean preTestScore from the regiment Nighthawks?
      • Step 5. Present general statistics by company
      • Step 6. What is the mean each company's preTestScore?
      • Step 7. Present the mean preTestScores grouped by regiment and company
      • Step 8. Present the mean preTestScores grouped by regiment and company without heirarchical indexing
      • Step 9. Group the entire dataframe by regiment and company
      • Step 10. What is the number of observations in each regiment and company
      • Step 11. Iterate over a group and print the name and the whole data from the regiment
  • Conclusion

Preparation

下面是练习题的数据集,尽量下载下来使用。下面习题的连接不一定能打开。
https://github.com/justmarkham/pandas-videos/tree/master/data

Exercise 1-GroupBy

Introduction:

GroupBy can be summarizes as Split-Apply-Combine.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called drinks.

代码如下:

drinks = pd.read_csv('drinks.csv', ',')
drinks

输出结果如下:

country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 AS
1 Albania 89 132 54 4.9 EU
2 Algeria 25 0 14 0.7 AF
3 Andorra 245 138 312 12.4 EU
4 Angola 217 57 45 5.9 AF
5 Antigua & Barbuda 102 128 45 4.9 NaN
6 Argentina 193 25 221 8.3 SA
7 Armenia 21 179 11 3.8 EU
8 Australia 261 72 212 10.4 OC
9 Austria 279 75 191 9.7 EU
10 Azerbaijan 21 46 5 1.3 EU
11 Bahamas 122 176 51 6.3 NaN
12 Bahrain 42 63 7 2.0 AS
13 Bangladesh 0 0 0 0.0 AS
14 Barbados 143 173 36 6.3 NaN
15 Belarus 142 373 42 14.4 EU
16 Belgium 295 84 212 10.5 EU
17 Belize 263 114 8 6.8 NaN
18 Benin 34 4 13 1.1 AF
19 Bhutan 23 0 0 0.4 AS
20 Bolivia 167 41 8 3.8 SA
21 Bosnia-Herzegovina 76 173 8 4.6 EU
22 Botswana 173 35 35 5.4 AF
23 Brazil 245 145 16 7.2 SA
24 Brunei 31 2 1 0.6 AS
25 Bulgaria 231 252 94 10.3 EU
26 Burkina Faso 25 7 7 4.3 AF
27 Burundi 88 0 0 6.3 AF
28 Cote d'Ivoire 37 1 7 4.0 AF
29 Cabo Verde 144 56 16 4.0 AF
... ... ... ... ... ... ...
163 Suriname 128 178 7 5.6 SA
164 Swaziland 90 2 2 4.7 AF
165 Sweden 152 60 186 7.2 EU
166 Switzerland 185 100 280 10.2 EU
167 Syria 5 35 16 1.0 AS
168 Tajikistan 2 15 0 0.3 AS
169 Thailand 99 258 1 6.4 AS
170 Macedonia 106 27 86 3.9 EU
171 Timor-Leste 1 1 4 0.1 AS
172 Togo 36 2 19 1.3 AF
173 Tonga 36 21 5 1.1 OC
174 Trinidad & Tobago 197 156 7 6.4 NaN
175 Tunisia 51 3 20 1.3 AF
176 Turkey 51 22 7 1.4 AS
177 Turkmenistan 19 71 32 2.2 AS
178 Tuvalu 6 41 9 1.0 OC
179 Uganda 45 9 0 8.3 AF
180 Ukraine 206 237 45 8.9 EU
181 United Arab Emirates 16 135 5 2.8 AS
182 United Kingdom 219 126 195 10.4 EU
183 Tanzania 36 6 1 5.7 AF
184 USA 249 158 84 8.7 NaN
185 Uruguay 115 35 220 6.6 SA
186 Uzbekistan 25 101 8 2.4 AS
187 Vanuatu 21 18 11 0.9 OC
188 Venezuela 333 100 3 7.7 SA
189 Vietnam 111 2 1 2.0 AS
190 Yemen 6 0 0 0.1 AS
191 Zambia 32 19 4 2.5 AF
192 Zimbabwe 64 18 4 4.7 AF

193 rows × 6 columns

Step 4. Which continent drinks more beer on average?

代码如下:

drinks.groupby('continent').beer_servings.mean()

输出结果如下:

continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

Step 5. For each continent print the statistics for wine consumption.

代码如下:

drinks.groupby('continent').wine_servings.describe()

输出结果如下:

count mean std min 25% 50% 75% max
continent
AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0
AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0
EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0
OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0
SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0

Step 6. Print the mean alcoohol consumption per continent for every column

代码如下:

drinks.groupby('continent').mean()

输出结果如下:

beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 61.471698 16.339623 16.264151 3.007547
AS 37.045455 60.840909 9.068182 2.170455
EU 193.777778 132.555556 142.222222 8.617778
OC 89.687500 58.437500 35.625000 3.381250
SA 175.083333 114.750000 62.416667 6.308333

Step 7. Print the median alcoohol consumption per continent for every column

代码如下:

drinks.groupby('continent').median()

输出结果如下:

beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 32.0 3.0 2.0 2.30
AS 17.5 16.0 1.0 1.20
EU 219.0 122.0 128.0 10.00
OC 52.5 37.0 8.5 1.75
SA 162.5 108.5 12.0 6.85

Step 8. Print the mean, min and max values for spirit consumption.

This time output a DataFrame

代码如下:

drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
# agg聚合函数,对分组后数据进行聚合,默认情况对分组后其他列进行聚合。
# 对分组后的部分列进行聚合,某些情况下,只需要对部分数据进行不同的聚合操作,可以通过字典来构建
# spirit_servings_info = {'spirit_servings':['min','mean','max']}
# print(df.groupby('continent').agg(spirit_servings_info))

输出结果如下:

mean min max
continent
AF 16.339623 0 152
AS 60.840909 0 326
EU 132.555556 0 373
OC 58.437500 0 254
SA 114.750000 25 302

Exercise 2-Occupation

Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called users.

代码如下:

users = pd.read_table('u.user', sep='|', index_col = 'user_id')
users.head()

输出结果如下:

age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213

Step 4. Discover what is the mean age per occupation

代码如下:

users.groupby('occupation').age.mean()

输出结果如下:

occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

Step 5. Discover the Male ratio per occupation and sort it from the most to the least

代码如下:

# create function
def gender_to_numeric(x):
    if x == 'M':
        return 1
    if x == 'F':
        return 0
users['gender_n'] = users['gender'].apply(gender_to_numeric)

a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
a.sort_values(ascending = False)

输出结果如下:

doctor           100.000000
engineer          97.014925
technician        96.296296
retired           92.857143
programmer        90.909091
executive         90.625000
scientist         90.322581
entertainment     88.888889
lawyer            83.333333
salesman          75.000000
educator          72.631579
student           69.387755
other             65.714286
marketing         61.538462
writer            57.777778
none              55.555556
administrator     54.430380
artist            53.571429
librarian         43.137255
healthcare        31.250000
homemaker         14.285714
dtype: float64

Step 6. For each occupation, calculate the minimum and maximum ages

代码如下:

users.groupby('occupation').age.agg(['min', 'max'])

输出结果如下:

min max
occupation
administrator 21 70
artist 19 48
doctor 28 64
educator 23 63
engineer 22 70
entertainment 15 50
executive 22 69
healthcare 22 62
homemaker 20 50
lawyer 21 53
librarian 23 69
marketing 24 55
none 11 55
other 13 64
programmer 20 63
retired 51 73
salesman 18 66
scientist 23 55
student 7 42
technician 21 55
writer 18 60

Step 7. For each combination of occupation and gender, calculate the mean age

代码如下:

users.groupby(['occupation', 'gender']).mean()

输出结果如下:

age gender_n
occupation gender
administrator F 40.638889 0.0
M 37.162791 1.0
artist F 30.307692 0.0
M 32.333333 1.0
doctor M 43.571429 1.0
educator F 39.115385 0.0
M 43.101449 1.0
engineer F 29.500000 0.0
M 36.600000 1.0
entertainment F 31.000000 0.0
M 29.000000 1.0
executive F 44.000000 0.0
M 38.172414 1.0
healthcare F 39.818182 0.0
M 45.400000 1.0
homemaker F 34.166667 0.0
M 23.000000 1.0
lawyer F 39.500000 0.0
M 36.200000 1.0
librarian F 40.000000 0.0
M 40.000000 1.0
marketing F 37.200000 0.0
M 37.875000 1.0
none F 36.500000 0.0
M 18.600000 1.0
other F 35.472222 0.0
M 34.028986 1.0
programmer F 32.166667 0.0
M 33.216667 1.0
retired F 70.000000 0.0
M 62.538462 1.0
salesman F 27.000000 0.0
M 38.555556 1.0
scientist F 28.333333 0.0
M 36.321429 1.0
student F 20.750000 0.0
M 22.669118 1.0
technician F 38.000000 0.0
M 32.961538 1.0
writer F 37.631579 0.0
M 35.346154 1.0

Step 8. For each occupation present the percentage of women and men

代码如下:

# a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
# print(a.sort_values(ascending = False))
# b = 100 - a
# print(b.sort_values(ascending=True))
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'}) # 计算各个职业男女人数
occup_count = users.groupby(['occupation']).agg('count')                     # 计算各个职业总人数
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100    # 求出各个职业男女占比,返回一个DataFrame
occup_gender.loc[:, 'gender']   # 显示gender数据

输出结果如下:

occupation     gender
administrator  F          45.569620
               M          54.430380
artist         F          46.428571
               M          53.571429
doctor         M         100.000000
educator       F          27.368421
               M          72.631579
engineer       F           2.985075
               M          97.014925
entertainment  F          11.111111
               M          88.888889
executive      F           9.375000
               M          90.625000
healthcare     F          68.750000
               M          31.250000
homemaker      F          85.714286
               M          14.285714
lawyer         F          16.666667
               M          83.333333
librarian      F          56.862745
               M          43.137255
marketing      F          38.461538
               M          61.538462
none           F          44.444444
               M          55.555556
other          F          34.285714
               M          65.714286
programmer     F           9.090909
               M          90.909091
retired        F           7.142857
               M          92.857143
salesman       F          25.000000
               M          75.000000
scientist      F           9.677419
               M          90.322581
student        F          30.612245
               M          69.387755
technician     F           3.703704
               M          96.296296
writer         F          42.222222
               M          57.777778
Name: gender, dtype: float64

Exercise 3-Regiment

Introduction:

Special thanks to: http://chrisalbon.com/ for sharing the dataset and materials.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Create the DataFrame with the following values:

代码如下:

raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}

Step 3. Assign it to a variable called regiment.

Don’t forget to name each column

代码如下:

regiment = pd.DataFrame(raw_data, columns = raw_data.keys())
regiment

输出结果如下:

regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70

Step 4. What is the mean preTestScore from the regiment Nighthawks?

代码如下:

regiment[regiment['regiment'] == 'Nighthawks'].groupby('regiment').mean()
# regiment[regiment['regiment'] == 'Nighthawks'].mean()

输出结果如下:

preTestScore postTestScore
regiment
Nighthawks 15.25 59.5

Step 5. Present general statistics by company

代码如下:

regiment.groupby('company').describe()

输出结果如下:

postTestScore preTestScore
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
company
1st 6.0 57.666667 27.485754 25.0 34.25 66.0 70.0 94.0 6.0 6.666667 8.524475 2.0 3.00 3.5 4.00 24.0
2nd 6.0 67.000000 14.057027 57.0 58.25 62.0 68.0 94.0 6.0 15.500000 14.652645 2.0 2.25 13.5 29.25 31.0

Step 6. What is the mean each company’s preTestScore?

代码如下:

regiment.groupby('company').preTestScore.mean()

输出结果如下:

company
1st     6.666667
2nd    15.500000
Name: preTestScore, dtype: float64

Step 7. Present the mean preTestScores grouped by regiment and company

代码如下:

regiment.groupby(['regiment', 'company']).preTestScore.mean()

输出结果如下:

regiment    company
Dragoons    1st         3.5
            2nd        27.5
Nighthawks  1st        14.0
            2nd        16.5
Scouts      1st         2.5
            2nd         2.5
Name: preTestScore, dtype: float64

Step 8. Present the mean preTestScores grouped by regiment and company without heirarchical indexing

代码如下:

'''
stack()和unstack()
stack:将数据的列“旋转”为行。
unstack:将数据的行“旋转”为列。
如果是多层索引,则以上函数是针对内层索引。
'''
regiment.groupby(['regiment', 'company']).preTestScore.mean().unstack()

输出结果如下:

company 1st 2nd
regiment
Dragoons 3.5 27.5
Nighthawks 14.0 16.5
Scouts 2.5 2.5

Step 9. Group the entire dataframe by regiment and company

代码如下:

regiment.groupby(['regiment', 'company']).mean()

输出结果如下:

preTestScore postTestScore
regiment company
Dragoons 1st 3.5 47.5
2nd 27.5 75.5
Nighthawks 1st 14.0 59.5
2nd 16.5 59.5
Scouts 1st 2.5 66.0
2nd 2.5 66.0

Step 10. What is the number of observations in each regiment and company

代码如下:

regiment.groupby(['regiment', 'company']).size()

输出结果如下:

regiment    company
Dragoons    1st        2
            2nd        2
Nighthawks  1st        2
            2nd        2
Scouts      1st        2
            2nd        2
dtype: int64

Step 11. Iterate over a group and print the name and the whole data from the regiment

代码如下:

for name, group in regiment.groupby('regiment'):
    print(name)
    print(group)

输出结果如下:

Dragoons
   regiment company    name  preTestScore  postTestScore
4  Dragoons     1st   Cooze             3             70
5  Dragoons     1st   Jacon             4             25
6  Dragoons     2nd  Ryaner            24             94
7  Dragoons     2nd    Sone            31             57
Nighthawks
     regiment company      name  preTestScore  postTestScore
0  Nighthawks     1st    Miller             4             25
1  Nighthawks     1st  Jacobson            24             94
2  Nighthawks     2nd       Ali            31             57
3  Nighthawks     2nd    Milner             2             62
Scouts
   regiment company   name  preTestScore  postTestScore
8    Scouts     1st  Sloan             2             62
9    Scouts     1st  Piger             3             70
10   Scouts     2nd  Riani             2             62
11   Scouts     2nd    Ali             3             70

Conclusion

今天的pandas练习题就这么多了,大家坚持练习呀!还有英文的题目这次就没翻译了,各位要适应看英文。大家加油学习呀!有问题可以评论区探讨,欢迎大家一起学习进步!

你可能感兴趣的:(利用Python进行数据分析,python,pandas,数据分析)