Python数据分析pandas入门练习题(四)

Python数据分析基础

  • Preparation
  • Exercise 1 - Filtering and Sorting Data
      • Step 1. Import the necessary libraries
      • Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv).
      • Step 3. Assign it to a variable called chipo.
      • Step 4. How many products cost more than $10.00?
      • Step 5. What is the price of each item?
          • print a data frame with only two columns item_name and item_price
      • Step 6. Sort by the name of the item
      • Step 7. What was the quantity of the most expensive item ordered?
      • Step 8. How many times were a Veggie Salad Bowl ordered?
      • Step 9. How many times people orderd more than one Canned Soda?
  • Exercise2 - Filtering and Sorting Data
      • Step 1. Import the necessary libraries
      • Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv).
      • Step 3. Assign it to a variable called euro12.
      • Step 4. Select only the Goal column.
      • Step 5. How many team participated in the Euro2012?
      • Step 6. What is the number of columns in the dataset?
      • Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline
      • Step 8. Sort the teams by Red Cards, then to Yellow Cards
      • Step 9. Calculate the mean Yellow Cards given per Team
      • Step 10. Filter teams that scored more than 6 goals
      • Step 11. Select the teams that start with G
      • Step 12. Select the first 7 columns
      • Step 13. Select all columns except the last 3.
      • Step 14. Present only the Shooting Accuracy from England, Italy and Russia
  • Fictional Army - Filtering and Sorting
      • Introduction:
      • Step 1. Import the necessary libraries
      • Step 2. This is the data given as a dictionary
      • Step 3. Create a dataframe and assign it to a variable called army.
        • Don't forget to include the columns names in the order presented in the dictionary ('regiment', 'company', 'deaths'...) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.
      • Step 4. Set the 'origin' colum as the index of the dataframe
      • Step 5. Print only the column veterans
      • Step 6. Print the columns 'veterans' and 'deaths'
      • Step 7. Print the name of all the columns.
      • Step 8. Select the 'deaths', 'size' and 'deserters' columns from Maine and Alaska
      • Step 9. Select the rows 3 to 7 and the columns 3 to 6
      • Step 10. Select every row after the fourth row
      • Step 11. Select every row up to the 4th row
      • Step 12. Select the 3rd column up to the 7th column
      • Step 13. Select rows where df.deaths is greater than 50
      • Step 14. Select rows where df.deaths is greater than 500 or less than 50
      • Step 15. Select all the regiments not named "Dragoons"
      • Step 16. Select the rows called Texas and Arizona
      • Step 17. Select the third cell in the row named Arizona
      • Step 18. Select the third cell down in the column named deaths
  • 小节
  • 结语

Preparation

下面练习题的数据集,给出的网址不一定可用,这个地址数据集亲测可用。如果数据集失效了,可自行网上寻找。https://github.com/daacheng/PythonBasic/tree/master/dataset

Exercise 1 - Filtering and Sorting Data

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Import the dataset from this address.

这个地址数据集不一定能用,可能需要梯子。

Step 3. Assign it to a variable called chipo.

代码如下:

chipo = pd.read_csv('chipotle.csv', sep=',')

Step 4. How many products cost more than $10.00?

代码如下:

# 题目是让你求单价超过10美金的产品
# 整理 item_price 列并将其转换为浮点数
prices = [float(value[1:-1]) for value in chipo.item_price]
# 用整理过的价格重新分配列
chipo.item_price = prices
# 删除 item_name 和quantity中的重复项
'''
drop_duplicates(self, subset=None, keep="first", inplace=False)
subset(子集 ):考虑用于标识重复行的列标签或标签序列。 默认情况下,所有列均用于查找重复的行。
keep :允许的值为{'first','last',False},默认为'first'。 如果为“ first”,则删除除第一个行以外的重复行。 
如果为“ last”,则删除除最后一行以外的重复行。 如果为False,则删除所有重复的行。
inplace :如果为True,则更改源DataFrame并返回None。 默认情况下,源DataFrame保持不变,并返回一个新的DataFrame实例。
'''
chipo_filtered = chipo.drop_duplicates(['item_name', 'quantity'])
# 仅选择数量等于 1 的产品
chipo_one_prod = chipo_filtered[chipo_filtered.quantity == 1]
# item_name.nunique()返回每列不同值的个数
chipo_one_prod[chipo_one_prod['item_price']>10].item_name.nunique()

输出结果如下:

12

Step 5. What is the price of each item?

print a data frame with only two columns item_name and item_price

代码如下:

# 输出每个商品的单价,只输出item_name和item_price
# delete the duplicates in item_name and quantity
chipo_filtered = chipo.drop_duplicates(['item_name','quantity'])
# chipo[(chipo['item_name'] == 'Chicken Bowl') & (chipo['quantity'] == 1)]

# select only the products with quantity equals to 1
chipo_one_prod = chipo_filtered[chipo_filtered.quantity == 1]

# select only the item_name and item_price columns
price_per_item = chipo_one_prod[['item_name', 'item_price']]
print(price_per_item)
# sort the values from the most to less expensive
# price_per_item.sort_values(by = "item_price", ascending = False).head(20)

输出结果如下:

                                  item_name  item_price
0              Chips and Fresh Tomato Salsa        2.39
1                                      Izze        3.39
2                          Nantucket Nectar        3.39
3     Chips and Tomatillo-Green Chili Salsa        2.39
5                              Chicken Bowl       10.98
6                             Side of Chips        1.69
7                             Steak Burrito       11.75
8                          Steak Soft Tacos        9.25
10                      Chips and Guacamole        4.45
11                     Chicken Crispy Tacos        8.75
12                       Chicken Soft Tacos        8.75
16                          Chicken Burrito        8.49
21                         Barbacoa Burrito        8.99
27                         Carnitas Burrito        8.99
28                              Canned Soda        1.09
33                            Carnitas Bowl        8.99
34                            Bottled Water        1.09
38    Chips and Tomatillo Green Chili Salsa        2.95
39                            Barbacoa Bowl       11.75
40                                    Chips        2.15
44                       Chicken Salad Bowl        8.75
54                               Steak Bowl        8.99
56                      Barbacoa Soft Tacos        9.25
57                           Veggie Burrito       11.25
62                              Veggie Bowl       11.25
92                       Steak Crispy Tacos        9.25
111     Chips and Tomatillo Red Chili Salsa        2.95
168                   Barbacoa Crispy Tacos       11.75
186                       Veggie Salad Bowl       11.25
191      Chips and Roasted Chili-Corn Salsa        2.39
233      Chips and Roasted Chili Corn Salsa        2.95
237                     Carnitas Soft Tacos        9.25
250                           Chicken Salad       10.98
263                       Canned Soft Drink        1.25
298                       6 Pack Soft Drink        6.49
300     Chips and Tomatillo-Red Chili Salsa        2.39
510                                 Burrito        7.40
520                            Crispy Tacos        7.40
554                   Carnitas Crispy Tacos        9.25
606                        Steak Salad Bowl       11.89
664                             Steak Salad        8.99
673                                    Bowl        7.40
674       Chips and Mild Fresh Tomato Salsa        3.00
738                       Veggie Soft Tacos       11.25
1132                    Carnitas Salad Bowl       11.89
1229                    Barbacoa Salad Bowl       11.89
1414                                  Salad        7.40
1653                    Veggie Crispy Tacos        8.49
1694                           Veggie Salad        8.49
3750                         Carnitas Salad        8.99

Step 6. Sort by the name of the item

代码如下:

chipo.sort_values(by='item_name')
# chipo.item_name.sort_values()

输出结果如下:

Unnamed: 0 order_id quantity item_name choice_description item_price
3389 3389 1360 2 6 Pack Soft Drink [Diet Coke] 12.98
341 341 148 1 6 Pack Soft Drink [Diet Coke] 6.49
1849 1849 749 1 6 Pack Soft Drink [Coke] 6.49
1860 1860 754 1 6 Pack Soft Drink [Diet Coke] 6.49
2713 2713 1076 1 6 Pack Soft Drink [Coke] 6.49
3422 3422 1373 1 6 Pack Soft Drink [Coke] 6.49
553 553 230 1 6 Pack Soft Drink [Diet Coke] 6.49
1916 1916 774 1 6 Pack Soft Drink [Diet Coke] 6.49
1922 1922 776 1 6 Pack Soft Drink [Coke] 6.49
1937 1937 784 1 6 Pack Soft Drink [Diet Coke] 6.49
3836 3836 1537 1 6 Pack Soft Drink [Coke] 6.49
298 298 129 1 6 Pack Soft Drink [Sprite] 6.49
1976 1976 798 1 6 Pack Soft Drink [Diet Coke] 6.49
1167 1167 481 1 6 Pack Soft Drink [Coke] 6.49
3875 3875 1554 1 6 Pack Soft Drink [Diet Coke] 6.49
1124 1124 465 1 6 Pack Soft Drink [Coke] 6.49
3886 3886 1558 1 6 Pack Soft Drink [Diet Coke] 6.49
2108 2108 849 1 6 Pack Soft Drink [Coke] 6.49
3010 3010 1196 1 6 Pack Soft Drink [Diet Coke] 6.49
4535 4535 1803 1 6 Pack Soft Drink [Lemonade] 6.49
4169 4169 1664 1 6 Pack Soft Drink [Diet Coke] 6.49
4174 4174 1666 1 6 Pack Soft Drink [Coke] 6.49
4527 4527 1800 1 6 Pack Soft Drink [Diet Coke] 6.49
4522 4522 1798 1 6 Pack Soft Drink [Diet Coke] 6.49
3806 3806 1525 1 6 Pack Soft Drink [Sprite] 6.49
2389 2389 949 1 6 Pack Soft Drink [Coke] 6.49
3132 3132 1248 1 6 Pack Soft Drink [Diet Coke] 6.49
3141 3141 1253 1 6 Pack Soft Drink [Lemonade] 6.49
639 639 264 1 6 Pack Soft Drink [Diet Coke] 6.49
1026 1026 422 1 6 Pack Soft Drink [Sprite] 6.49
... ... ... ... ... ... ...
2996 2996 1192 1 Veggie Salad [Roasted Chili Corn Salsa (Medium), [Black Bea... 8.49
3163 3163 1263 1 Veggie Salad [[Fresh Tomato Salsa (Mild), Roasted Chili Cor... 8.49
4084 4084 1635 1 Veggie Salad [[Fresh Tomato Salsa (Mild), Roasted Chili Cor... 8.49
1694 1694 686 1 Veggie Salad [[Fresh Tomato Salsa (Mild), Roasted Chili Cor... 8.49
2756 2756 1094 1 Veggie Salad [[Tomatillo-Green Chili Salsa (Medium), Roaste... 8.49
4201 4201 1677 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Black... 11.25
1884 1884 760 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... 11.25
455 455 195 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... 11.25
3223 3223 1289 1 Veggie Salad Bowl [Tomatillo Red Chili Salsa, [Fajita Vegetables... 11.25
2223 2223 896 1 Veggie Salad Bowl [Roasted Chili Corn Salsa, Fajita Vegetables] 8.75
2269 2269 913 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... 8.75
4541 4541 1805 1 Veggie Salad Bowl [Tomatillo Green Chili Salsa, [Fajita Vegetabl... 8.75
3293 3293 1321 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Rice, Black Beans, Chees... 8.75
186 186 83 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... 11.25
960 960 394 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... 8.75
1316 1316 536 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... 8.75
2156 2156 869 1 Veggie Salad Bowl [Tomatillo Red Chili Salsa, [Fajita Vegetables... 11.25
4261 4261 1700 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... 11.25
295 295 128 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... 11.25
4573 4573 1818 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... 8.75
2683 2683 1066 1 Veggie Salad Bowl [Roasted Chili Corn Salsa, [Fajita Vegetables,... 8.75
496 496 207 1 Veggie Salad Bowl [Fresh Tomato Salsa, [Rice, Lettuce, Guacamole... 11.25
4109 4109 1646 1 Veggie Salad Bowl [Tomatillo Red Chili Salsa, [Fajita Vegetables... 11.25
738 738 304 1 Veggie Soft Tacos [Tomatillo Red Chili Salsa, [Fajita Vegetables... 11.25
3889 3889 1559 2 Veggie Soft Tacos [Fresh Tomato Salsa (Mild), [Black Beans, Rice... 16.98
2384 2384 948 1 Veggie Soft Tacos [Roasted Chili Corn Salsa, [Fajita Vegetables,... 8.75
781 781 322 1 Veggie Soft Tacos [Fresh Tomato Salsa, [Black Beans, Cheese, Sou... 8.75
2851 2851 1132 1 Veggie Soft Tacos [Roasted Chili Corn Salsa (Medium), [Black Bea... 8.49
1699 1699 688 1 Veggie Soft Tacos [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... 11.25
1395 1395 567 1 Veggie Soft Tacos [Fresh Tomato Salsa (Mild), [Pinto Beans, Rice... 8.49

4622 rows × 6 columns

Step 7. What was the quantity of the most expensive item ordered?

代码如下:

chipo.sort_values(by='item_price', ascending=False).head(1)

输出结果如下:

Unnamed: 0 order_id quantity item_name choice_description item_price
3598 3598 1443 15 Chips and Fresh Tomato Salsa NaN 44.25

Step 8. How many times were a Veggie Salad Bowl ordered?

代码如下:

chipo_salad = chipo[chipo.item_name == 'Veggie Salad Bowl']
len(chipo_salad)
# 或者chipo_salad.shape[0]

输出结果如下:

18

Step 9. How many times people orderd more than one Canned Soda?

代码如下:

# chipo[(chipo['item_name'] == 'Chicken Bowl') & (chipo['quantity'] == 1)]
chipo_soda = chipo[(chipo.item_name == 'Canned Soda') & (chipo.quantity>1)]
len(chipo_soda)
# 或者print(chipo_soda.shape[0])

输出结果如下:

20

Exercise2 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called euro12.

代码如下:

euro12 = pd.read_csv('Euro_2012_stats_TEAM.csv', sep=',')
euro12

输出结果如下:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20
7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 20 74.1% 101 89 16 16 0 18 18 19
8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 12 70.6% 35 30 3 5 0 7 7 15
9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 6 66.7% 48 56 3 7 1 7 7 17
10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 10 71.5% 73 90 10 12 0 14 14 16
11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 17 65.4% 43 51 11 6 1 10 10 17
12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 10 77.0% 34 43 4 6 0 7 7 16
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18
14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 8 61.6% 35 51 7 7 0 9 9 18
15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 13 76.5% 48 31 4 5 0 9 9 18

16 rows × 35 columns

Step 4. Select only the Goal column.

代码如下:

euro12.Goals
# 或者euro12['Goals']

输出结果如下:

0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

Step 5. How many team participated in the Euro2012?

代码如下:

euro12.shape[0]
# 或者len(euro12.Team)

输出结果如下:

16

Step 6. What is the number of columns in the dataset?

代码如下:

# euro12.columns.shape[0]
euro12.info()

输出结果如下:


RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
Team                          16 non-null object
Goals                         16 non-null int64
Shots on target               16 non-null int64
Shots off target              16 non-null int64
Shooting Accuracy             16 non-null object
% Goals-to-shots              16 non-null object
Total shots (inc. Blocked)    16 non-null int64
Hit Woodwork                  16 non-null int64
Penalty goals                 16 non-null int64
Penalties not scored          16 non-null int64
Headed goals                  16 non-null int64
Passes                        16 non-null int64
Passes completed              16 non-null int64
Passing Accuracy              16 non-null object
Touches                       16 non-null int64
Crosses                       16 non-null int64
Dribbles                      16 non-null int64
Corners Taken                 16 non-null int64
Tackles                       16 non-null int64
Clearances                    16 non-null int64
Interceptions                 16 non-null int64
Clearances off line           15 non-null float64
Clean Sheets                  16 non-null int64
Blocks                        16 non-null int64
Goals conceded                16 non-null int64
Saves made                    16 non-null int64
Saves-to-shots ratio          16 non-null object
Fouls Won                     16 non-null int64
Fouls Conceded                16 non-null int64
Offsides                      16 non-null int64
Yellow Cards                  16 non-null int64
Red Cards                     16 non-null int64
Subs on                       16 non-null int64
Subs off                      16 non-null int64
Players Used                  16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB

Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline

代码如下:

discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]
discipline

输出结果如下:

Team Yellow Cards Red Cards
0 Croatia 9 0
1 Czech Republic 7 0
2 Denmark 4 0
3 England 5 0
4 France 6 0
5 Germany 4 0
6 Greece 9 1
7 Italy 16 0
8 Netherlands 5 0
9 Poland 7 1
10 Portugal 12 0
11 Republic of Ireland 6 1
12 Russia 6 0
13 Spain 11 0
14 Sweden 7 0
15 Ukraine 5 0

Step 8. Sort the teams by Red Cards, then to Yellow Cards

代码如下:

# 通过红牌数和黄牌数对每个队伍排序
discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending=False)

输出结果如下:

Team Yellow Cards Red Cards
6 Greece 9 1
9 Poland 7 1
11 Republic of Ireland 6 1
7 Italy 16 0
10 Portugal 12 0
13 Spain 11 0
0 Croatia 9 0
1 Czech Republic 7 0
14 Sweden 7 0
4 France 6 0
12 Russia 6 0
3 England 5 0
8 Netherlands 5 0
15 Ukraine 5 0
2 Denmark 4 0
5 Germany 4 0

Step 9. Calculate the mean Yellow Cards given per Team

代码如下:

# 计算每个队伍得到的黄牌数量平均值
round(discipline['Yellow Cards'].mean())

输出结果如下:

7

Step 10. Filter teams that scored more than 6 goals

代码如下:

# 筛选出goals大于6的队伍
euro12[euro12['Goals']>6]
# euro12[euro12.Goals>6]

输出结果如下:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18

2 rows × 35 columns

Step 11. Select the teams that start with G

代码如下:

# 选择G开头的队伍
euro12[euro12.Team.str.startswith('G')]

输出结果如下:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20

2 rows × 35 columns

Step 12. Select the first 7 columns

代码如下:

# 选择前七列
euro12.iloc[:, 0:7]

输出结果如下:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked)
0 Croatia 4 13 12 51.9% 16.0% 32
1 Czech Republic 4 13 18 41.9% 12.9% 39
2 Denmark 4 10 10 50.0% 20.0% 27
3 England 5 11 18 50.0% 17.2% 40
4 France 3 22 24 37.9% 6.5% 65
5 Germany 10 32 32 47.8% 15.6% 80
6 Greece 5 8 18 30.7% 19.2% 32
7 Italy 6 34 45 43.0% 7.5% 110
8 Netherlands 2 12 36 25.0% 4.1% 60
9 Poland 2 15 23 39.4% 5.2% 48
10 Portugal 6 22 42 34.3% 9.3% 82
11 Republic of Ireland 1 7 12 36.8% 5.2% 28
12 Russia 5 9 31 22.5% 12.5% 59
13 Spain 12 42 33 55.9% 16.0% 100
14 Sweden 5 17 19 47.2% 13.8% 39
15 Ukraine 2 7 26 21.2% 6.0% 38

Step 13. Select all columns except the last 3.

代码如下:

# 选择除了后三列的所有列
euro12.iloc[:, 0:-3]

输出结果如下:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Clean Sheets Blocks Goals conceded Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 0 10 3 13 81.3% 41 62 2 9 0
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 1 10 6 9 60.1% 53 73 8 7 0
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 1 10 5 10 66.7% 25 38 8 4 0
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 2 29 3 22 88.1% 43 45 6 5 0
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 1 7 5 6 54.6% 36 51 5 6 0
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 1 11 6 10 62.6% 63 49 12 4 0
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 1 23 7 13 65.1% 67 48 12 9 1
7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 2 18 7 20 74.1% 101 89 16 16 0
8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 0 9 5 12 70.6% 35 30 3 5 0
9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 0 8 3 6 66.7% 48 56 3 7 1
10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 2 11 4 10 71.5% 73 90 10 12 0
11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 0 23 9 17 65.4% 43 51 11 6 1
12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 0 8 3 10 77.0% 34 43 4 6 0
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 5 8 1 15 93.8% 102 83 19 11 0
14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 1 12 5 8 61.6% 35 51 7 7 0
15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 0 4 4 13 76.5% 48 31 4 5 0

16 rows × 32 columns

Step 14. Present only the Shooting Accuracy from England, Italy and Russia

代码如下:

# 只取出三个队伍England, Italy and Russia的Shooting Accuracy
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team', 'Shooting Accuracy']]

输出结果如下:

Team Shooting Accuracy
3 England 50.0%
7 Italy 43.0%
12 Russia 22.5%

Fictional Army - Filtering and Sorting

Introduction:

This exercise was inspired by this page

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. This is the data given as a dictionary

代码如下:

# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'deaths': [523, 52, 25, 616, 43, 234, 523, 62, 62, 73, 37, 35],
            'battles': [5, 42, 2, 2, 4, 7, 8, 3, 4, 7, 8, 9],
            'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005, 1099, 1523],
            'veterans': [1, 5, 62, 26, 73, 37, 949, 48, 48, 435, 63, 345],
            'readiness': [1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 2, 3],
            'armored': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1],
            'deserters': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'origin': ['Arizona', 'California', 'Texas', 'Florida', 'Maine', 'Iowa', 'Alaska', 'Washington', 'Oregon', 'Wyoming', 'Louisana', 'Georgia']}

Step 3. Create a dataframe and assign it to a variable called army.

Don’t forget to include the columns names in the order presented in the dictionary (‘regiment’, ‘company’, ‘deaths’…) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.

代码如下:

army = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'deaths', 'battles', 'size', 'veterans', 'readiness', 'armored', 'deserters', 'origin'])

Step 4. Set the ‘origin’ colum as the index of the dataframe

代码如下:

army = army.set_index('origin')
army

输出结果如下:

regiment company deaths battles size veterans readiness armored deserters
origin
Arizona Nighthawks 1st 523 5 1045 1 1 1 4
California Nighthawks 1st 52 42 957 5 2 0 24
Texas Nighthawks 2nd 25 2 1099 62 3 1 31
Florida Nighthawks 2nd 616 2 1400 26 3 1 2
Maine Dragoons 1st 43 4 1592 73 2 0 3
Iowa Dragoons 1st 234 7 1006 37 1 1 4
Alaska Dragoons 2nd 523 8 987 949 2 0 24
Washington Dragoons 2nd 62 3 849 48 3 1 31
Oregon Scouts 1st 62 4 973 48 2 0 2
Wyoming Scouts 1st 73 7 1005 435 1 0 3
Louisana Scouts 2nd 37 8 1099 63 2 1 2
Georgia Scouts 2nd 35 9 1523 345 3 1 3

Step 5. Print only the column veterans

代码如下:

army.veterans
# army['veterans']

输出结果如下:

origin
Arizona         1
California      5
Texas          62
Florida        26
Maine          73
Iowa           37
Alaska        949
Washington     48
Oregon         48
Wyoming       435
Louisana       63
Georgia       345
Name: veterans, dtype: int64

Step 6. Print the columns ‘veterans’ and ‘deaths’

代码如下:

army[['veterans', 'deaths']]

输出结果如下:

veterans deaths
origin
Arizona 1 523
California 5 52
Texas 62 25
Florida 26 616
Maine 73 43
Iowa 37 234
Alaska 949 523
Washington 48 62
Oregon 48 62
Wyoming 435 73
Louisana 63 37
Georgia 345 35

Step 7. Print the name of all the columns.

代码如下:

army.columns

输出结果如下:

Index(['regiment', 'company', 'deaths', 'battles', 'size', 'veterans',
       'readiness', 'armored', 'deserters'],
      dtype='object')

Step 8. Select the ‘deaths’, ‘size’ and ‘deserters’ columns from Maine and Alaska

代码如下:

army.loc[['Maine', 'Alaska'], ['deaths', 'size', 'deserters']]

输出结果如下:

deaths size deserters
origin
Maine 43 1592 3
Alaska 523 987 24

Step 9. Select the rows 3 to 7 and the columns 3 to 6

代码如下:

army.iloc[3:7, 3:6]

输出结果如下:

battles size veterans
origin
Florida 2 1400 26
Maine 4 1592 73
Iowa 7 1006 37
Alaska 8 987 949

Step 10. Select every row after the fourth row

代码如下:

army.iloc[3:]

输出结果如下:

regiment company deaths battles size veterans readiness armored deserters
origin
Florida Nighthawks 2nd 616 2 1400 26 3 1 2
Maine Dragoons 1st 43 4 1592 73 2 0 3
Iowa Dragoons 1st 234 7 1006 37 1 1 4
Alaska Dragoons 2nd 523 8 987 949 2 0 24
Washington Dragoons 2nd 62 3 849 48 3 1 31
Oregon Scouts 1st 62 4 973 48 2 0 2
Wyoming Scouts 1st 73 7 1005 435 1 0 3
Louisana Scouts 2nd 37 8 1099 63 2 1 2
Georgia Scouts 2nd 35 9 1523 345 3 1 3

Step 11. Select every row up to the 4th row

代码如下:

# 选择每一行直到第 4 行
army.iloc[:3]

输出结果如下:

regiment company deaths battles size veterans readiness armored deserters
origin
Arizona Nighthawks 1st 523 5 1045 1 1 1 4
California Nighthawks 1st 52 42 957 5 2 0 24
Texas Nighthawks 2nd 25 2 1099 62 3 1 31

Step 12. Select the 3rd column up to the 7th column

代码如下:

army.iloc[: , 4:7]

输出结果如下:

size veterans readiness
origin
Arizona 1045 1 1
California 957 5 2
Texas 1099 62 3
Florida 1400 26 3
Maine 1592 73 2
Iowa 1006 37 1
Alaska 987 949 2
Washington 849 48 3
Oregon 973 48 2
Wyoming 1005 435 1
Louisana 1099 63 2
Georgia 1523 345 3

Step 13. Select rows where df.deaths is greater than 50

代码如下:

army[army['deaths']>50]

输出结果如下:

regiment company deaths battles size veterans readiness armored deserters
origin
Arizona Nighthawks 1st 523 5 1045 1 1 1 4
California Nighthawks 1st 52 42 957 5 2 0 24
Florida Nighthawks 2nd 616 2 1400 26 3 1 2
Iowa Dragoons 1st 234 7 1006 37 1 1 4
Alaska Dragoons 2nd 523 8 987 949 2 0 24
Washington Dragoons 2nd 62 3 849 48 3 1 31
Oregon Scouts 1st 62 4 973 48 2 0 2
Wyoming Scouts 1st 73 7 1005 435 1 0 3

Step 14. Select rows where df.deaths is greater than 500 or less than 50

代码如下:

army[(army['deaths']>500) | (army['deaths']<50)]

输出结果如下:

regiment company deaths battles size veterans readiness armored deserters
origin
Arizona Nighthawks 1st 523 5 1045 1 1 1 4
Texas Nighthawks 2nd 25 2 1099 62 3 1 31
Florida Nighthawks 2nd 616 2 1400 26 3 1 2
Maine Dragoons 1st 43 4 1592 73 2 0 3
Alaska Dragoons 2nd 523 8 987 949 2 0 24
Louisana Scouts 2nd 37 8 1099 63 2 1 2
Georgia Scouts 2nd 35 9 1523 345 3 1 3

Step 15. Select all the regiments not named “Dragoons”

代码如下:

army[(army.regiment != 'Dragoons')]

输出结果如下:

regiment company deaths battles size veterans readiness armored deserters
origin
Arizona Nighthawks 1st 523 5 1045 1 1 1 4
California Nighthawks 1st 52 42 957 5 2 0 24
Texas Nighthawks 2nd 25 2 1099 62 3 1 31
Florida Nighthawks 2nd 616 2 1400 26 3 1 2
Oregon Scouts 1st 62 4 973 48 2 0 2
Wyoming Scouts 1st 73 7 1005 435 1 0 3
Louisana Scouts 2nd 37 8 1099 63 2 1 2
Georgia Scouts 2nd 35 9 1523 345 3 1 3

Step 16. Select the rows called Texas and Arizona

代码如下:

army.loc[['Texas', 'Arizona']]

输出结果如下:

regiment company deaths battles size veterans readiness armored deserters
origin
Texas Nighthawks 2nd 25 2 1099 62 3 1 31
Arizona Nighthawks 1st 523 5 1045 1 1 1 4

Step 17. Select the third cell in the row named Arizona

代码如下:

army.loc[['Arizona'], ['deaths']]
# OR    army.iloc[[0], army.columns.get_loc('deaths')]

输出结果如下:

deaths
origin
Arizona 523

Step 18. Select the third cell down in the column named deaths

代码如下:

# army.loc['Texas', 'deaths']
# OR   army.deaths[2]
# OR
army.iloc[[2], army.columns.get_loc('deaths')]

输出结果如下:

origin
Texas    25
Name: deaths, dtype: int64

小节

  1. tsv与csv文件
    TSV ,Tab-separated values ,制表符分隔值。
    CSV,Comma-separated values,逗号分隔值。(CSV更为常见)

  2. TSV与CSV的区别:
    1)从名称上即可知道,TSV是用制表符(tab,’\t’)作为字段值的分隔符;CSV是用半角逗号(’,’)作为字段值的分隔符;
    2)IANA规定的标准TSV格式,字段值之中是不允许出现制表符的。

  3. read_csv()函数与read_table函数用法

  4. 本次练习题涉及了很多的iloc和loc操作,可以有遗忘的可以参考前面的博客——pandas真入门(2)

结语

今天的练习题就这么多了,数据分析还得多多练习,这是基础,越到后面还涉及到业务的逻辑就更麻烦了,所以请把基础打牢靠。还有今天粉丝涨的有点猛,本人表示不知所措,但还是感谢各位关注,同各位一起学习进步。哈哈哈!好了,希望各位抓住暑假时间,继续加油学习鸭!

你可能感兴趣的:(利用Python进行数据分析,pandas,数据分析,python,jupyter)