【练习一】 现有一份关于美剧《权力的游戏》剧本的数据集,请解决以下问题:
(a)在所有的数据中,一共出现了多少人物?
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/PuLinYue/Desktop/joyful-pandas/data/Game_of_Thrones_Script.csv')
df.head()
Release Date | Season | Episode | Episode Title | Name | Sentence | |
---|---|---|---|---|---|---|
0 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | waymar royce | What do you expect? They're savages. One lot s... |
1 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | will | I've never seen wildlings do a thing like this... |
2 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | waymar royce | How close did you get? |
3 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | will | Close as any man would. |
4 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | gared | We should head back to the wall. |
df.describe()
Release Date | Season | Episode | Episode Title | Name | Sentence | |
---|---|---|---|---|---|---|
count | 23911 | 23911 | 23911 | 23911 | 23911 | 23911 |
unique | 73 | 8 | 10 | 73 | 564 | 22300 |
top | 2017/8/13 | Season 2 | Episode 5 | Eastwatch | tyrion lannister | No. |
freq | 505 | 3914 | 3083 | 505 | 1760 | 103 |
df['Name'].nunique() #显示Name有多少个唯一值
564
(b)以单元格计数(即简单把一个单元格视作一句),谁说了最多的话?
df['Name'].value_counts()
tyrion lannister 1760
jon snow 1133
daenerys targaryen 1048
cersei lannister 1005
jaime lannister 945
...
janos slunt 1
steward of house stark 1
archmaester 1
night watch stable boy 1
bryndel 1
Name: Name, Length: 564, dtype: int64
df['Name'].value_counts().index[0]
'tyrion lannister'
(c)以单词计数,谁说了最多的单词?
#apply(lambda x:len(x.split())) apply函数看每个句子里有多少个单词
df_words = df.assign(Words=df['Sentence'].apply(lambda x:len(x.split()))).sort_values(by='Name')
df_words.head()
Release Date | Season | Episode | Episode Title | Name | Sentence | Words | |
---|---|---|---|---|---|---|---|
276 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | a voice | It's Maester Luwin, my lord. | 5 |
3012 | 2011/6/19 | Season 1 | Episode 10 | Fire and Blood | addam marbrand | ls it true about Stannis and Renly? | 7 |
3017 | 2011/6/19 | Season 1 | Episode 10 | Fire and Blood | addam marbrand | Kevan Lannister | 2 |
13610 | 2014/6/8 | Season 4 | Episode 9 | The Watchers on the Wall | aemon | And what is it that couldn't wait until mornin... | 10 |
13614 | 2014/6/8 | Season 4 | Episode 9 | The Watchers on the Wall | aemon | Oh, no need. I know my way around this library... | 48 |
df.assign(Words=df['Sentence'].apply(lambda x:len(x.split())))
Release Date | Season | Episode | Episode Title | Name | Sentence | Words | |
---|---|---|---|---|---|---|---|
0 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | waymar royce | What do you expect? They're savages. One lot s... | 25 |
1 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | will | I've never seen wildlings do a thing like this... | 21 |
2 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | waymar royce | How close did you get? | 5 |
3 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | will | Close as any man would. | 5 |
4 | 2011/4/17 | Season 1 | Episode 1 | Winter is Coming | gared | We should head back to the wall. | 7 |
... | ... | ... | ... | ... | ... | ... | ... |
23906 | 2019/5/19 | Season 8 | Episode 6 | The Iron Throne | brienne | I think we can all agree that ships take prece... | 12 |
23907 | 2019/5/19 | Season 8 | Episode 6 | The Iron Throne | bronn | I think that's a very presumptuous statement. | 7 |
23908 | 2019/5/19 | Season 8 | Episode 6 | The Iron Throne | tyrion lannister | I once brought a jackass and a honeycomb into ... | 11 |
23909 | 2019/5/19 | Season 8 | Episode 6 | The Iron Throne | man | The Queen in the North! | 5 |
23910 | 2019/5/19 | Season 8 | Episode 6 | The Iron Throne | all | The Queen in the North! The Queen in the North... | 25 |
23911 rows × 7 columns
#基本思路:先对人进行排序
#以will为例。向下统计,若遇到will则将对应的words数加上去,若不是则跳过。
#L_count[-1] 第一个元素特殊处理
L_count = []
N_words = list(zip(df_words['Name'],df_words['Words']))
for i in N_words:
if i == N_words[0]:
L_count.append(i[1])
last = i[0]
else:
L_count.append(L_count[-1]+i[1] if i[0]==last else i[1])
last = i[0]
df_words['Count']=L_count
df_words['Name'][df_words['Count'].idxmax()]
'tyrion lannister'
【练习二】现有一份关于科比的投篮数据集,请解决如下问题:
(a)哪种action_type和combined_shot_type的组合是最多的?
df_1 = pd.read_csv('C:/Users/PuLinYue/Desktop/joyful-pandas/data/Kobe_data.csv',index_col='shot_id')
df_1.head()
action_type | combined_shot_type | game_event_id | game_id | lat | loc_x | loc_y | lon | minutes_remaining | period | ... | shot_made_flag | shot_type | shot_zone_area | shot_zone_basic | shot_zone_range | team_id | team_name | game_date | matchup | opponent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
shot_id | |||||||||||||||||||||
1 | Jump Shot | Jump Shot | 10 | 20000012 | 33.9723 | 167 | 72 | -118.1028 | 10 | 1 | ... | NaN | 2PT Field Goal | Right Side(R) | Mid-Range | 16-24 ft. | 1610612747 | Los Angeles Lakers | 2000/10/31 | LAL @ POR | POR |
2 | Jump Shot | Jump Shot | 12 | 20000012 | 34.0443 | -157 | 0 | -118.4268 | 10 | 1 | ... | 0.0 | 2PT Field Goal | Left Side(L) | Mid-Range | 8-16 ft. | 1610612747 | Los Angeles Lakers | 2000/10/31 | LAL @ POR | POR |
3 | Jump Shot | Jump Shot | 35 | 20000012 | 33.9093 | -101 | 135 | -118.3708 | 7 | 1 | ... | 1.0 | 2PT Field Goal | Left Side Center(LC) | Mid-Range | 16-24 ft. | 1610612747 | Los Angeles Lakers | 2000/10/31 | LAL @ POR | POR |
4 | Jump Shot | Jump Shot | 43 | 20000012 | 33.8693 | 138 | 175 | -118.1318 | 6 | 1 | ... | 0.0 | 2PT Field Goal | Right Side Center(RC) | Mid-Range | 16-24 ft. | 1610612747 | Los Angeles Lakers | 2000/10/31 | LAL @ POR | POR |
5 | Driving Dunk Shot | Dunk | 155 | 20000012 | 34.0443 | 0 | 0 | -118.2698 | 6 | 2 | ... | 1.0 | 2PT Field Goal | Center(C) | Restricted Area | Less Than 8 ft. | 1610612747 | Los Angeles Lakers | 2000/10/31 | LAL @ POR | POR |
5 rows × 24 columns
gamegroup = list(zip(df_1['action_type'],df_1['combined_shot_type']))
gamegroup
[('Jump Shot', 'Jump Shot'),
('Jump Shot', 'Jump Shot'),
...]
pd.Series(gamegroup).value_counts().index[0]
('Jump Shot', 'Jump Shot')
(b)在所有被记录的game_id中,遭遇到最多的opponent是一个支?
df_1.iloc[:,[3,-1]]#取最后三列
game_id | opponent | |
---|---|---|
shot_id | ||
1 | 20000012 | POR |
2 | 20000012 | POR |
3 | 20000012 | POR |
4 | 20000012 | POR |
5 | 20000012 | POR |
... | ... | ... |
30693 | 49900088 | IND |
30694 | 49900088 | IND |
30695 | 49900088 | IND |
30696 | 49900088 | IND |
30697 | 49900088 | IND |
30697 rows × 2 columns
gamegroup_1 = list(zip(df_1['game_id'],df_1['opponent']))
pd.Series(list(list(zip(*(pd.Series(gamegroup_1).unique()).tolist()))[1])).value_counts().index[0]
'SAS'
zip(df_1['game_id'],df_1['opponent'])
list(zip(df_1['game_id'],df_1['opponent']))
[(20000012, 'POR'),
(20000012, 'POR'),
...]
#unique函数:去重。不可用在list上,需要用在pd.series上。
a = pd.Series(list(zip(df_1['game_id'],df_1['opponent']))).unique()
a
array([(20000012, 'POR'), (20000019, 'UTA'), (20000047, 'VAN'), ...,
(49900086, 'IND'), (49900087, 'IND'), (49900088, 'IND')],
dtype=object)
a = pd.Series(list(zip(df_1['game_id'],df_1['opponent']))).unique().tolist()
a #无重复元素
[(20000012, 'POR'),
(20000019, 'UTA'),
(20000047, 'VAN'),
(20000049, 'LAC'),
(20000058, 'HOU'),
...]
#解包——所有的第一个元素拿出来,第二元素拿出来。此时没有重复元素。
#取出两个包:第一个元素的集合、第二个元素的集合。
n = list(zip(*a))
n
pd.Series(n) #变成序列
0 (20000012, 20000019, 20000047, 20000049, 20000...
1 (POR, UTA, VAN, LAC, HOU, SAS, HOU, DEN, SAC, ...
dtype: object