Pandas基础1.1|Python学习笔记

【练习一】 现有一份关于美剧《权力的游戏》剧本的数据集,请解决以下问题:
(a)在所有的数据中,一共出现了多少人物?

import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/PuLinYue/Desktop/joyful-pandas/data/Game_of_Thrones_Script.csv')
df.head()
Release Date Season Episode Episode Title Name Sentence
0 2011/4/17 Season 1 Episode 1 Winter is Coming waymar royce What do you expect? They're savages. One lot s...
1 2011/4/17 Season 1 Episode 1 Winter is Coming will I've never seen wildlings do a thing like this...
2 2011/4/17 Season 1 Episode 1 Winter is Coming waymar royce How close did you get?
3 2011/4/17 Season 1 Episode 1 Winter is Coming will Close as any man would.
4 2011/4/17 Season 1 Episode 1 Winter is Coming gared We should head back to the wall.
df.describe() 
Release Date Season Episode Episode Title Name Sentence
count 23911 23911 23911 23911 23911 23911
unique 73 8 10 73 564 22300
top 2017/8/13 Season 2 Episode 5 Eastwatch tyrion lannister No.
freq 505 3914 3083 505 1760 103
df['Name'].nunique() #显示Name有多少个唯一值
564

(b)以单元格计数(即简单把一个单元格视作一句),谁说了最多的话?

df['Name'].value_counts()
tyrion lannister          1760
jon snow                  1133
daenerys targaryen        1048
cersei lannister          1005
jaime lannister            945
                          ... 
janos slunt                  1
steward of house stark       1
archmaester                  1
night watch stable boy       1
bryndel                      1
Name: Name, Length: 564, dtype: int64
df['Name'].value_counts().index[0]
'tyrion lannister'

(c)以单词计数,谁说了最多的单词?

#apply(lambda x:len(x.split())) apply函数看每个句子里有多少个单词
df_words = df.assign(Words=df['Sentence'].apply(lambda x:len(x.split()))).sort_values(by='Name')
df_words.head()
Release Date Season Episode Episode Title Name Sentence Words
276 2011/4/17 Season 1 Episode 1 Winter is Coming a voice It's Maester Luwin, my lord. 5
3012 2011/6/19 Season 1 Episode 10 Fire and Blood addam marbrand ls it true about Stannis and Renly? 7
3017 2011/6/19 Season 1 Episode 10 Fire and Blood addam marbrand Kevan Lannister 2
13610 2014/6/8 Season 4 Episode 9 The Watchers on the Wall aemon And what is it that couldn't wait until mornin... 10
13614 2014/6/8 Season 4 Episode 9 The Watchers on the Wall aemon Oh, no need. I know my way around this library... 48
df.assign(Words=df['Sentence'].apply(lambda x:len(x.split())))
Release Date Season Episode Episode Title Name Sentence Words
0 2011/4/17 Season 1 Episode 1 Winter is Coming waymar royce What do you expect? They're savages. One lot s... 25
1 2011/4/17 Season 1 Episode 1 Winter is Coming will I've never seen wildlings do a thing like this... 21
2 2011/4/17 Season 1 Episode 1 Winter is Coming waymar royce How close did you get? 5
3 2011/4/17 Season 1 Episode 1 Winter is Coming will Close as any man would. 5
4 2011/4/17 Season 1 Episode 1 Winter is Coming gared We should head back to the wall. 7
... ... ... ... ... ... ... ...
23906 2019/5/19 Season 8 Episode 6 The Iron Throne brienne I think we can all agree that ships take prece... 12
23907 2019/5/19 Season 8 Episode 6 The Iron Throne bronn I think that's a very presumptuous statement. 7
23908 2019/5/19 Season 8 Episode 6 The Iron Throne tyrion lannister I once brought a jackass and a honeycomb into ... 11
23909 2019/5/19 Season 8 Episode 6 The Iron Throne man The Queen in the North! 5
23910 2019/5/19 Season 8 Episode 6 The Iron Throne all The Queen in the North! The Queen in the North... 25

23911 rows × 7 columns

#基本思路:先对人进行排序
#以will为例。向下统计,若遇到will则将对应的words数加上去,若不是则跳过。
#L_count[-1] 第一个元素特殊处理
L_count = []
N_words = list(zip(df_words['Name'],df_words['Words']))
for i in N_words:
    if i == N_words[0]:
        L_count.append(i[1])
        last = i[0]
    else:
        L_count.append(L_count[-1]+i[1] if i[0]==last else i[1])
        last = i[0]
df_words['Count']=L_count
df_words['Name'][df_words['Count'].idxmax()]
'tyrion lannister'

【练习二】现有一份关于科比的投篮数据集,请解决如下问题:
(a)哪种action_type和combined_shot_type的组合是最多的?

df_1 = pd.read_csv('C:/Users/PuLinYue/Desktop/joyful-pandas/data/Kobe_data.csv',index_col='shot_id')
df_1.head()
action_type combined_shot_type game_event_id game_id lat loc_x loc_y lon minutes_remaining period ... shot_made_flag shot_type shot_zone_area shot_zone_basic shot_zone_range team_id team_name game_date matchup opponent
shot_id
1 Jump Shot Jump Shot 10 20000012 33.9723 167 72 -118.1028 10 1 ... NaN 2PT Field Goal Right Side(R) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000/10/31 LAL @ POR POR
2 Jump Shot Jump Shot 12 20000012 34.0443 -157 0 -118.4268 10 1 ... 0.0 2PT Field Goal Left Side(L) Mid-Range 8-16 ft. 1610612747 Los Angeles Lakers 2000/10/31 LAL @ POR POR
3 Jump Shot Jump Shot 35 20000012 33.9093 -101 135 -118.3708 7 1 ... 1.0 2PT Field Goal Left Side Center(LC) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000/10/31 LAL @ POR POR
4 Jump Shot Jump Shot 43 20000012 33.8693 138 175 -118.1318 6 1 ... 0.0 2PT Field Goal Right Side Center(RC) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000/10/31 LAL @ POR POR
5 Driving Dunk Shot Dunk 155 20000012 34.0443 0 0 -118.2698 6 2 ... 1.0 2PT Field Goal Center(C) Restricted Area Less Than 8 ft. 1610612747 Los Angeles Lakers 2000/10/31 LAL @ POR POR

5 rows × 24 columns

gamegroup = list(zip(df_1['action_type'],df_1['combined_shot_type']))
gamegroup
[('Jump Shot', 'Jump Shot'),
 ('Jump Shot', 'Jump Shot'),
...]
pd.Series(gamegroup).value_counts().index[0]
('Jump Shot', 'Jump Shot')

(b)在所有被记录的game_id中,遭遇到最多的opponent是一个支?

df_1.iloc[:,[3,-1]]#取最后三列
game_id opponent
shot_id
1 20000012 POR
2 20000012 POR
3 20000012 POR
4 20000012 POR
5 20000012 POR
... ... ...
30693 49900088 IND
30694 49900088 IND
30695 49900088 IND
30696 49900088 IND
30697 49900088 IND

30697 rows × 2 columns

gamegroup_1 = list(zip(df_1['game_id'],df_1['opponent']))
pd.Series(list(list(zip(*(pd.Series(gamegroup_1).unique()).tolist()))[1])).value_counts().index[0]
'SAS'
zip(df_1['game_id'],df_1['opponent'])
list(zip(df_1['game_id'],df_1['opponent']))
[(20000012, 'POR'),
 (20000012, 'POR'),
 ...]
#unique函数:去重。不可用在list上,需要用在pd.series上。
a = pd.Series(list(zip(df_1['game_id'],df_1['opponent']))).unique()
a
array([(20000012, 'POR'), (20000019, 'UTA'), (20000047, 'VAN'), ...,
       (49900086, 'IND'), (49900087, 'IND'), (49900088, 'IND')],
      dtype=object)
a = pd.Series(list(zip(df_1['game_id'],df_1['opponent']))).unique().tolist()
a #无重复元素
[(20000012, 'POR'),
 (20000019, 'UTA'),
 (20000047, 'VAN'),
 (20000049, 'LAC'),
 (20000058, 'HOU'),
 ...]
#解包——所有的第一个元素拿出来,第二元素拿出来。此时没有重复元素。
#取出两个包:第一个元素的集合、第二个元素的集合。
n = list(zip(*a))
n
pd.Series(n) #变成序列
0    (20000012, 20000019, 20000047, 20000049, 20000...
1    (POR, UTA, VAN, LAC, HOU, SAS, HOU, DEN, SAC, ...
dtype: object

你可能感兴趣的:(Python学习笔记)