唐宇迪《python数据分析与机器学习实战》学习笔记
32探索性数据分析-足球赛事数据集
原始数据:链接,提取码:yypl
数据包含球员和裁判的信息,2012-2013年的比赛数据,总共设计球员2053名,裁判3147名,特征列表如下:
1.1 数据及模块导入
#from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)
#import missingno as msno
#import pandas_profiling
from sklearn.datasets import make_blobs
import time
df = pd.read_csv("redcard.csv.gz", compression='gzip')
print(df.shape)
df.head()
(146028, 28)
1.2 简单的统计:(count统计非空值个数)
df.describe().T
1.3查看数据类型:
(机器学习建模时只认识‘float’和‘int’型,其他类型需要映射转换一下,这里做探索分析就不用了)
df.dtypes
playerShort object
player object
club object
leagueCountry object
birthday object
height float64
weight float64
position object
games int64
victories int64
ties int64
defeats int64
goals int64
yellowCards int64
yellowReds int64
redCards int64
photoID object
rater1 float64
rater2 float64
refNum int64
refCountry int64
Alpha_3 object
meanIAT float64
nIAT float64
seIAT float64
meanExp float64
nExp float64
seExp float64
dtype: object
1.4 查看并提取列名
all_columns = df.columns.tolist()
all_columns
[‘playerShort’,
‘player’,
‘club’,
‘leagueCountry’,
‘birthday’,
‘height’,
‘weight’,
‘position’,
‘games’,
‘victories’,
‘ties’,
‘defeats’,
‘goals’,
‘yellowCards’,
‘yellowReds’,
‘redCards’,
‘photoID’,
‘rater1’,
‘rater2’,
‘refNum’,
‘refCountry’,
‘Alpha_3’,
‘meanIAT’,
‘nIAT’,
‘seIAT’,
‘meanExp’,
‘nExp’,
‘seExp’]
思考问题,加入一个运动员出现多次,计算时相当于其权重加强,所以可以用groupby解决这个问题:
print(df['height'].mean())
print(np.mean(df.groupby('playerShort').height.mean()))
181.93593798236887
181.74372848007872
数据通常具有多特征高纬度,分析时统计指标不同,因此可以将其分为几个小的数据集单项分析。例如:单看球员、裁判,看球员-裁判关系,单看国家…
2.1.1数据切分
#切分出球员数据,及其相关特征数据
player_index = 'playerShort' #球员ID
player_cols = [#'player', #球员名字丢弃,因为有对应ID了
'birthday',
'height',
'weight',
'position',
'photoID',
'rater1',
'rater2',
]
2.1.2检测及去重
#数据检测,避免复制错误、重名等影响
all_cols_unique_players = df.groupby('playerShort').agg({col:'nunique' for col in player_cols})
all_cols_unique_players.head() #为1就是出现一次,为2就是重复了,这里数据干净
all_cols_unique_players[all_cols_unique_players > 1].dropna().head()#去重
def get_subgroup(dataframe, g_index, g_columns):
"""Helper function that creates a sub-table from the columns and runs a quick uniqueness test."""
g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
if g[g > 1].dropna().shape[0] != 0:
print("Warning: you probably assumed this had all unique values but it doesn't.")
return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})
函数调用
players = get_subgroup(df, player_index, player_cols)
players.head()
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
# Test that we recover what we send in
if dataframe.equals(test_df):
print("Test-passed: we recover the equivalent subgroup dataframe.")
else:
print("Warning -- equivalence test!!! Double-check.")
save_subgroup(players, player_index, "players")
Test-passed: we recover the equivalent subgroup dataframe. 储存成功
根据上面一套操作思路还可以切割其他数据:
club_index = 'club'
club_cols = ['leagueCountry']
clubs = get_subgroup(df, club_index, club_cols)#检测去重
print(clubs.head())
print('-------------')
print(clubs['leagueCountry'].value_counts())#查看所属国家
save_subgroup(clubs, club_index, "clubs")#保存数据
club leagueCountry
England 48
Spain 27
France 22
Germany 21
Name: leagueCountry, dtype: int64
Test-passed: we recover the equivalent subgroup dataframe.
referee_index = 'refNum'
referee_cols = ['refCountry']
referees = get_subgroup(df, referee_index, referee_cols)#检测去重
referees.head()
print(referees.refCountry.nunique())#查看裁判国家数
print(referees.shape)#查看形状
save_subgroup(referees, referee_index, "referees")#保存数据
refNum refCountry
1 1
2 2
3 3
4 4
5 5
161
(3147, 1)
Test-passed: we recover the equivalent subgroup dataframe.
country_index = 'refCountry'
country_cols = ['Alpha_3', 'meanIAT', 'nIAT', 'seIAT', 'meanExp', 'nExp', 'seExp']
countries = get_subgroup(df, country_index, country_cols)
countries = countries.rename(columns = {'Alpha_3' : 'countryName'})
print(countries.shape)
save_subgroup(countries, country_index, 'countries')
(161, 7)
dyad_index = ['refNum', 'playerShort']
dyad_cols = ['games',
'victories',
'ties',
'defeats',
'goals',
'yellowCards',
'yellowReds',
'redCards',
]
dyads = get_subgroup(df, g_index=dyad_index, g_columns=dyad_cols)
dyads.head()
print(dyads.shape)
dyads[dyads.redCards > 1].head(10)#查看超过1次红牌的球员
save_subgroup(dyads, dyad_index, "dyads")#保存
Test-passed: we recover the equivalent subgroup dataframe.
3.1数据读取
players = pd.read_csv("raw_players.csv.gz", compression='gzip',index_col=0)
players.shape
players.head()
这里用missingno直观的展示出缺失值位置:
详细用法点击链接
安装,cmd中:
pip install missingno
3.2缺失值分布图( missingno.matrix)
import missingno as msno
msno.matrix(players, figsize = (16, 6))
msno.bar(players.sample(500))#取样500个
3.4缺失值热度图( missingno.heatmap)
直观表达缺失值间的相关性
msno.heatmap(players.sample(500),
figsize=(16, 7),)
photoID与rate1、rate2相关性极强。
3.5缺失值数量查询
print("All players:", len(players))
print("rater1 nulls:", len(players[(players.rater1.isnull())]))
print("rater2 nulls:", len(players[players.rater2.isnull()]))
print("Both nulls:", len(players[(players.rater1.isnull()) & (players.rater2.isnull())]))
All players: 2053
rater1 nulls: 468
rater2 nulls: 468
Both nulls: 468
3.6缺失值处理
缺失值多的时候直接移除
其他方式查询,统计不为缺失的值,再减去
players = players[players.rater1.notnull()]
print(players.shape[0])
print(len(players)-players.shape[0])
1585
468
msno.matrix(players.sample(500),#上面只留不缺失的值
figsize=(16, 7),
width_ratios=(15, 1))#这里rate1没有缺失的了
4.1 .1评分相关性展示
这里用crosstab将rate1和rate2按类别分组,统计各个分组中值的频数:
pd.crosstab(players.rater1, players.rater2)
fig, ax = plt.subplots(figsize=(12, 8)) #sns画图命令更简洁
sns.heatmap(pd.crosstab(players.rater1, players.rater2), cmap='Blues', annot=True, fmt='d', ax=ax)
ax.set_title("Correlation between Rater 1 and Rater 2\n")
fig.tight_layout()
4.1.2 评分取均值
到底用rate1还是rate2?不如干脆取均值
players['skintone'] = players[['rater1', 'rater2']].mean(axis=1)
players.head()
sns.set_style("dark")#风格设置
sns.distplot(players.skintone, kde=False);
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
#位置值统计
players.position.value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("Position")
ax.set_xlabel("Counts")
fig.tight_layout()
这里想探索位置与被罚的关系,但是如图位置太多了,有左前锋、右前锋,都是前锋,可以合并
提取位置
position_types = players.position.unique()
position_types
array([‘Center Back’, ‘Attacking Midfielder’, ‘Right Midfielder’, ‘Center Midfielder’, ‘Goalkeeper’, ‘Defensive Midfielder’,‘Left Fullback’, nan, ‘Left Midfielder’, ‘Right Fullback’,‘Center Forward’, ‘Left Winger’, ‘Right Winger’], dtype=object)
合并为4类
defense = ['Center Back','Defensive Midfielder', 'Left Fullback', 'Right Fullback', ]
midfield = ['Right Midfielder', 'Center Midfielder', 'Left Midfielder',]
forward = ['Attacking Midfielder', 'Left Winger', 'Right Winger', 'Center Forward']
keeper = 'Goalkeeper'
#df.isin(['b','c']) 判断后者是否在前者中
players.loc[players['position'].isin(defense), 'position_agg'] = "Defense"
players.loc[players['position'].isin(midfield), 'position_agg'] = "Midfield"
players.loc[players['position'].isin(forward), 'position_agg'] = "Forward"
players.loc[players['position'].eq(keeper), 'position_agg'] = "Keeper"
展示合并结果
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players['position_agg'].value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("position_agg")
ax.set_xlabel("Counts")
fig.tight_layout()
from pandas.plotting import scatter_matrix
fig, ax = plt.subplots(figsize=(10, 10))
scatter_matrix(players[['height', 'weight', 'skintone']], alpha=0.2, diagonal='hist', ax=ax);
如图展示,可以观察发现(体重、身高)和肤色间基本无相关性。身高体重间有正相关性。
fig, ax = plt.subplots(figsize=MIDSIZE)
sns.regplot('weight', 'height', data=players, ax=ax,
scatter_kws={"color":"darkred",'s':15,'alpha':1}) #颜色、大小、透明度设置
ax.set_ylabel("Height [cm]")
ax.set_xlabel("Weight [kg]")
fig.tight_layout()
都是离散值,具体连续值在这里分析时无太大用,但是可以按照等级划分。
#身高划分
height_categories = ["vlow_height",
"low_height",
"mid_height",
"high_height",
"vhigh_height",
]
players['heightclass'] = pd.qcut(players['height'],
len(height_categories),
height_categories)
#pd.qcut(切分数据,份数,类别名称)
#体重划分
weight_categories = ["vlow_weight",
"low_weight",
"mid_weight",
"high_weight",
"vhigh_weight",
]
players['weightclass'] = pd.qcut(players['weight'],
len(weight_categories),
weight_categories)
players.head()
print (players['skintone'])
pd.qcut(players['skintone'], 3)
players['skintoneclass'] = pd.qcut(players['skintone'], 3)#切为3份
players
0 0.125
1 0.125
2 0.250
3 0.000
4 0.250
…
2047 0.250
2048 0.000
2050 0.875
2051 0.000
2052 0.000
Name: skintone, Length: 1585, dtype: float64
安装:
pip install pandas_profiling
Github示例:https://github.com/pandas-profiling/pandas-profiling
生成分析表:
import pandas_profiling
pandas_profiling.ProfileReport(players)
通过pandas_profiling生成一大堆可视化分析,这里截图部分展示:
面展示的球员生日并无大用,但是可以通过其知道年龄。
players.birthday.head()
playerShort
aaron-hughes 08.11.1979
aaron-hunt 04.09.1986
aaron-lennon 16.04.1987
aaron-ramsey 26.12.1990
abdelhamid-el-kaoutari 17.03.1990
Name: birthday, dtype: object
players['birth_date'] = pd.to_datetime(players.birthday, format='%d.%m.%Y')
players['age_years'] = ((pd.to_datetime("2013-01-01") - players['birth_date']).dt.days)/365.25 #闰年多的一天均分
players['age_years']
playerShort
aaron-hughes 33.149897
aaron-hunt 26.327173
aaron-lennon 25.713895
aaron-ramsey 22.017796
abdelhamid-el-kaoutari 22.795346
…
zoltan-gera 33.697467
zoltan-stieber 24.210815
zoumana-camara 33.749487
zubikarai 28.848734
zurutuza 26.455852
Name: age_years, Length: 1585, dtype: float64
players.head()
players_cleaned_variables = [#'birthday',
'height',
'weight',
# 'position',
# 'photoID',
# 'rater1',
# 'rater2',
'skintone',
'position_agg',
'weightclass',
'heightclass',
'skintoneclass',
# 'birth_date',
'age_years']
pandas_profiling.ProfileReport(players[players_cleaned_variables])
结果内容和五类似,这里选择指定特征展示,就不一一放图了
保存
players[players_cleaned_variables].to_csv("cleaned_players.csv.gz", compression='gzip')
clean_players= pd.read_csv("cleaned_players.csv.gz", compression='gzip')
players = pd.read_csv("raw_players.csv.gz", compression='gzip')
referees = pd.read_csv("raw_referees.csv.gz", compression='gzip')
#通过球员新的子表取过滤其他子表
agg_dyads = pd.read_csv("raw_dyads.csv.gz", compression='gzip', index_col=[0, 1])
print(agg_dyads.shape)
agg_dyads.head(10)
(146028, 8)
判断逻辑
踢球总场次 = 胜场+平局+败局
all(agg_dyads['games'] == agg_dyads.victories + agg_dyads.ties + agg_dyads.defeats)
True
数据增加
agg_dyads['totalRedCards'] = agg_dyads['yellowReds'] + agg_dyads['redCards']
agg_dyads.rename(columns={'redCards': 'strictRedCards'}, inplace=True)
agg_dyads.head()
clean_players.head()
agg_dyads.head()
agg_dyads.reset_index().head()
agg_dyads.reset_index().set_index('playerShort').head()
player_dyad = (clean_players.merge(agg_dyads.reset_index().set_index('playerShort'),
left_index = True,
right_index = True))
player_dyad.head()
过滤 通过clean_players过滤agg_dyads得到clean_dyads
clean_dyads = (agg_dyads.reset_index()[agg_dyads.reset_index()
.playerShort
.isin(set(clean_players.index))
]).set_index(['refNum', 'playerShort'])
clean_dyads.head()
clean_dyads.shape,agg_dyads.shape,player_dyad.shape
((124621, 9), (146028, 9), (124621, 18))
提取出要用的特征
j = 0
out = [0 for _ in range(sum(clean_dyads['games']))]
for index, row in clean_dyads.reset_index().iterrows():
n = row['games']
d = row['totalRedCards']
ref = row['refNum']
player = row['playerShort']
for _ in range(n):
row['totalRedCards'] = 1 if (d - _) > 0 else 0
rowlist = list([ref, player, row['totalRedCards']])
out[j] = rowlist
j += 1
tidy_dyads = pd.DataFrame(out, columns=['refNum','playerShort','redcard']).set_index(['refNum', 'playerShort'])
print(tidy_dyads.redcard.sum())
print(clean_dyads.games.sum())
print(tidy_dyads.shape)
tidy_dyads.head()
tidy_dyads.redcard.sum()
# 3092
clean_referees = referees.reset_index()[referees.reset_index().refNum.isin(tidy_dyads.reset_index().refNum.unique())].set_index('refNum')
clean_referees.shape, referees.shape
# ((2978, 1), (3147, 1))
clean_countries = countries.reset_index()[countries.reset_index().refCountry.isin(clean_referees.refCountry.unique())].set_index('refCountry')
clean_countries.shape, countries.shape
# ((160, 7), (161, 7))
tidy_dyads.to_csv('cleaned_dyads.csv.gz', compression = 'gzip')
tidy_dyads.head()
(tidy_dyads.groupby(level=0)
.sum()
.sort_values('redcard', ascending=False)
.rename(columns={'redcard':'total redcards given'})).head()
tidy_dyads.groupby(level=1).sum().sort_values(ascending=False, by='redcard').head()
player_ref_game = (tidy_dyads.reset_index()
.set_index('playerShort')
.merge(clean_players,
left_index=True,
right_index=True)
)
player_ref_game.head()
bootstrap = pd.concat([player_ref_game.sample(replace=True,
n=10000).groupby('skintone').mean()
for _ in range(100)])
player_ref_game.sample(replace=True,n=10000).groupby('skintone').mean()
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.regplot(bootstrap.index.values,
y='redcard',
data=bootstrap,
lowess=True,
scatter_kws={"color":"darkred",'s':15,'alpha':1},
x_jitter=(0.125 / 4.0))
ax.set_xlabel("Skintone");