2018 世界杯 -> 看数据,猜全场最佳球员?机器学习可解释性->特征重要性


2018 男足世界杯(128 场比赛)基本统计信息

完整数据分析报告:https://github.com/adi0229/ML-DL/blob/master/fifa2018.ipynb

数据特征包含:

Index(['Date', 'Team', 'Opponent', 'Goal Scored', 'Ball Possession %',
       'Attempts', 'On-Target', 'Off-Target', 'Blocked', 'Corners', 'Offsides',
       'Free Kicks', 'Saves', 'Pass Accuracy %', 'Passes',
       'Distance Covered (Kms)', 'Fouls Committed', 'Yellow Card',
       'Yellow & Red', 'Red', 'Man of the Match', '1st Goal', 'Round', 'PSO',
       'Goals in PSO', 'Own goals', 'Own goal Time'],
      dtype='object')

随机森林分类器(Baseline)及特征重要性

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


data = pd.read_csv(path + 'FIFA_2018_Statistics.csv')

y = (data['Man of the Match'] == "Yes")  

# 特征工程 -> 选取numerical类数值作为训练特征

feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]

X = data[feature_names]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

rf = RandomForestClassifier(random_state=0).fit(train_X, train_y)

from sklearn.metrics import accuracy_score

predictions = rf.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))
accuracy_score: 0.59375
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(rf, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

随机森林分类器(微调)及特征重要性变化

rf = RandomForestClassifier(random_state=0,n_estimators=500).fit(train_X, train_y)

predictions = rf.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))
accuracy_score: 0.71875

分析:「随机森林」准确率(60% - 72%)提升之后

  • 扑救、传球准确率、射门命中率的重要性上升

  • 角球、全场跑动距离的重要性下降

    符合足球战术常识

Xgboost 分类器(微调)及特征重要性

from xgboost import XGBRFClassifier

xgb = XGBRFClassifier(silent=False, 
                      scale_pos_weight=1,
                      learning_rate=0.01,  
                      colsample_bytree = 0.4,
                      subsample = 0.8, 
                      n_estimators=1000, 
                      reg_alpha = 0.3,
                      max_depth=4, 
                      gamma=10).fit(train_X, train_y)

predictions = xgb.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))
accuracy_score: 0.71875

Xgboost发现进球是唯一重要特征。
简单粗暴,也更符合足球常理。进球多,更容易获胜,获胜一方容易出 MVP 球员。其他数据的关系并不大。

perm_xgb = PermutationImportance(xgb, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm_xgb, feature_names = val_X.columns.tolist())

你可能感兴趣的:(2018 世界杯 -> 看数据,猜全场最佳球员?机器学习可解释性->特征重要性)