2018 男足世界杯(128 场比赛)基本统计信息
完整数据分析报告:https://github.com/adi0229/ML-DL/blob/master/fifa2018.ipynb
数据特征包含:
Index(['Date', 'Team', 'Opponent', 'Goal Scored', 'Ball Possession %',
'Attempts', 'On-Target', 'Off-Target', 'Blocked', 'Corners', 'Offsides',
'Free Kicks', 'Saves', 'Pass Accuracy %', 'Passes',
'Distance Covered (Kms)', 'Fouls Committed', 'Yellow Card',
'Yellow & Red', 'Red', 'Man of the Match', '1st Goal', 'Round', 'PSO',
'Goals in PSO', 'Own goals', 'Own goal Time'],
dtype='object')
随机森林分类器(Baseline)及特征重要性
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv(path + 'FIFA_2018_Statistics.csv')
y = (data['Man of the Match'] == "Yes")
# 特征工程 -> 选取numerical类数值作为训练特征
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
rf = RandomForestClassifier(random_state=0).fit(train_X, train_y)
from sklearn.metrics import accuracy_score
predictions = rf.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))
accuracy_score: 0.59375
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(rf, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())
随机森林分类器(微调)及特征重要性变化
rf = RandomForestClassifier(random_state=0,n_estimators=500).fit(train_X, train_y)
predictions = rf.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))
accuracy_score: 0.71875
分析:「随机森林」准确率(60% - 72%)提升之后
扑救、传球准确率、射门命中率的重要性上升
-
角球、全场跑动距离的重要性下降
符合足球战术常识
Xgboost 分类器(微调)及特征重要性
from xgboost import XGBRFClassifier
xgb = XGBRFClassifier(silent=False,
scale_pos_weight=1,
learning_rate=0.01,
colsample_bytree = 0.4,
subsample = 0.8,
n_estimators=1000,
reg_alpha = 0.3,
max_depth=4,
gamma=10).fit(train_X, train_y)
predictions = xgb.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))
accuracy_score: 0.71875
Xgboost发现进球是唯一重要特征。
简单粗暴,也更符合足球常理。进球多,更容易获胜,获胜一方容易出 MVP 球员。其他数据的关系并不大。
perm_xgb = PermutationImportance(xgb, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm_xgb, feature_names = val_X.columns.tolist())