绝地求生(Player unknown’s Battlegrounds),俗称吃鸡,是一款战术竞技型射击类沙盒游戏。 这款游戏是一款大逃杀类型的游戏,每一局游戏将有最多100名玩家参与,他们将被投放在绝地岛(battlegrounds)上,在游戏的开始时所有人都一无所有。玩家需要在岛上收集各种资源,在不断缩小的安全区域内对抗其他玩家,让自己生存到最后。
该游戏拥有很高的自由度,玩家可以体验飞机跳伞、开越野车、丛林射击、抢夺战利品等玩法,小心四周埋伏的敌人,尽可能成为最后1个存活的人。
数据基本处理
机器学习基本算法的使用
本项目中,将为您提供大量匿名的《绝地求生》游戏统计数据。 其格式为每行包含一个玩家的游戏后统计数据,列为数据的特征值。 数据来自所有类型的比赛:单排,双排,四排;不保证每场比赛有100名人员,每组最多4名成员。
文件说明:
train_V2.csv - 训练集
test_V2.csv - 测试集
数据集中字段解释:
Id [用户id]
Player’s Id
groupId [所处小队id]
ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
matchId [该场比赛id]
ID to identify match. There are no matches that are in both the training and testing set.
assists [助攻数]
Number of enemy players this player damaged that were killed by teammates.
boosts [使用能量,道具数量]
Number of boost items used.
damageDealt [总伤害]
Total damage dealt. Note: Self inflicted damage is subtracted.
DBNOs [击倒敌人数量]
Number of enemy players knocked.
headshotKills [爆头数]
Number of enemy players killed with headshots.
heals [使用治疗药品数量]
Number of healing items used.
killPlace [本厂比赛杀敌排行]
Ranking in match of number of enemy players killed.
killPoints [Elo杀敌排名]
Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
kills [杀敌数]
Number of enemy players killed.
killStreaks [连续杀敌数]
Max number of enemy players killed in a short amount of time.
longestKill [最远杀敌距离]
Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
matchDuration [比赛时长]
Duration of match in seconds.
matchType [比赛类型(小组人数)]
String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
maxPlace [本局最差名次]
Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
numGroups [小组数量]
Number of groups we have data for in the match.
rankPoints [Elo排名]
Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
revives [救活队员的次数]
Number of times this player revived teammates.
rideDistance [驾车距离]
Total distance traveled in vehicles measured in meters.
roadKills [驾车杀敌数]
Number of kills while in a vehicle.
swimDistance [游泳距离]
Total distance traveled by swimming measured in meters.
teamKills [杀死队友的次数]
Number of times this player killed a teammate.
vehicleDestroys [毁坏机动车的数量]
Number of vehicles destroyed.
walkDistance [步行距离]
Total distance traveled on foot measured in meters.
weaponsAcquired [收集武器的数量]
Number of weapons picked up.
winPoints [胜率Elo排名]
Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
winPlacePerc [百分比排名]
The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.
你必须创建一个模型,根据他们的最终统计数据预测玩家的排名,从1(第一名)到0(最后一名)。
最后结果通过平均绝对误差(MAE)进行评估,即通过预测的winPlacePerc和真实的winPlacePerc之间的平均绝对误差
就是绝对误差的平均值
能更好地反映预测值误差的实际情况
(,ℎ)=1∑=1|ℎ(())−()|
api:
sklearn.metrics.mean_absolute_error
在接下来的分析中,我们将分析数据集,检测异常值。
然后我们通过随机森林模型对其训练,并对对该模型进行了优化。
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
导入数据,且查看数据的基本信息
train = pd.read_csv("./data/train_V2.csv")
train.describe()
assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks ... revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc
count 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 ... 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446966e+06 4.446965e+06
mean 2.338149e-01 1.106908e+00 1.307171e+02 6.578755e-01 2.268196e-01 1.370147e+00 4.759935e+01 5.050060e+02 9.247833e-01 5.439551e-01 ... 1.646590e-01 6.061157e+02 3.496091e-03 4.509322e+00 2.386841e-02 7.918208e-03 1.154218e+03 3.660488e+00 6.064601e+02 4.728216e-01
std 5.885731e-01 1.715794e+00 1.707806e+02 1.145743e+00 6.021553e-01 2.679982e+00 2.746294e+01 6.275049e+02 1.558445e+00 7.109721e-01 ... 4.721671e-01 1.498344e+03 7.337297e-02 3.050220e+01 1.673935e-01 9.261157e-02 1.183497e+03 2.456544e+00 7.397004e+02 3.074050e-01
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.400000e+01 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.551000e+02 2.000000e+00 0.000000e+00 2.000000e-01
50% 0.000000e+00 0.000000e+00 8.424000e+01 0.000000e+00 0.000000e+00 0.000000e+00 4.700000e+01 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.856000e+02 3.000000e+00 0.000000e+00 4.583000e-01
75% 0.000000e+00 2.000000e+00 1.860000e+02 1.000000e+00 0.000000e+00 2.000000e+00 7.100000e+01 1.172000e+03 1.000000e+00 1.000000e+00 ... 0.000000e+00 1.909750e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.976000e+03 5.000000e+00 1.495000e+03 7.407000e-01
max 2.200000e+01 3.300000e+01 6.616000e+03 5.300000e+01 6.400000e+01 8.000000e+01 1.010000e+02 2.170000e+03 7.200000e+01 2.000000e+01 ... 3.900000e+01 4.071000e+04 1.800000e+01 3.823000e+03 1.200000e+01 5.000000e+00 2.578000e+04 2.360000e+02 2.013000e+03 1.000000e+00
8 rows × 25 columns
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4446966 entries, 0 to 4446965
Data columns (total 29 columns):
Id object
groupId object
matchId object
assists int64
boosts int64
damageDealt float64
DBNOs int64
headshotKills int64
heals int64
killPlace int64
killPoints int64
kills int64
killStreaks int64
longestKill float64
matchDuration int64
matchType object
maxPlace int64
numGroups int64
rankPoints int64
revives int64
rideDistance float64
roadKills int64
swimDistance float64
teamKills int64
vehicleDestroys int64
walkDistance float64
weaponsAcquired int64
winPoints int64
winPlacePerc float64
dtypes: float64(6), int64(19), object(4)
memory usage: 983.9+ MB
可以看到数据一共有4446966条,
train.shape
(4446966, 29)
查看目标值,我们发现有一条样本,比较特殊,其“winplaceperc”的值为NaN,也就是目标值是缺失值,
因为只有一个玩家是这样,直接进行删除处理。
# 查看缺失值
train[train['winPlacePerc'].isnull()]
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc
2744604 f70c74418bb064 12dfbede33f92b 224a123c53e008 0 0 0.0 0 0 0 1 ... 0 0.0 0 0.0 0 0 0.0 0 0 NaN
1 rows × 29 columns
# 删除缺失值
train.drop(2744604, inplace=True)
train.shape
(4446965, 29)
4.2.2 特征数据规范化处理
4.2.2.1 查看每场比赛参加的人数
处理完缺失值之后,我们看一下每场参加的人数会有多少呢,是每次都会匹配100个人,才开始游戏吗?
# 显示每场比赛参加人数
# transform的作用类似实现了一个一对多的映射功能,把统计数量映射到对应的每个样本上
count = train.groupby('matchId')['matchId'].transform('count')
count
0 96
1 91
2 98
3 91
4 97
..
4446961 94
4446962 93
4446963 98
4446964 94
4446965 98
Name: matchId, Length: 4446965, dtype: int64
train['playersJoined'] = count
count.count()
4446965
train.head()
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc playersJoined
0 7f96b2f878858a 4d4b580de459be a10357fd1a4a91 0 0 0.00 0 0 0 60 ... 0.0000 0 0.00 0 0 244.80 1 1466 0.4444 96
1 eef90569b9d03c 684d5656442f9e aeb375fc57110c 0 0 91.47 0 0 0 57 ... 0.0045 0 11.04 0 0 1434.00 5 0 0.6400 91
2 1eaf90ac73de72 6a4a42c3245a74 110163d8bb94ae 1 0 68.00 0 0 0 47 ... 0.0000 0 0.00 0 0 161.80 2 0 0.7755 98
3 4616d365dd2853 a930a9c79cd721 f1f1f4ef412d7e 0 0 32.90 0 0 0 75 ... 0.0000 0 0.00 0 0 202.70 3 0 0.1667 91
4 315c96c26c9aac de04010b3458dd 6dc8ff871e21e6 0 0 100.00 0 0 0 45 ... 0.0000 0 0.00 0 0 49.75 2 0 0.1875 97
5 rows × 30 columns
# 通过每场参加人数进行,按值升序排列
train["playersJoined"].sort_values().head()
1206365 2
2109739 2
3956552 5
3620228 5
696000 5
Name: playersJoined, dtype: int64
通过结果发现,最少的一局,竟然只有两个人,wtf!!!!
# 通过绘制图像,查看每局开始人数
# 通过seaborn下的countplot方法,可以直接绘制统计过数量之后的直方图
plt.figure(figsize=(20,10))
sns.countplot(train['playersJoined'])
plt.title('playersJoined')
plt.grid()
plt.show()
通过观察,发现一局游戏少于75个玩家,就开始的还是比较少
同时大部分游戏都是在接近100人的时候才开始
限制每局开始人数大于等于75,再进行绘制。
猜想:把这些数据在后期加入数据处理,应该会得到的结果更加准确一些
# 再次绘制每局参加人数的直方图
plt.figure(figsize=(20,10))
sns.countplot(train[train['playersJoined']>=75]['playersJoined'])
plt.title('playersJoined')
plt.grid()
plt.show()
现在我们统计了“每局玩家数量”,那么我们就可以通过“每局玩家数量”来进一步考证其它特征,同时对其规范化设置
试想:一局只有70个玩家的杀敌数,和一局有100个玩家的杀敌数,应该是不可以同时比较的
可以考虑的特征值包括
1.kills(杀敌数)
2.damageDealt(总伤害)
3.maxPlace(本局最差名次)
4.matchDuration(比赛时长)
# 对部分特征值进行规范化
train['killsNorm'] = train['kills']*((100-train['playersJoined'])/100 + 1)
train['damageDealtNorm'] = train['damageDealt']*((100-train['playersJoined'])/100 + 1)
train['maxPlaceNorm'] = train['maxPlace']*((100-train['playersJoined'])/100 + 1)
train['matchDurationNorm'] = train['matchDuration']*((100-train['playersJoined'])/100 + 1)
# 比较经过规范化的特征值和原始特征值的值
to_show = ['Id', 'kills','killsNorm','damageDealt', 'damageDealtNorm', 'maxPlace', 'maxPlaceNorm', 'matchDuration', 'matchDurationNorm']
train[to_show][0:11]
Id kills killsNorm damageDealt damageDealtNorm maxPlace maxPlaceNorm matchDuration matchDurationNorm
0 7f96b2f878858a 0 0.00 0.000 0.00000 28 29.12 1306 1358.24
1 eef90569b9d03c 0 0.00 91.470 99.70230 26 28.34 1777 1936.93
2 1eaf90ac73de72 0 0.00 68.000 69.36000 50 51.00 1318 1344.36
3 4616d365dd2853 0 0.00 32.900 35.86100 31 33.79 1436 1565.24
4 315c96c26c9aac 1 1.03 100.000 103.00000 97 99.91 1424 1466.72
5 ff79c12f326506 1 1.05 100.000 105.00000 28 29.40 1395 1464.75
6 95959be0e21ca3 0 0.00 0.000 0.00000 28 28.84 1316 1355.48
7 311b84c6ff4390 0 0.00 8.538 8.87952 96 99.84 1967 2045.68
8 1a68204ccf9891 0 0.00 51.600 53.14800 28 28.84 1375 1416.25
9 e5bb5a43587253 0 0.00 37.270 38.38810 29 29.87 1930 1987.90
10 2b574d43972813 0 0.00 28.380 28.66380 29 29.29 1811 1829.11
此处我们把特征:heals(使用治疗药品数量)和boosts(能量、道具使用数量)合并成一个新的变量,命名:”healsandboosts“, 这是一个探索性过程,最后结果不一定有用,如果没有实际用处,最后再把它删除。
# 创建新变量“healsandboosts”
train['healsandboosts'] = train['heals'] + train['boosts']
train[["heals", "boosts", "healsandboosts"]].tail()
heals boosts healsandboosts
4446961 0 0 0
4446962 0 1 1
4446963 0 0 0
4446964 2 4 6
4446965 1 2 3
异常数据处理:
一些行中的数据统计出来的结果非常反常规,那么这些玩家肯定有问题,为了训练模型的准确性,我们会把这些异常数据剔除
通过以下操作,识别出玩家在游戏中有击杀数,但是全局没有移动;
这类型玩家肯定是存在异常情况(挂**),我们把这些玩家删除。
# 创建新变量,统计玩家移动距离
train['totalDistance'] = train['rideDistance'] + train['walkDistance'] + train['swimDistance']
# 创建新变量,统计玩家是否在游戏中,有击杀,但是没有移动,如果是返回True, 否则返回false
train['killsWithoutMoving'] = ((train['kills'] > 0) & (train['totalDistance'] == 0))
train["killsWithoutMoving"].head()
0 False
1 False
2 False
3 False
4 False
Name: killsWithoutMoving, dtype: bool
train["killsWithoutMoving"].describe()
count 4446965
unique 2
top False
freq 4445430
Name: killsWithoutMoving, dtype: object
# 检查是否存在有击杀但是没有移动的数据
train[train['killsWithoutMoving'] == True].shape
(1535, 37)
train[train['killsWithoutMoving'] == True].head()
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... winPoints winPlacePerc playersJoined killsNorm damageDealtNorm maxPlaceNorm matchDurationNorm healsandboosts totalDistance killsWithoutMoving
1824 b538d514ef2476 0eb2ce2f43f9d6 35e7d750e442e2 0 0 593.0 0 0 3 18 ... 0 0.8571 58 8.52 842.060 21.30 842.06 3 0.0 True
6673 6d3a61da07b7cb 2d8119b1544f87 904cecf36217df 2 0 346.6 0 0 6 33 ... 0 0.6000 42 4.74 547.628 17.38 2834.52 6 0.0 True
11892 550398a8f33db7 c3fd0e2abab0af db6f6d1f0d4904 2 0 1750.0 0 4 5 3 ... 0 0.8947 21 35.80 3132.500 35.80 1607.42 5 0.0 True
14631 58d690ee461e9d ea5b6630b33d67 dbf34301df5e53 0 0 157.8 0 0 0 69 ... 1500 0.0000 73 1.27 200.406 24.13 1014.73 0 0.0 True
15591 49b61fc963d632 0f5c5f19d9cc21 904cecf36217df 0 0 100.0 0 1 0 37 ... 0 0.3000 42 1.58 158.000 17.38 2834.52 0 0.0 True
5 rows × 37 columns
# 删除这些数据
train.drop(train[train['killsWithoutMoving'] == True].index, inplace=True)
4.2.4.2 异常值处理:删除驾车杀敌数异常的数据
# 查看载具杀敌数超过十个的玩家
train[train['roadKills'] > 10]
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... winPoints winPlacePerc playersJoined killsNorm damageDealtNorm maxPlaceNorm matchDurationNorm healsandboosts totalDistance killsWithoutMoving
2733926 c3e444f7d1289f 489dd6d1f2b3bb 4797482205aaa4 0 0 1246.0 0 0 0 1 ... 1371 0.4286 92 15.12 1345.68 99.36 1572.48 0 1282.302 False
2767999 34193085975338 bd7d50fa305700 a22354d036b3d6 0 0 1102.0 0 0 0 1 ... 1533 0.4713 88 12.32 1234.24 98.56 2179.52 0 4934.600 False
2890740 a3438934e3e535 1081c315a80d14 fe744430ac0070 0 8 2074.0 0 1 11 1 ... 1568 1.0000 38 32.40 3359.88 61.56 3191.40 19 5876.000 False
3524413 9d9d044f81de72 8be97e1ba792e3 859e2c2db5b125 0 3 1866.0 0 5 7 1 ... 1606 0.9398 84 20.88 2164.56 97.44 2233.00 10 7853.000 False
4 rows × 37 columns
# 删除这些数据
train.drop(train[train['roadKills'] > 10].index, inplace=True)
train.shape
(4445426, 37)
4.2.4.3 异常值处理:删除玩家在一局中杀敌数超过30人的数据
# 首先绘制玩家杀敌数的条形图
plt.figure(figsize=(10,4))
sns.countplot(data=train, x=train['kills']).set_title('Kills')
plt.show()
train[train['kills'] > 30].shape
(95, 37)
train[train['kills'] > 30].head()
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... winPoints winPlacePerc playersJoined killsNorm damageDealtNorm maxPlaceNorm matchDurationNorm healsandboosts totalDistance killsWithoutMoving
57978 9d8253e21ccbbd ef7135ed856cd8 37f05e2a01015f 9 0 3725.0 0 7 0 2 ... 1500 0.8571 16 64.40 6854.00 14.72 3308.32 0 48.82 False
87793 45f76442384931 b3627758941d34 37f05e2a01015f 8 0 3087.0 0 8 27 3 ... 1500 1.0000 16 57.04 5680.08 14.72 3308.32 27 780.70 False
156599 746aa7eabf7c86 5723e7d8250da3 f900de1ec39fa5 21 0 5479.0 0 12 7 4 ... 0 0.7000 11 90.72 10355.31 20.79 3398.22 7 23.71 False
160254 15622257cb44e2 1a513eeecfe724 db413c7c48292c 1 0 4033.0 0 40 0 1 ... 1500 1.0000 62 57.96 5565.54 11.04 1164.72 0 718.30 False
180189 1355613d43e2d0 f863cd38c61dbf 39c442628f5df5 5 0 3171.0 0 6 15 1 ... 0 1.0000 11 66.15 5993.19 17.01 3394.44 15 71.51 False
5 rows × 37 columns
# 异常数据删除
train.drop(train[train['kills'] > 30].index, inplace=True)
4.2.4.4 异常值处理:删除爆头率异常数据
如果一个玩家的击杀爆头率过高,也说明其有问题
# 创建变量爆头率
train['headshot_rate'] = train['headshotKills'] / train['kills']
train['headshot_rate'] = train['headshot_rate'].fillna(0)
train["headshot_rate"].tail()
4446961 0.0
4446962 0.0
4446963 0.0
4446964 0.5
4446965 0.0
Name: headshot_rate, dtype: float64
# 绘制爆头率图像
plt.figure(figsize=(12,4))
sns.distplot(train['headshot_rate'], bins=10)
plt.show()
train[(train['headshot_rate'] == 1) & (train['kills'] > 9)].shape
(24, 38)
train[(train['headshot_rate'] == 1) & (train['kills'] > 9)].head()
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... winPlacePerc playersJoined killsNorm damageDealtNorm maxPlaceNorm matchDurationNorm healsandboosts totalDistance killsWithoutMoving headshot_rate
281570 ab9d7168570927 add05ebde0214c e016a873339c7b 2 3 1212.0 8 10 0 1 ... 0.8462 93 10.70 1296.84 28.89 1522.61 3 2939.0 False 1.0
346124 044d18fc42fc75 fc1dbc2df6a887 628107d4c41084 3 5 1620.0 13 11 3 1 ... 1.0000 96 11.44 1684.80 28.08 1796.08 8 8142.0 False 1.0
871244 e668a25f5488e3 5ba8feabfb2a23 f6e6581e03ba4f 0 4 1365.0 9 13 0 1 ... 1.0000 98 13.26 1392.30 27.54 1280.10 4 2105.0 False 1.0
908815 566d8218b705aa a9b056478d71b2 3a41552d553583 2 5 1535.0 10 10 3 1 ... 0.9630 95 10.50 1611.75 29.40 1929.90 8 7948.0 False 1.0
963463 1bd6fd288df4f0 90584ffa22fe15 ba2de992ec7bb8 2 6 1355.0 12 10 2 1 ... 1.0000 96 10.40 1409.20 28.08 1473.68 8 3476.0 False 1.0
5 rows × 38 columns
train.drop(train[(train['headshot_rate'] == 1) & (train['kills'] > 9)].index, inplace=True)
4.2.4.5 异常值处理:删除最远杀敌距离异常数据
# 绘制图像
plt.figure(figsize=(12,4))
sns.distplot(train['longestKill'], bins=10)
plt.show()
# 找出最远杀敌距离大于等于1km的玩家
train[train['longestKill'] >= 1000].shape
(20, 38)
train[train['longestKill'] >= 1000]["longestKill"].head()
202281 1000.0
240005 1004.0
324313 1026.0
656553 1000.0
803632 1075.0
Name: longestKill, dtype: float64
train.drop(train[train['longestKill'] >= 1000].index, inplace=True)
train.shape
(4445287, 38)
4.2.4.6 异常值处理:删除关于运动距离的异常值
# 距离整体描述
train[['walkDistance', 'rideDistance', 'swimDistance', 'totalDistance']].describe()
walkDistance rideDistance swimDistance totalDistance
count 4.445287e+06 4.445287e+06 4.445287e+06 4.445287e+06
mean 1.154619e+03 6.063215e+02 4.510898e+00 1.765451e+03
std 1.183508e+03 1.498562e+03 3.050738e+01 2.183248e+03
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.554000e+02 0.000000e+00 0.000000e+00 1.584000e+02
50% 6.863000e+02 0.000000e+00 0.000000e+00 7.892500e+02
75% 1.977000e+03 2.566000e-01 0.000000e+00 2.729000e+03
max 2.578000e+04 4.071000e+04 3.823000e+03 4.127010e+04
a)行走距离处理
plt.figure(figsize=(12,4))
sns.distplot(train['walkDistance'], bins=10)
plt.show()
train[train['walkDistance'] >= 10000].shape
(219, 38)
train[train['walkDistance'] >= 10000].head()
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... winPlacePerc playersJoined killsNorm damageDealtNorm maxPlaceNorm matchDurationNorm healsandboosts totalDistance killsWithoutMoving headshot_rate
23026 8a6562381dd83f 23e638cd6eaf77 b0a804a610e9b0 0 1 0.00 0 0 0 44 ... 0.8163 99 0.00 0.0000 99.99 1925.06 1 13540.3032 False 0.0
34344 5a591ecc957393 6717370b51c247 a15d93e7165b05 0 3 23.22 0 0 1 34 ... 0.9474 65 0.00 31.3470 27.00 2668.95 4 10070.9073 False 0.0
49312 582685f487f0b4 338112cd12f1e7 d0afbf5c3a6dc9 0 4 117.20 1 0 1 24 ... 0.9130 94 1.06 124.2320 49.82 2323.52 5 12446.7588 False 0.0
68590 8c0d9dd0b4463c c963553dc937e9 926681ea721a47 0 1 32.34 0 0 1 46 ... 0.8333 96 0.00 33.6336 50.96 1909.44 2 12483.6200 False 0.0
94400 d441bebd01db61 7e179b3366adb8 923b57b8b834cc 1 1 73.08 0 0 3 27 ... 0.8194 73 0.00 92.8116 92.71 2293.62 4 11490.6300 False 0.0
5 rows × 38 columns
train.drop(train[train['walkDistance'] >= 10000].index, inplace=True)
b)载具行驶距离处理
plt.figure(figsize=(12,4))
sns.distplot(train['rideDistance'], bins=10)
plt.show()
train[train['rideDistance'] >= 20000].shape
(150, 38)
train[train['rideDistance'] >= 20000].head()
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... winPlacePerc playersJoined killsNorm damageDealtNorm maxPlaceNorm matchDurationNorm healsandboosts totalDistance killsWithoutMoving headshot_rate
28588 6260f7c49dc16f b24589f02eedd7 6ebea3b4f55b4a 0 0 99.2 0 0 1 30 ... 0.6421 96 1.04 103.168 99.84 1969.76 1 26306.6 False 0.000000
63015 adb7dae4d0c10a 8ede98a241f30a 8b36eac66378e4 0 0 0.0 0 0 0 55 ... 0.5376 94 0.00 0.000 99.64 2004.46 0 22065.4 False 0.000000
70507 ca6fa339064d67 f7bb2e30c3461f 3bfd8d66edbeff 0 0 100.0 0 0 0 26 ... 0.8878 99 1.01 101.000 99.99 1947.28 0 28917.5 False 0.000000
72763 198e5894e68ff4 ccf47c82abb11f d92bf8e696b61d 0 0 0.0 0 0 0 46 ... 0.7917 97 0.00 0.000 99.91 1861.21 0 21197.2 False 0.000000
95276 c3fabfce7589ae 15529e25aa4a74 d055504340e5f4 0 7 778.2 0 1 2 2 ... 0.9785 94 7.42 824.892 99.64 1986.44 9 26733.2 False 0.142857
5 rows × 38 columns
train.drop(train[train['rideDistance'] >= 20000].index, inplace=True)
c)游泳距离处理
plt.figure(figsize=(12,4))
sns.distplot(train['swimDistance'], bins=10)
plt.show()
train[train['swimDistance'] >= 2000].shape
(12, 38)
train[train['swimDistance'] >= 2000][["swimDistance"]]
swimDistance
177973 2295.0
274258 2148.0
1005337 2718.0
1195818 2668.0
1227362 3823.0
1889163 2484.0
2065940 3514.0
2327586 2387.0
2784855 2206.0
3359439 2338.0
3513522 2124.0
4132225 2382.0
train.drop(train[train['swimDistance'] >= 2000].index, inplace=True)
4.2.4.7 异常值处理:武器收集异常值处理
plt.figure(figsize=(12,4))
sns.distplot(train['weaponsAcquired'], bins=100)
plt.show()
train[train['weaponsAcquired'] >= 80].shape
(19, 38)
train[train['weaponsAcquired'] >= 80][['weaponsAcquired']].head()
weaponsAcquired
233643 128
588387 80
1437471 102
1449293 95
1592744 94
train.drop(train[train['weaponsAcquired'] >= 80].index, inplace=True)
4.2.4.8 异常值处理:删除使用治疗药品数量异常值
plt.figure(figsize=(12,4))
sns.distplot(train['heals'], bins=10)
plt.show()
train[train['heals'] >= 40].shape
(135, 38)
train[train['heals'] >= 40][["heals"]].head()
heals
18405 47
54463 43
126439 52
259351 42
268747 48
train.drop(train[train['heals'] >= 40].index, inplace=True)
train.shape
(4444752, 38)
4.2.5 类别型数据处理
4.2.5.1 比赛类型one-hot处理
# 关于比赛类型,共有16种方式
train['matchType'].unique()
array(['squad-fpp', 'duo', 'solo-fpp', 'squad', 'duo-fpp', 'solo',
'normal-squad-fpp', 'crashfpp', 'flaretpp', 'normal-solo-fpp',
'flarefpp', 'normal-duo-fpp', 'normal-duo', 'normal-squad',
'crashtpp', 'normal-solo'], dtype=object)
# 对matchType进行one_hot编码
# 通过在后面添加的方式,实现,赋值并不是替换
train = pd.get_dummies(train, columns=['matchType'])
train.head()
Id assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills ... matchType_normal-solo matchType_normal-solo-fpp matchType_normal-squad matchType_normal-squad-fpp matchType_solo matchType_solo-fpp matchType_squad matchType_squad-fpp groupId_cat matchId_cat
0 7f96b2f878858a 0 0 0.00 0 0 0 60 1241 0 ... 0 0 0 0 0 0 0 1 613591 30085
1 eef90569b9d03c 0 0 91.47 0 0 0 57 0 0 ... 0 0 0 0 0 0 0 1 827580 32751
2 1eaf90ac73de72 1 0 68.00 0 0 0 47 0 0 ... 0 0 0 0 0 0 0 0 843271 3143
3 4616d365dd2853 0 0 32.90 0 0 0 75 0 0 ... 0 0 0 0 0 0 0 1 1340070 45260
4 315c96c26c9aac 0 0 100.00 0 0 0 45 0 1 ... 0 0 0 0 0 1 0 0 1757334 20531
5 rows × 53 columns
train.shape
(4444752, 53)
# 通过正则匹配查看具体内容
matchType_encoding = train.filter(regex='matchType')
matchType_encoding.head()
matchType_crashfpp matchType_crashtpp matchType_duo matchType_duo-fpp matchType_flarefpp matchType_flaretpp matchType_normal-duo matchType_normal-duo-fpp matchType_normal-solo matchType_normal-solo-fpp matchType_normal-squad matchType_normal-squad-fpp matchType_solo matchType_solo-fpp matchType_squad matchType_squad-fpp
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
4 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
4.2.5.2 对groupId,matchId等数据进行处理
关于groupId,matchId这类型数据,也是类别型数据。但是它们的数据量特别多,如果你使用one-hot编码,无异于自杀。
在这儿我们把它们变成用数字统计的类别型数据依旧不影响我们正常使用。
# 把groupId 和 match Id 转换成类别类型 categorical types
# 就是把一堆不怎么好识别的内容转换成数字
# 转换group_id
train["groupId"].head()
0 4d4b580de459be
1 684d5656442f9e
2 6a4a42c3245a74
3 a930a9c79cd721
4 de04010b3458dd
Name: groupId, dtype: object
train['groupId'] = train['groupId'].astype('category')
train["groupId"].head()
0 4d4b580de459be
1 684d5656442f9e
2 6a4a42c3245a74
3 a930a9c79cd721
4 de04010b3458dd
Name: groupId, dtype: category
Categories (2026153, object): [00000c08b5be36, 00000d1cbbc340, 000025a09dd1d7, 000038ec4dff53, ..., fffff305a0133d, fffff32bc7eab9, fffff7edfc4050, fffff98178ef52]
train["groupId_cat"] = train["groupId"].cat.codes
train["groupId_cat"].head()
0 613591
1 827580
2 843271
3 1340070
4 1757334
Name: groupId_cat, dtype: int32
# 转换match_id
train['matchId'] = train['matchId'].astype('category')
train['matchId_cat'] = train['matchId'].cat.codes
# 删除之前列
train.drop(['groupId', 'matchId'], axis=1, inplace=True)
# 查看新产生列
train[['groupId_cat', 'matchId_cat']].head()
groupId_cat matchId_cat
0 613591 30085
1 827580 32751
2 843271 3143
3 1340070 45260
4 1757334 20531
train.head()
Id assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills ... matchType_normal-solo matchType_normal-solo-fpp matchType_normal-squad matchType_normal-squad-fpp matchType_solo matchType_solo-fpp matchType_squad matchType_squad-fpp groupId_cat matchId_cat
0 7f96b2f878858a 0 0 0.00 0 0 0 60 1241 0 ... 0 0 0 0 0 0 0 1 613591 30085
1 eef90569b9d03c 0 0 91.47 0 0 0 57 0 0 ... 0 0 0 0 0 0 0 1 827580 32751
2 1eaf90ac73de72 1 0 68.00 0 0 0 47 0 0 ... 0 0 0 0 0 0 0 0 843271 3143
3 4616d365dd2853 0 0 32.90 0 0 0 75 0 0 ... 0 0 0 0 0 0 0 1 1340070 45260
4 315c96c26c9aac 0 0 100.00 0 0 0 45 0 1 ... 0 0 0 0 0 1 0 0 1757334 20531
5 rows × 53 columns
4.2.6 数据截取
4.2.6.1 取部分数据进行使用(1000000)
# 取前100万条数据,进行训练
sample = 1000000
df_sample = train.sample(sample)
df_sample.shape
(1000000, 53)
4.2.7 确定特征值和目标值
# 确定特征值和目标值
df = df_sample.drop(["winPlacePerc", "Id"], axis=1) #all columns except target
y = df_sample['winPlacePerc'] # Only target variable
df.head()
assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks ... matchType_normal-solo matchType_normal-solo-fpp matchType_normal-squad matchType_normal-squad-fpp matchType_solo matchType_solo-fpp matchType_squad matchType_squad-fpp groupId_cat matchId_cat
2324052 0 1 120.20 0 0 0 67 1337 0 0 ... 0 0 0 0 0 0 0 1 339395 43113
533207 0 0 32.93 0 0 0 44 1100 0 0 ... 0 0 0 0 0 0 0 1 914206 13399
325801 0 2 161.10 0 0 3 52 0 0 0 ... 0 0 0 0 0 0 0 1 1119774 45981
478373 0 0 63.94 0 0 0 56 0 0 0 ... 0 0 0 0 0 0 0 0 1932650 44393
1021200 0 0 0.00 0 0 0 89 0 0 0 ... 0 0 0 0 0 0 0 0 1706611 44723
5 rows × 51 columns
y.head()
2324052 0.4074
533207 0.6923
325801 0.8000
478373 0.4894
1021200 0.0816
Name: winPlacePerc, dtype: float64
print(df.shape, y.shape)
(1000000, 51) (1000000,)
4.2.8 分割训练集和验证集
# 自定义函数,分割训练集和验证集
def split_vals(a, n : int):
# ps: n:int 是一种新的定义函数方式,告诉你这个n,传入应该是int类型,但不是强制的
return a[:n].copy(), a[n:].copy()
val_perc = 0.12 # % to use for validation set
n_valid = int(val_perc * sample)
n_trn = len(df)-n_valid
# 分割数据集
raw_train, raw_valid = split_vals(df_sample, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)
# 检查数据集维度
print('Sample train shape: ', X_train.shape,
'\nSample target shape: ', y_train.shape,
'\nSample validation shape: ', X_valid.shape)
Sample train shape: (880000, 51)
Sample target shape: (880000,)
Sample validation shape: (120000, 51)
4.3 机器学习(模型训练)和评估
# 导入需要训练和评估api
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
4.3.1 初步使用随机森林进行模型训练
# 模型训练
m1 = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features='sqrt', n_jobs=-1)
# n_jobs=-1 表示训练的时候,并行数和cpu的核数一样,如果传入具体的值,表示用几个核去跑
m1.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='sqrt', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=3, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=-1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
y_pre = m1.predict(X_valid)
m1.score(X_valid, y_valid)
0.9209745520056316
mean_absolute_error(y_true=y_valid, y_pred=y_pre)
0.06134694645458625
经过第一次计算,得出准确率为:0.92, mae=0.06
4.3.2 再次使用随机森林,进行模型训练
减少特征值,提高模型训练效率
# 查看特征值在当前模型中的重要程度
m1.feature_importances_
array([2.78018429e-03, 7.09256632e-02, 1.29069372e-02, 2.66242891e-03,
3.22533248e-03, 4.03021778e-02, 1.96128120e-01, 1.93034113e-03,
6.83785312e-03, 7.15347972e-03, 1.41163125e-02, 9.47316429e-03,
6.60520545e-03, 7.54587098e-03, 3.77642011e-03, 2.36387992e-03,
1.79286139e-02, 2.71369501e-05, 1.37579746e-03, 1.15328556e-04,
2.24318554e-05, 2.84514166e-01, 4.87906687e-02, 2.25145604e-03,
6.34793081e-03, 1.17778514e-02, 9.38100747e-03, 6.92348737e-03,
1.26945601e-02, 3.77781917e-02, 1.57621968e-01, 0.00000000e+00,
1.22717096e-03, 6.59880420e-05, 7.40618079e-07, 2.08886376e-04,
4.90619346e-04, 4.92684576e-08, 2.07956404e-06, 2.76274752e-07,
9.83807884e-05, 0.00000000e+00, 9.89167941e-06, 1.01437578e-06,
2.77453616e-04, 2.41390349e-04, 1.06886660e-03, 1.08671409e-03,
9.27714940e-04, 4.00129494e-03, 4.00750039e-03])
imp_df = pd.DataFrame({"cols":df.columns, "imp":m1.feature_importances_})
imp_df.head()
cols imp
0 assists 0.002780
1 boosts 0.070926
2 damageDealt 0.012907
3 DBNOs 0.002662
4 headshotKills 0.003225
imp_df = imp_df.sort_values("imp", ascending=False)
imp_df.head()
cols imp
21 walkDistance 0.284514
6 killPlace 0.196128
30 totalDistance 0.157622
1 boosts 0.070926
22 weaponsAcquired 0.048791
# Plot a feature importance graph for the 20 most important features
# 绘制特征重要性程度图,仅展示排名前二十的特征
plot_fea = imp_df[:20].plot('cols', 'imp', figsize=(14,6), legend=False, kind = 'barh')
plot_fea
<matplotlib.axes._subplots.AxesSubplot at 0x1713427b8>
# 保留比较重要的特征
to_keep = imp_df[imp_df.imp>0.005].cols
print('Significant features: ', len(to_keep))
to_keep
Significant features: 20
21 walkDistance
6 killPlace
30 totalDistance
1 boosts
22 weaponsAcquired
5 heals
29 healsandboosts
16 rideDistance
10 longestKill
2 damageDealt
28 matchDurationNorm
25 killsNorm
11 matchDuration
26 damageDealtNorm
13 numGroups
9 killStreaks
27 maxPlaceNorm
8 kills
12 maxPlace
24 playersJoined
Name: cols, dtype: object
# 由这些比较重要的特征值,生成新的df
df[to_keep].head()
walkDistance killPlace totalDistance boosts weaponsAcquired heals healsandboosts rideDistance longestKill damageDealt matchDurationNorm killsNorm matchDuration damageDealtNorm numGroups killStreaks maxPlaceNorm kills maxPlace playersJoined
2324052 1192.0000 67 1754.9000 1 6 0 1 562.9 0.0 120.20 1965.60 0.0 1872 126.2100 28 0 29.40 0 28 95
533207 3105.0000 44 3105.0000 0 7 0 0 0.0 0.0 32.93 1857.42 0.0 1821 33.5886 27 0 27.54 0 27 98
325801 4036.0000 52 4342.9000 2 2 3 5 306.9 0.0 161.10 1820.00 0.0 1820 161.1000 31 0 31.00 0 31 100
478373 611.0000 56 611.0000 0 5 0 0 0.0 0.0 63.94 1501.50 0.0 1430 67.1370 47 0 50.40 0 48 95
1021200 0.3298 89 0.3298 0 1 0 0 0.0 0.0 0.00 1505.86 0.0 1462 0.0000 46 0 51.50 0 50 97
df.head()
assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks ... matchType_normal-solo matchType_normal-solo-fpp matchType_normal-squad matchType_normal-squad-fpp matchType_solo matchType_solo-fpp matchType_squad matchType_squad-fpp groupId_cat matchId_cat
2324052 0 1 120.20 0 0 0 67 1337 0 0 ... 0 0 0 0 0 0 0 1 339395 43113
533207 0 0 32.93 0 0 0 44 1100 0 0 ... 0 0 0 0 0 0 0 1 914206 13399
325801 0 2 161.10 0 0 3 52 0 0 0 ... 0 0 0 0 0 0 0 1 1119774 45981
478373 0 0 63.94 0 0 0 56 0 0 0 ... 0 0 0 0 0 0 0 0 1932650 44393
1021200 0 0 0.00 0 0 0 89 0 0 0 ... 0 0 0 0 0 0 0 0 1706611 44723
5 rows × 51 columns
# 重新制定训练集和测试集
df_keep = df[to_keep]
X_train, X_valid = split_vals(df_keep, n_trn)
# 模型训练
m2 = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features='sqrt',
n_jobs=-1)
# n_jobs=-1 表示训练的时候,并行数和cpu的核数一样,如果传入具体的值,表示用几个核去跑
m2.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='sqrt', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=3, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=-1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
# 模型评分
y_pre = m2.predict(X_valid)
m2.score(X_valid, y_valid)
0.9247615702679183
# mae评估
mean_absolute_error(y_true=y_valid, y_pred=y_pre)
0.05956897962889757
print(m2.score)
<bound method RegressorMixin.score of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='sqrt', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=3, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=-1,
oob_score=False, random_state=None, verbose=0, warm_start=False)>