nba球员python
(Notes: All opinions are my own)
(注:所有观点均为我自己)
介绍 (Introduction)
This article explores correlations between NBA player salary data and actual on-court performance, using stats from the 2019/2020 regular season. It also illustrates how you can conduct web-scraping and simple data analysis in Python.
本文使用2019/2020常规赛季的统计数据探索NBA球员薪水数据与实际场上表现之间的相关性。 它还说明了如何在Python中进行网页抓取和简单的数据分析。
The aim is to identify above-average performers with below-average pay (in per minute terms) about to be free-agents in the upcoming off-season, as these players might represent sound opportunities to add quality rotation players to a team (from a front-office perspective).
目的是确定在即将到来的淡季中表现将低于平均水平 (按每分钟计算) 的平均水平更高的表演者将成为自由球员 , 因为这些球员可能代表了向团队增加优质轮换球员的良好机会(来自前台角度)。
Data Sources: Basketball Reference, HoopsHype
数据来源: 篮球参考书 , HoopsHype
For this article, I am also going to show you how you can scrape and parse NBA salary data from HoopsHype.
对于本文,我还将向您展示如何从HoopsHype中抓取并解析NBA薪水数据。
You are additionally going to make use of in-game statistics coming from Basketball Reference.The Basketball Reference relevant dataset (2019/2020 NBA Season Player Totals)can be exported into csv format.
您还打算利用从篮球Reference.The未来在游戏中统计的篮球参考 相关数据集( 2019/2020 NBA Season Player Total s)可以导出为csv格式。
If you are also looking to learn how to scrape data from Basketball Reference in order to automate data ingestion, check out my other NBA-focused article below:
如果您还想学习如何从《篮球参考》中抓取数据以自动进行数据提取,请查看下面我针对NBA的其他文章:
步骤1:从HoopsHype中抓取球员的薪水数据 (Step 1: Scrape player salary data from HoopsHype)
HoopsHype contains player payroll data up to the 2024/25 season (for contracts already signed). Thus, the first thing you want to do is extract the dataset into a pandas Dataframe.
HoopsHype包含截至2024/25赛季的球员工资数据(适用于已签约的合同)。 因此,您要做的第一件事是将数据集提取到pandas Dataframe中 。
Let’s see how you can do that using Python.
让我们看看如何使用Python做到这一点。
Import relevant Packages: As a first step, you will need to import the relevant Python packages and define the url variable, which points to the online data source. Here, you are going to make use of the requests, BeatifulSoup,pandas,matplotlib and numpy packages.
导入相关的软件包 :第一步,您需要导入相关的Python软件包并定义url变量,该变量指向在线数据源。 在这里,您将使用请求,BeatifulSoup,pandas,matplotlib和numpy包。
2. Extract the table element from the web-page: You then need to fetch the relevant url content with the request.get method. You can then save the response in the r variable and then extract its text content only with the .text method.
2.从网页中提取表元素:然后,您需要使用request.get方法获取相关的URL内容。 然后,您可以将响应保存在rv变量中,然后仅使用.text方法提取其文本内容。
Once done, you can parse it into a BeautifulSoup object and locate the dataset by pointing to the html table tag via the soup.find method.
完成后,您可以将其解析为BeautifulSoup对象,并通过soup.find方法指向html表标签来定位数据集。
3.Extract table rows from the table element: the html table element is composed of table data cells identified by the td html tag. You can fetch the total number of data cells with the len function over the salary_table.find_all(“td”) list elements, which is a list of all data cells in the table.
3.从table元素中提取表行: html table元素由td html标签标识的表数据单元格组成。 您可以使用len函数在salary_table.find_all(“ td”)列表元素上获取数据单元的总数,该列表元素是表中所有数据单元的列表。
By exploring the list of all td data cells (salary_table.find_all(“td”), you notice that the data cell starting at index 10 makes up the first data cell of the first column of the table, the one containing the player names. Then all data cells at indexes 11 through 15 contain the all data cells with salary data for 2019/20–2024/25 years. All indexes <10 contain the table headers.
通过浏览所有td数据单元的列表(salary_table.find_all(“ td”) ,您会注意到,从索引10开始的数据单元构成了表格第一列的第一个数据单元,其中包含玩家名称。然后,索引11到15的所有数据单元格都包含具有2019 / 20–2024 / 25年薪水数据的所有数据单元格,所有<10的索引都包含表头。
Knowing this, you can use list comprehensions to loop over the full data cell list and capture every other 8th element (the size of the step necessary to fetch the next relevant td value for each column) from each column’s starting index position. After this runs, you will have stored the player_names and season salary data (column1 to column6 lists) in separate lists.
知道了这一点,您可以使用列表推导遍历整个数据单元格列表,并从每列的起始索引位置捕获其他每第8个元素(获取每列的下一个相关td值所需的步长)。 运行之后,您将在单独的列表中存储了player_names和赛季工资数据( 第1列至第6列 )。
3. Store data into pandas DataFrame: Now that you have all table data store in the relevant list, you can set them as values in a dictionary with the keys set as the desired table headers and convert it to a pandas DataFrame with the pd.DataFrame method.
3.数据存储到数据帧大熊猫:现在你已经在相关列表中的所有表的数据存储,您可以将它们设置为与设置为所需的表头密钥的字典中的值,并将其转换成数据框大熊猫与PD。 DataFrame方法。
Exploring the dataset, you will output the following, with the dataset containing the player names and the expected salary columns.
浏览数据集,您将输出以下内容,其中包含球员姓名和期望薪水列。
步骤2:资料清理与合并 (Step 2: Data cleaning and joining)
Data cleaning on the Salary data: The first data cleaning step is removing all dollar signs (“$”) and commas (“,”) from the salary columns. You can do that with the .replace method.
薪水数据的数据清理:数据清理的第一步是从薪水列中删除所有美元符号(“ $”)和逗号(“,”)。 您可以使用.replace方法执行此操作。
After this, all salaries can be converted to a numeric data type with the pd.to_numeric command, which is thus applied to all columns except for the only non-numeric one, the player_names.
此后,可以使用pd.to_numeric命令将所有薪水转换为数字数据类型,从而将其应用于除唯一的非数字字段player_names之外的所有列。
2. Importing and cleaning the 2019/20 stats dataset
2.导入和清理2019/20统计数据集
You now need to import the per-player total statistics from the relevant season. You can do so via the pd.read_csv command.
现在,您需要导入相关季节的每人总统计信息。 您可以通过pd.read_csv命令执行此操作。
You notice that the first column contains the player name attached to what seems like a name identifier of sorts that is used by the data source. Let’s remove that and keep only the full name.
您会注意到,第一列包含播放器名称,该名称附加在数据源所使用的看起来像是各种名称的标识符中。 让我们删除它,仅保留全名。
To do so, you can loop over the Player column, get the index position of the backslash character(“\”), which separates the full name from the identifier, and use that position to slice each name’s string to keep only the characters from the start of the string up until the position of the index.
为此,您可以在“ 播放器”列上循环,获取反斜杠字符(“ \”)的索引位置,以将全名与标识符分开,然后使用该位置对每个名称的字符串进行切片,以仅保留字符中的字符。字符串的开始,直到索引的位置。
Each polished name then overwrites the previous one, resulting in the polished column. You then merge the salary and the stats dataset on the unique player’s full name, and store the result in a separate object (complete_df), which now contains the full stats for each player plus their current and projected salary data for future seasons.
然后,每个优美的名称都会覆盖前一个,从而产生优美的一列。 然后,您将薪水和统计数据集合并到唯一球员的全名上,并将结果存储在一个单独的对象( complete_df )中,该对象现在包含每个球员的全部统计数据以及他们未来和未来赛季的薪水数据。
步骤3:数据探索和分析 (Step 3: Data Exploration and analysis)
Now that you have paired up 2019/20 season and salary data, you can start exploring and analyzing the data.
既然您已经将2019/20赛季和薪水数据配对,就可以开始探索和分析数据了。
In this instance, I am going back to my initial question, and thus will aim to complete a simple and preliminary analysis in order to identify above-average performers with below-average pay (in per minute terms) about to be free-agents in the upcoming off-season.
在这种情况下,我将回到我的第一个问题,因此将旨在完成一个简单而初步的分析,以便确定在工资水平上低于平均水平 (按每分钟计算)的高于平均水平的人将成为自由球员。即将到来的淡季。
In terms of metrics, I will be approximating a player’s on-court performance via the Approximate Value (AV) metric, developed by Dean Oliver.
在指标方面,我将通过Dean Oliver开发的近似值 (AV)指标来估算球员的场上表现。
I will also refer to pay in per-minute played terms. Let’s then start computing the two metrics from our dataset.
我还将指按每分钟播放次数付费。 然后,让我们开始从数据集中计算两个指标。
Compute AV, per-minute salary
计算AV,每分钟工资
AV’s formula is available at the above link; after having replicated it in Python, I also calculate per-minute play by simply dividing the salary for the year by the actual total minutes played by each player.
AV的公式可在上面的链接中找到; 在Python中将其复制后,我还可以通过将年薪除以每个球员的实际总分钟数来计算每分钟的比赛数。
2. Plot correlations and treat ouliers
2.绘制相关关系并处理油井
Exploring the relationship between a player’s value and its pay-per-minute, you are immediately struck by the skewness of the distribution. In fact, many of the players’ AV are distributed within the $0–$14K /minute bands, with outliers taking place on the right-end side of the distribution.
在探究球员价值与其按分钟付费之间的关系时,您会立即被分配偏斜所震惊。 实际上,许多玩家的AV分布在$ 0– $ 14K /分钟的范围内,而异常值发生在分布的右端。
Exploring the distribution of the per-minute salary column further with the .describe method over the , you can appreciate how right-skewed the data is, as over 75% of the data points are less than or equal to 14K, with the mean located at ~17K (median at 5K).
使用进一步探索“每分钟薪水”列的分布。 在上描述方法,您可以体会到数据的偏斜度,因为超过75%的数据点小于或等于14K,均值位于〜17K(中值为5K)。
This indicates that you are in the probable presence of outliers; in this context, of players whose per-minute salary is too high given the excessively low amount of games they have played in the season (usually due to injuries, which cut seasons short).
这表明您可能存在异常值; 在这种情况下,由于每季薪水过高的球员,因为他们本赛季打的比赛次数过少(通常是由于受伤,导致赛季缩短)。
Removing these shall help in getting a better sense of the data. Armed with this knowledge, you can set a per-minute salary threshold on which to “cut” the data and thus allow you to get a better sense of the nature of the distribution for players with a high enough amount of games played.
删除这些将有助于更好地了解数据。 有了这些知识,您就可以设置每分钟的工资阈值,以“削减”数据,从而使您可以更好地了解游戏数量足够多的玩家的分配性质。
In this example, I will be using the 75th percentile ($ 14K per-minute) but feel free to play around here.
在此示例中,我将使用第75个百分位(每分钟$ 14K),但可以在这里随意玩。
Plotting the data now gives a different picture, as one can now better understand the correlation between per-minute salary and performance.
现在绘制数据可以得出另一幅图,因为现在可以更好地了解每分钟薪水与绩效之间的相关性。
3. High-level player clustering
3.高级玩家集群
The overall correlation seems to be somewhat positive. Let’s now identify clusters of players within the high-end of performance and within the low-end of $ per-minute played.
总体相关性似乎有些积极。 现在让我们确定在每分钟$的性能高端和低端范围内的玩家群体。
Using mean values, I can split the above-data into quadrants which allows to visually grasp each data point’s position relative to the mean value for both the x and y axis.
使用平均值,我可以将上述数据划分为四个象限,从而可以直观地掌握每个数据点相对于x和y轴均值的位置。
To do so, I use the plt.axvline and plt.axhline methods to draw vertical and horizontal lines which identify the quadrants relative to the averages.
为此,我使用plt.axvline和plt.axhline方法绘制垂直和水平线,以识别相对于平均值的象限。
The resulting graph is the following, with players with above-average AV and below-average $/minute are being located in the upper left quadrant of the graph.
结果图如下所示,AV高于平均水平且$ /分钟低于平均水平的玩家位于该图的左上象限。
Looking towards the upcoming season, these players represent a potential opportunity, from a NBA front office perspective, to invest in proven quality players with a non-demanding salary relative to the rest of the league.
展望即将到来的赛季,从NBA前台办公室的角度来看,这些球员是潜在的机会,以相对于联盟其他成员来说薪水不高的方式,投资经过验证的高素质球员。
In essence, these are players which have represented a high return on investment (ROI) to the teams who had them on their payroll.
本质上,这些球员代表了那些拥有薪水的团队的高投资回报(ROI)。
Let’s now conduct some final analysis to nail down a list of potential free-agent candidates, keeping in mind that not all of these players will be free agents next season, and thus available to be signed by other teams.
现在让我们进行一些最终分析,以列出潜在的自由球员候选人,同时请记住,并不是所有这些球员在下个赛季都是自由球员,因此可以由其他球队签下。
4. Spotting final free-agent list of high ROI players
4.发现高投资回报率的最终自由球员名单
To come up with a final list of potential high ROI players, let’s first slice the data to remove any outliers. I will name it as relevant_dataset.
为了给出潜在的高投资回报率参与者的最终列表,让我们首先对数据进行切片以去除任何异常值。 我将其命名为related_dataset。
Then I store average values for both performance and salary metrics, and subsequently filter the data to only keep above-average performers with below-average pay in per-minute terms.
然后,我将绩效指标和薪水指标的平均值存储起来,然后过滤数据以仅使每分钟薪水低于平均水平的表现保持较高水平。
I will name this final dataset good_performers.
我将这个最终数据集命名为good_performers 。
Finally, as I want to consider only upcoming free-agents, I do so by filtering the data to fetch back players whose salary for next year is equal to 0, as this means that they have yet to sign a contract.
最后,由于我只考虑即将到来的自由球员,所以我通过过滤数据以获取明年薪水等于0的球员的方式来这样做,因为这意味着他们尚未签订合同。
Of those players, I sort by top performers using the AV metric (in descending order), and get back my final list of about top~20 candidates.
在这些球员中,我使用AV指标(按降序排列)对表现最好的球员进行排序,并获得大约前20名候选人的最终名单。
5. Considerations
5. 注意事项
Going down the list, you can see that all of these players serve as great quality rotation players for their respective teams.
在列表中,您可以看到所有这些球员都是各自球队的优秀轮换球员。
The best on this list, Harrell, has just won the Six Man of the Year Award, and his AV confirms his great production for the Clippers this past season. With his current contract, you know now that he’s also been a bargain for the Clips.
该名单上最好的哈雷尔(Harrell)刚刚获得了年度六人奖,他的影音证实了他过去一个赛季在快船队的出色表现。 根据他目前的合同,您现在知道他也一直在为Clips讨价还价。
Going into next season, definitely expect these players to be in the eye of contract negotiations, either with their current teams or with other ones wanting to take advantage of their services and spend possibly high ROI money on a quality role player.
进入下个赛季,我们绝对希望这些球员能与合同谈判达成共识,无论是与他们目前的球队还是其他希望利用他们的服务并可能在高素质的角色球员身上花费高投资回报率的球队。
下一步 (Next steps)
AV and salary are by no means exhaustive metrics, and this correlation exercise should not be intended as a complete piece of analysis , but I hope to have inspired you to dig deeper into these trends and take other metrics into consideration as you work towards answering your personal questions and testing your hypotheses.
期望值和薪水绝不是详尽的指标,并且此关联练习不应作为完整的分析,但我希望启发您深入研究这些趋势,并在努力回答自己的问题时考虑其他指标个人问题并检验您的假设。
Thanks for reading!
谢谢阅读!
Access my free Data Science resource checklist here
在此处 访问我的免费数据科学资源清单
翻译自: https://medium.com/@edo.romani1/linking-nba-salary-to-performance-sample-player-analysis-with-python-2c568455b306
nba球员python