准备开个新坑,一周练习一次小爬虫,对于质量较高的数据集,可以顺便做一下分析。同时回归Python代码与统计分析方法。
可以看到这里有一些NBA球员的排名与数据表现,很明显这是一个静态网页,那么直接get网页并搜索标签即可。直接上代码
import requests
from bs4 import BeautifulSoup
import pandas as pd
url2 = 'https://nba.hupu.com/stats/players'
header = {
"user-agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36",
"cookie" : "csrfToken=x8OTXySKbz8VDMB5feAjvY5a; new_nba=1; new_nba.sig=slSAJI6uejKCajd3mHOP1-Lssar98CC05plbblZ8sJo; Hm_lvt_6158ac1596b0de37381ffd343b3df24c=1662727863; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22183224dbb31cfe-0a770b96a673e28-26021c51-1821369-183224dbb3212ef%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMTgzMjI0ZGJiMzFjZmUtMGE3NzBiOTZhNjczZTI4LTI2MDIxYzUxLTE4MjEzNjktMTgzMjI0ZGJiMzIxMmVmIn0%3D%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%22%2C%22value%22%3A%22%22%7D%2C%22%24device_id%22%3A%22183224dbb31cfe-0a770b96a673e28-26021c51-1821369-183224dbb3212ef%22%7D; Hm_lvt_4fac77ceccb0cd4ad5ef1be46d740615=1662727871; Hm_lvt_b241fb65ecc2ccf4e7e3b9601c7a50de=1662727871; Hm_lvt_a3d34dd67fa1fb34b2b430bbaaa2a5bf=1662727871; Hm_lpvt_6158ac1596b0de37381ffd343b3df24c=1662727883; Hm_lpvt_4fac77ceccb0cd4ad5ef1be46d740615=1662727886; Hm_lpvt_b241fb65ecc2ccf4e7e3b9601c7a50de=1662727886; Hm_lpvt_a3d34dd67fa1fb34b2b430bbaaa2a5bf=1662727886"
}
response = requests.get(url2, headers=header)
response.encoding = response.apparent_encoding
print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
soup.prettify()
tbody = soup.find("tbody")
player_info = tbody.find_all("tr")
player_info[0]
col_name = []
for td in player_info[0].find_all('td'):
#print(td.text)
col_name.append(td.text)
print(col_name)
df = pd.DataFrame(columns=col_name, index=["0"])
df
首先是第一行标题行,它和球员的数据一起存在了td标签里,所以先提出这一行,作为DataFrame的columns
players = []
Players = []
for i in range(1, len(player_info)):
players = []
for td in player_info[i].find_all('td'):
#print(td.text)
players.append(td.text)
Players.append(players)
df = pd.DataFrame(Players, columns=col_name)
df.set_index(['排名'], inplace=True)
df
将排名作为排序的索引,这样就很容易的得到了球员们的详细数据,总体上没有什么难度。但有一个问题需要思考,那就是只看DataFrame的形式似乎我们已经获得了数据,但是这个数据显然是不具备分析条件的。因为:
可以看到我们提取标签的时候原始数据都是str格式,那么就无法进行数值计算。所有只是单纯的存下数据没有什么实际意义,接下来对代码进行改进。
players = []
Players = []
for i in range(1, len(player_info)):
players = []
for td in player_info[i].find_all('td'):
#print(td.text)
if "%" in td.text:
players.append(float(td.text.replace("%", ""))/100)
else:
try:
players.append(float(td.text))
except:
players.append(td.text)
Players.append(players)
#print(players)
df = pd.DataFrame(Players, columns=col_name)
df.set_index(['排名'], inplace=True)
df
其中的思想就是通过try、except方法来过滤数据的类型,对于百分数直接去除百分号再除100,若不满足就只剩下两种数据类型,满足numeric的直接float,不满足的说明一定是字符串或混合型,那直接不做处理即可。
从结果上看二者并无差异,但我们继续执行描述统计后会发现
df.describe()
若我们不对describe()函数做声明那么就只会统计数值型变量,可见所有数值型变量都给出了描述统计。数据类型没有问题。