就在五一放假前一个星期,我的老师大哥给我丢了个爬虫项目,而对于我一个刚入门的小白来说,任务是十分艰巨的,经历了坐牢一个星期,没日没夜的查代码,我终于憋出来了。
网站的首页就十分复杂,我在首页就看到了商品页,我原以为工作量会就这么点,这只是网站的首页,我的好大哥要求我做全部商品页的商品数据爬虫,我听到这句话的时候,如芒刺背,如坐针毡......全部商品页可比首页商品的难度大多了。
那么将网址设为全局变量,以方便我接下来的获取:
htUrl = '(不便展示网址)'
导入爬虫需要的包:
from urllib3.util import url
from selenium import webdriver
from lxml import etree
import requests
import csv
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
首先,为了避免被网站的反爬程序发现我,我选用了headers来给我做个掩饰,并且因为这是个动态网站,所有为了获取到网页源码,我选择用driver来获取:
def spider(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
driver = webdriver.Chrome()
driver.get(url)
response = driver.page_source
# response = requests.get(url, headers=headers)
# response.encoding = 'utf-8'
return response
那么获取商品数据的定位这些,我选择用xpath,原因不言而喻,因为方便....由于商品数量较多,如果用一个函数对应一个商品来获取,那是十分麻烦的,所以在这里我选择用精准的xpath定位配合for循环来获取,效果还行:
def getEverypro(source):
html_element = etree.HTML(source)
proItemList = html_element.xpath('//*[@id="gform"]/div/div[4]/div[1]/ul')
print(proItemList)
proList = []
for eachpro in proItemList:
for i in ['1','2','3','4','5','6','7','8','9','10','11','12']:
proDict = {}
name = eachpro.xpath('//*[@id="gform"]/div[@class="htzszy_layout ptb40"]/div[@class="htzszy_listmode mt16"]/div[@class="htzszy_prolist"]/ul/li['+i+']/div[@class="proname"]/text()')
proDict['name'] = name
company1 = eachpro.xpath('//*[@id="gform"]/div[@class="htzszy_layout ptb40"]/div[@class="htzszy_listmode mt16"]/div[@class="htzszy_prolist"]/ul/li['+i+']/div[@class="htzszy_proinfo clearfix"]/table/tbody/tr[1]/td[1]/text()')
proDict['company'] = company1
didian1 = eachpro.xpath('//*[@id="gform"]/div[@class="htzszy_layout ptb40"]/div[@class="htzszy_listmode mt16"]/div[@class="htzszy_prolist"]/ul/li['+i+']/div[@class="htzszy_proinfo clearfix"]/table/tbody/tr[1]/td[2]/text()')
proDict['didian'] = didian1
price1 = eachpro.xpath('//*[@id="gform"]/div[@class="htzszy_layout ptb40"]/div[@class="htzszy_listmode mt16"]/div[@class="htzszy_prolist"]/ul/li['+i+']/div[@class="htzszy_proinfo clearfix"]/table/tbody/tr[2]/td[1]/p[@class="nprice "]/span[@class="price"]/span/text()')
proDict['price'] = price1
proList.append(proDict)
return proList
接下来就是写入数据,存储数据,我就不多言了:
def writeData(proList):
with open('数据.csv', 'w', encoding='utf-8' , newline='') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'company', 'didian', 'price'])
writer.writeheader()
for each in proList:
writer.writerow(each)
那么困扰我几天的问题什么呢,我没办法实现爬虫获取完第一页的数据后,自己跳转向下一页,我原本做了个driver的点击函数,用以自己跳转下一页,但最后只能跳转到第二页就开始会有报错,我仔细观察了网站在新的一页里网址是否会变化,很显然,并没有。就在这时,我看见了网站底部有一个跳转小框,果然是车到山前必有路,我的思路是在获取存入完第一页的数据后,我的点击函数会自动输入第二页的网址挑战然后重复数据的获取和写入,当然,输入跳转框的xpath我也是费了一番功夫找到了:
def click(self):
driver = webdriver.Chrome()
driver.get(self)
w=driver.find_element(By.XPATH,'//*[@id="pageno"]')
w.send_keys(i)
next_page = driver.find_element(By.XPATH,'//*[@id="gform"]/div/div[5]/div/div/ul/div[2]/span[4]/a')
next_page.click()
page = driver.page_source
return page
在这次的这个爬虫里,我不仅学到了爬虫项目的整个思路,我还学习到了一些新的代码命令,例如点击和传入数据。更更重要的是xpath的定位,这是我第一次了解到这个定位,正如我的简介所说,我是个小白,哈哈哈哈。欢迎各位大佬指出我可以改进的地方,小弟谢过,接下来是我全部的代码:
# 导包
from urllib3.util import url
from selenium import webdriver
from lxml import etree
import requests
import csv
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
# 拿到目标url
htUrl = '(不便展示网址)'
# 获取网页源码
def spider(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
driver = webdriver.Chrome()
driver.get(url)
response = driver.page_source
# response = requests.get(url, headers=headers)
# response.encoding = 'utf-8'
return response
# 获取一页商品数据
def getEverypro(source):
html_element = etree.HTML(source)
proItemList = html_element.xpath('//*[@id="gform"]/div/div[4]/div[1]/ul')
print(proItemList)
proList = []
for eachpro in proItemList:
for i in ['1','2','3','4','5','6','7','8','9','10','11','12']:
proDict = {}
name = eachpro.xpath('//*[@id="gform"]/div[@class="htzszy_layout ptb40"]/div[@class="htzszy_listmode mt16"]/div[@class="htzszy_prolist"]/ul/li['+i+']/div[@class="proname"]/text()')
proDict['name'] = name
company1 = eachpro.xpath('//*[@id="gform"]/div[@class="htzszy_layout ptb40"]/div[@class="htzszy_listmode mt16"]/div[@class="htzszy_prolist"]/ul/li['+i+']/div[@class="htzszy_proinfo clearfix"]/table/tbody/tr[1]/td[1]/text()')
proDict['company'] = company1
didian1 = eachpro.xpath('//*[@id="gform"]/div[@class="htzszy_layout ptb40"]/div[@class="htzszy_listmode mt16"]/div[@class="htzszy_prolist"]/ul/li['+i+']/div[@class="htzszy_proinfo clearfix"]/table/tbody/tr[1]/td[2]/text()')
proDict['didian'] = didian1
price1 = eachpro.xpath('//*[@id="gform"]/div[@class="htzszy_layout ptb40"]/div[@class="htzszy_listmode mt16"]/div[@class="htzszy_prolist"]/ul/li['+i+']/div[@class="htzszy_proinfo clearfix"]/table/tbody/tr[2]/td[1]/p[@class="nprice "]/span[@class="price"]/span/text()')
proDict['price'] = price1
proList.append(proDict)
return proList
#自动翻页
def click(self):
driver = webdriver.Chrome()
driver.get(self)
w=driver.find_element(By.XPATH,'//*[@id="pageno"]')
w.send_keys(i)
next_page = driver.find_element(By.XPATH,'//*[@id="gform"]/div/div[5]/div/div/ul/div[2]/span[4]/a')
next_page.click()
page = driver.page_source
return page
# 存储数据
def writeData(proList):
with open('数据.csv', 'w', encoding='utf-8' , newline='') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'company', 'didian', 'price'])
writer.writeheader()
for each in proList:
writer.writerow(each)
# 运行
if __name__ == '__main__':
proList = []
r = spider(htUrl)
proList += getEverypro(r)
#控制需要第几页的数据(第一页默认取)
for i in range(2,4):
y = click(htUrl)
proList += getEverypro(y)
writeData(proList)
大家五一快乐!!!!!!