目录
一、代码目的
二、准备工作
三、代码
四、过程中遇到的坑
1.加载不完全
2.元素位置的确定
五、实验结果和总结
主要是为了学习selenium模拟浏览器操作的方法
浏览器:chrome
驱动:chromedirver(和python.exe在同一目录)
用到的库:lxml、selenium
import time
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from lxml import etree
browser = webdriver.Chrome()
browser.get("https://www.baidu.com")
wait = WebDriverWait(browser,50)
def search():
browser.get('https://www.jd.com/')
try:
input = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR,"#key"))
)#llist
submit = wait.until(
EC.element_to_be_clickable((By.CSS_SELECTOR,"#search > div > div.form > button"))
)
#input = browser.find_element_by_id('key')
input[0].send_keys('python')
submit.click()
total = wait.until(
EC.presence_of_all_elements_located(
(By.CSS_SELECTOR,'#J_bottomPage > span.p-skip > em:nth-child(1) > b')
)
)
html = browser.page_source
prase_html(html)
return total[0].text
except TimeoutError:
search()
def next_page(page_number):
try:
#滑动到底部,加载出后三十个货物信息
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(10)
#翻页动作
button = wait.until(
EC.element_to_be_clickable((By.CSS_SELECTOR,'#J_bottomPage > span.p-num > a.pn-next > em'))
)
button.click()
wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR,"#J_goodsList > ul > li:nth-child(60)"))
)
#判断翻页成功
wait.until(
EC.text_to_be_present_in_element((By.CSS_SELECTOR,"#J_bottomPage > span.p-num > a.curr"),str(page_number))
)
html = browser.page_source
prase_html(html)
except TimeoutError:
return next_page(page_number)
def prase_html(html):
html = etree.HTML(html)
items = html.xpath('//li[@class="gl-item"]')
for i in range(len(items)):
if html.xpath('//div[@class="p-img"]//img')[i].get('data-lazy-img') != "done":
print("img:", html.xpath('//div[@class="p-img"]//img')[i].get('data-lazy-img'))
else :
print("img:",html.xpath('//div[@class="p-img"]//img')[i].get('src'))
print("title:", html.xpath('//div[@class="p-name"]//em')[i].xpath('string(.)'))
print("price:",html.xpath('//div[@class="p-price"]//i')[i].text)
print("commit", html.xpath('//div[@class="p-commit"]//a')[i].text)
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
def main():
print("第",1,"页:")
total = int(search())
for i in range(2,total+1):
time.sleep(3)
print("第",i,"页:")
next_page(i)
if __name__ == "__main__":
main()
京东的商品信息一页六十个,但是每次只加载三十,只有向下划才会加载后三十个,所以我原先只定位了 页码的加载是行不通的,后添加了划到底部的操作
我对网页的知识还是很浅薄,不太明白网页的机理。解析网页时,书名、价格、评论数很快找到了,但在找img的时候出现了问题,img在审查元素中是位于src属性中,如下图,但是get(“src”)得到的大部分None。我就去查看返回的parse_source,发现img在data-lazy-img中居多,在src中也有,所以当data-lazy-img的值不为“done”的时候,打印data-lazy-img,否则打印src,之后的提取基本上都成功的打印了img。(注:ip是一个class="p-img"的元素)
对selenium有了初步了解,但对于网页的了解还不够充分,不太能够很好的把握元素的位置,仍需努力。