pip install selenium
Chrome浏览器驱动:Chromedriver
Firefox浏览器驱动:Geckodriver
IE浏览器驱动:IEDriverServer
Edge浏览器驱动:MicrosoftWebDriver
Opera浏览器驱动:Operadriver
还是延续前章,以爬取豆瓣Top250电影网站电影名为例。
导包:
from selenium import webdriver
url = 'https://movie.douban.com/top250'
browser = webdriver.Chrome()
browser.get(url)
opts = webdriver.ChromeOptions()
opts.add_argument('--headless') # 浏览器不提供可视化页面
opts.add_argument('--no-sandbox') # 解决DevToolsActivePort文件不存在的报错
opts.add_argument('blink-settings=imagesEnabled=false') # 禁止网页加载图片
browser = webdriver.Chrome(options=opts)
selector = '#content > div > div.article > ol > li:nth-child(1) > div > div.info > div.hd > a > span:nth-child(1)'
movie_name = browser.find_element_by_css_selector(selector).text
print(movie_name)
selenium有众多元素定位方法,将方法名中的element改为复数elememts即可定位多个元素:
- xpath定位:find_element_by_xpath()
- id定位:find_element_by_id()
- name定位:find_element_by_name()
- class定位:find_element_by_class_name()
- tag定位:find_element_by_tag_name()
- link定位:find_element_by_link_text()
- partial_link定位:find_element_by_partial_link_text()
Xiao='#content > div > div.article > ol > li:nth-child(1) > div > div.info > div.hd > a > span:nth-child(1)'
Wang='#content > div > div.article > ol > li:nth-child(2) > div > div.info > div.hd > a > span:nth-child(1)'
selector = '#content > div > div.article > ol > li > div > div.info > div.hd > a > span:nth-child(1)'
movie_names = browser.find_elements_by_css_selector(selector)
for movie_name in movie_names:
print(movie_name.text)
from selenium import webdriver
url = 'https://movie.douban.com/top250'
# opts = webdriver.ChromeOptions()
# opts.add_argument('--headless')
# browser = webdriver.Chrome(options=opts)
browser = webdriver.Chrome()
browser.get(url)
selector = '#content > div > div.article > ol > li > div > div.info > div.hd > a > span:nth-child(1)'
movie_names = browser.find_elements_by_css_selector(selector)
for movie_name in movie_names:
print(movie_name.text)