PhantomJS无法获取https网址的内容

PhantomJS是一个重要的爬虫工具,能爬取动态加载上来的数据。

goods_url = "https://xueqiu.com/u/5832323914"
xpath0 = "(//div[@id='app']/div[contains(@class,'container')]/div[@class='profiles__main']/div[@class='profiles__timeline__bd']/article[@class='timeline__item']/div[@class='timeline__item__main']/div[@class='timeline__item__bd']/div[@class='timeline__item__content']/div[contains(@class,'content')]/div)[2]"
from selenium import webdriver
browser = webdriver.PhantomJS(executable_path = './phantomjs',service_args=['--ssl-protocol=any'])
browser.get(goods_url)
res = browser.find_element_by_xpath(xpath0) # 查找内容
print(res.text)
print(driver.page_source)
browser.quit()

今天我碰到一个很奇怪的问题，上述python爬虫代码，对好多个网址，都没问题，对于上面的网址却不行。res = browser.find_element_by_xpath(xpath0) 找不到内容，

然后print(driver.page_source) 的结果是

经过一番百度google，终于找到解决办法

下面的代码,对一些网址没问题,对有的https网址有问题,因为ssl安全不过关呀(需要设置)

2.解决方法:PhantomJS的设置

http协议的网址是不会有问题的，对有的https网址有问题,因为ssl安全不过关呀(需要设置)

参考:https://www.cnblogs.com/fly-kaka/p/6656196.html

我去掉了帖子里冗余的设置

(网上多个帖子说设置 service_args=['--ignore-ssl-errors=true'] ,但是,我通过测试是desired_capabilities=cap(也就是headers)的问题 ,并且确切的说是

cap["phantomjs.page.customHeaders.User-Agent"] =ua ,不过,不同的网址可能不一样,也算是个反爬虫机制吧 ,都设置的话,肯定比较稳当

'''设置'''
from selenium import webdriver
#executable_path 指出phantomjs.exe的位置,如果它在path中有,就不需要此参数
browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
cap = webdriver.DesiredCapabilities.PHANTOMJS
cap["phantomjs.page.settings.resourceTimeout"] = 2000
cap["phantomjs.page.settings.loadImages"] = True
cap["phantomjs.page.settings.disk-cache"] = True
cap["phantomjs.page.settings.userAgent"] = ua
cap["phantomjs.page.customHeaders.User-Agent"] =ua
cap["phantomjs.page.customHeaders.Referer"] = "http://tj.ac.10086.cn/login/"
browser = webdriver.PhantomJS(desired_capabilities=cap,service_args=['--ignore-ssl-errors=true'])
'''开始调用'''
goods_url = "https://xueqiu.com/u/5832323914"
xpath0 = "(//div[@id='app']/div[contains(@class,'container')]/div[@class='profiles__main']/div[@class='profiles__timeline__bd']/article[@class='timeline__item']/div[@class='timeline__item__main']/div[@class='timeline__item__bd']/div[@class='timeline__item__content']/div[contains(@class,'content')]/div)[2]"
browser.get(goods_url)
res = browser.find_element_by_xpath(xpath0) # 查找内容
print(res.text)
print(browser.page_source)
browser.quit()

PhantomJS无法获取https网址的内容

PhantomJS无法获取https网址的内容

2.解决方法:PhantomJS的设置

你可能感兴趣的:(PhantomJS无法获取https网址的内容)