1.requests模块
requests模块中的get()方法是比较常用的方式之一。
首先,需要安装requests模块:pip install requests
其次,请看下图所示
import requests def test(): url = 'https://www.toutiao.com/' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36', 'Host': 'www.toutiao.com', 'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8', 'Connection': 'keep-alive', 'Accept - Encoding': 'gzip, deflate,br', 'Accept-Language': 'zh-CN,zh;q=0.9', #'Cookie':'' } proxies = {"http": "http://182.96.186.173:3276", "https": "http://182.96.186.173:3276"} try: body = requests.get(url,headers=headers,proxies=proxies,timeout=10,verify=False).text urllib3.disable_warnings() print(body) except: print('something is wrong!') if __name__ == "__main__": test()
其中,url为请求的网页地址,headers为请求报头,proxies为代理IP,timeout为响应超时设置,verify=False表示跳过SSL证书验证,urllib3.disable_warnings()表示禁用警告。具体信息请参考:
开源地址:点击打开链接
中文API文档:点击打开链接
2.urllib模块
python3中没有urllib2模块,而把urllib2模块整合到了urllib模块中。urllib模块为python3自带,不需要安装。
具体如下图所示:
from urllib import request from fake_useragent import UserAgent def test2(): url = 'https://www.toutiao.com/' proxies = {"http": "http://182.96.186.173:3276"} ua = UserAgent() try: proxy_handler = request.ProxyHandler(proxies) opener = request.build_opener(proxy_handler) opener.addheaders =[('User-Agent',ua.random)] request.install_opener(opener) body = request.urlopen(url,timeout=5) response = soup.read().decode('utf-8') print(response) except: print('something is wrong!') if __name__ == '__main__': test2()
fake_useragent包会随机提供不同的User-Agent,具体内容详解请看:点击打开链接
具体内容详解请参考:urllib模块API文档
3.selenium与PhantomJS结合
对于那些使用ajax动态加载的网站,可以使用模拟点击的方式来解决。
具体请看如下图所示:
from selenium import webdriver from bs4 import BeautifulSoup def test3(): url = 'https://www.toutiao.com/' try: driver = webdriver.PhantomJS() driver.get(url) time.sleep(1) response = BeautifulSoup(driver.page_source, 'html.parser') print(response) driver.refresh() driver.implicitly_wait(1) except: print('something is wrong!') if __name__ == '__main__': test3()
implicitly_wait()表示隐式等待,使用BeautifulSoup()来进行解析源码
具体内容详解,请参考:selenium with python
4.scrapy框架
以后再更新,详情请看:scrapy框架入门教程