python3爬虫常用的方式分析

1.requests模块

requests模块中的get()方法是比较常用的方式之一。

首先,需要安装requests模块:pip install requests

其次,请看下图所示

import requests

def test():
   url = 'https://www.toutiao.com/'
   headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36',
        'Host': 'www.toutiao.com',
        'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
        'Connection': 'keep-alive',
        'Accept - Encoding': 'gzip, deflate,br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        #'Cookie':''
    }
    proxies = {"http": "http://182.96.186.173:3276",
          "https": "http://182.96.186.173:3276"}
    try:
      body = requests.get(url,headers=headers,proxies=proxies,timeout=10,verify=False).text
      urllib3.disable_warnings()
      print(body)
   except:
      print('something is wrong!')

if __name__ == "__main__":
   test()

其中,url为请求的网页地址,headers为请求报头,proxies为代理IP,timeout为响应超时设置,verify=False表示跳过SSL证书验证,urllib3.disable_warnings()表示禁用警告。具体信息请参考:

开源地址:点击打开链接

中文API文档:点击打开链接

2.urllib模块

python3中没有urllib2模块,而把urllib2模块整合到了urllib模块中。urllib模块为python3自带,不需要安装。

具体如下图所示:

from urllib import request
from fake_useragent import UserAgent

def test2():
   url = 'https://www.toutiao.com/' 
   proxies = {"http": "http://182.96.186.173:3276"}
   ua = UserAgent()
   try       proxy_handler = request.ProxyHandler(proxies)
       opener = request.build_opener(proxy_handler)
       opener.addheaders =[('User-Agent',ua.random)]
       request.install_opener(opener)
       body = request.urlopen(url,timeout=5)
       response = soup.read().decode('utf-8')
       print(response)
    except:
       print('something is wrong!')

if __name__ == '__main__':
   test2()

fake_useragent包会随机提供不同的User-Agent,具体内容详解请看:点击打开链接

具体内容详解请参考:urllib模块API文档

3.selenium与PhantomJS结合

对于那些使用ajax动态加载的网站,可以使用模拟点击的方式来解决。

具体请看如下图所示:

from selenium import webdriver
from bs4 import BeautifulSoup

def test3():
   url = 'https://www.toutiao.com/'
   try:
      driver = webdriver.PhantomJS()
      driver.get(url)
       time.sleep(1)
       response = BeautifulSoup(driver.page_source, 'html.parser')
       print(response)
       driver.refresh()
       driver.implicitly_wait(1)
   except:
      print('something is wrong!')

if __name__ == '__main__':
   test3()

implicitly_wait()表示隐式等待,使用BeautifulSoup()来进行解析源码

具体内容详解,请参考:selenium with python

4.scrapy框架

以后再更新,详情请看:scrapy框架入门教程

你可能感兴趣的:(Web,Crawler)