利用python3.7抓取天气网的历史数据,破解网站的反爬

   声明:本代码抓取数据可以使用研究学习之用,不要使用商业用途,否则由此产生商业纠纷由使用者这负责

       最近需要使用环境pm2.5的全国省会城市的历史数据,正好天气网(http://www.tianqi.com)提供了历史数据的查询,因此再网上查询相关python抓取数据的代码,主要参考了这篇博文:https://blog.csdn.net/haha_point/article/details/77197230#commentsedit,但是这篇博文有以下两个问题:

1,用的是python2.7的版本,python3和python2还是有比较多的差别的,作为本抓取程序,主要是urllib的区别,2.7中直接importurllib就ok了,在3.7中需要import urllib.request;

2,就是www.tianqi.com网址增加了反爬措施,因为是爬取的是历史数据,其实用的是http://lishi.tianqi.com/(city)/(date).html,如北京201907的数据的连接是:http://lishi.tianqi.com/beijing/201907.html,但是你再一个新浏览器上面直接打开上面的连接的话,会返回一个如下的页面:

  1. 利用python3.7抓取天气网的历史数据,破解网站的反爬_第1张图片这也是导致目前网上很多python爬虫无法抓取数据的原因,找个原因是你必须先访问我www.tianqi.com,然后再去上面的页面就返回ok了,猜测可能是在访问www.tianqi.com首页的时候,前端写cookie了,因此你可以,再访问www.tianqi.com的页面,按F12,看下页面的cookie数据,具体如下图:其中cookie部分,已经用红线标出了,再url请求的时候直接添加headers,用这个方法:req=urllib.request.Request(url=url,headers=my_headers),其中
    my_headers = {
        "Host": "lishi.tianqi.com",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
        "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
        "Referer": "http://lishi.tianqi.com/Accept-Encoding: gzip, deflate",
        "Cookie": "cityPy=xianqu; cityPy_expire=1565422933; UM_distinctid=16c566dd356244-05e0d9cb0c361-3f385c06-1fa400-16c566dd357642; Hm_lvt_ab6a683aa97a52202eab5b3a9042a8d2=1564818134; CNZZDATA1275796416=927309794-1564814113-%7C1564814113; Hm_lpvt_ab6a683aa97a52202eab5b3a9042a8d2=1564818280"},具体代码见下面代码

利用python3.7抓取天气网的历史数据,破解网站的反爬_第2张图片

 代码如下:

import random
import socket
import sys
import urllib
import urllib.request

from bs4 import BeautifulSoup



#reload(sys)
#sys.('utf8')
socket.setdefaulttimeout(30.0)


def parseTianqi(url):
    my_headers = {
        "Host": "lishi.tianqi.com",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
        "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
        "Referer": "http://lishi.tianqi.com/Accept-Encoding: gzip, deflate",
        "Cookie": "cityPy=xianqu; cityPy_expire=1565422933; UM_distinctid=16c566dd356244-05e0d9cb0c361-3f385c06-1fa400-16c566dd357642; Hm_lvt_ab6a683aa97a52202eab5b3a9042a8d2=1564818134; CNZZDATA1275796416=927309794-1564814113-%7C1564814113; Hm_lpvt_ab6a683aa97a52202eab5b3a9042a8d2=1564818280"}
    req = urllib.request.Request(url=url, headers=my_headers)
    req.add_header("Content-Type", "application/json")
    fails = 0
    while True:
        try:
            if fails >= 3:
                break
            req_data = urllib.request.urlopen(req)
            response_data = req_data.read()
            response_data = response_data.decode('gbk').encode('utf-8')
            return response_data
        except urllib.request.URLError as e:
            fails += 1
            print ('网络连接出现问题, 正在尝试再次请求: ', fails)
        else:
            break



def witeCsv(data, file_name):
    file = open(file_name, 'w',-1,'utf-8')
    soup = BeautifulSoup(data, 'html.parser')
    weather_list = soup.select('div[class="tqtongji2"]')
    for weather in weather_list:
        weather_date = weather.select('a')[0].string.encode('utf-8')
        ul_list = weather.select('ul')
        i = 0
        for ul in ul_list:
            li_list = ul.select('li')
            str = ""
            for li in li_list:
                str += li.string.encode('utf-8').decode() + ','
            if i != 0:
                file.write(str + '\n')
            i += 1

    file.close()


# 根据图片主页,抓取当前图片下面的相信图片

if __name__ == "__main__":
    data = parseTianqi("http://lishi.tianqi.com/beijing/201907.html");
    witeCsv(data, "beijing_201907");

 

你可能感兴趣的:(python,抓取历史天气,爬虫,破解反爬策略)