爬取全国空气空气质量

今天天气不大好,我就是看看天气,我就发现这个网站数据不错,今天就给他全干下来!!!!!!链接:http://www.pm25.com/

我们先打开网站,查看下数据是否在相应的源码,利用network进行抓包,结果如下:

爬取全国空气空气质量_第1张图片
数据就在相应源码中,我们就将这个页面响应代码,用lxml解析,将源代码转化为etree树,分别使用xpath提取链接对每一个链接进行请求,然后再对详情页响应解析,例如:北京天气详情页http://www.pm25.com/beijing.html我们大致思路就是这样,最后把数据保存为csv文件,xpath获取数据的时候有些是空值,会报错,所以我们就全部try了,代码如下:

# -- coding: utf-8 --
# @Time : 2021/1/23 3:27
# @FileName: Pm2.5.py
# @Software: PyCharm

import requests
from lxml import etree
import csv


class Weather():
    # 初始化
    def __init__(self):
    	# url
        self.url = 'http://www.pm25.com/'
        self.headers = {
     
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/87.0.4280.88 Safari/537.36 "
        }

    # 发送请求
    def get_data(self):
        response = requests.get(url=self.url, headers=self.headers)
        return response

    # 解析
    def parse_data(self, response):
        html = etree.HTML(response.content)
        link_list = html.xpath('//*[@id="scrollbar1"]/div[3]/div/div[3]/div/dl/dd/a/@href')
        for link in link_list:
            link = 'http://www.pm25.com' + link
            # 解析子页数据
            self.url = link
            response = self.get_data()
            html = etree.HTML(response.content)
            try:
                city_name = html.xpath("/html/body/div[6]/div/div[1]/h2/text()")[0]
            except:
                pass
            try:
                qua = html.xpath("/html/body/div[6]/div/div[3]/div[1]/p/span[1]/text()")[0]
            except:
                pass
            try:
                aqi_num = html.xpath("/html/body/div[6]/div/div[3]/div[1]/a/text()")[0]
            except:
                pass
            try:
                pm = html.xpath("/html/body/div[6]/div/div[3]/div[2]/p[1]/span/text()")[0] + '微克/立方米'
            except:
                pass
            try:
                wea = html.xpath("/html/body/div[6]/div/div[4]/div/p/span/text()")[0]
                temp = html.xpath("/html/body/div[6]/div/div[4]/div/p/text()")[1]
                add_weather = wea + temp
            except:
                pass

            data = "城市名称:" + city_name + ", " + "空气质量:" + qua + ", " + "AQI指数:" + aqi_num + ", " + "PM2.5浓度:" + pm + ', ' + "天气:" + add_weather
            print(data)
            # 这里直接单写也不返回重新定义保存函数
            # 写入csv
            csv_writer.writerow([city_name, qua, aqi_num, pm, add_weather])

    # 调用
    def run(self):
        response = self.get_data()
        self.parse_data(response)


if __name__ == '__main__':
	# 保证只运行一次,如果不保证一次话就会
    with open('info.csv', 'a', newline='') as f:
        csv_writer = csv.writer(f)
        csv_writer.writerow(["城市名称", "空气质量", 'AQI指数', "PM2.5", "天气"])
        weather = Weather()
        weather.run()

为了方便我就没定义保存函数,效果如下:
爬取全国空气空气质量_第2张图片
我们保存的csv文件如下:
爬取全国空气空气质量_第3张图片 有喜欢的请多多点赞!!!!!

爬取全国空气空气质量_第4张图片

你可能感兴趣的:(python,爬虫)