一个简单的抓取中国城市空气质量的python包chinaAQI

中国地级市的空气质量查询包

几日之前,帮助别人写了个程序来抓取国家环保部网站上全国各个地级市的空气质量。
当时觉得写完就得了,代码就一直放着没动,现在整理文件夹,又翻了出来,觉得当时写的结构不好,重新把组织结构想了一遍。
由于觉得可能同样有人会需要查询这些数均进行分析,我就直接把它们写成了个包,方便别人使用。
有兴趣的可以直接pip install chinaAQI试一试, 或者访问http://github.com/ludlows/chinaAQI。

用法:

   使用python3写的:Python version: 3.0 or above
    pip install chinaAQI
    import datetime
    import chinaAQI
    city = '北京市'
    start = datetime.date(2015,1,1)
    end = datetime.date(2016,1,1)
    crawler = chinaAQI.Crawler(city, start, end)
    crawler.scrapy()
    print(crawler.data)

   返回的数据类型为:列表,其中每个元素是个元组。如下:
(1.number 2.city name     3.date            4.air quality index 5.air quality degree 6.primary pollutan)
('1',      ,         '北京市',  '2015-12-28',         '149',                         '轻度污染',          'PM2.5')

实现方法:

    1.全局变量:

     
URLBASE = 'http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?city={}&startdate={}&enddate={}'
AQIKINDS = ('优', '良', '轻度污染', '中度污染', '重度污染',  '严重污染')
URLBASE作为拼接url的基本结构。
AQIKINDS是空气质量的六个等级。

    
    2.URLBuilder


URLBuilder用来生成查询字符串。依赖urlib.request, urllib.parse
class URLBuilder(object):
    '''
    build url for getting data
    '''
    def __init__(self, name, start_str, end_str):
        '''
        name: str, city name, chinese
        start_str: date str, 2015-05-09
        end_str: date object, 2016-01-01
        '''
        self.name = name
        self.start_str = start_str
        self.end_str = end_str

    def build(self, page=1):
        '''
        return the query url
        '''
        if page == 1:
            return URLBASE.format(parse.quote(self.name), self.start_str, self.end_str)
        return ''.join((URLBASE,'&page={}')).format(parse.quote(self.name), self.start_str, self.end_str, page)

    3.TrParser

TrParser用来抽取html中的数据。依赖html.parser.HTMLParser
class TrParser(HTMLParser):
    '''
    '''
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = []
        self.getData = False
        self.circle_data = []

    def handle_starttag(self, tag, attrs):
        if tag == 'tr' and ('height', '30') in attrs: # need notation!!!
            self.getData = True

    def handle_endtag(self, tag):
        if tag == 'tr':
            self.getData = False

    def handle_data(self, data):
        if self.getData:
            self.data.append(data)

4.Items

Items用于清洗数据,得到我们需要的所有信息。
class Items(object):
    def __init__(self, text):
        self.data = []
        self.trItems = TrParser()
        self.trItems.feed(text)

    def filter(self):
        self.data = (ele for ele in self.trItems.data if '\n' not in ele)
        value = []
        # add '' after '优'
        for ele in self.data:
            value.append(ele)
            if ele == '优':
                value.append('')
        self.data = value[16:]
        self.trItems.close()

5.TaskQueue

TaskQueue依赖threading,queue.Queue,是用来进行多线程抓取。
class TaskQueue(queue.Queue):
    def __init__(self, num_workers=1):
        queue.Queue.__init__(self)
        self.num_workers = num_workers
        self.start_workers()

    def add_task(self, task, *args, **kwargs):
        args = args or ()
        kwargs = kwargs or {}
        self.put((task, args, kwargs))

    def start_workers(self):
        for i in range(self.num_workers):
            task = threading.Thread(target=self.worker)
            task.daemon = True
            task.start()
    
    def worker(self):
        while True:
            tupl = self.get()
            item, args, kwargs = tupl
            item(*args, **kwargs)
            self.task_done()

6.Crawler

Crawler类利用以上工具,进行爬取,解析,返回数据。是这个包的接口。
class Crawler(object):
    def __init__(self, city_name, start_date, end_date):
        '''
        city_name: chinese name
        start_date, enddate: datetime.date() object
        '''
        self.query_builder = URLBuilder(city_name, start_date.isoformat(), end_date.isoformat())
        self.days_num = (end_date - start_date).days + 1
        self.data = None
    
    @staticmethod
    def scrapy_page(url_build, page_num, data_list, lock):
        url = url_build.build(page_num)
        html = request.urlopen(url)
        txt = html.read().decode('utf8')
        html.close()

        itms = Items(txt)
        itms.filter()

        print("the page num is {}, the data is:".format(page_num))
        print(itms.data)
        print("the length of data is", len(itms.data))
        with lock:
            idx = 0
            while idx <= len(itms.data)-6:
                sample = itms.data[idx:idx+6]
                if Crawler.checkKind(sample):
                    data_list.append(tuple(sample))
                    idx += 6
                else:
                    idx += 1
        return None

    @staticmethod
    def checkKind(sample):
        '''['1', '北京市', '2015-12-28', '149', '轻度污染', 'PM2.5']
        '''
        if not sample[0].isdigit():
            return False
        date = sample[2].replace('-','')
        if not date.isdigit():
            return False
        if not sample[3].isdigit():
            return False
        if sample[4] not in AQIKINDS:
            return False
        if sample[5].isdigit():
            return False
        return True

    def scrapy(self):
        '''
        return data in the style : [('1', '北京市','2015-12-28', '14', '优', ''), ('2', '北京市','2015-12-27', '75', '良', 'PM10')]
        '''
        num = self.days_num//30 + 1
        que = TaskQueue(5)
        self.data = []
        lock = threading.Lock()
        for pg_num in range(1, num+1):
            que.add_task(Crawler.scrapy_page, self.query_builder, pg_num, self.data, lock)
        que.join()
        return None

这样一个简易的工具包就做好了,如果你对空气质量指数有研究需要,相信chinaAQI
这个工具能给你带来一些便利。

你可能感兴趣的:(web-development)