新手转专业学生,导师突然安排个小活,爬取100多个城市的未来15天天气数据。
网页上有一些爬取7天的案例,通过学习也不是很难。
这里推荐几个我认为做的很详细的链接,大家可以去学习(热心地帮助大家节省时间)
https://blog.csdn.net/heshushun/article/details/77772408
http://blog.sina.com.cn/s/blog_132d8ac0d0102wmpl.html
我的这篇文章在这些前辈无私分享的基础上完成的,感恩。
一、获取URL,分析网页源码
这里给大家两种方法。
①登录到中国天气网网页(以北京为例:http://www.weather.com.cn/weather15d/101010100.shtml)
界面如上图。按F12会屏幕下方出现源码,然后在8-15天天气位置右键,点击审查元素,或者ctrl+shift+l。
②通过urllib命令request命令
参考代码:
# -*- coding: UTF-8 -*- from urllib import request if __name__ == "__main__": response = request.urlopen("http://www.weather.com.cn/weather15d/101010100.shtml") html = response.read() html = html.decode("utf-8") print(html)
可以在pycharm或者sublime运行直接出结果。为了查看方便,我把8-15天天气部分拷贝到了有道云笔记
这里的源码格式还算整齐,与7天天气只是些许的标签不同,头疼的就是气温那里,最低气温部分少标签
(29℃/17℃)
二、爬取数据
一个小白摸索了很长时间,没找到很好的方法,最后还请教了计算机系好友——用正则表达式方法。
网页上关于正则表达式的科普文章千篇一律,在我看来纸上谈兵没什么意思,大家学习时可以找一些有实践应用例子的帖子。
这里我参考的文章链接:https://www.jianshu.com/p/d1bf2f0bdc51
文章的处理细节不是很精细,需要弄清爬虫的框架进行一些调整,新手宝宝不要完全照抄。
def get_data2(html, city): soup = BeautifulSoup(html, 'html.parser') # 创建BeautifulSoup对象 articles = [] for article in soup.find_all(class_='c15d'): li = article.find_all('li') for day in li: line = city #城市 date = re.findall('\\"time\"\>(.*?)\<\/span>', str(day))[0] #日期 wea = re.findall('\\"wea\"\>(.*?)\<\/span>', str(day))[0] #天气 tem_max = re.findall('\\"tem\"\>\(.*?)<\/em>', str(day))[0] #最高温 tem_min = re.findall('\\"tem\"\>\.*?<\/em>\/(.*?)\<\/span>', str(day))[0] articles.append([line,date,wea,tem_max,tem_min]) return articles
补充获得html
def get_content(url): rep = requests.get(url, timeout=60) rep.encoding = 'utf-8' return rep.text
三、保存数据与多城市快速爬取
def save_data(data, filename): with open(filename, 'a', errors='ignore', newline='') as f: # 为了不会有空行间隔 f_csv = csv.writer(f) for row in data: f_csv.writerow(row)
为了多城市爬取方便,导入city.txt输入城市,(city.txt可以网上下载或者从我之前给的参考链接下载)
def get_url(city_name): url = 'http://www.weather.com.cn/weather15d/' with open('city.txt', 'r', encoding='UTF-8') as fs: lines = fs.readlines() for line in lines: if(city_name in line): code = line.split('=')[0].strip() return url + code + '.shtml' raise ValueError('invalid city name')
四、主程序
if __name__ == '__main__': cities = input('city name: ').split(' ') for city in cities: url = get_url(city) html = get_content(url) result = get_data2(html, city) save_data(result, 'G:\weather8-15d.csv')
运行时中间用空格隔开。
结果如下
五、完整代码
# -*- coding: utf-8 -*- import requests import csv from bs4 import BeautifulSoup import re def get_url(city_name): url = 'http://www.weather.com.cn/weather15d/' with open('city.txt', 'r', encoding='UTF-8') as fs: lines = fs.readlines() for line in lines: if(city_name in line): code = line.split('=')[0].strip() return url + code + '.shtml' raise ValueError('invalid city name') def get_content(url): rep = requests.get(url, timeout=60) rep.encoding = 'utf-8' return rep.text def get_data2(html, city): soup = BeautifulSoup(html, 'html.parser') # 创建BeautifulSoup对象 articles = [] for article in soup.find_all(class_='c15d'): li = article.find_all('li') for day in li: line = city # 城市 date = re.findall('\\"time\"\>(.*?)\<\/span>', str(day))[0] # 日期 wea = re.findall('\\"wea\"\>(.*?)\<\/span>', str(day))[0] # 天气 tem_max = re.findall('\\"tem\"\>\(.*?)<\/em>', str(day))[0] #最高温 tem_min = re.findall('\\"tem\"\>\.*?<\/em>\/(.*?)\<\/span>', str(day))[0] #最低温 articles.append([line,date,wea,tem_max,tem_min]) return articles def save_data(data, filename): with open(filename, 'a', errors='ignore', newline='') as f: f_csv = csv.writer(f) for row in data: f_csv.writerow(row) if __name__ == '__main__': cities = input('city name: ').split(' ') for city in cities: url = get_url(city) html = get_content(url) result = get_data2(html, city) save_data(result, 'G:\weather8-15d.csv')