个人博客:http://www.chenjianqu.com/
原文链接:http://www.chenjianqu.com/show-93.html
使用爬虫把微博热搜和天气预报爬下来,并通过邮件定时发送给自己查看。目录:
1.爬取微博热搜
2.邮件发送
3.爬取天气预报
4.综合程序
爬取微博热搜
我这里使用Python的正则表达式进行爬取,这虽然是一种原始的方式,但是应对简单的爬虫任务时却很有效。首先打开微博热搜的页面:https://d.weibo.com/231650_ctg1_-_all#。然后F12进入调试模式。接着根据想要爬去的内容定位到网页元素,对于想要爬取热搜的话,可以定位到
下一步,切换到Network窗口,点击网页刷新,找到网页内容文件。经过查找,发现在Doc内容的231650_ctg1_-_all里面。查看该文件的请求头的内容,写代码的时候需要用到。下面是Python的代码:
import requests
date_url='https://d.weibo.com/231650_ctg1_-_all'
user_agent = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
header = {
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':user_agent,
'Connection': 'keep-alive',
'Host':'d.weibo.com',
'Referer':r'https://weibo.com/?category=1760',
'Sec-Fetch-Mode':'navigate',
'Sec-Fetch-Site':'same-origin',
'Sec-Fetch-User':'?1',
'Upgrade-Insecure-Requests':'1',
'Cookie':r'SINAGLOBAL=3157249405177.425.1576929340602; SCF=Al6xXQQ55-6jcuFXUVP0A6SEVlMaKwwCLiZUNjT9niWFZphUNGW7iw5NY4L42KvBQbIpbHZIIsILhHH8bZ5OnbM.; SUHB=0WGdKi-XaWA8Uj; ALF=1611383135; SUB=_2AkMpZMs-f8NxqwJRmPoVxW3rb4VwzAHEieKfODrlJRMxHRl-yT9kqn0vtRB6AuTl0ValAGtvAToNrCinxEZouvLjQMeG; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W5va2LfoCEFCfOQu6BQpoCk; login_sid_t=8bc131d1a2f8aa8965871b50f73c6c2d; cross_origin_proto=SSL; _s_tentry=passport.weibo.com; UOR=www.pythontip.com,widget.weibo.com,www.baidu.com; Apache=5219702142720.495.1580745741891; ULV=1580745741899:5:1:1:5219702142720.495.1580745741891:1579837074450; YF-Page-G0=46fe8b26d816d699836422a078175e33|1580745781|1580745767'
}
r = requests.post(url=date_url, headers=header)
raw_text=r.text
re_s1 = r""
re_s2=r"- "
re_pic=r""
re_pic_src=r"src=(.*?)jpg"
re_sub=r"(.*?)div>"
re_link=r""
re_link_src=r"href=(.*?) class="
re_key=r"#(.*?)#"
s1 = re.findall(re_s1,raw_text,re.S|re.M)
for line_s1 in s1:
s2=re.findall(re_s2,line_s1,re.S|re.M)
#每个热搜项
for line_s2 in s2:
#获取关键词
key=re.findall(re_key,line_s2,re.S|re.M)
print(key[0])
#获取图片地址
pic_s=re.findall(re_pic,line_s2,re.S|re.M)
src=re.findall(re_pic_src,pic_s[0],re.S|re.M)
print(src[0].replace('\\','')+'jpg')
#获取子标题
subtitle=re.findall(re_sub,line_s2,re.S|re.M)
print(subtitle[0].replace('\\t','').replace('\\n','').replace('<\\/',''))
#获取该热搜的链接
link=re.findall(re_link,line_s2,re.S|re.M)
link_src=re.findall(re_link_src,link[0],re.S|re.M)
print(link_src[0].replace('\\',''))
print('\n')
#################################################################################
爬取的结果
远程办公
"https://wx4.sinaimg.cn/large/59853be1ly1gbjb12koefj206o06ogoi.jpg
"https://s.weibo.com/weibo?q=%23%E8%BF%9C%E7%A8%8B%E5%8A%9E%E5%85%AC%23"
下一站是幸福
"https://wx1.sinaimg.cn/large/0079PGXzly1gb409yn3poj30dw0dwq3r.jpg
@微博电视剧 推荐:《下一站是幸福》(原《资深少女的初恋》),讲述...
"https://s.weibo.com/weibo?q=%23%E4%B8%8B%E4%B8%80%E7%AB%99%E6%98%AF%E5%B9%B8%E7%A6%8F%23"
过多睡眠不利于当前健康调整
"https://wx3.sinaimg.cn/large/6a5ce645ly1gbj96c9fgrj205q05qglj.jpg
3日,国家卫生健康委召开新闻发布会,北京回龙观医院党委书记杨甫德表...
"https://s.weibo.com/weibo?q=%23%E8%BF%87%E5%A4%9A%E7%9D%A1%E7%9C%A0%E4%B8%8D%E5%88%A9%E4%BA%8E%E5%BD%93%E5%89%8D%E5%81%A5%E5%BA%B7%E8%B0%83%E6%95%B4%23"
李兰娟回应疫苗进展
"https://wx4.sinaimg.cn/large/9e5389bbly1gbjaa14qwsj20c80c8t9g.jpg
2月2日凌晨,中国工程院院士、国家卫健委高级别专家组成员李兰娟带领...
"https://s.weibo.com/weibo?q=%23%E6%9D%8E%E5%85%B0%E5%A8%9F%E5%9B%9E%E5%BA%94%E7%96%AB%E8%8B%97%E8%BF%9B%E5%B1%95%23"
抗疫行动
"https://wx2.sinaimg.cn/large/005C79Jbly1gbjozauqc6j30dw0dw0ti.jpg
疫情让人恐惧,也让我们团结一心!@好友一起#手写加油接力# 为身边的...
"https://s.weibo.com/weibo?q=%23%E6%8A%97%E7%96%AB%E8%A1%8C%E5%8A%A8%23"
2020最大心愿
"https://wx2.sinaimg.cn/large/a716fd45ly1gbiy5n6qqrj20dw0dwmzd.jpg
2020最大心愿:国泰民安! 转发海报,一起许下2020年的愿望!
"https://s.weibo.com/weibo?q=%232020%E6%9C%80%E5%A4%A7%E5%BF%83%E6%84%BF%23"
武汉最新城市宣传片
"https://wx2.sinaimg.cn/large/7a273328ly1g7sxt0udwnj20ba0baabb.jpg
"https://s.weibo.com/weibo?q=%23%E6%AD%A6%E6%B1%89%E6%9C%80%E6%96%B0%E5%9F%8E%E5%B8%82%E5%AE%A3%E4%BC%A0%E7%89%87%23"
儿童和孕产妇是新型肺炎易感人群
"https://wx2.sinaimg.cn/large/60718250ly1gbj8qp16a8j20bl0bl0t1.jpg
"https://s.weibo.com/weibo?q=%23%E5%84%BF%E7%AB%A5%E5%92%8C%E5%AD%95%E4%BA%A7%E5%A6%87%E6%98%AF%E6%96%B0%E5%9E%8B%E8%82%BA%E7%82%8E%E6%98%93%E6%84%9F%E4%BA%BA%E7%BE%A4%23"
福尔摩斯式破解病毒传染迷局
"https://wx3.sinaimg.cn/large/9e5389bbly1gbjkfi69hvj20dw0dw3yx.jpg
日前,天津某百货大楼内部相继出现5例确诊病例,从起初的3个病例来看...
"https://s.weibo.com/weibo?q=%23%E7%A6%8F%E5%B0%94%E6%91%A9%E6%96%AF%E5%BC%8F%E7%A0%B4%E8%A7%A3%E7%97%85%E6%AF%92%E4%BC%A0%E6%9F%93%E8%BF%B7%E5%B1%80%23"
宝石gem经纪人回应
"https://wx2.sinaimg.cn/large/4b79be8bly1gbjcd3ja43j208o08o74u.jpg
"https://s.weibo.com/weibo?q=%23%E5%AE%9D%E7%9F%B3gem%E7%BB%8F%E7%BA%AA%E4%BA%BA%E5%9B%9E%E5%BA%94%23"
手写加油接力
"https://wx1.sinaimg.cn/large/005C79Jbly1gbig4h9v7dj30dw0dwgm0.jpg
@好友 接力,手写祝福,为奋战在所有一线的工作者们加油打气,武汉加...
"https://s.weibo.com/weibo?q=%23%E6%89%8B%E5%86%99%E5%8A%A0%E6%B2%B9%E6%8E%A5%E5%8A%9B%23"
宁波一次聚餐祈福25人确诊
"https://wx3.sinaimg.cn/large/6a5ce645ly1gbje76xyhkj20dw0dwwfd.jpg
2月3日,据宁波市政府新闻办召开新闻发布会通报:患者胡某,无湖北(...
"https://s.weibo.com/weibo?q=%23%E5%AE%81%E6%B3%A2%E4%B8%80%E6%AC%A1%E8%81%9A%E9%A4%90%E7%A5%88%E7%A6%8F25%E4%BA%BA%E7%A1%AE%E8%AF%8A%23"
北京发现41起聚集性病例
"https://wx2.sinaimg.cn/large/9e5389bbly1gbjbvb0z0wj20dw0dwgmx.jpg
今日,北京市新型冠状病毒感染的肺炎疫情防控工作新闻发布会介绍,截...
"https://s.weibo.com/weibo?q=%23%E5%8C%97%E4%BA%AC%E5%8F%91%E7%8E%B041%E8%B5%B7%E8%81%9A%E9%9B%86%E6%80%A7%E7%97%85%E4%BE%8B%23"
锦衣之下
"https://wx2.sinaimg.cn/large/006WpiUTly1g8pdxpnafnj30dw0dwdib.jpg
由艺能传媒、欢瑞世纪、芒果超媒、快乐阳光出品,总导演尹涛、导演刘...
"https://s.weibo.com/weibo?q=%23%E9%94%A6%E8%A1%A3%E4%B9%8B%E4%B8%8B%23"
确诊病例门把手测出病毒核酸
"https://wx4.sinaimg.cn/large/a716fd45ly1gbj1jn8ogfj206n06n3yq.jpg
日前,广州市疾控中心在疫情监测中,在一名确诊患者家中门把手上发现...
"https://s.weibo.com/weibo?q=%23%E7%A1%AE%E8%AF%8A%E7%97%85%E4%BE%8B%E9%97%A8%E6%8A%8A%E6%89%8B%E6%B5%8B%E5%87%BA%E7%97%85%E6%AF%92%E6%A0%B8%E9%85%B8%23"
邮件发送
这里直接按照菜鸟教程的Python邮件发送教程来,使用QQ邮箱作为SMTP作为邮件发送服务器。SMTP(Simple Mail Transfer Protocol)即简单邮件传输协议,它是一组用于由源地址到目的地址传送邮件的规则,由它来控制信件的中转方式。python的smtplib提供了一种很方便的途径发送电子邮件。它对smtp协议进行了简单的封装。这里需要在QQ邮箱里的"设置->帐号管理->开启POS3/SMTP服务->获得授权码",将授权码作为登录的密码,得到的代码如下:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import smtplib
from email.mime.text import MIMEText
from email.utils import formataddr
my_sender='[email protected]' # 发件人邮箱账号
my_pass = 'xxx' # 发件人邮箱密码
my_user='[email protected]' # 收件人邮箱账号
def mail():
ret=True
try:
msg=MIMEText('邮件内容:测试','plain','utf-8')
msg['From']=formataddr(["AlexChen",my_sender]) # 括号里的对应发件人邮箱昵称、发件人邮箱账号
msg['To']=formataddr(["JianquChen",my_user]) # 括号里的对应收件人邮箱昵称、收件人邮箱账号
msg['Subject']="邮件测试" # 邮件的主题,也可以说是标题
server=smtplib.SMTP_SSL("smtp.qq.com", 465) # 发件人邮箱中的SMTP服务器,端口是25
server.login(my_sender, my_pass) #
server.sendmail(my_sender,[my_user,],msg.as_string())
server.quit() # 关闭连接
except Exception:
ret=False
return ret
ret=mail()
if ret:
print("邮件发送成功")
else:
print("邮件发送失败")
更正:这爬的好像不是热搜,,,但这是不是重点。
爬取天气预报
直接使用<树莓派智能家居-天气预报和实时温湿度监控>的代码获取天气预报。如下:
import requests
import json
def getWeather(city,date=0):
s=''
rb=requests.get('http://wthrcdn.etouch.cn/weather_mini?city='+city)
#print(rb.text)
data=json.loads(rb.text)
if(data['status']==1000):
d=data['data']
if(date==0):
s+=d['city']+'今天'+d['forecast'][0]['type']+','
s+=d['forecast'][0]['low'][2:]+'到'+d['forecast'][0]['high'][2:]+','
s+=d['forecast'][0]['fengxiang']+d['forecast'][0]['fengli'][8:]+','
s+='当前室外温度:'+d['wendu']+'度,'
s+=d['ganmao']
elif(date>0 and date<5):
s+=d['city']
if(date==1):
s+='明天'
elif(date==2):
s+='后天'
else:
s+=d['forecast'][date]['date']
s+=d['forecast'][date]['type']+','
s+=d['forecast'][date]['low'][2:]+'到'+d['forecast'][date]['high'][2:]+','
s+=d['forecast'][date]['fengxiang']+d['forecast'][date]['fengli'][8:]
elif(date==-1):
s+=d['city']+'昨天'+d['yesterday']['type']+','
s+=d['yesterday']['low'][2:]+'到'+d['yesterday']['high'][2:]+','
s+=d['yesterday']['fx']+d['yesterday']['fl'][8:]
else:
s='天气请求失败'
return s
print(getWeather("钦州市",date=0))
综合程序
总的程序如下:
# -*- coding: UTF-8 -*-
import datetime
import time
import smtplib
from email.mime.text import MIMEText
from email.utils import formataddr
import json
import re
import requests
my_sender='[email protected]' # 发件人邮箱账号
my_pass = 'xxx' # 发件人邮箱密码
my_user='[email protected]' # 收件人邮箱账号,
#定时时刻[小时,分钟]
my_times=[
[13,57],
[13,54]
]
def getWeather(city,date=0):
s=''
rb=requests.get('http://wthrcdn.etouch.cn/weather_mini?city='+city)
#print(rb.text)
data=json.loads(rb.text)
if(data['status']==1000):
d=data['data']
if(date==0):
s+=d['city']+'今天'+d['forecast'][0]['type']+','
s+=d['forecast'][0]['low'][2:]+'到'+d['forecast'][0]['high'][2:]+','
s+=d['forecast'][0]['fengxiang']+d['forecast'][0]['fengli'][8:]+','
s+='当前室外温度:'+d['wendu']+'度,'
s+=d['ganmao']
elif(date>0 and date<5):
s+=d['city']
if(date==1):
s+='明天'
elif(date==2):
s+='后天'
else:
s+=d['forecast'][date]['date']
s+=d['forecast'][date]['type']+','
s+=d['forecast'][date]['low'][2:]+'到'+d['forecast'][date]['high'][2:]+','
s+=d['forecast'][date]['fengxiang']+d['forecast'][date]['fengli'][8:]
elif(date==-1):
s+=d['city']+'昨天'+d['yesterday']['type']+','
s+=d['yesterday']['low'][2:]+'到'+d['yesterday']['high'][2:]+','
s+=d['yesterday']['fx']+d['yesterday']['fl'][8:]
else:
s='天气请求失败'
return s+'\n'
def getWeibo():
date_url='https://d.weibo.com/231650_ctg1_-_all'
user_agent = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
header = {
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':user_agent,
'Connection': 'keep-alive',
'Host':'d.weibo.com',
'Referer':r'https://weibo.com/?category=1760',
'Sec-Fetch-Mode':'navigate',
'Sec-Fetch-Site':'same-origin',
'Sec-Fetch-User':'?1',
'Upgrade-Insecure-Requests':'1',
'Cookie':r'SINAGLOBAL=3157249405177.425.1576929340602; SCF=Al6xXQQ55-6jcuFXUVP0A6SEVlMaKwwCLiZUNjT9niWFZphUNGW7iw5NY4L42KvBQbIpbHZIIsILhHH8bZ5OnbM.; SUHB=0WGdKi-XaWA8Uj; ALF=1611383135; SUB=_2AkMpZMs-f8NxqwJRmPoVxW3rb4VwzAHEieKfODrlJRMxHRl-yT9kqn0vtRB6AuTl0ValAGtvAToNrCinxEZouvLjQMeG; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W5va2LfoCEFCfOQu6BQpoCk; login_sid_t=8bc131d1a2f8aa8965871b50f73c6c2d; cross_origin_proto=SSL; _s_tentry=passport.weibo.com; UOR=www.pythontip.com,widget.weibo.com,www.baidu.com; Apache=5219702142720.495.1580745741891; ULV=1580745741899:5:1:1:5219702142720.495.1580745741891:1579837074450; YF-Page-G0=46fe8b26d816d699836422a078175e33|1580745781|1580745767'
}
r = requests.post(url=date_url, headers=header)
raw_text=r.text
re_s1 = r""
re_s2=r"- "
re_pic=r""
re_pic_src=r"src=(.*?)jpg"
re_sub=r"