【Python学习】爬虫爬虫爬虫爬虫~

第八天
网上好多爬虫都是py2的(:з」∠)
今天找了条py3的爬虫尝试爬学校的门户

import io
import sys
import urllib.request
web_header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Cookie':'iPlanetDirectoryPro=AQIC5wM2LY4SfczGmS5S1wsHjs3f8d%2FvQadvCPz780%2B9%2B1o%3D%40AAJTSQACMDI%3D%23; JSESSIONID=0000g4W05n040WWJHMgWwYK6u41:172u4qcnp'}
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
url_mh='http://xxxx.xxxx.xxx.xx/index.portal'
req=urllib.request.Request(url=url_mh,headers=web_header)
resp=urllib.request.urlopen(req)
data=resp.read()
print(data.decode('utf-8'))

分析下这些代码web_header 是头部信息包含了cookie之前cookie都是单写在cookie模块里用这个方法很简单= =(:з」∠)
sys.stdout 重定向页面的编码为utf8
url_mh 门户登陆成功的界面
req为设置好的关联性修改头部信息
resq发送post请求返回的数据通过read参数获取
然后显示出来用utf-8的格式
两个utf-8一定要设置好了一个是显示的格式一个是解码的格式

经过诸多调试可以爬到公告了

import io
import sys
import urllib.request
import re
import requests
web_header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Cookie':'iPlanetDirectoryPro=AQIC5wM2LY4SfczGmS5S1wsHjs3f8d%2FvQadvCPz780%2B9%2B1o%3D%40AAJTSQACMDI%3D%23; JSESSIONID=0000g4W05n040WWJHMgWwYK6u41:172u4qcnp'}
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
url_mh='http://XXX/index.portal'
req=urllib.request.Request(url=url_mh,headers=web_header)
#resp=urllib.request.urlopen(req)
# data=resp.read()
#  print(data.decode('utf8'))
resp = requests.get(url=url_mh,headers=web_header)
resp.encoding = 'utf-8'
# print(resp.text)
gonggao = re.findall(':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Cookie':'iPlanetDirectoryPro=AQIC5wM2LY4SfczGmS5S1wsHjs3f8d%2FvQadvCPz780%2B9%2B1o%3D%40AAJTSQACMDI%3D%23; JSESSIONID=0000g4W05n040WWJHMgWwYK6u41:172u4qcnp'}
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
url_mh='http://xxx.cn/index.portal'
req=urllib.request.Request(url=url_mh,headers=web_header)
#resp=urllib.request.urlopen(req)
# data=resp.read()
#  print(data.decode('utf8'))
resp = requests.get(url=url_mh,headers=web_header)
resp.encoding = 'utf-8'
# print(resp.text)
gonggao = re.findall(',resp.text,re.S)
for each in gonggao:
    print (each)

今天有点晚就不弄导出到text了下次吧·

你可能感兴趣的:(萌新编程)