模拟登陆本校官网爬取成绩

看了一段时间的爬虫文章和视频,总感觉看的懂但是实际操作自己的项目却是难点比较多,还是因为知识点不够扎实。今天尝试一下登陆本校的官网,并在之后能够爬取到想要的信息。

系统:win10 1803

工具:Pycharm 1703

python版本:3.6

抓包工具:Charles

用到的模块:requests,PIL,BeautifulSoup/lxml,os

我们学校的教务管理系统:http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/default2.aspx


模拟登陆本校官网爬取成绩_第1张图片
登录界面

使用抓包软件登录后抓到提交的数据

模拟登陆本校官网爬取成绩_第2张图片
抓包数据

这些数据就是在模拟登陆向服务器post的数据,同时我们需要提交验证码,但是验证码是随机动态的,所以我们需要找到验证码的链接。

模拟登陆本校官网爬取成绩_第3张图片
验证码链接

# 下载验证码

s = requests.session()   #获取session 在之后使用同一个session

imgUrl = "http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/CheckCode.aspx?"

imgresponse = s.get(imgUrl, stream=True)

print(s.cookies)

image = imgresponse.contentDstDir = os.getcwd() + "\\"

print("保存验证码在:" + DstDir + "code.png" + "\n")

try:

with open(DstDir + "code.png", "wb") as png:

png.write(image)

except IOError:

print("IO Error\n")

finally: png.close


# 打开并手动输入验证码

img = Image.open('I:\pc_first\pc\code.png')

img.show()

data = {}


emmm,我开始用的图片格式全是png格式并且可以自动打开,这样对于随机验证码的获取和使用就完成了。

url = 'http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/default2.aspx'

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',}

data = {

'__VIEWSTATE': 'dDwtMTg3MTM5OTI5MTs7Pu9NgXuEf8Rr/BvvkUWH8oCYiXB2',

'TextBox1': '我的学号',

'TextBox2': '我的密码',

'TextBox3': input('输入验证码:'),

"Button1": "", 'lbLanguage': ""

}

response = s.post(url=url, data=data, headers=headers)

if 'xh=' in response.url:

print('登陆成功')

else:

False


登录进去后,可以得到另一个链接http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/xs_main.aspx?xh=15040****;

最后的就是我的学号,所以在最后加一个小的判断,字段中如果有‘xh=’就是登录成功了。


但是登录之后呢,直接爬取成绩就是爬取不下来,一直使用一个空的列表。我看了好长时间都没发现问题,还是请教的朋友。是因为成绩所在的真正url改变了不是登录跳转的url。


模拟登陆本校官网爬取成绩_第4张图片
新url

我们需要在最初始的url然后通过操作得到可以进行爬取成绩的url,所以要对这个链接进行组合。现在最初的界面代码中进行查找。


模拟登陆本校官网爬取成绩_第5张图片
url查找

查找发现能在最初的url找到的只有这个url 对比我们需要的url好像有区别,xm=后面我们需要的是一串码,但是这里是我的姓名。我就先试试用这个url能不能爬取下来成绩,如果不信再进行进一步的查找(其实我是找到了最后的url,再试了一下这个url,发现效果一样!!!)

link0 = requests.post(response.url, headers=headers1).text

s1 = etree.HTML(link0)

link = s1.xpath('//*[@id="headDiv"]/ul/li[4]/ul/li[3]/a/@href')

url2 = 'http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/' + str(link)[2:-2]

把href字段爬取下来,然后对比进行拼接得到需要的链接。


模拟登陆本校官网爬取成绩_第6张图片
需要提交的data

爬取成绩需要提交的data,因为我点击的是所有成绩 所以学期,学年等提交都为空,可以进一步的提交想要的data得到想要的对应成绩。

data2 = {

'__EVENTTARGET': '',

'__EVENTARGUMENT': '',

'__VIEWSTATE': 'dDwxMzc0MjAwNjg2O3Q8cDxsPFNvcnRFeHByZXM7c2ZkY2JrO2RnMztkeWJ5c2NqO1NvcnREaXJlO3hoO3N0cl90YWJfYmpnO2NqY3hfbHNiO3p4Y2pjeHhzOz47bDxrY21jO1xlO2JqZztcZTthc2M7MTUwNDAyMjE2O3pmX2N4Y2p0al8xNTA0MDIyMTY7XGU7MDs+PjtsPGk8MT47PjtsPHQ8O2w8aTw0PjtpPDEwPjtpPDE5PjtpPDMwPjtpPDMyPjtpPDM0PjtpPDM2PjtpPDM4PjtpPDM5PjtpPDQxPjtpPDQzPjtpPDQ1PjtpPDQ3PjtpPDQ5PjtpPDUxPjtpPDUzPjtpPDU1PjtpPDU3PjtpPDU5PjtpPDYxPjtpPDYyPjtpPDYzPjtpPDY1PjtpPDY3PjtpPDY5PjtpPDcxPjtpPDczPjtpPDc1PjtpPDc3PjtpPDc5PjtpPDgwPjs+O2w8dDx0PDt0PGk8MTk+O0A8XGU7MjAwMS0yMDAyOzIwMDItMjAwMzsyMDAzLTIwMDQ7MjAwNC0yMDA1OzIwMDUtMjAwNjsyMDA2LTIwMDc7MjAwNy0yMDA4OzIwMDgtMjAwOTsyMDA5LTIwMTA7MjAxMC0yMDExOzIwMTEtMjAxMjsyMDEyLTIwMTM7MjAxMy0yMDE0OzIwMTQtMjAxNTsyMDE1LTIwMTY7MjAxNi0yMDE3OzIwMTctMjAxODsyMDE4LTIwMTk7PjtAPFxlOzIwMDEtMjAwMjsyMDAyLTIwMDM7MjAwMy0yMDA0OzIwMDQtMjAwNTsyMDA1LTIwMDY7MjAwNi0yMDA3OzIwMDctMjAwODsyMDA4LTIwMDk7MjAwOS0yMDEwOzIwMTAtMjAxMTsyMDExLTIwMTI7MjAxMi0yMDEzOzIwMTMtMjAxNDsyMDE0LTIwMTU7MjAxNS0yMDE2OzIwMTYtMjAxNzsyMDE3LTIwMTg7MjAxOC0yMDE5Oz4+Oz47Oz47dDx0PHA8cDxsPERhdGFUZXh0RmllbGQ7RGF0YVZhbHVlRmllbGQ7PjtsPGtjeHptYztrY3h6ZG07Pj47Pjt0PGk8OT47QDzlv4Xkv67or7476YCJ5L+u6K++O+WFrOWFseWfuuehgOivvjvlrp7ot7Xor7475LiT5Lia5qC45b+D6K++O+S4k+S4muivvjvkuJPkuJrpgInkv67or74757u85ZCI5a6e6Le16K++O1xlOz47QDwxOzI7Mzs0OzU7Njs3Ozg7XGU7Pj47Pjs7Pjt0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPFxlOz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOWtpuWPt++8mjE1MDQwMjIxNjtvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOWnk+WQje+8mum7hOa1qTtvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOWtpumZou+8muS/oeaBr+S4juiuoeeul+acuuezuztvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOS4k+S4mu+8mjtvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOmAmuS/oeW3peeoiztvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDzkuJPkuJrmlrnlkJHvvJo7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7VmlzaWJsZTs+O2w86KGM5pS/54+t77yaMjAxNee6p+mAmuS/oeW3peeoizLnj607bzx0Pjs+Pjs+Ozs+O3Q8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs+O3Q8QDA8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+PjtwPGw8c3R5bGU7PjtsPERJU1BMQVk6bm9uZTs+Pj47Ozs7Ozs7Ozs7Pjs7Pjt0PDtsPGk8MTM+Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs+Ozs+Oz4+O3Q8cDxwPGw8VGV4dDtWaXNpYmxlOz47bDzoh7Pku4rmnKrpgJrov4for77nqIvmiJDnu6nvvJo7bzx0Pjs+Pjs+Ozs+O3Q8QDA8cDxwPGw8UGFnZUNvdW50O18hSXRlbUNvdW50O18hRGF0YVNvdXJjZUl0ZW1Db3VudDtEYXRhS2V5czs+O2w8aTwxPjtpPDE+O2k8MT47bDw+Oz4+O3A8bDxzdHlsZTs+O2w8RElTUExBWTpibG9jazs+Pj47Ozs7Ozs7Ozs7PjtsPGk8MD47PjtsPHQ8O2w8aTwxPjs+O2w8dDw7bDxpPDA+O2k8MT47aTwyPjtpPDM+O2k8ND47aTw1Pjs+O2w8dDxwPHA8bDxUZXh0Oz47bDxKWDAyMTAxNDs+Pjs+Ozs+O3Q8cDxwPGw8VGV4dDs+O2w86K6h566X5py65a+86K66Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDzlv4Xkv67or747Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPDIuMDs+Pjs+Ozs+O3Q8cDxwPGw8VGV4dDs+O2w8MDs+Pjs+Ozs+O3Q8cDxwPGw8VGV4dDs+O2w8Jm5ic3BcOzs+Pjs+Ozs+Oz4+Oz4+Oz4+O3Q8QDA8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+PjtwPGw8c3R5bGU7PjtsPERJU1BMQVk6bm9uZTs+Pj47Ozs7Ozs7Ozs7Pjs7Pjt0PEAwPHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47cDxsPHN0eWxlOz47bDxESVNQTEFZOm5vbmU7Pj4+Ozs7Ozs7Ozs7Oz47Oz47dDxAMDw7Ozs7Ozs7Ozs7Pjs7Pjt0PEAwPHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47cDxsPHN0eWxlOz47bDxESVNQTEFZOm5vbmU7Pj4+Ozs7Ozs7Ozs7Oz47Oz47dDxAMDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+O3A8bDxzdHlsZTs+O2w8RElTUExBWTpub25lOz4+Pjs7Ozs7Ozs7Ozs+Ozs+O3Q8QDA8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs7Ozs7Ozs7Oz47Oz47dDxAMDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+O3A8bDxzdHlsZTs+O2w8RElTUExBWTpub25lOz4+Pjs7Ozs7Ozs7Ozs+Ozs+O3Q8QDA8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+PjtwPGw8c3R5bGU7PjtsPERJU1BMQVk6bm9uZTs+Pj47Ozs7Ozs7Ozs7Pjs7Pjt0PEAwPDtAMDw7O0AwPHA8bDxIZWFkZXJUZXh0Oz47bDzliJvmlrDlhoXlrrk7Pj47Ozs7PjtAMDxwPGw8SGVhZGVyVGV4dDs+O2w85Yib5paw5a2m5YiGOz4+Ozs7Oz47QDA8cDxsPEhlYWRlclRleHQ7PjtsPOWIm+aWsOasoeaVsDs+Pjs7Ozs+Ozs7Pjs7Ozs7Ozs7Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOacrOS4k+S4muWFsTExMeS6ujtvPGY+Oz4+Oz47Oz47dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDxaSlU7Pj47Pjs7Pjt0PHA8cDxsPEltYWdlVXJsOz47bDwuL2V4Y2VsLzk3MDUyNzEuanBnOz4+Oz47Oz47Pj47Pj47PgGGyciVaIkDb4w+sTsVpJ8ImvRN',

'btn_zcj': '(unable to decode value)',

'hidLanguage': '',

'ddl_kcxz': '',

'ddlXQ': '',

'ddlXN': '',}

headers1 = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', 'Referer': 'http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/xscjcx.aspx?xh=150402216&xm=%BB%C6%BA%C6&gnmkdm=N121605',

'Connection': 'keep-alive',

'Accept-Encoding': 'gzip, deflate',

'Accept-Language': 'zh-CN,zh;q=0.9',

'Upgrade-Insecure-Requests': '1'

}

data4 = s.post(url3, data=data2, headers=headers1).text

s2 = etree.HTML(data4)

with open('C:/Users\linx00\Desktop\cj.txt', 'w', encoding='utf-8') as f:

years = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[1]/a/text()|//*[@id="Form1"]/div[2]/div/span/table/tr/td[1]/text()')

xueqi = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[2]/a/text()|//*[@id="Form1"]/div[2]/div/span/table/tr/td[2]/text()')

kcmc = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[4]/a/text()|//*[@id="Form1"]/div[2]/div/span/table/tr/td[4]/text()')

#因为前面的属性不一样 所以使用或关系来爬取

kcxz = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[5]/text()')

xuefen = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[7]/text()')

jidian = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[8]/text()')

chengji = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[9]/text()')

bkchengji = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[11]/text()')

f.write('{}\n,{}\n,{}\n,{}\n,{}\n,{}\n,{}\n,{}\n'.format(years,xueqi, kcmc, kcxz, xuefen, jidian, chengji, bkchengji))


这样需要的全部成绩就被全部保存下来了。(但是保存的很不美观,对于格式我不是很清楚怎么进行完美的修改!有大佬可以帮助一下)


模拟登陆本校官网爬取成绩_第7张图片
成绩表

你可能感兴趣的:(模拟登陆本校官网爬取成绩)