用python爬下杭电OJ所有题目(除了图)


title: 用python爬hdu题库
date: 2018-05-07 01:39:09
tags:
- python3
- 爬虫
categories: python3

description: 爬取杭电所有题目,杭电最近不太稳定,为了方便刷题,特地将题目爬取下来。

这里涉及到很多的点,比如python利用正则表达式爬下来之后怎么把html标签去除,还有伪装浏览器访问,错误处理。

import webbrowser
import urllib
import requests
import re
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    unicodehtml = html.decode("gbk")    ## 转化为gbk格式
    return unicodehtml

def zhenghe(str1,id,imgre):
    html=getHtml( str1+id )
    return re.findall(imgre,html)

headers = {'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/51.0.2704.63 Safari/537.36'}    ## 改request的头  模拟浏览器
num =1000;
Url = "http://acm.hdu.edu.cn/showproblem.php?pid="
reg = r'
.*?[\s\S]*?
'
## 正则表达式 imgre=re.compile(reg) while num<=6275: list = zhenghe(Url,str(num),imgre) t = open("hdu题库\hdu%s.txt"%num,"w") for i in list: dr = re.compile(r'<[^>]+>',re.S) dd = dr.sub('',i) ## 上面两行时用来去除html标签 t.write(dd) ## 写入 t.write("\n") t.close() num=num+1

然而有一些问题 。就是转化gbk格式失败会终止程序,因此要写个错误处理

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    try:
        unicodehtml = html.decode("gbk")
    except:
        print("%s had not been install\n"%url)  
        return "a"  ## 错误时return一个a
    return unicodehtml

while num<=6275:
        list = zhenghe(Url,str(num),imgre)
        if(list=="a"):  ## 当发生错误时不写入
            continue
        t = open("hdu题库\hdu%s.txt"%num,"w")

到了这一步其实基本都已经实现了,但是还是不太完美,输入输出没有分开,而且也没有标题:
用python爬下杭电OJ所有题目(除了图)_第1张图片
这时候详化一下正则表达式就可以了,最终版本代码:

import webbrowser
import urllib
import requests
import re
import codecs
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    try:
        unicodehtml = html.decode("gbk")
    except:
        print("%s had not been install\n"%url)
        return "a"
    return unicodehtml

def zhenghe(str1,id,imgre):
    html=getHtml( str1+id )
    return re.findall(imgre,html)
headers = {'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/51.0.2704.63 Safari/537.36'}
num =1013;
Url = "http://acm.hdu.edu.cn/showproblem.php?pid="
reg = {}
reg[0] = r"

.*?[\s\S]

"
reg[1] = r"

.*?[\s\S]*?
" reg[2] = r"
.*?[\s\S]*?
" reg[3] = r'Sample Input
.*?[\s\S]*?
' reg[4] = r'Sample Output
.*?[\s\S]*?
' while num<=1013: t = open("hdu题库\hdu%s.txt"%num,"a") for j in range (0,5): imgre=re.compile(reg[j]) list = zhenghe(Url,str(num),imgre) if(list=="a"): continue for i in list: dr = re.compile(r'<[^>]+>',re.S) dd = dr.sub('',i) dr = re.compile(r'Input',re.S) dd = dr.sub('Input\n',dd) dr = re.compile(r'Output',re.S) dd = dr.sub('Output\n',dd) t.write(dd) t.write("\n\n") t.close() num=num+1

用python爬下杭电OJ所有题目(除了图)_第2张图片

你可能感兴趣的:(python)