python爬虫爬取可可英语官网----四级翻译

可可英语四级备考界面

爬虫基础介绍:

  • 1.url:某个网页的网址
  • 2.带反扒机制的网页,加个header
header={'User-Agent':'Mozilla/5.0 
(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/56.0.2924.87 Safari/537.36'}
  • 3.模拟浏览器进入网页:
request = urllib2.Request(url,headers=header)
  • 4.打开网址:
response = urllib2.urlopen(request)
  • 5.获取源码,读取网页
html = response.read()
  • 6.编写正则表达式:
pattern = re.compile(r"----正则表达式----")
  • 7.匹配正则表达式:
items = re.findall(pattern,html)#  注意:此时的items是一个列表,用来存放匹配到的东西
  • 代码分享
#coding=utf-8
import urllib2
from constants import url
import re
import sys
import os

reload(sys)
sys.setdefaultencoding('utf-8')#解决编码中出现乱码问题

def get_title(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent',
                   'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36')#添加头部信息
    res = urllib2.urlopen(req)
    html = res.read()
    response = re.compile(r'id="nrtitle">(.*?)', re.S)
    title = re.findall(response, html)[0]#通过re模块进行内容匹配查找
    title = title.replace('​','')#字符串的替换
    title = title.replace('&','')
    title = title.replace(';','')
    title = title.replace('#','')
    title = title.replace('34','')
    return title



def get_first_page(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36')
    res = urllib2.urlopen(req)
    html = res.read()
    response = re.compile(r'(.*?)
                    
                    

你可能感兴趣的:(python爬虫爬取可可英语官网----四级翻译)