简单利用urllib2搞爬虫

1.获取页面

import urllib2
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
headers = { 'User-Agent' : user_agent }  
def gethtml(url):
    request = urllib2.Request(url,headers=headers)
    response = urllib2.urlopen(request)
    html=response.read().decode('utf-8','ignore')
    return html
url=raw_input("url:")
print gethtml(url)

2.抓取你想要的信息

def getText(html):
    pattern= re.compile('.*?(.*?).*?(.*?).*?
(.*?)
(.*?)',re.S)#写正则 texts= pattern.findall(html) for text in texts: print text[0],text[1],text[2],text[3],text[4]#打印出文字

你可能感兴趣的:(python爬虫)