作者简历地址:http://resume.hackycoder.cn
最近在学习机器学习算法,分为回归,分类,聚类等,在学习过程中苦于没有数据做练习,就想爬取一下国内各大网站的新闻,通过训练,然后对以后的新闻做一个分类预测。在这样的背景之下,就开始了我的爬虫之路。
国内各大新闻网站汇总(未完待续):
搜狐新闻:
时政:http://m.sohu.com/cr/32/?page=2&_smuid=1qCppj0Q1MiPQWJ4Q8qOj1&v=2
社会:http://m.sohu.com/cr/53/?page=2&_smuid=1qCppj0Q1MiPQWJ4Q8qOj1&v=2
天下:http://m.sohu.com/cr/57/?_smuid=1qCppj0Q1MiPQWJ4Q8qOj1&v=2
总的网址:http://m.sohu.com/cr/4/?page=4 第一个4代表类别,第二个4代表页数
网易新闻
推荐:http://3g.163.com/touch/article/list/BA8J7DG9wangning/20-20.html 主要修改20-20
新闻:http://3g.163.com/touch/article/list/BBM54PGAwangning/0-10.html
娱乐:http://3g.163.com/touch/article/list/BA10TA81wangning/0-10.html
体育:http://3g.163.com/touch/article/list/BA8E6OEOwangning/0-10.html
财经:http://3g.163.com/touch/article/list/BA8EE5GMwangning/0-10.html
时尚:http://3g.163.com/touch/article/list/BA8F6ICNwangning/0-10.html
军事:http://3g.163.com/touch/article/list/BAI67OGGwangning/0-10.html
手机:http://3g.163.com/touch/article/list/BAI6I0O5wangning/0-10.html
科技:http://3g.163.com/touch/article/list/BA8D4A3Rwangning/0-10.html
游戏:http://3g.163.com/touch/article/list/BAI6RHDKwangning/0-10.html
数码:http://3g.163.com/touch/article/list/BAI6JOD9wangning/0-10.html
教育:http://3g.163.com/touch/article/list/BA8FF5PRwangning/0-10.html
健康:http://3g.163.com/touch/article/list/BDC4QSV3wangning/0-10.html
汽车:http://3g.163.com/touch/article/list/BA8DOPCSwangning/0-10.html
家居:http://3g.163.com/touch/article/list/BAI6P3NDwangning/0-10.html
房产:http://3g.163.com/touch/article/list/BAI6MTODwangning/0-10.html
旅游:http://3g.163.com/touch/article/list/BEO4GINLwangning/0-10.html
亲子:http://3g.163.com/touch/article/list/BEO4PONRwangning/0-10.html
未完待续。。。
在这个过程中主要用到了urllib2
和BeautifulSoup
两个包,以搜狐新闻为例,做了一个简单的爬取内容的爬虫,没有做任何的优化等问题,因此会出现假死等情况。
# -*- coding:utf-8 -*-
'''
Created on 2016-3-15
@author: AndyCoder
'''
import urllib2
from bs4 import BeautifulSoup
import socket
import httplib
class Spider(object):
"""Spider"""
def __init__(self, url):
self.url = url
def getNextUrls(self):
urls = []
request = urllib2.Request(self.url)
request.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; \
WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36')
try:
html = urllib2.urlopen(request)
except socket.timeout, e:
pass
except urllib2.URLError,ee:
pass
except httplib.BadStatusLine:
pass
soup = BeautifulSoup(html,'html.parser')
for link in soup.find_all('a'):
print("http://m.sohu.com" + link.get('href'))
if link.get('href')[0] == '/':
urls.append("http://m.sohu.com" + link.get('href'))
return urls
def getNews(url):
print url
xinwen = ''
request = urllib2.Request(url)
request.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; \
WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36')
try:
html = urllib2.urlopen(request)
except urllib2.HTTPError, e:
print e.code
soup = BeautifulSoup(html,'html.parser')
for news in soup.select('p.para'):
xinwen += news.get_text().decode('utf-8')
return xinwen
class News(object):
"""
source:from where 从哪里爬取的网站
title:title of news 文章的标题
time:published time of news 文章发布时间
content:content of news 文章内容
type:type of news 文章类型
"""
def __init__(self, source, title, time, content, type):
self.source = source
self.title = title
self.time = time
self.content = content
self.type = type
file = open('C:/test.txt','a')
for i in range(38,50):
for j in range(1,5):
url = "http://m.sohu.com/cr/" + str(i) + "/?page=" + str(j)
print url
s = Spider(url)
for newsUrl in s.getNextUrls():
file.write(getNews(newsUrl))
file.write("\n")
print "---------------------------"
在上述代码运行过程中,会遇到一些问题,导致爬虫运行中断,速度慢等问题。下面列出来几种问题:
代理服务器
可以从网上寻找一些代理服务器,然后通过设置爬虫的代理从而解决IP的问题。代码如下:
def setProxy(pro):
proxy_support=urllib2.ProxyHandler({'https':pro})
opener=urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
urllib2.install_opener(opener)
关于状态问题,如果寻找不到网页则直接舍弃,因为丢弃少量的网页不影响以后的工作。
def getHtml(url,pro):
urls = []
request = urllib2.Request(url)
setProxy(pro)
request.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; \
WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36')
try:
html = urllib2.urlopen(request)
statusCod = html.getcode()
if statusCod != 200:
return urls
except socket.timeout, e:
pass
except urllib2.URLError,ee:
pass
except httplib.BadStatusLine:
pass
return html
关于速度慢的问题,可以采用多进程的方式进行爬取。在分析完网址以后,可以在Redis
中使用有序的集合作为一个队列,既解决了URL
重复的问题,又解决了多进程的问题。(暂未实现)
昨天晚上尝试运行了一下,爬取搜狐新闻网的部分网页,大概是50*5*15=3750
多个网页,从而解析出来了2000
多条新闻,在网速为将近1Mbps
的情况下,花费了1101s
的时间,大概是18分钟左右。