本文中的内容和代码参考了以下:
http://blog.csdn.net/u010454729/article/details/50900716
http://cuiqingcai.com/912.html
http://zqdevres.qiniucdn.com/data/20160426130231/index.html
本文主要内容是scrapy+beautifulsoup+mongo数据库,快速搭建简单爬虫——利用搜索关键词爬取百度百科城市地理信息,并将结果存入mongo数据库中 。
scrapy是python语言的一个爬虫框架,利用scrapy可以快速的搭建一个爬虫,我们只需要根据自己的需要做一些小的修改就可以爬取信息了。scrapy还支持多线程、分布式。beautifulsoup是用来解析网页的。这里介绍一下利用scrapy写简单爬虫的过程,不涉及多线程和分布式这些内容。
首先安装scrapy
pip install scrapy即可
安装完成后,可以使用pip list 查看一下,如果在显示的列表里面看到scrapy即表示安装成功了,下面就可以开始写爬虫了。
第一步是构建爬虫工程
scrapy startproject tutorial
最后一个是工程的名字,这里是tutorial。可以发现scrapy自动创建了一个tutorial的文件夹,里面已经有一些文件了。我们需要对这些文件进行一些修改就可以了。
第二步在tutorial\目录下的items.py中定义我们要抓取信息的格式。例如我希望抓取百度百科的某个城市页面中的城市名字、位置、气候三个信息。那么items.py中可以写成如下:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.item import Item,Field
class BaikeItem(scrapy.Item):
name=Field()
location=Field()
climate=Field()
第三步在 tutorial\spiders目录下新建自己的爬虫文件,文件名字可以随意(例如我新建的Baike.py),因为scrapy执行的时候是认的文件内设置的name.Baike.py文件中定义了一个爬虫类,类中有两个主要方法,start_requests(self)和parse(self,response)。start_requests(self)方法里放置你要爬取的网址,parse方法里是对爬取到的网页内容解析。在parse方法中会用到之前定义的item,这里需要在文件开头引入定义好的item才可以使用,引用语句是from your_roject_name.items importyour_item。我这里使用了beautifulsoup对网页内容定位和提取,常用的还有xpath等。百度百科的页面比较乱,这里我的处理也比较繁琐,基本思路是先找到页面中的目录,然后根据目录的层级关系提取地理目录下的子目录。提取时,从地理目录开始,如果遇到下一个1级目录则结束。需要注意的是,parse函数中是可以定义将提取到的内容写入文件等操作,但不是必须的。还可以在pipline.py中对提取结果进行处理 。Baike.py详细代码如下:
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
import xlrd
import uniout
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import codecs#用于打开文件夹保证编码格式
import scrapy
from scrapy.selector import Selector
from bs4 import BeautifulSoup as bs#用于解析html
import os#用于创建文件夹
import urllib2#用于爬取
import urllib
import re#用于匹配找到url
from tutorial.items import BaikeItem
class Baike(scrapy.Spider):
name="baike"
def read_xls_file(self,filename):
data=[]
xlrd.Book.encoding="utf-8"
book= xlrd.open_workbook(filename)
if book.nsheets<=0:
return data
table=book.sheet_by_index(0)
data=table.col_values(0)
i=0
for col in table.col_values(0):
data.append(table.cell_value(i,0))
i=i+1
return data
def start_requests(self):
allowed_domains=["baidu.com"]
#queries=["上海市"]
queries=self.read_xls_file("citytest.xlsx")
#print'read data:',queries
#queries = ["上海市","武汉市","绵阳市","南京市","重庆市","明光市"]
#根据关键字得到对应的百度百科页面url
urls=[]
for query in queries:
# query=urllib.quote(query)
query.encode('utf-8')
print query
#unicode(query)
url = "http://baike.baidu.com/search/word?word="+query
urls.append(url)
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self,response):
items=[]
soup=bs(response.body)
#获取1级标题:地理
title_li=soup.find_all("li",class_="level1")
titles=[]
texts1 = []
lis=[]
s0="地理"
unicode(s0)
item=BaikeItem()
for i in title_li:
span=i.find_next("span",class_="text")
text =span.get_text()#这里,用get_text()就能够提取正文
if(text.__contains__(s0)):
lis=i.find_all_next("li")
for li in lis:
list=li.get('class')
if not list is None:
str=list[0]
if(str== "level1"): #到下一个一级标题退出,只取地理这一个1级标题下的内容
break
if(str== "level2"):
lis.append(li)
span=li.find_next("span",class_="text")
text=span.get_text()
texts1.append(text)
break
for i in texts1:
i.encode('utf-8')
titles.append(i)
print titles
################################################################
#找到地理目录下的内容,这是需要提取的内容
text_h3=soup.find_all("h3",class_="title-text")#找到h3标签
#text_div=soup.find_all("div",class_="main-content")
boolset=False
boolwrite=False
texts = []
cityname=""
locationstr="位置"
climatestr="气候"
unicode(locationstr)
unicode(climatestr)
for i in text_h3:
text=i.get_text()
for title in titles:
if (text.__contains__(title)):
texts.append(title)
next_items=i.find_all_next()#找到所有后续节点
for next_item in next_items:
attr_name=next_item.name
if attr_name== "h3":#遇到下一个一级目录退出,只取一个目录下的内容
break
else:
if attr_name== "div":
attr_classes=next_item.get('class')
if not attr_classes is None and len(attr_classes)>0 and attr_classes[0]=="para":
text = next_item.get_text()#这里,用get_text()就能够提取正文
text = text.replace("\n","")#包含太多换行,去掉
if text:
texts.append(text)
if title.__contains__(locationstr):
item['location']=text
if title.__contains__(climatestr):
item['climate']=text
dl=soup.find("dl",class_="basicInfo-block basicInfo-left")
dt=dl.find_next("dd",class_="basicInfo-item value")
cityname=dt.get_text()
item['name']=cityname.strip()
items.append(item)
filename = response.url.split("/")[-2]+'.txt'
filename.decode('ascii').encode('utf-8')
write = codecs.open(filename,"w",encoding='utf-8')
write.write("城市名称")
write.write("\n")
cityname.encode('utf-8')
write.write(cityname)
write.write("\n")
for i in texts:
i.encode('utf-8')
write.write(i),
write.write("\n"),
write.close()
return items
第四步介绍一下piplines.py.这个文件位于tutorial目录下,这里面是对提取结果的操作,可以将提取到的结果保存到json中写入文件,也可以链接数据库将结果写入数据库,我这里是链接mongo数据库,将结果写入mongo数据库中。这里要求本机装了mongo数据库并且已经运行,如果没有安装数据库的话可以将链接数据库的部分注释掉再运行。piplines详细代码如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings
from scrapy import signals
import json
import codecs
class JsonWithEncodingBaikePipeline(object):
def __init__(self):
#self.file=codecs.open('cityInfo.json','w',encoding='utf-8')
#链接数据库
self.client=pymongo.MongoClient(host=settings['MONGO_HOST'],port=settings['MONGO_PORT'])
self.db = self.client[settings['MONGO_DB']] # 获得数据库的句柄
self.coll = self.db[settings['MONGO_COLL']] # 获得collection的句柄
def process_item(self, item, spider):
postItem = dict(item) # 把item转化成字典形式
self.coll.insert(postItem) # 向数据库插入一条记录
return item # 会在控制台输出原item数据,可以选择不写
## line = json.dumps(dict(item), ensure_ascii=False) + "\n"
## self.file.write(line)
## return item
def spider_closed(self, spider):
self.file.close()
class TutorialPipeline(object):
def process_item(self, item, spider):
return item
第五步在settings.py中进行一些设置。settings.py详细代码如下:
# -*- coding: utf-8 -*-
# Scrapy settings for tutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
##MOGO_DB
MONGO_HOST = "127.0.0.1" # 主机IP
MONGO_PORT = 27017 # 端口号
MONGO_DB = "xhltest" # 库名
MONGO_COLL = "baikecity" # collection名
# MONGO_USER = "user" #ruguo如果xuyao如果需要yong如果需要用huming如果需要用户名he如果需要用户名和mima
# MONGO_PSW = "123456"
DOWNLOAD_DELAY = 2
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'tutorial.pipelines.JsonWithEncodingBaikePipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
第六步,现在爬虫已经写好了 执行 scrapy crawl baike
爬虫就可以工作了。抓取到的城市地理信息如下: