这个真的是说来话太长。
最先是看了scrapy的官方文档,安装成功之后就创建了一个新的工程。
然后里面自身就包含如下内容:
· scrapy.cfg: 项目的配置文件
· tutorial/: 该项目的python模块。之后您将在此加入代码。
· tutorial/items.py: 项目中的item文件.
· tutorial/pipelines.py: 项目中的pipelines文件.
· tutorial/settings.py: 项目的设置文件.
· tutorial/spiders/: 放置spider代码的目录.
首先定义一个item文件:
import scrapy
from scrapy import Item, Field
class yilongItem(scrapy.Item):
title = scrapy.Field()
contentnews = scrapy.Field()
pass
创建一个Spider,继承 scrapy.Spider 类, 定义以下三个属性:
· name: 用于区别Spider。 该名字必须是唯一的,您不可以为不同的Spider设定相同的名字。
· start_urls: 包含了Spider在启动时进行爬取的url列表。 因此,第一个被获取到的页面将是其中之一。 后续的URL则从初始的URL获取到的数据中提取。
· parse() 是spider的一个方法。 被调用时,每个初始URL完成下载后生成的Response 对象将会作为唯一的参数传递给该函数。该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的Request 对象。
然后在该目录下创建一个main.py文件,内容如下:
from scrapy import cmdline
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
cmdline.execute(['scrapy','crawl','yilong'])
自定义了一个py文件,我定义的是yilong.py。光是导入的就有很多很多:
# coding:utf-8
import scrapy
import requests
import json
import re
import urllib2
from bs4 import BeautifulSoup
from scrapy.http import Request
from scrapy.http import FormRequest
from tutorial.items import yilongItem
爬取的过程呢很多与ajax的相似之处,不同之处在这里:
item = yilongItem()
item['title'] = m1
item['contentnews']=m2
yield item
定义一个item,然后返回一个迭代的值使。用 yield 的时候就是一个迭代器,可以不断 yield 新的 request,如果用 return,就只会返回一个 request。
这里尝试过使用两个yield在一个程序中,不过没有成功。最后存入CSV文件时总是分开存入,所以又改成一个。
创建Pipelines.py文件,需写入:
class TutorialPipeline(object):
def process_item(self, item, spider):
return item
item (Item 对象) – 被爬取的item
spider (Spider 对象) – 爬取该item的spider
最最重要的settings.py文件,里面需要写入保存方式,保存地址和保存的内容的编码格式,以及制表的表头。
BOT_NAME = 'tutorial'
FEED_EXPORT_ENCODING = 'gbk'
FEED_URI = u'file///D:/tutorial/123456.csv'
FEED_FORMAT = 'csv'
FEED_EXPORTERS = {
'csv': 'tutorial.spiders.csv_item_exporter.MyProjectCsvItemExporter',
}
# CSV_DELIMITER = "\t" #制表符
FIELDS_TO_EXPORT = ['title',
'contentnews',
]
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
有一个很重要的点也存在于该文件中,之前我的代码一直爬取不出文字内容,直到后来,
ROBOTSTXT_OBEY = False
需要爬取出内容时等号后应为False.
默认为True,就是要遵守robots.txt 的规则,那么 robots.txt 是什么?
robots.txt 是遵循 Robot协议 的一个文件,它保存在网站的服务器中,
它的作用是,告诉搜索引擎爬虫,本网站哪些目录下的网页不希望 你进行爬取收录。
在Scrapy启动后,会在第一时间访问网站的 robots.txt 文件,然后决定该网站的爬取范围。
当然,并不是在做搜索引擎,而且在某些情况下我们想要获取的内容恰恰是被 robots.txt 所禁止访问的。
完整代码如下:
dmoz_spider.py
import scrapyclass DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "https://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "https://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)
main.py
#coding=utf-8
from scrapy import cmdline
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
cmdline.execute(['scrapy','crawl','yilong'])
items.py
import scrapy
from scrapy import Item, Field
class yilongItem(scrapy.Item):
title = scrapy.Field()
contentnews = scrapy.Field()
# contenttime = scrapy.Field()
pass
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class TutorialPipeline(object):
def process_item(self, item, spider):
return item
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for tutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'tutorial'
FEED_EXPORT_ENCODING = 'gbk'
FEED_URI = u'file///D:/tutorial/123456.csv'
FEED_FORMAT = 'csv'
FEED_EXPORTERS = {
'csv': 'tutorial.spiders.csv_item_exporter.MyProjectCsvItemExporter',
}
# CSV_DELIMITER = "\t"
FIELDS_TO_EXPORT = ['title',
'contentnews',
]
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'tutorial (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# FEED_EXPORTERS_BASE = {
# 'csv': 'scrapy.contrib.exporter.CsvItemExporter'
# }
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'tutorial.middlewares.TutorialSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'tutorial.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'tutorial.pipelines.TutorialPipeline': 300,
}
yilong1.py
# coding:utf-8
import scrapy
import requests
import json
import re
import urllib2
from bs4 import BeautifulSoup
from scrapy.http import Request
from scrapy.http import FormRequest
from tutorial.items import yilongItem
class yilongspider(scrapy.Spider):
name = 'yilong'
allow_domians = ["elong.com"]
custom_settings = {
"headers": {'Accept':'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip, deflate',
'Accept-Language' :'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Connection': 'keep-alive',
'Cookie' :'CookieGuid=158948cd-563a-4211-8a3e-a92bd263cba0; page_time=1509628060819%2C1509628152402%2C1509628219681%2C1509693907739%2C1509693938986%2C1509693967862%2C1509701358453%2C1509701412108%2C1509764088450%2C1509764288344%2C1509772980043%2C1509773457586%2C1509773753700%2C1509774349211%2C1509774650832%2C1509775873404%2C1509775901516%2C1509776293716%2C1509776386781%2C1509776424331%2C1509776587625%2C1509776889277%2C1509778200633%2C1509779461446; _RF1=111.225.131.187; _RSG=zs4ixjEH93BfifEVp1.6NB; _RDG=28df89f13e8f75279825ba18abc6dd550e; _RGUID=b0c944eb-7bb0-4dbd-b5c6-f81cdf3a9b29; ShHotel=CityID=0101&CityNameCN=%E5%8C%97%E4%BA%AC%E5%B8%82&CityName=%E5%8C%97%E4%BA%AC%E5%B8%82&OutDate=2017-11-06&CityNameEN=beijing&InDate=2017-11-05; _fid=j9jp7iwv-bf9b-4c1d-8c4f-893b780f205c; newjava1=a79d58d364ea53c8c9161ec3b2f45c8d; JSESSIONID=EB741B472421B3E455397D577D009DE8; SessionGuid=6a99e379-6ae4-4a74-b0a4-f94dfe60ab5a; Esid=1ec4caeb-c805-4d96-8ddf-43a6a23eed52; com.eLong.CommonService.OrderFromCookieInfo=Status=1&Orderfromtype=1&Isusefparam=0&Pkid=50&Parentid=50000&Coefficient=0.0&Makecomefrom=0&Cookiesdays=0&Savecookies=0&Priority=8000; fv=pcweb; s_cc=true; s_sq=%5B%5BB%5D%5D; s_visit=1',
'Host' :'hotel.elong.com ',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'X-Requested-With':'XMLHttpRequest',
'Content-Length':'1062'},
}
def start_requests(self):
url = "http://hotel.elong.com/ajax/list/asyncsearch"
requests = []
for i in range(0,1):
formdata = {'listRequest.orderFromID':'50',
'listRequest.pageIndex':str(i)}
# print formdata
request = FormRequest(url, callback=self.parse_item, formdata=formdata)
requests.append(request)
return requests
def parse_item(self, response):
data0 = json.loads(response.body)
data1 = data0['value']['hotelIds']
data2 = data1.split(',')
for k in data2:
url2 = 'http://hotel.elong.com/' + str(k)
page = urllib2.urlopen(url2)
html = page.read()
soup = BeautifulSoup(html, "html.parser")
# item = yilongItem()
for tag1 in soup.find_all('div', class_="hdetail_view"):
m_1 = tag1.get_text(strip=True)
m1=m_1+'\n'
# yield item1
for s in range(0, 1):
url3 = 'http://hotel.elong.com/ajax/detail/gethotelreviews/?hotelId=' + str(
k) + '&recommendedType=0&pageIndex=' + str(
s) + '&mainTagId=0&subTagId=0&code=9253708&elongToken=j9cpvej8-4dea-4d07-a3b9-54b27a2797e9&ctripToken=88cf3b41-c4a2-4e49-a411-16af6b55ebec&_=1509281280799'
r = requests.get(url3).text
data = json.loads(r)
# item2 = yilongItem()
for i in data["contents"]:
m2 = i['content'].encode('utf-8')
# item2["contenttime"]= i['createTimeString'].encode('utf-8')
# print item2["contenttime"]
# yield item2
item = yilongItem()
item['title'] = m1
item['contentnews']=m2
yield item