参考链接:http://essun.blog.51cto.com/721033/1703367
1
|
#yum remove audit
|
1
2
3
|
#yum install python-pip -y(在安装微信爬虫-centos7讲过了)
#pip install setuptools
#pip install setuptools --upgrade
#sudo pip install --upgrade pysn1(自己加了一步,不加一直提示出错)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
# pip install Scrapy
Collecting Scrapy
Using cached Scrapy-1.0.3-py2-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): cssselect>=0.9
in
/usr/lib/python2
.7
/site-packages
(from Scrapy)
Requirement already satisfied (use --upgrade to upgrade): queuelib
in
/usr/lib/python2
.7
/site-packages
(from Scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyOpenSSL
in
/usr/lib/python2
.7
/site-packages
(from Scrapy)
Requirement already satisfied (use --upgrade to upgrade): w3lib>=1.8.0
in
/usr/lib/python2
.7
/site-packages
(from Scrapy)
Collecting lxml (from Scrapy)
Using cached lxml-3.4.4.
tar
.gz
Collecting Twisted>=10.0.0 (from Scrapy)
Using cached Twisted-15.4.0.
tar
.bz2
Requirement already satisfied (use --upgrade to upgrade): six>=1.5.2
in
/usr/lib/python2
.7
/site-packages
(from Scrapy)
Collecting service-identity (from Scrapy)
Using cached service_identity-14.0.0-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): cryptography>=0.7
in
/usr/lib64/python2
.7
/site-packages
(from pyOpenSSL->Scrapy)
Collecting zope.interface>=3.6.0 (from Twisted>=10.0.0->Scrapy)
Using cached zope.interface-4.1.3.
tar
.gz
Collecting characteristic>=14.0.0 (from service-identity->Scrapy)
Using cached characteristic-14.3.0-py2.py3-none-any.whl
Collecting pyasn1-modules (from service-identity->Scrapy)
Using cached pyasn1_modules-0.0.8-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): pyasn1
in
/usr/lib/python2
.7
/site-packages
(from service-identity->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): idna>=2.0
in
/usr/lib/python2
.7
/site-packages
(from cryptography>=0.7->pyOpenSSL->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): setuptools
in
/usr/lib/python2
.7
/site-packages
(from cryptography>=0.7->pyOpenSSL->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): enum34
in
/usr/lib/python2
.7
/site-packages
(from cryptography>=0.7->pyOpenSSL->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): ipaddress
in
/usr/lib/python2
.7
/site-packages
(from cryptography>=0.7->pyOpenSSL->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): cffi>=1.1.0
in
/usr/lib64/python2
.7
/site-packages
(from cryptography>=0.7->pyOpenSSL->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): pycparser
in
/usr/lib/python2
.7
/site-packages
(from cffi>=1.1.0->cryptography>=0.7->pyOpenSSL->Scrapy)
Installing collected packages: lxml, zope.interface, Twisted, characteristic, pyasn1-modules, service-identity, Scrapy
Running setup.py
install
for
lxml
Running setup.py
install
for
zope.interface
Running setup.py
install
for
Twisted
Successfully installed Scrapy-1.0.3 Twisted-15.4.0 characteristic-14.3.0 lxml-3.4.4 pyasn1-modules-0.0.8 service-identity-14.0.0 zope.interface-4.1.3
|
1
2
3
4
5
6
7
8
9
10
|
[root@localhost workspace]
# scrapy startproject dmoz
2015-10-15 21:54:24 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-15 21:54:24 [scrapy] INFO: Optional features available: ssl, http11
2015-10-15 21:54:24 [scrapy] INFO: Overridden settings: {}
New Scrapy project
'dmoz'
created
in
:
/workspace/tutorial
You can start your first spider with:
cd
tutorial
scrapy genspider example example.com
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
[root@localhost workspace]
# tree
.
└── tutorial
├── scrapy.cfg
└── tutorial
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
3 directories, 6 files
|
Items是将要装载抓取的数据的容器,它工作方式像python里面的字典,但它提供更多的保护,比如对未定义的字段填充以防止拼写错误。
它通过创建一个scrapy.item.Item类来声明,定义它的属性为scrpy.item.Field对象,就像是一个对象关系映射(ORM).
我们通过将需要的item模型化,来控制从dmoz.org获得的站点数据,比如我们要获得站点的名字,url和网站描述,我们定义这三种属性的域。要做到这点,我们编辑在tutorial目录下的items.py文件,我们的Item类将会是这样
# -*- coding: utf-8 -*-
from scrapy.item import Item,Field class DomzItem(Item): title = Field() link = Field() desc = Field() # define the fields for your item here like: # name = scrapy.Field()
八、第一个爬虫(Spider)
Spider是用户编写的类,用于从一个域(或域组)中抓取信息。
他们定义了用于下载的URL的初步列表,如何跟踪链接,以及如何来解析这些网页的内容用于提取items。
要建立一个Spider,你必须为scrapy.spider.BaseSpider创建一个子类,并确定三个主要的、强制的属性:
这个方法负责解析返回的数据、匹配抓取的数据(解析为item)并跟踪更多的URL。
这是我们的第一只爬虫的代码,将其命名为dmoz_spider.py并保存在tutorial\spiders目录下。
from scrapy.spider import BaseSpider from scrapy.spider import BaseSpider class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://news.sohu.com/20161120/n473657485.shtml"
] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body)
|
八、执行
为了让我们的爬虫工作,我们返回项目主目录执行以下命令
#scrapy crawl dmoz
这时候可能手机微信上面的链接无法获取,把settings.py的ROBOTSTXT_OBEY = True改成ROBOTSTXT_OBEY = False。
基本完成。