这两个函数用于CrawlSpider内的rules属性中,具体的参数用法网上有很多,这里不再赘述。我想说的是差点搞死我的几个注意点。
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
1.rules内规定了对响应中url的爬取规则,爬取得到的url会被再次进行请求,并根据callback函数和follow属性的设置进行解析或跟进。
这里强调两点:一是会对所有返回的response进行url提取,包括首次url请求得来的response;二是rules列表中规定的所有Rule都会被执行。
2.allow参数没有必要写出要提取的url完整的正则表达式,部分即可,只要能够区别开来。且最重要的是,即使原网页中写的是相对url,通过LinkExtractor这个类也可以提取中绝对的url,这个类太厉害了。
start_urls = ['https://www.kanunu8.com/book2/10935/index.html']
def parse(self, response):
link = LinkExtractor(allow='\d{6}\.html',restrict_xpaths='//div//table//a')
links = link.extract_links(response)
print(links)
[Link(url=‘https://www.kanunu8.com/book2/10935/194600.html’, text=‘楔子’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194601.html’, text=‘第一章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194602.html’, text=‘第二章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194603.html’, text=‘第三章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194604.html’, text=‘第四章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194605.html’, text=‘第五章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194606.html’, text=‘第六章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194607.html’, text=‘第七章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194608.html’, text=‘第八章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194609.html’, text=‘第九章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194610.html’, text=‘第十章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194611.html’, text=‘第十一章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194612.html’, text=‘第十二章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194613.html’, text=‘第十三章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194614.html’, text=‘后记’, fragment=’’, nofollow=False)]
看到没,原网页给的是相对地址,它竟然能够通过计算返回出绝对地址,真是很厉害。而且links是一个Link对象的列表。这里通过:
for link in links:
print(link.url)
即可提取绝对url地址,这个作用很方便,就不用再用response.urljoin()函数了。