初涉scrapy,在项目中的爬虫需求中,遇到了需要获取html中的script标签中的data数据,样例网址:
获取的方式有很多
response.xpath("//body//script/text()")就可以拿到该script的标签内容,之后对字符串进行处理分析,转换为json之类的都可以
for match in re.finditer('"title":"(.*?)"',response.text):
print(match.group())
#返回结果
# "title": "安砾"
# "title": "陈卓"
# "title": "金涛"
# "title": "鞠建东"
# "title": "李波"
# "title": "廖理"
正则表达式:"title":"(.*?)" 在线校验
还有另外一种方法,就是使用js2xml进行分析获取,js2xml已经支持python3.6
js2xml是将javascript代码转换为xml document的工具,转换之后,可以更加方便的通过使用xpath提取数据
详细代码如下:
import scrapy
import time
import datetime
import pymongo
from lxml import html, etree
from scrapy.selector import Selector
import js2xml
from bs4 import BeautifulSoup
import re
class PbcsftsinghuaeducnSpider(scrapy.Spider):
name = 'PbcsfTsinghuaEduCn'
allowed_domains = ['pbcsf.tsinghua.edu.cn']
def start_requests(self):
url = 'http://www.pbcsf.tsinghua.edu.cn/portal/list/index/id/10.html'
yield scrapy.Request(url, dont_filter=True, callback=self.parse)
def parse(self, response):
demo = response.text
soup = BeautifulSoup(demo, 'html')
icount = 1
for script in soup.select('body script'):
icount += 1
if icount > 7:
pass
else:
script_text = js2xml.parse(script.string, encoding='utf-8', debug=False)
script_tree = js2xml.pretty_print(script_text)
selector = etree.HTML(script_tree)
for obj in selector.xpath("//var[@name='data']//object"):
url = str(response.urljoin(obj.xpath(".//property[@name='url']/string/text()")[0]))
name = str(obj.xpath(".//property[@name='title']/string/text()")[0])
print(name+':\t'+url)
有关js2xml的更多资料,详情查看https://github.com/scrapinghub/js2xml/blob/master/README.md