Nokogiri (锯) 使使用 Ruby 中的 XML 和 HTML 变得轻松而轻松。 提供了一个明智的、易于理解的 API 阅读 、编写、 修改 和 查询 文档 它依赖于 libxml2 (CRuby) 和 xerces (JRuby) 等原生解析器,速度快且符合标准。
Nokogiri
安裝
環境:
Requirements:
- Ruby >= 2.6
- JRuby >= 9.3.0.0
命令:
gem install nokogiri
在服務器端可能會安裝不成功,但是沒有關係,本地抓取後,在遠程也可以繼續住區,不影響使用。
使用
使用css 比較方便,注意如果使用的是id,需要加# 例如 doc.css('#content_views')
另外需要注意的是,抓取的content 直接使用content.children 即可,抓取到就是帶有p標籤和code標籤的內容,如果使用是content.content 就會出現漢字
#! /usr/bin/env ruby require 'nokogiri' require 'open-uri' # Fetch and parse HTML document doc = Nokogiri::HTML(URI.open('https://nokogiri.org/tutorials/installing_nokogiri.html')) # Search for nodes by css doc.css('nav ul.menu li a', 'article h2').each do |link| puts link.content end # Search for nodes by xpath doc.xpath('//nav//ul//li/a', '//article//h2').each do |link| puts link.content end # Or mix and match doc.search('nav ul.menu li a', '//article//h2').each do |link| puts link.content end
Method: Nokogiri::XML::Searchable#css
Defined in:
lib/nokogiri/xml/searchable.rb
permalink #css(*args) ⇒ Object
call-seq:
css(*rules, [namespace-bindings, custom-pseudo-class])
Search this object for CSS
rules
.rules
must be one or more CSS selectors. For example:node.css('title') node.css('body h1.bold') node.css('div + p.green', 'div#one')
A hash of namespace bindings may be appended. For example:
node.css('bike|tire', {'bike' => 'http://schwinn.com/'})
Custom CSS pseudo classes may also be defined which are mapped to a custom XPath function. To define custom pseudo classes, create a class and implement the custom pseudo class you want defined. The first argument to the method will be the matching context NodeSet. Any other arguments are ones that you pass in. For example:
handler = Class.new { def regex(node_set, regex) node_set.find_all { |node| node['some_attribute'] =~ /#{regex}/ } end }.new node.css('title:regex("\w+")', handler)
Some XPath syntax is supported in CSS queries. For example, to query for an attribute:
node.css('img > @href') # returns all +href+ attributes on an +img+ element node.css('img / @href') # same # ⚠ this returns +class+ attributes from all +div+ elements AND THEIR CHILDREN! node.css('div @class') node.css