Golang 爬虫框架 Goquery的使用

介绍

goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go’s net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery’s stateful manipulation functions (like height(), css(), detach()) have been left off.

实现了类似jquery的功能,可以快速选择元素。

GitHub:https://github.com/PuerkitoBio/goquery

安装:go get github.com/PuerkitoBio/goquery

example1
package main

import (
  "fmt"
  "log"
  "net/http"
  "github.com/PuerkitoBio/goquery"
)

func ExampleScrape() {
  // Request the HTML page.
  res, err := http.Get("http://metalsucks.net")
  if err != nil {
    log.Fatal(err)
  }
  defer res.Body.Close()
  if res.StatusCode != 200 {
    log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
  }

  // Load the HTML document
  doc, err := goquery.NewDocumentFromReader(res.Body)
  if err != nil {
    log.Fatal(err)
  }

  // Find the review items
  doc.Find(".sidebar-reviews article .content-block").Each(func(i int, s *goquery.Selection) {
    // For each item found, get the band and title
    band := s.Find("a").Text()
    title := s.Find("i").Text()
    fmt.Printf("Review %d: %s - %s\n", i, band, title)
  })
}

func main() {
  ExampleScrape()
}

运行结果为:

go run main.go
Review 0: Code Orange - Underneath
Review 1: Ozzy Osbourne - Ordinary Man
Review 2: Kvelertak - Splid
Review 3: Suicide Silence - Become the Hunter
Review 4: Cattle Decapitation - Death Atlas
example2
package main

import (
	"fmt"
	"log"
	"net/http"

	"github.com/PuerkitoBio/goquery"
)

func ExampleScrape() {
	// Request the HTML page.
	res, err := http.Get("https://github.com/PuerkitoBio/goquery")
	if err != nil {
		log.Fatal(err)
	}
	defer res.Body.Close()
	if res.StatusCode != 200 {
		log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
	}

	// Load the HTML document
	doc, err := goquery.NewDocumentFromReader(res.Body)
	if err != nil {
		log.Fatal(err)
	}

	// Find the review items
	doc.Find("div.Box-body").Each(func(i int, s *goquery.Selection) {
		// For each item found, get the band and title
		txt := s.Find("p").Text()
		fmt.Printf("%s\n", txt)
	})
}

func main() {
	ExampleScrape()
}

运行结果为:

goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go’s net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery’s stateful manipulation functions (like height(), css(), detach()) have been left off.Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller’s responsibility to ensure that the source document provides UTF-8 encoded HTML. See the wiki for various options to do this.Syntax-wise, it is as close as possible to jQuery, with the same function names when possible, and that warm and fuzzy chainable interface. jQuery being the ultra-popular library that it is, I felt that writing a similar HTML-manipulating library was better to follow its API than to start anew (in the same spirit as Go’s fmt package), even though some of its methods are less than intuitive (looking at you, index()…).Please note that because of the net/html dependency, goquery requires Go1.1+.(optional) To run unit tests:(optional) To run benchmarks (warning: it runs for a few minutes):Note that goquery’s API is now stable, and will not break.goquery exposes two structs, Document and Selection, and the Matcher interface. Unlike jQuery, which is loaded as part of a DOM document, and thus acts on its containing document, goquery doesn’t know which HTML document to act upon. So it needs to be told, and that’s what the Document type is for. It holds the root document node as the initial Selection value to manipulate.jQuery often has many variants for the same function (no argument, a selector string argument, a jQuery object argument, a DOM element argument, …). Instead of exposing the same features in goquery as a single method with variadic empty interface arguments, statically-typed signatures are used following this naming convention:Utility functions that are not in jQuery but are useful in Go are implemented as functions (that take a *Selection as parameter), to avoid a potential naming clash on the *Selection’s methods (reserved for jQuery-equivalent behaviour).The complete godoc reference documentation can be found here.Please note that Cascadia’s selectors do not necessarily match all supported selectors of jQuery (Sizzle). See the cascadia project for details. Invalid selector strings compile to a Matcher that fails to match any node. Behaviour of the various functions that take a selector string as argument follows from that fact, e.g. (where ~ is an invalid selector string):See some tips and tricks in the wiki.Adapted from example_test.go:There are a number of ways you can support the project:If you desperately want to send money my way, I have a BuyMeACoffee.com page:The BSD 3-Clause license, the same as the Go language. Cascadia’s license is here.

function

func NewDocumentFromResponse(res *http.Response) (*Document, error)

NewDocumentFromResponse is another Document constructor that takes an http response as argument. It loads the specified response’s document, parses it, and stores the root Document node, ready to be manipulated. The response’s body is closed on return.
Deprecated: Use goquery.NewDocumentFromReader with the response’s body.

func (s *Selection) Each(f func(int, *Selection)) *Selection

Each iterates over a Selection object, executing a function for each matched element. It returns the current Selection object. The function f is called for each element in the selection with the index of the element in that selection starting at 0, and a *Selection that contains only that element.

func (s *Selection) Filter(selector string) *Selection

Filter reduces the set of matched elements to those that match the selector string. It returns a new Selection object for this subset of matching elements.

func (s *Selection) Find(selector string) *Selection

Find gets the descendants of each element in the current set of matched elements, filtered by a selector. It returns a new Selection object containing these matched elements.

你可能感兴趣的:(Golang,webCrawler)