Colly — Golang爬虫开发示例

昨天正好看到一位朋友分享了一个基于Golang的爬虫框架 — Colly

用Golang写爬虫(六) - 使用colly

Colly是一个基于Golang开发的快速轻量的爬虫框架，支持异步，并行，分布式，还可以处理Cookie和Session

Colly的官方文档写的也很简单明了，建议可以看一下，尤其是提供了很多例子

之前我写过一篇使用net/http和goquery的爬虫，Golang 并发爬虫爬取某著名游戏媒体，Colly解析HTML也是使用goquery来的。

goquery 的语法并不难，很快就能掌握。

Colly的入口是一个 Collector 实例。 Collector 负责所有的请求与数据处理。

Colly有几个常用的接口

OnRequest

Called before a request，OnRequest 请求发出之前调用

OnError

Called if error occured during the request，OnError 请求过程中出现Error时调用

OnResponse

Called after response received，OnResponse 收到response后调用

OnHTML

Called right after OnResponse if the received content is HTML，如果收到的内容是HTML，就在onResponse执行后调用

OnXML

Called right after OnHTML if the received content is HTML or XML，OnXML 如果收到的内容是HTML或者XML，就在onHTML执行后调用

OnScraped

Called after OnXML callbacks，OnScraped OnXML执行后调用

这里我以之前爬过的游民星空为例，首先初始化Collector实例。

    c := colly.NewCollector(
	colly.Async(true),
	colly.UserAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"),
    )

    c.Limit(&colly.LimitRule{
        Parallelism: 2,
	RandomDelay: 5 * time.Second,
    })

    detailCollector := c.Clone()
    
    c.Wait()
    detailCollector.Wait()
复制代码

开启异步，并设置并行数为2，随机的请求延迟为5秒。

detailCollector是复制了所有c的配置的Collector。这是为了针对不同的网页采用不同的Collector。

因为我们开启了异步，Colly的异步也是基于sync/waitgroup的，所以我们需要使用Wait()方法。

    c.OnRequest(func(r *colly.Request) {
	fmt.Println("Visiting", r.URL.String())
    })

    c.Visit("https://www.gamersky.com/news/")
复制代码

Visit方法访问URL，OnRequest方法会在请求啊之前调用。

    c.OnHTML("a[class=tt]", func(e *colly.HTMLElement) {
        detailCollector.Visit(e.Attr("href"))
    })
复制代码

OnHTML会在请求返回为HTML时调用，这里使用goquery的解析HTML的方法得到首页所有的链接。

detailCollector则会对这些得到的URL进行请求。

    type news struct {
        Title     string
        URL       string
        Contents  string
        CrawledAt time.Time
    }

    detailCollector.OnResponse(func(r *colly.Response) {
        fmt.Println(r.StatusCode)
    })

    detailCollector.OnHTML("body", func(e *colly.HTMLElement) {
	n := news{}
	n.Title = e.ChildText("div[class=Mid2L_tit]>h1")
	n.URL = e.Request.URL.String()
	n.Contents = e.ChildText("div[class=Mid2L_con]>p")
	n.CrawledAt = time.Now()
	log.Println(n)
    })
复制代码

属于detailCollector的Response返回后调用OnResponse方法，我在这里打印出每次请求的HTTP状态码。在这之后OnHTML会解析返回的HTML，得到我需要的数据。

到这里我就拿到了我所有需要的数据。下一步的持久化数据就不在阐述。

全部代码

package main

import (
	"fmt"
	"log"
	"time"

	"github.com/gocolly/colly"
)

type news struct {
    Title     string
    URL       string
    Contents  string
    CrawledAt time.Time
}

func main() {
    c := colly.NewCollector(
    	colly.Async(true),
    	colly.UserAgent("Mozilla/5.0 (X11; Linux x86_64)    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"),
    )

    c.Limit(&colly.LimitRule{
    	Parallelism: 2,
    	RandomDelay: 5 * time.Second,
    })  
    
    detailCollector := c.Clone()
    
    detailCollector.OnResponse(func(r *colly.Response) {
    	fmt.Println(r.StatusCode)
    })

    detailCollector.OnHTML("body", func(e *colly.HTMLElement) {
    	n := news{}
    	n.Title = e.ChildText("div[class=Mid2L_tit]>h1")
    	n.URL = e.Request.URL.String()
    	n.Contents = e.ChildText("div[class=Mid2L_con]>p")
    	n.CrawledAt = time.Now()
    	log.Println(n)
    })

    c.OnHTML("a[class=tt]", func(e *colly.HTMLElement) {
    	detailCollector.Visit(e.Attr("href"))
    })

    c.OnRequest(func(r *colly.Request) {
    	fmt.Println("Visiting", r.URL.String())
    })

    c.Visit("https://www.gamersky.com/news/")

    c.Wait()
    detailCollector.Wait()
}

复制代码

Github github.com/3inchtime/c…

这就是一个基础的Colly例子，Colly还有很多的特性，更多的方法与特性就请参考官方文档。

Colly — Golang爬虫开发示例

你可能感兴趣的:(Colly — Golang爬虫开发示例)