缘由

同事找了个腾讯微服务开发文档,由于工作原因不方便一直用外网,所以希望我能帮忙把这个文件爬取下来并生成pdf格式,一开始想的很简单,不就是爬取gitbook吗,网上现成工具一大堆

soeasy.jpg

~~不简单的链接:~~ https://tsf-gitbook-1257356411.cos.ap-chengdu.myqcloud.com/1.12.4/usage/%E4%BA%A7%E5%93%81%E7%AE%80%E4%BB%8B/%E4%BA%A7%E5%93%81%E6%A6%82%E8%BF%B0.html

链接不简单的地方

根链接(https://tsf-gitbook-1257356411.cos.ap-chengdu.myqcloud.com/1.12.4/usage/)无法访问
很多链接混乱,点第一个章节和第二个章节看到的html源码中链接数量和内容不同
部分链接无法点击,可能文章未写完
由于根链接无法访问,只能使用中文路径去转换,导致很多现成工具无法使用
(https://github.com/TruthHun/converter)

++这种不规范gitbook只能自己编码来实现爬取++

思路

找一个可以将html转为pdf工具
获取所有html并打包成一个zip文件,使用工具直接转为pdf

使用现成工具

下载工具

Calibre(html转pdf)

安装calibre
下载地址：https://calibre-ebook.com/download
根据自己的系统安装对应的calibre（需要注意的是，calibre要安装3.x版本的，2.x版本的功能不是很强大。反正安装最新的就好。）安装完calibre之后，将calibre加入到系统环境变量中，执行下面的命令之后显示3.x的版本即表示安装成功。
ebook-convert --version

google 浏览器(保存当前html)

使用方式:

image.png

目前我还没有找到那种可以一键保留当前所有页面html工具,如果你有,请告诉我,不免费要,给你钱

使用方式

生成.epub格式文件,在此不详述,可参考下方代码实现方式了解细节(了解实现细节)
使用命令ebook-convert demo.epub demo.pdf

编码方式

此方式由使用工具方式演变而来,具体思路是一致的

获取所有内容,并保存为html

1. 根据配置文件中配置爬取url获取当前页面所有要爬取的页面
2. 获取完毕发现爬取html中都存在类似左边目录的链接,这种html对pdf目录很不友好
3. 这里我们只获取gitbook中bookbody中内容,自己通过合成html方式生成html

body := htmlquery.Find(doc, "//div[@class='page-inner']")
        if len(body) != 0 {
            pdfBody := body[0]
            htmlBody := htmlquery.OutputHTML(pdfBody, true)
            htmlTempleta := `


    
    
    
    %v
    

%v

`
htmlTempleta = fmt.Sprintf(htmlTempleta, book.Title, htmlBody)

htmlTempleta := `


    
    
    
    %v
    

%v

`
    htmlTempleta = fmt.Sprintf(htmlTempleta, value, value)

组装epub文件(参考链接)
1. 生成mimetype
2. 生成container.xml文件
3. 生成目录文件
4. 生成首页log
5. 生成content.opf文件
6. 将上述生成文件打包并组装为zip文件,然后修改后缀为epub即可
使用calibre命令转换html为pdf(pdf可选选项)

args := []string{
        this.BasePath+"/content.epub",
        this.BasePath + "/" + output + "/book.pdf",
    }
    //页面大小
    if len(this.Config.PaperSize) > 0 {
        args = append(args, "--paper-size", this.Config.PaperSize)
    }
    //文字大小
    if len(this.Config.FontSize) > 0 {
        args = append(args, "--pdf-default-font-size", this.Config.FontSize)
    }

    //header template
    if len(this.Config.Header) > 0 {
        args = append(args, "--pdf-header-template", this.Config.Header)
    }

    //footer template
    if len(this.Config.Footer) > 0 {
        args = append(args, "--pdf-footer-template", this.Config.Footer)
    }

    if len(this.Config.MarginLeft) > 0 {
        args = append(args, "--pdf-page-margin-left", this.Config.MarginLeft)
    }
    if len(this.Config.MarginTop) > 0 {
        args = append(args, "--pdf-page-margin-top", this.Config.MarginTop)
    }
    if len(this.Config.MarginRight) > 0 {
        args = append(args, "--pdf-page-margin-right", this.Config.MarginRight)
    }
    if len(this.Config.MarginBottom) > 0 {
        args = append(args, "--pdf-page-margin-bottom", this.Config.MarginBottom)
    }

    //更多选项
    if len(this.Config.More) > 0 {
        args = append(args, this.Config.More...)
    }
    fmt.Println(args)
    cmd := exec.Command(ebookConvert, args...)
    return cmd.Run()

注意事项

这种方式对通用的gitbook不实用,如果想要生效,需要修改爬取逻辑,具体代码在crawl/htmlspider中,修改其中需要抓取的具体逻辑
自动生成的json文件因为存在部分链接无法跳转的情况,需要修改自动生成json文件,详情参考github代码文档
上才艺

go 语言学习练手项目（一）：html转pdf并添加属于自己的水印

缘由

链接不简单的地方

思路

使用现成工具

下载工具

Calibre(html转pdf)

google 浏览器(保存当前html)

使用方式

编码方式

注意事项

你可能感兴趣的:(go 语言学习练手项目（一）：html转pdf并添加属于自己的水印)