python 实现将网页转化成pdf文档

思路:
1.爬取要制作成pdf的HTML网页标签
2.将爬取到的标签放到body标签内部组合成完整的HTML格式代码(我记得有个库可以实现,找了半天没找到,有记得的帮忙下边评论下)
3.使用pdfkit库将组合完整的HTML代码转化成pdf文档

pdfkit库的安装使用
pip install pdfkit
还需要安装配套的软件wkhtmltox(官网下载就行,一路next安装即可)
并且将wkhtmltox安装目录中的bin目录配置到path环境变量中
pfdkit.from_file(‘html文件路径’,‘输出的pdf文件路径’)

补个代码:

import pdfkit
import requests
from lxml import etree

# 爬虫爬取csdn文章主体内容
def spider(url):
    headers = {
        'authority': 'blog.csdn.net',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'sec-ch-ua': '"Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://blog.csdn.net/CXY00000?spm=1008.2194.3001.5343',
        'accept-language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
        'cookie': '',# (换成自己的cookie)
    }
    # print(headers)
    response=requests.get(url=url,headers=headers)
    # print(response.text)
    tree = etree.HTML(response.text)
    lis=tree.xpath('//div[@class="blog-content-box"]')[0]
    lis_str=etree.tostring(lis,encoding='unicode')
    # print(lis_str)
    html1 = '''


    
        
            Title
    
        
            {}
        

        '''.format(lis_str)

    with open('./csdn.html','w',encoding='utf8') as fp:
        fp.write(html1)

# 将html生成pdf
def makepdf(out_name):
    pdfkit.from_file('./csdn.html','./'+str(out_name)+'.pdf')




if __name__ == '__main__':
    url = input('请输入要爬取的链接:')
    out_name = input('请输入输出的pdf文件名:')
    spider(url)
    makepdf(out_name)

exe小工具下载地址:
csdn文章转pdf(可见即可转).exe
https://jhc001.lanzouw.com/idWn4x0n50b
密码:7eeh

你可能感兴趣的:(爬虫,python,开发语言,html5)