aspose.words+docx实现docx合并以及去除aspose的印记

aspose.words+docx实现docx合并以及去除aspose的印记

原因

因工作需要完成多个word文档的合并,并尽量保证original style的方式将word转化成html用于端上进行展示。本文实现主要解决问题:

  • word的多个文档的合并[主要是完成append的方式合并]
  • 将合并文档转化成html文件,涉及英文,日文的字体word原样展示,合并中图片的base64d的转化
  • 由于aspose是商业应用,为了实现完美白嫖,不通过破解的方式去掉转化后结果中aspose的印记

安装主要工具

主要代码

  • 应用宝导入
#! /usr/bin/env python3
# -*- coding: utf-8 -*-
# DESC: 1. 基于docx实现多个docx的合并
#       2. 基于aspose的实现docx到html的转化
#       3. 基于bs4的html的元素和内容的增删改等操作

import os
import re
import pandas as pd
import aspose.words as aw
import aspose.words.saving as saving
from bs4 import BeautifulSoup
from docx import Document
from docxcompose.composer import Composer
  • 合并word文档
def merge_docx(docx_list: list, docx_merge_tar: str, docx_list_src: str) -> str:
    """
    合并word文档
    目前只是将word进行拼装,不进行分页等操作
    """
    if len(docx_list) == 0:
        raise Exception("input is empty.")
    if len(docx_list) == 1:
        return os.path.join(docx_list_src, docx_list[0])
    # 将第一个word作为基word
    base_docx = Document(os.path.join(docx_list_src, docx_list[0]))
    base_docx_composer = Composer(base_docx)
    # composer.append的方式合并到基word
    for next_docx in docx_list[1:]:
        next_docx_path = os.path.join(docx_list_src, next_docx)
        base_docx_composer.append(Document(next_docx_path))
    base_docx_composer.save(docx_merge_tar)
    print("merge docx list ok.")
    return docx_merge_tar
  • 将word转成html
def aspose_convert_docx_html(docx_file_path: str, html_file_path: str) -> str:
    """
    使用aspose.words-python将word转化成html
    """
    docx = aw.Document(docx_file_path)
    # 设置转化选项
    save_options = saving.HtmlSaveOptions(aw.SaveFormat.HTML)
    # 将图片存成base64形式
    save_options.export_images_as_base64 = True
    docx.save(html_file_path, save_options)
    return html_file_path
  • 去掉aspose的印记
def del_aspose_elemet(html_tar_file: str, to_tar_file: str):
    """
    去除aspose的信息
    """
    html_content = open(html_tar_file, "r", encoding="utf-8")
    soup = BeautifulSoup(html_content, features="lxml")
    # 删除指定的aspose的内容
    for tag in soup.find_all(style=re.compile("-aw-headerfooter-type:")):
        tag.extract()
    word_key_tag = soup.find("p", text=re.compile("Evaluation Only"))
    word_key_tag.extract()

    f = open(to_tar_file, "w", encoding="utf-8")
    f.write(soup.prettify())
    f.close()

测试

if __name__ == '__main__':
    docx_file_path = r"D:\merge_tar\demo.docx"
    html_file_path = r"D:\merge_tar\demo.html"
    aspose_convert_docx_html(docx_file_path, html_file_path)

    process_file_path = r"D:\merge_tar\demo_d.html"
    del_aspose_elemet(html_file_path, process_file_path)

测试结果

  • demo.docx

aspose.words+docx实现docx合并以及去除aspose的印记_第1张图片

  • apsose转化word到html

aspose.words+docx实现docx合并以及去除aspose的印记_第2张图片

  • 处理aspose的印记

aspose.words+docx实现docx合并以及去除aspose的印记_第3张图片

后记

  • aspose的转化后options设置有很多,具体可参考sapose.words的github查看demos
  • bs4在处理html很强大
  • 本文主要是记录工作中处理文档的实践结果,如果对你有用,那再好不过了

你可能感兴趣的:(python)