word完美转html(doc、docx 图片转base64编码)

       近期在做一个项目,里面涉及到关于word转html的需求。要求上传一个word文档,转换成html进行在线预览编辑个功能。由于我选择将预览修改后的文档保存到S3里面,所以我选择将word中的图片直接转成base64编码,上传到S3中,好处就是不需要额外的地方(例如 mongo)存储保存这些图片,缺点就是转成的html文本的大小会比相应的word文件要大一些。至于那种方案好就看自己实际情况了。

       由于网上关于word转html的文件一搜一大堆,所以我这里就不展示将word中的图片保存到文件夹中转换的方法了。只贴出将image转成base64编码的代码,供有需要的同学参考。如有问题,请指出。大家共同进步

先上需要的jar包

dependencies {
    compile group: 'fr.opensagres.xdocreport', name: 'xdocreport', version: '2.0.2'
    // https://mvnrepository.com/artifact/org.apache.poi/poi
    compile group: 'org.apache.poi', name: 'poi', version: '4.1.0'
    // https://mvnrepository.com/artifact/org.apache.poi/poi-scratchpad
    compile group: 'org.apache.poi', name: 'poi-scratchpad', version: '4.1.0'
    // https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml
    compile group: 'org.apache.poi', name: 'poi-ooxml', version: '4.1.0'
    // https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml-schemas
    compile group: 'org.apache.poi', name: 'poi-ooxml-schemas', version: '4.1.0'
    // https://mvnrepository.com/artifact/org.apache.poi/ooxml-schemas
    compile group: 'org.apache.poi', name: 'ooxml-schemas', version: '1.4'
}

我这里用的是gradle进行管理的项目,与使用maven没啥两样

由于doc与docx转html方法不一致,我分着贴出代码

doc转html

import org.apache.commons.io.FileUtils;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.WordToHtmlUtils;
import org.apache.poi.hwpf.usermodel.Picture;
import org.w3c.dom.Document;
import org.w3c.dom.Element;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.*;
import java.util.Base64;

public class DocToHtml{
    public static void main(String[] args) throws ParserConfigurationException, TransformerException, IOException {
        DocToHtml docToHtml = new DocToHtml();
        docToHtml.docToHtml();
    }
    public void docToHtml() throws IOException, ParserConfigurationException, TransformerException {
        HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("D:\\345.doc"));
        WordToHtmlConverter wordToHtmlConverter = new ImageConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument()
        );
        wordToHtmlConverter.processDocument(wordDocument);
        Document htmlDocument = wordToHtmlConverter.getDocument();
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DOMSource domSource = new DOMSource(htmlDocument);
        StreamResult streamResult = new StreamResult(out);
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer serializer = transformerFactory.newTransformer();
        serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        serializer.setOutputProperty(OutputKeys.INDENT, "yes");
        serializer.setOutputProperty(OutputKeys.METHOD, "html");
        serializer.transform(domSource, streamResult);
        out.close();
        String result = new String(out.toByteArray());
        FileUtils.writeStringToFile(new File("D:\\", "a.html"), result, "utf-8");
    }
    
    public class ImageConverter extends WordToHtmlConverter{

        public ImageConverter(Document document) {
            super(document);
        }
        @Override
        protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture){
            Element imgNode = currentBlock.getOwnerDocument().createElement("img");
            StringBuffer sb = new StringBuffer();
            sb.append(Base64.getMimeEncoder().encodeToString(picture.getRawContent()));
            sb.insert(0, "data:" + picture.getMimeType() + ";base64,");
            imgNode.setAttribute("src", sb.toString());
            currentBlock.appendChild(imgNode);
        }
    }
}

效果如下:

 

word完美转html(doc、docx 图片转base64编码)_第1张图片

转换后效果图

word完美转html(doc、docx 图片转base64编码)_第2张图片

docx转html

import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.commons.io.FileUtils;

import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.*;

/**
 * Created by liushiyu
 * docx转html处理
 */
public class DocxToHtml {

    //docx转换html
    public String docxToHtml(String fileName) throws IOException {
        XWPFDocument docxDocument = new XWPFDocument(new FileInputStream(fileName));
        XHTMLOptions options = XHTMLOptions.create();
        //图片转base64
        options.setImageManager(new Base64EmbedImgManager());
        // 转换htm11
        ByteArrayOutputStream htmlStream = new ByteArrayOutputStream();
        XHTMLConverter.getInstance().convert(docxDocument, htmlStream, options);
        String htmlStr = htmlStream.toString();
        return htmlStr;
    }


    public static void main(String arg[]) throws Exception {
        DocxToHtml test = new DocxToHtml();
        FileUtils.writeStringToFile(new File("D:\\", "a2.html"), test.docxToHtml("D:\\567.docx").toString(), "utf-8");
    }
}

效果如下

word完美转html(doc、docx 图片转base64编码)_第3张图片

html转换效果

word完美转html(doc、docx 图片转base64编码)_第4张图片

你可能感兴趣的:(日常记录)