实现文档在线预览的方式除了上篇文章[《文档在线预览(一)通过将txt、word、pdf转成图片实现在线预览功能》](https://blog.csdn.net/q2qwert/article/details/130884607)说的将文档转成图片的实现方式外,还有转成pdf,前端通过pdf.js、pdfobject.js等插件来实现在线预览,以及本文将要说到的将文档转成html的方式来实现在线预览。
以下代码分别提供基于aspose、pdfbox、spire来实现来实现txt、word、pdf、ppt、word等文件转图片的需求。
Aspose 是一家致力于.Net ,Java,SharePoint,JasperReports和SSRS组件的提供商,数十个国家的数千机构都有用过aspose组件,创建、编辑、转换或渲染 Office、OpenOffice、PDF、图像、ZIP、CAD、XPS、EPS、PSD 和更多文件格式。注意aspose是商用组件,未经授权导出文件里面都是是水印(尊重版权,远离破解版)。
需要在项目的pom文件里添加如下依赖
<dependency>
<groupId>com.asposegroupId>
<artifactId>aspose-wordsartifactId>
<version>23.1version>
dependency>
<dependency>
<groupId>com.asposegroupId>
<artifactId>aspose-pdfartifactId>
<version>23.1version>
dependency>
<dependency>
<groupId>com.asposegroupId>
<artifactId>aspose-cellsartifactId>
<version>23.1version>
dependency>
<dependency>
<groupId>com.asposegroupId>
<artifactId>aspose-slidesartifactId>
<version>23.1version>
dependency>
因为aspose和spire虽然好用,但是都是是商用组件,所以这里也提供使用开源库操作的方式的方式。
POI是Apache软件基金会用Java编写的免费开源的跨平台的 Java API,Apache POI提供API给Java程序对Microsoft Office格式档案读和写的功能。
Apache PDFBox是一个开源Java库,支持PDF文档的开发和转换。 使用此库,您可以开发用于创建,转换和操作PDF文档的Java程序。
需要在项目的pom文件里添加如下依赖
<dependency>
<groupId>org.apache.pdfboxgroupId>
<artifactId>pdfboxartifactId>
<version>2.0.4version>
dependency>
<dependency>
<groupId>org.apache.poigroupId>
<artifactId>poiartifactId>
<version>5.2.0version>
dependency>
<dependency>
<groupId>org.apache.poigroupId>
<artifactId>poi-ooxmlartifactId>
<version>5.2.0version>
dependency>
<dependency>
<groupId>org.apache.poigroupId>
<artifactId>poi-scratchpadartifactId>
<version>5.2.0version>
dependency>
<dependency>
<groupId>org.apache.poigroupId>
<artifactId>poi-excelantartifactId>
<version>5.2.0version>
dependency>
spire一款专业的Office编程组件,涵盖了对Word、Excel、PPT、PDF等文件的读写、编辑、查看功能。spire提供免费版本,但是存在只能导出前3页以及只能导出前500行的限制,只要达到其一就会触发限制。需要超出前3页以及只能导出前500行的限制的这需要购买付费版(尊重版权,远离破解版)。这里使用免费版进行演示。
spire在添加pom之前还得先添加maven仓库来源
<repository>
<id>com.e-iceblueid>
<name>e-icebluename>
<url>https://repo.e-iceblue.cn/repository/maven-public/url>
repository>
接着在项目的pom文件里添加如下依赖
免费版:
<dependency>
<groupId>e-icebluegroupId>
<artifactId>spire.office.freeartifactId>
<version>5.3.1version>
dependency>
付费版版:
<dependency>
<groupId>e-icebluegroupId>
<artifactId>spire.officeartifactId>
<version>5.3.1version>
dependency>
public static String wordToHtmlStr(String wordPath) {
try {
Document doc = new Document(wordPath); // Address是将要被转化的word文档
String htmlStr = doc.toString();
return htmlStr;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
public String wordToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
String htmlStr = null;
String ext = wordPath.substring(wordPath.lastIndexOf("."));
if (ext.equals(".docx")) {
htmlStr = word2007ToHtmlStr(wordPath);
} else if (ext.equals(".doc")){
htmlStr = word2003ToHtmlStr(wordPath);
} else {
throw new RuntimeException("文件格式不正确");
}
return htmlStr;
}
public String word2007ToHtmlStr(String wordPath) throws IOException {
// 使用内存输出流
try(ByteArrayOutputStream out = new ByteArrayOutputStream()){
word2007ToHtmlOutputStream(wordPath, out);
return out.toString();
}
}
private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
ZipSecureFile.setMinInflateRatio(-1.0d);
InputStream in = Files.newInputStream(Paths.get(wordPath));
XWPFDocument document = new XWPFDocument(in);
XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
// 使用内存输出流
XHTMLConverter.getInstance().convert(document, out, options);
}
private String word2003ToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
// Transform document to string
StringWriter writer = new StringWriter();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(htmlDocument), new StreamResult(writer));
return writer.toString();
}
private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
InputStream input = Files.newInputStream(Paths.get(wordPath));
HWPFDocument wordDocument = new HWPFDocument(input);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
System.out.println(pictureType);
if (PictureType.UNKNOWN.equals(pictureType)) {
return null;
}
BufferedImage bufferedImage = ImgUtil.toImage(content);
String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
// 带图片的word,则将图片转为base64编码,保存在一个页面中
StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
return sb.toString();
});
// 解析word文档
wordToHtmlConverter.processDocument(wordDocument);
return wordToHtmlConverter.getDocument();
}
public String wordToHtmlStr(String wordPath) throws IOException {
try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
Document document = new Document();
document.loadFromFile(wordPath);
document.saveToFile(outputStream, FileFormat.Html);
return outputStream.toString();
}
}
public static String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new StringWriter();
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
return writer.toString();
}
public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new StringWriter();
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
return writer.toString();
}
public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile(pdfPath);
return outputStream.toString();
}
}
public static String excelToHtmlStr(String excelPath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(excelPath);
Workbook workbook = new XSSFWorkbook(fileInputStream);
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
Sheet sheet = workbook.getSheetAt(0);
StringBuilder htmlStringBuilder = new StringBuilder();
htmlStringBuilder.append("Excel to HTML using Java and POI library ");
htmlStringBuilder.append("");
htmlStringBuilder.append("");
for (Row row : sheet) {
htmlStringBuilder.append("");
for (Cell cell : row) {
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
formulaEvaluator.evaluateFormulaCell(cell);
cellType = cell.getCachedFormulaResultType();
}
String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
htmlStringBuilder.append("").append(cellValue).append(" ");
}
htmlStringBuilder.append(" ");
}
htmlStringBuilder.append("
");
return htmlStringBuilder.toString();
}
返回的html字符串:
<html><head><title>Excel to HTML using Java and POI librarytitle><style>table, th, td { border: 1px solid black; }style>head><body><table><tr><td>序号td><td>姓名td><td>性别td><td>联系方式td><td>地址td>tr><tr><td>1td><td>张晓玲td><td>女td><td>11111111111td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>2td><td>王小二td><td>男td><td>1222222td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>1td><td>张晓玲td><td>女td><td>11111111111td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>2td><td>王小二td><td>男td><td>1222222td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>1td><td>张晓玲td><td>女td><td>11111111111td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>2td><td>王小二td><td>男td><td>1222222td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>1td><td>张晓玲td><td>女td><td>11111111111td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>2td><td>王小二td><td>男td><td>1222222td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>1td><td>张晓玲td><td>女td><td>11111111111td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>2td><td>王小二td><td>男td><td>1222222td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>1td><td>张晓玲td><td>女td><td>11111111111td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>2td><td>王小二td><td>男td><td>1222222td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>1td><td>张晓玲td><td>女td><td>11111111111td><td>上海市浦东新区xx路xx弄xx号td>tr><tr><td>2td><td>王小二td><td>男td><td>1222222td><td>上海市浦东新区xx路xx弄xx号td>tr>table>body>html>
public String excelToHtmlStr(String excelPath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(excelPath);
try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
StringBuilder htmlStringBuilder = new StringBuilder();
htmlStringBuilder.append("Excel to HTML using Java and POI library ");
htmlStringBuilder.append("");
htmlStringBuilder.append("");
for (Row row : sheet) {
htmlStringBuilder.append("");
for (Cell cell : row) {
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
formulaEvaluator.evaluateFormulaCell(cell);
cellType = cell.getCachedFormulaResultType();
}
String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
htmlStringBuilder.append("").append(cellValue).append(" ");
}
htmlStringBuilder.append(" ");
}
htmlStringBuilder.append("
");
return htmlStringBuilder.toString();
}
}
public String excelToHtmlStr(String excelPath) throws Exception {
try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
Workbook workbook = new Workbook();
workbook.loadFromFile(excelPath);
workbook.saveToStream(outputStream, com.spire.xls.FileFormat.HTML);
return outputStream.toString();
}
}
有时我们是需要的不仅仅返回html字符串,而是需要生成一个html文件这时应该怎么做呢?一个改动量小的做法就是使用org.apache.commons.io包下的FileUtils工具类写入目标地址:
首先需要引入pom:
<dependency>
<groupId>commons-iogroupId>
<artifactId>commons-ioartifactId>
<version>2.8.0version>
dependency>
相关代码:
String htmlStr = FileConvertUtil.pdfToHtmlStr("D:\\书籍\\电子书\\小说\\历史小说\\最后的可汗.doc");
FileUtils.write(new File("D:\\test\\doc.html"), htmlStr, "utf-8");
除此之外,还可以对上面的代码进行一些调整,已实现生成html文件,代码调整如下:
public static void wordToHtml(String wordPath, String htmlPath) {
try {
File sourceFile = new File(wordPath);
String path = htmlPath + File.separator + sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf(".")) + ".html";
File file = new File(path); // 新建一个空白pdf文档
FileOutputStream os = new FileOutputStream(file);
Document doc = new Document(wordPath); // Address是将要被转化的word文档
HtmlSaveOptions options = new HtmlSaveOptions();
options.setExportImagesAsBase64(true);
options.setExportRelativeFontSize(true);
doc.save(os, options);
} catch (Exception e) {
e.printStackTrace();
}
}
public void wordToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
String ext = wordPath.substring(wordPath.lastIndexOf("."));
if (ext.equals(".docx")) {
word2007ToHtml(wordPath, htmlPath);
} else if (ext.equals(".doc")){
word2003ToHtml(wordPath, htmlPath);
} else {
throw new RuntimeException("文件格式不正确");
}
}
public void word2007ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
//try(OutputStream out = Files.newOutputStream(Paths.get(path))){
try(FileOutputStream out = new FileOutputStream(htmlPath)){
word2007ToHtmlOutputStream(wordPath, out);
}
}
private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
ZipSecureFile.setMinInflateRatio(-1.0d);
InputStream in = Files.newInputStream(Paths.get(wordPath));
XWPFDocument document = new XWPFDocument(in);
XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
// 使用内存输出流
XHTMLConverter.getInstance().convert(document, out, options);
}
public void word2003ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
// 生成html文件地址
try(OutputStream outStream = Files.newOutputStream(Paths.get(htmlPath))){
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(outStream);
TransformerFactory factory = TransformerFactory.newInstance();
Transformer serializer = factory.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
}
}
private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
InputStream input = Files.newInputStream(Paths.get(wordPath));
HWPFDocument wordDocument = new HWPFDocument(input);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
System.out.println(pictureType);
if (PictureType.UNKNOWN.equals(pictureType)) {
return null;
}
BufferedImage bufferedImage = ImgUtil.toImage(content);
String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
// 带图片的word,则将图片转为base64编码,保存在一个页面中
StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
return sb.toString();
});
// 解析word文档
wordToHtmlConverter.processDocument(wordDocument);
return wordToHtmlConverter.getDocument();
}
public void wordToHtml(String wordPath, String htmlPath) {
htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
Document document = new Document();
document.loadFromFile(wordPath);
document.saveToFile(htmlPath, FileFormat.Html);
}
转换成html的效果:
因为使用的是免费版,存在页数和字数限制,需要完整功能的的可以选择付费版本。PS:这回76页的文档居然转成功了前50页。
public static void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
File file = new File(pdfPath);
String path = htmlPath + File.separator + file.getName().substring(0, file.getName().lastIndexOf(".")) + ".html";
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new PrintWriter(path, "UTF-8");
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
}
public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
String path = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new PrintWriter(path, "UTF-8");
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
}
public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
htmlPath = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile(pdfPath);
pdf.saveToFile(htmlPath, com.spire.pdf.FileFormat.HTML);
}
图片版PDF文件验证结果:
因为使用的是免费版,所以只有前三页是正常的。。。有超过三页需求的可以选择付费版本。
文字版PDF原文件效果:
报错了无法转换。。。
java.lang.NullPointerException
at com.spire.pdf.PdfPageWidget.spr┢⅛(Unknown Source)
at com.spire.pdf.PdfPageWidget.getSize(Unknown Source)
at com.spire.pdf.PdfPageBase.spr†™—(Unknown Source)
at com.spire.pdf.PdfPageBase.getActualSize(Unknown Source)
at com.spire.pdf.PdfPageBase.getSection(Unknown Source)
at com.spire.pdf.general.PdfDestination.spr︻┎—(Unknown Source)
at com.spire.pdf.general.PdfDestination.spr┻┑—(Unknown Source)
at com.spire.pdf.general.PdfDestination.getElement(Unknown Source)
at com.spire.pdf.primitives.PdfDictionary.setProperty(Unknown Source)
at com.spire.pdf.bookmarks.PdfBookmark.setDestination(Unknown Source)
at com.spire.pdf.bookmarks.PdfBookmarkWidget.spr┭┘—(Unknown Source)
at com.spire.pdf.bookmarks.PdfBookmarkWidget.getDestination(Unknown Source)
at com.spire.pdf.PdfDocumentBase.spr╻⅝(Unknown Source)
at com.spire.pdf.widget.PdfPageCollection.spr┦⅝(Unknown Source)
at com.spire.pdf.widget.PdfPageCollection.removeAt(Unknown Source)
at com.spire.pdf.PdfDocumentBase.spr┞⅝(Unknown Source)
at com.spire.pdf.PdfDocument.loadFromFile(Unknown Source)
public void excelToHtml(String excelPath, String htmlPath) throws Exception {
htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
Workbook workbook = new Workbook(excelPath);
com.aspose.cells.HtmlSaveOptions options = new com.aspose.cells.HtmlSaveOptions();
workbook.save(htmlPath, options);
}
public void excelToHtml(String excelPath, String htmlPath) throws Exception {
String path = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
try(FileOutputStream fileOutputStream = new FileOutputStream(path)){
String htmlStr = excelToHtmlStr(excelPath);
byte[] bytes = htmlStr.getBytes();
fileOutputStream.write(bytes);
}
}
public String excelToHtmlStr(String excelPath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(excelPath);
try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
StringBuilder htmlStringBuilder = new StringBuilder();
htmlStringBuilder.append("Excel to HTML using Java and POI library ");
htmlStringBuilder.append("");
htmlStringBuilder.append("");
for (Row row : sheet) {
htmlStringBuilder.append("");
for (Cell cell : row) {
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
formulaEvaluator.evaluateFormulaCell(cell);
cellType = cell.getCachedFormulaResultType();
}
String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
htmlStringBuilder.append("").append(cellValue).append(" ");
}
htmlStringBuilder.append(" ");
}
htmlStringBuilder.append("
");
return htmlStringBuilder.toString();
}
}
public void excelToHtml(String excelPath, String htmlPath) throws Exception {
htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
Workbook workbook = new Workbook();
workbook.loadFromFile(excelPath);
workbook.saveToFile(htmlPath, com.spire.xls.FileFormat.HTML);
}
从上述的效果展示我们可以发现其实转成html效果不是太理想,很多细节样式没有还原,这其实是因为这类转换往往都是追求目标是通过使用文档中的语义信息并忽略其他细节来生成简单干净的 HTML,所以在转换过程中复杂样式被忽略,比如居中、首行缩进、字体,文本大小,颜色。举个例子在转换是 会将应用标题 1 样式的任何段落转换为 h1 元素,而不是尝试完全复制标题的样式。所以转成html的显示效果往往和原文档不太一样。这意味着对于较复杂的文档而言,这种转换不太可能是完美的。但如果都是只使用简单样式文档或者对文档样式不太关心的这种方式也不妨一试。
PS:如果想要展示效果好的话,其实可以将上篇文章《文档在线预览(一)通过将txt、word、pdf转成图片实现在线预览功能》说的内容和本文结合起来使用,即将文档里的内容都生成成图片(很可能是多张图片),然后将生成的图片全都放到一个html页面里 ,用html+css来保持样式并实现多张图片展示,再将html返回。开源组件kkfilevie就是用的就是这种做法。
kkfileview展示效果如下:
下图是kkfileview返回的html代码,从html代码我们可以看到kkfileview其实是将文件(txt文件除外)每页的内容都转成了图片,然后将这些图片都嵌入到一个html里,再返回给用户一个html页面。