http://zhuyufufu.iteye.com/blog/2009600
上面链接的文章展示了使用PDFBox转PDF为图片,但是有问题:
1.当PDF文档为180M大小时直接报解析异常 (通过加大堆内存可解决)
2.当PDF页数为500多页时处理非常慢
3.测试例子中出现中文正常,英文数字括号乱码的情况
4.jar包很大,达到9M以上
换个组件使用PDFRender来实现例子
上代码:
package com.zas.pdfrender.test; import java.awt.Graphics2D; import java.awt.Image; import java.awt.Rectangle; import java.awt.RenderingHints; import java.awt.image.BufferedImage; import java.io.File; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import java.io.RandomAccessFile; import java.nio.ByteBuffer; import java.nio.channels.FileChannel; import com.sun.image.codec.jpeg.JPEGCodec; import com.sun.image.codec.jpeg.JPEGEncodeParam; import com.sun.image.codec.jpeg.JPEGImageEncoder; import com.sun.pdfview.PDFFile; import com.sun.pdfview.PDFPage; public class PDFRenderTest { public static void convert(String inputPDFPath, String outputFDir) throws IOException, FileNotFoundException { //pdf文件存在校验,输出文件夹创建 File file = new File(inputPDFPath); if(!file.exists()){ throw new FileNotFoundException("文件不存在: " + inputPDFPath); } File outputFolder = new File(outputFDir); if(!outputFolder.exists()){ outputFolder.mkdirs(); } //获取PDFFile RandomAccessFile raf = new RandomAccessFile(file, "r"); FileChannel channel = raf.getChannel(); ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size()); channel.close(); raf.close(); PDFFile pdffile = new PDFFile(buf); System.out.println("PDF页数: " + pdffile.getNumPages() + " , " + inputPDFPath); //转化处理 for (int i = 1; i <= pdffile.getNumPages(); i++) { PDFPage page = pdffile.getPage(i); Rectangle rect = new Rectangle(0, 0, (int) page.getBBox().getWidth(), (int) page.getBBox().getHeight()); Image img = page.getImage(rect.width, rect.height, // width & height rect, // clip rect null, // null for the ImageObserver true, // fill background with white true // block until drawing is done ); BufferedImage tag = new BufferedImage(rect.width, rect.height, BufferedImage.TYPE_INT_RGB); Graphics2D g=tag.createGraphics(); //g.setRenderingHint(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_ON); g.drawImage(img, 0, 0, rect.width, rect.height, null); FileOutputStream out = new FileOutputStream(outputFDir + i + i + ".png"); // 输出到文件流 JPEGImageEncoder encoder = JPEGCodec.createJPEGEncoder(out); JPEGEncodeParam param2 = encoder.getDefaultJPEGEncodeParam(tag); param2.setQuality(1f, false);// 1f是提高生成的图片质量 encoder.setJPEGEncodeParam(param2); encoder.encode(tag); // JPEG编码 out.close(); } } public static void main(final String[] args) throws FileNotFoundException, IOException { String inputPDFPath = "D:\\pdf\\ppt\\2010110档案管理系统需求分析说明书正式.pdf"; String outputFDir = "D:\\pdf\\222222222222010110系统需求分析说明书正式\\"; PDFRenderTest.convert(inputPDFPath, outputFDir); } }
结果:
能够正常转换PDF为图片,没有乱码
问题:
1.转换的图片稍毛糙
2.在PDF超过500页时和PDFBox一样慢的令人难以忍受,看来只有做多线程处理了
PDF转图片效果最好的还是Adobe Acrobat X Pro,但是它没有提供程序调用接口,还是收费软件,好像也不支持Linux
还有两天的技术预研时间,接下来研究下文档转换为HTML