4、主要类介绍:
PDFTextStripper:核心提取组件,包含设置是否排序、抽取的起始页和结束页等方法。
PDDocument:内存中保存的PDFDocument对象,提供从文件系统和链接地址装入文件两种方式。
LucenePDFDocument:是主要用于和lucene集成的类,提供getDocument方法(也重载为接收url和本地文件两种方式),将PDF文件转化为lucene中的Document对象,并自动提取字段加入Document,可直接通过IndexWriter将其写入索引中。
5、使用PDFBox解析PDF内容
import java.io.BufferedWriter; import java.io.FileInputStream; import java.io.FileWriter; import org.pdfbox.pdfparser.PDFParser; import org.pdfbox.util.PDFTextStripper; public class PdfParser { /** * @param args */ // TODO 自动生成方法存根 public static void main(String[] args) throws Exception{ PDDocument doc = PDDocument.load("d:/文档.pdf"); int pagenumber = doc.getNumberOfPages(); FileOutputStream fos = new FileOutputStream("d:/文档.txt"); Writer writer = new OutputStreamWriter(fos,"UTF-8"); PDFTextStripper stripper = new PDFTextStripper(); ts.setStortByPosition(false); // stripper.setWordSeparator(""); //这样中文输出就不会带空格 stripper.setStartPage(1); stripper.setEndPage(4); stripper.writeText(doc,writer); doc.close(); writer.close(); } }
下面我们来看一个索引类
package com.qianyan.pdf; import java.io.File; import java.io.IOException; import net.paoding.analysis.analyzer.PaodingAnalyzer; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexWriter; public class TestLucenePDFDocument { public static void main(String[] args) throws IOException{ String indexDir = "d:/luceneindex"; Analyzer analyzer = new PaodingAnalyzer(); IndexWriter writer = new IndexWriter(indexDir, analyzer, true); Document doc = LucenePDFDocument.getDocument(new File("d:/explain.pdf")); writer.addDocument(doc); writer.close; } }
下面介绍解析PDF的另一种方式 (XPDF处理PDF文档)
1、首先需要去下载相应包,XPDF官网Download (xpdfbin-win-3.03.zip、xpdf-chinese-simplified.tar.gz)
2、配置:
1)在这里把解压后的xpdf-chinese-simplified拷贝到解压后xpdfbin-win-3.03的doc目录下
2)找到doc目录下的sample-xpdfrc的文件,改名为xpdfrc,然后打开进行编辑 :
1)#textEncoding UTF-8 (找到此行,将前面的#号去掉)
2)文件末尾添加以下代码(注意:目录为相应解压目录):
3)代码实现,为了方便我们运行命令行的xpdf,我们封装了一个XpdfParams类
package com.qianyan.xpdf; public class XpdfParams { private String convertor = ""; private String layout = ""; private String encoding = ""; private String source = ""; private String target = ""; public String getConvertor() { return convertor; } public void setConvertor(String convertor) { this.convertor = convertor; } public String getLayout() { return layout; } public void setLayout(String layout) { this.layout = layout; } public String getEncoding() { return encoding; } public void setEncoding(String encoding) { this.encoding = encoding; } public String getSource() { return source; } public void setSource(String source) { this.source = source; } public String getTarget() { return target; } public void setTarget(String target) { this.target = target; } public String getCMD(){ return convertor + " " + layout + " " + encoding + " " + source + " " + target + " "; } }
package com.qianyan.xpdf; import java.io.IOException; public class TestRuntime { public static void main(String[] args) throws IOException{ XpdfParams xparam = new XpdfParams(); xparam.setConvertor("D:\\program files (x86)\\xpdf\\bin64\\pdftotext.exe"); xparam.setEncoding("-enc UTF-8"); xparam.setSource("E:\\lucene.pdf"); xparam.setTarget("E:\\lucene.txt"); String cmd = xparam.getCMD(); Runtime.getRuntime().exec(cmd); } }