pdf文档的内容都是坐标定位的,文档内容主要包含文本、图片、线条;对于表格的解析,可以通过判断线条的位置来判断表格。PDFBox的api,不是很方便把内容和对应坐标读取出来。
Pdf2Dom是一个按绝对坐标的方式来把pdf转成html渲染的,Pdf2Dom基于Apache PDFBox库。
需要解析的pdf文档内容:
需要用到pdfbox和pdf2dom两个依赖包
MyPdf.java解析pdf代码
package com.penngo.pdf;
public class MyPdf extends PDFDomTree{
public MyPdf() throws IOException {
super();
}
protected void startNewPage(){
System.out.println("====页码:" + pagecnt);
super.startNewPage();
}
@Override
protected void renderText(String data, TextMetrics metrics)
{
System.out.println("====文本:" + data + "," + ",x:" + (int)metrics.getX() + ",top:" + (int)metrics.getTop() + ",width:" + (int)metrics.getWidth() + ",height:" + (int)metrics.getHeight() );
curpage.appendChild(createTextElement(data, metrics.getWidth()));
}
@Override
protected void renderPath(List<PathSegment> path, boolean stroke, boolean fill) throws IOException
{
PathSegment path1 = path.get(0);
System.out.println("====路径1:" + "x1:" + path.get(0).getX1() + ",y1:" + path1.getY1() + ",x2:" + path1.getX2() + ",y2:" + path1.getY2() + ",stroke:" + stroke + ",fill:" + fill);
super.renderPath(path, stroke, fill);
}
@Override
protected void renderImage(float x, float y, float width, float height, ImageResource resource) throws IOException
{
System.out.println("====图片:" + "x:" + x + ",y:" + y + ",width:" + width + ",height:" + height);
curpage.appendChild(createImageElement(x, y, width, height, resource));
}
public void parsePdf(PDDocument doc){
try
{
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSSerializer writer = impl.createLSSerializer();
LSOutput output = impl.createLSOutput();
writer.getDomConfig().setParameter("format-pretty-print", true);
createDOM(doc);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
try {
File pdfFile = new File("F:\\dev\\test\\test.pdf");
PDDocument document = PDDocument.load(pdfFile);
MyPdf pdfDomTree = new MyPdf();
pdfDomTree.parsePdf(document);
}
catch(Exception e){
e.printStackTrace();
}
}
}
运行结果:
====页码:0
====文本:文章:https://blog.csdn.net/penngo/article/details/125436956,x:90,top:81,width:312,height:13
====文本:1.2,x:90,top:112,width:14,height:13
====文本:安装,x:110,top:112,width:20,height:13
====文本:HTML,x:133,top:112,width:29,height:13
====文本:Publisher,x:166,top:112,width:46,height:13
====图片❌90.0,y:139.2,width:414.48,height:177.36
====文本:插件,x:215,top:112,width:20,height:13
====文本:表头,x:90,top:331,width:20,height:13
====文本:1,x:113,top:331,width:6,height:13
====文本:表头,x:196,top:331,width:20,height:13
====文本:2,x:220,top:331,width:6,height:13
====文本:表头,x:303,top:331,width:20,height:13
====文本:3,x:326,top:331,width:6,height:13
====文本:表头,x:409,top:331,width:20,height:13
====文本:4,x:433,top:331,width:6,height:13
====文本:列,x:90,top:362,width:10,height:13
====文本:1,x:103,top:362,width:6,height:13
====文本:列,x:196,top:362,width:10,height:13
====文本:2,x:209,top:362,width:6,height:13
====文本:列,x:303,top:362,width:10,height:13
====文本:3,x:316,top:362,width:6,height:13
====文本:列,x:409,top:362,width:10,height:13
====线条size:1
====线条0,x1:84.36,y1:321.83997,x2:510.94,y2:321.83997,stroke:true,fill:false
====线条size:1
====线条0,x1:84.36,y1:353.54,x2:510.94,y2:353.54,stroke:true,fill:false
====线条size:1
====线条0,x1:84.36,y1:385.24,x2:510.94,y2:385.24,stroke:true,fill:false
====线条size:1
====线条0,x1:84.6,y1:321.59998,x2:84.6,y2:385.0,stroke:true,fill:false
====线条size:1
====线条0,x1:191.1,y1:321.59998,x2:191.1,y2:385.0,stroke:true,fill:false
====线条size:1
====线条0,x1:297.6,y1:321.59998,x2:297.6,y2:385.0,stroke:true,fill:false
====线条size:1
====线条0,x1:404.15,y1:321.59998,x2:404.15,y2:385.0,stroke:true,fill:false
====线条size:1
====线条0,x1:510.7,y1:321.59998,x2:510.7,y2:385.0,stroke:true,fill:false
====文本:4,x:422,top:362,width:6,height:13
源码下载