java读取pdf的文字、图片、线条和对应坐标

pdf文档的内容都是坐标定位的,文档内容主要包含文本、图片、线条;对于表格的解析,可以通过判断线条的位置来判断表格。PDFBox的api,不是很方便把内容和对应坐标读取出来。
Pdf2Dom是一个按绝对坐标的方式来把pdf转成html渲染的,Pdf2Dom基于Apache PDFBox库。
需要解析的pdf文档内容:java读取pdf的文字、图片、线条和对应坐标_第1张图片
需要用到pdfbox和pdf2dom两个依赖包

MyPdf.java解析pdf代码

package com.penngo.pdf;

public class MyPdf extends PDFDomTree{
    public MyPdf() throws IOException {
        super();
    }

    protected void startNewPage(){
        System.out.println("====页码:" + pagecnt);
        super.startNewPage();
    }


    @Override
    protected void renderText(String data, TextMetrics metrics)
    {
        System.out.println("====文本:" + data + "," +  ",x:" + (int)metrics.getX() + ",top:" + (int)metrics.getTop() + ",width:" + (int)metrics.getWidth() + ",height:" + (int)metrics.getHeight() );
        curpage.appendChild(createTextElement(data, metrics.getWidth()));
    }

    @Override
    protected void renderPath(List<PathSegment> path, boolean stroke, boolean fill) throws IOException
    {
        PathSegment path1 = path.get(0);
        System.out.println("====路径1:" + "x1:" + path.get(0).getX1() + ",y1:" + path1.getY1() + ",x2:" + path1.getX2() + ",y2:" + path1.getY2() + ",stroke:" + stroke + ",fill:" + fill);
        super.renderPath(path, stroke, fill);
    }

    @Override
    protected void renderImage(float x, float y, float width, float height, ImageResource resource) throws IOException
    {
        System.out.println("====图片:" + "x:" + x + ",y:" + y + ",width:" + width + ",height:" + height);
        curpage.appendChild(createImageElement(x, y, width, height, resource));
    }

    public void parsePdf(PDDocument doc){
        try
        {
            DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
            DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
            LSSerializer writer = impl.createLSSerializer();
            LSOutput output = impl.createLSOutput();
            writer.getDomConfig().setParameter("format-pretty-print", true);
            createDOM(doc);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void main(String[] args) {
        try {
            File pdfFile = new File("F:\\dev\\test\\test.pdf");
            PDDocument document = PDDocument.load(pdfFile);
            MyPdf pdfDomTree = new MyPdf();
            pdfDomTree.parsePdf(document);
        }
        catch(Exception e){
            e.printStackTrace();
        }
    }
}

运行结果:

====页码:0
====文本:文章:https://blog.csdn.net/penngo/article/details/125436956,x:90,top:81,width:312,height:13
====文本:1.2,x:90,top:112,width:14,height:13
====文本:安装,x:110,top:112,width:20,height:13
====文本:HTML,x:133,top:112,width:29,height:13
====文本:Publisher,x:166,top:112,width:46,height:13
====图片❌90.0,y:139.2,width:414.48,height:177.36
====文本:插件,x:215,top:112,width:20,height:13
====文本:表头,x:90,top:331,width:20,height:13
====文本:1,x:113,top:331,width:6,height:13
====文本:表头,x:196,top:331,width:20,height:13
====文本:2,x:220,top:331,width:6,height:13
====文本:表头,x:303,top:331,width:20,height:13
====文本:3,x:326,top:331,width:6,height:13
====文本:表头,x:409,top:331,width:20,height:13
====文本:4,x:433,top:331,width:6,height:13
====文本:列,x:90,top:362,width:10,height:13
====文本:1,x:103,top:362,width:6,height:13
====文本:列,x:196,top:362,width:10,height:13
====文本:2,x:209,top:362,width:6,height:13
====文本:列,x:303,top:362,width:10,height:13
====文本:3,x:316,top:362,width:6,height:13
====文本:列,x:409,top:362,width:10,height:13
====线条size:1
====线条0,x1:84.36,y1:321.83997,x2:510.94,y2:321.83997,stroke:true,fill:false
====线条size:1
====线条0,x1:84.36,y1:353.54,x2:510.94,y2:353.54,stroke:true,fill:false
====线条size:1
====线条0,x1:84.36,y1:385.24,x2:510.94,y2:385.24,stroke:true,fill:false
====线条size:1
====线条0,x1:84.6,y1:321.59998,x2:84.6,y2:385.0,stroke:true,fill:false
====线条size:1
====线条0,x1:191.1,y1:321.59998,x2:191.1,y2:385.0,stroke:true,fill:false
====线条size:1
====线条0,x1:297.6,y1:321.59998,x2:297.6,y2:385.0,stroke:true,fill:false
====线条size:1
====线条0,x1:404.15,y1:321.59998,x2:404.15,y2:385.0,stroke:true,fill:false
====线条size:1
====线条0,x1:510.7,y1:321.59998,x2:510.7,y2:385.0,stroke:true,fill:false
====文本:4,x:422,top:362,width:6,height:13

源码下载

你可能感兴趣的:(java,java,jvm,开发语言)