如题,获取PDF页面的高度和宽度,这里仅获取首页的高度和宽度
两种解决方案,分别通过 pdfplumber 和 PyPDF2 两个包来实现
import time
import pdfplumber
path = 'E:/data/DT_test/PDF_test/all_type.pdf'
def run(path):
with pdfplumber.open(path) as pdf:
page_1 = pdf.pages[0]
return page_1.height, page_1.width
start = time.time()
height, width = run(path)
print('height: %s, width: %s'%(height, width)) #height: 841.920, width: 595.200
print('cost time:', time.time()-start) #cost time: 0.07300710678100586
import time
from PyPDF2 import PdfFileReader
path = 'E:/data/DT_test/PDF_test/all_type.pdf'
def run(path):
pdf = PdfFileReader(open(path, 'rb'))
page_1 = pdf.getPage(0)
if page_1.get('/Rotate', 0) in [90, 270]:
return page_1['/MediaBox'][2], page_1['/MediaBox'][3]
else:
return page_1['/MediaBox'][3], page_1['/MediaBox'][2]
start = time.time()
height, width = run(path)
print('height: %s, width: %s'%(height, width)) #height: 841.92, width: 595.2
print('cost time:', time.time()-start) #cost time: 0.007000923156738281
通过 PyPDF2 解析PDF文档可能会遇到文件加密无法解析的情况,可通过 pdf.isEncrypted 判断文件是否加密,加密为True,未加密为False,加密文件可通过以下命令先解密再提取
qpdf --decrypt file_path new_file_path
暂且找到这两种提取方法,pdfplumber方法简单,但效率低,文件越大耗时越大;PyPDF2 方法稍显复杂,但效率较高,文件增大对提取时间影响不大。