五个pdf操作库:PyPDF2
, Textract
, tika
, pdfPlumber
, pdfMiner
通过pip包管理工具下载,其他库同理
pip install PyPDF3
这个库的优点是安装简便,但是虽然可以准确提取出文件内的文本信息,但会把一行文本内的每个单词打断成多行,甚至把完整的单词也切割开来,识别精度不是很高。
import PyPDF3
fhandle = open(r'国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf', 'rb')
pdfReader = PyPDF3.PdfFileReader(fhandle)
for i in range(54):
pagehandle = pdfReader.getPage(i)
print(pagehandle.extractText())
N
ATIONAL
S
TRATEGY FOR
A
DVANCED
M
ANUFACTURING
textract
对英文都可以准确识别,直接应用textract识别得到的文本是字节流,通过decode
可以得到正常的文本字符串。从实际效果来看提取精度很高。
# some python file
import textract
text = textract.process("国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf")
string = text.decode("utf-8")
print(string)
NATIONAL STRATEGY FOR ADVANCED MANUFACTURING
A Report by the SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
October 2022
安装方式
pip install tika
Apache Tika
库的Python端口tika-python
会在后台启动tika rest
服务器,系统需安装Java 7+
版本才能正常使用这个库。from tika import parser
file = "国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf"
file_data = parser.from_file(file)
text = file_data['content']
print(text)
NATIONAL STRATEGY FOR
ADVANCED MANUFACTURING
A Report by the
SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY
of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
October 2022 October 2022
最终的文本提取效果相当好。
pdfPlumber安装简单,操作简单易用,对正文的提取效果好。
import pdfplumber
with pdfplumber.open(r'国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf') as pdf:
for i in range(54):
page = pdf.pages[i]
print(page.extract_text())
N S
ATIONAL TRATEGY FOR
A M
DVANCED ANUFACTURING
A Report by the
SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY
of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
OOccttoobbeerr 22002222
官方说明很详细,但是使用起来略微有些复杂,需要仔细看示例代码才好上手,不过文本提取精度也相当不错!
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open('国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf', 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
print(text)
NATIONAL STRATEGY FOR
ADVANCED MANUFACTURING
A Report by the
SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY
of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
October 2022
October 2022