最近有个人叫我把印刷版的pdf解析一下,我试了腾讯,阿里,百度的方法,都不太行,后面发现创业公司庖丁科技对这方面优化得还可以,所以买了API,这里分享一下我的python实现。
官网为:http://www.pdflux.com/
import requests,json
from Get_Token import encode_url
""" 读取图片 """
def get_file_content(filePath):
with open(filePath, 'rb') as fp:
return fp.read()
def upload(URL,fname):
url = encode_url(URL, 'pdflux', 'qTLmxhIi20YH')
data = {'file':open(fname, 'rb')}
r =requests.post(url,files=data)
return r.text
if __name__ == "__main__":
fname='../chengdu/1995.pdf'
user=''
URL='http://saas.pdflux.com/api/v1/saas/upload?user={}&force_updata=true'.format(user)
result=upload(URL,fname)
print(result)
json_file=fname+'.json'
with open(json_file, "w") as fp:
fp.write(json.dumps(result,indent=4))
有人会有疑问,Get_Token是哪里来的呢?其实就是官方工作人员给你账号以后,登陆进去就能下载了。登陆地址:https://saas.pdflux.com/#/login
我们是直接找工作人员开的账号哈,有需要的可以直接联系他们
import requests,json
from Get_Token import encode_url
from upload_file import upload
import time
import os
def get_status(uuid,user):
URL='http://saas.pdflux.com/api/v1/saas/document/{}?user={}'.format(uuid,user)
url = encode_url(URL, 'pdflux', 'qTLmxhIi20YH')
r =requests.get(url)
return r.text
def download_data(uuid,file_name,user):
url='http://saas.pdflux.com/api/v1/saas/document/{}/excel?user={}'.format(uuid,user)
down_url=encode_url(url, 'pdflux', 'qTLmxhIi20YH')
down_res = requests.get(url=down_url)
with open(file_name,"wb") as code:
code.write(down_res.content)
def test_status():
uuid='fad4c522-c71c-11ea-ba3d-00163e028884'
# uuid='fb892010-c6a6-11ea-ba3d-00163e028884'
res=get_status(uuid)
print(res)
if __name__ == "__main__":
fnames=['./pdf_data/1988.pdf','./pdf_data/1989.pdf','./pdf_data/1990.pdf','./pdf_data/1991.pdf','./pdf_data/1992.pdf','./pdf_data/1993.pdf','../pdf_data/1996.pdf']
user=''
uuids=[]
for uuid,fname in zip(uuids,fnames):
file_name=fname+'.xls'
if(os.path.exists(file_name)):
continue
while True:
res=get_status(uuid,user)
res=json.loads(res)
print(res)
if(res['data']['parsed']==2):
download_data(uuid,file_name,user)
break
time.sleep(20)
import requests,json
from Get_Token import encode_url
""" 读取图片 """
def get_file_content(filePath):
with open(filePath, 'rb') as fp:
return fp.read()
def upload(URL,fname):
url = encode_url(URL, 'pdflux', 'qTLmxhIi20YH')
data = {'file':open(fname, 'rb')}
r =requests.post(url,files=data)
return r.text
填上你的user和uuid那些,就等着处理完了以后下载下来了哈,最终会把所有pdf里面的表格整合成一个excel,非常的方便哈,我也有点好奇他们是怎么实现这个pdf表格解析这项技术的,我试了好多开源的方案和大厂的API,都达不到他们的这种水平。