根据官方文档,可以通过culr向solr提交文件,https://wiki.apache.org/solr/ExtractingRequestHandler
具体原理请阅读官方wiki
那么如何通过Python来实现呢?
1 在solrconfig.xml里配置ExtractingRequestHandler
<requestHandler name="/update/extract" class="solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
<lst name="defaults">
<str name="fmap.content">text</str> #这个就是tika解析pdf文件后获得内容对应的字段
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str> # 很重要,配合schema.xml里设置的字段
</lst>
</requestHandler>
2 在schema.xml里配置字段
<fields>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="subject" type="text" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true" /> #主键
<dynamicField name="ignored_*" type="ignored"/> #对应上面1中的ignored配置
</fields>
3.官方wiki里用culr来提交文件如:
curl “http://localhost:8983/solr/update/extract?literal.id=doc2&commit=true” -F “[email protected]”
那么Python怎么实现呢?
这里用Python的第三方库pycurl来实现,pycurl下载请移步 https://pypi.python.org/pypi/pycurl
上代码
import pycurl
import cStringIO
url = "localhost/solr/qa_file/update/extract?literal.id=filename&commit=true"
cur = pycurl.Curl()
fp = cStringIO.StringIO()
cur.setopt(pycurl.WRITEFUNCTION, fp.write)
cur.setopt(pycurl.FOLLOWLOCATION, 1)
cur.setopt(pycurl.MAXREDIRS, 5)
cur.setopt(pycurl.CONNECTTIMEOUT, 60)
cur.setopt(pycurl.TIMEOUT, 300)
cur.setopt(cur.POST, 1)
cur.setopt(cur.URL, url)
# cur.setopt(cur.POSTFIELDS, urllib.urlencode(post_data_dic))
cur.setopt(cur.HTTPPOST, [("file", (cur.FORM_FILE, r"E:\1Python\test.pdf"))])
cur.perform()
status = cur.getinfo(cur.HTTP_CODE)
bbody = fp.getvalue()
print status, "\n", bbody
cur.close()
literal.id=filename就是对应刚才配置字段里的id,也就是主键
这样就能在solr里通过pdf的内容检索到pdf了