Elasticsearch索引文档【word,pdf等】

安装es5.x版本

需要把jvm设置调大,否则起不起来

sudo sysctl -w vm.max_map_count=262144

 

1.elasticsearch索引文件需要一个插件

es版本

插件名

参考文档

es5.0之前

mapper-attachments

https://qbox.io/blog/index-attachments-files-elasticsearch-mapper ,

es5.0以后

ingest-attachment

https://qbox.io/blog/how-to-index-attachments-and-files-to-elasticsearch-5-0-using-ingest-api , https://www.elastic.co/guide/en/elasticsearch/plugins/5.6/using-ingest-attachment.html

由于原本的es集群是2.3.5版本的,先试了安装2.3.5版本的 mapper-attachments安装失败,原因是下载下来的插件版本说是匹配2.0的ES。好像es集群是2.4的时候可以安装成功,请自己测试。又想把ES版本升级到5.x,于是选择了5.6的ES版本。

2.插件安装(安装attachment插件)

sudo bin/elasticsearch-plugin install ingest-attachment

注:1、如果elasticsearch是用docker部署的话,需要在容器内执行这个命令,否则不生效,

       2、安装完成attachment插件,还需要重启es集群,才能使得插件生效

 

3.创建一个attachment pipeline

注:properties的字段可以指定,最多可指定"content", "title", "author", "keywords", "date", "content_length", "content_type"

curl -XPUT 'http://localhost:19200/_ingest/pipeline/attachment?pretty' -H 'Content-Type: application/json' -d '{ "description" : "Extract attachment information encoded in Base64 with UTF-8 charset", "processors" : [ { "attachment" : { "field" : "data", "properties": [ "content", "title", "author", "keywords", "date", "content_length", "content_type" ] } } ] }'

4.创建索引test

curl -XPUT 'http://localhost:19200/test/' -d '{ "settings":{ "index":{ "number_of_shards":1, "number_of_replicas":1 } } }'

5.创建mapping

curl -XPUT 'http://localhost:19200/test/_mapping/document/' -d ' { "document": { "_source": { "excludes": [ "data", "attachment.content" ] }, "properties": { "filename": { "type": "text" }, "attachment": { "properties": { "date": { "type": "date" }, "content_type": { "type": "text", "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } } }, "author": { "type": "text", "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } } }, "title": { "type": "text", "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } } }, "content": { "type": "text" }, "content_length": { "type": "long" } } }, "data": { "type": "binary", "store": false }, "filePath": { "type": "keyword" }, "downloadTimes": { "type": "long" }, "source": { "type": "keyword" }, "type": { "type": "keyword" }, "uploadTime": { "type": "date" }, "viewTimes": { "type": "long" }, "fileType": { "type": "keyword" } } } }'

  1. 说明:1.为了只索引而不存储content字段,否则文件过大查询一次要把内容都拿出来,需要在source中排除掉,只写store:false是没用的。

参考:http://blog.csdn.net/napoay/article/details/62233031

"_source": { "excludes": [ "data", "attachment.content" ] },

type:"keyword",完全匹配搜索

"source": { "type": "keyword" }

ES5之后去掉了string类型,改为text

"content": { "type": "text" }

data 是原文档的base64编码,存储为binary,不需要被看到,也排除在_source中

"data": { "type": "binary", "store": false }

6.索引数据

注:data 是原文档的base64编码,用java api索引的时候要把文件内容读为base64字符串放入data字段

curl -XPUT 'http://localhost:19200/test/document/test_id2?pipeline=attachment&pretty' -H 'Content-Type: application/json' -d '{ "source":"北京地区", "filename":"测试文档", "data": "UWJveCBlbmFibGVzIGxhdW5jaGluZyBzdXBwb3J0ZWQsIGZ1bGx5LW1hbmFnZWQsIFJFU1RmdWwgRWxhc3RpY3NlYXJjaCBTZXJ2aWNlIGluc3RhbnRseS4g" }'

7.查询

curl -XPOST 'http://localhost:19200/test/document/_search?pretty' -d '{ "query": { "bool": { "must": [ { "match_phrase": { "attachment.content": "Qbox" } }, { "term": { "source": "北京地区" } } ] } } }'

其他参考:

https://www.elastic.co/guide/en/elasticsearch/plugins/5.6/using-ingest-attachment.htmlhttps://www.elastic.co/guide/en/elasticsearch/reference/5.5/binary.html

 

你可能感兴趣的:(#,ElasticSearch,【,大数据生态,】)