最近有个朋友咨询如何实现对海量磁盘资料进行目录、文件名及文件正文进行搜索,要求实现简单高效、维护方便、成本低廉。我想了想利用ES来实现文档的索引及搜索是适当的选择,于是就着手写了一些代码来实现,下面就将设计思路及实现方法作以介绍。
考虑到磁盘文件分布到不同的设备上,所以采用磁盘扫瞄代理的模式构建系统,即把扫描服务以代理的方式部署到目标磁盘所在的服务器上,作为定时任务执行,索引统一建立到ES中,当然ES采用分布式高可用部署方法,搜索服务和扫描代理部署到一起来简化架构并实现分布式能力。
ES(elasticsearch)是本项目唯一依赖的第三方软件,ES支持docker方式部署,以下是部署过程
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.3.2docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9200:9200 -p 9300:9300 --name es01 docker.elastic.co/elasticsearch/elasticsearch:6.3.2
部署完成后,通过浏览器打开http://localhost:9200,如果正常打开,出现如下界面,则说明ES部署成功。
本项目除了引入springboot的基础starter外,还需要引入ES相关包
org.springframework.bootspring-boot-starter-data-elasticsearchio.searchboxjest5.3.3net.sf.jmimemagicjmimemagic0.1.4
需要将ES的访问地址配置到application.yml里边,同时为了简化程序,需要将待扫描磁盘的根目录(index-root)配置进去,后面的扫描任务就会递归遍历该目录下的全部可索引文件。
server: port: @elasticsearch.port@spring: application: name: @project.artifactId@ profiles: active: dev elasticsearch: jest: uris: http://127.0.0.1:9200index-root: /Users/crazyicelee/mywokerspace
因为要求文件所在目录、文件名、文件正文都有能够检索,所以要将这些内容都作为索引字段定义,而且添加ES client要求的JestId来注解id。
package com.crazyice.lee.accumulation.search.data;import io.searchbox.annotations.JestId;import lombok.Data;@Datapublic class Article { @JestId private Integer id; private String author; private String title; private String path; private String content; private String fileFingerprint;}
因为要扫描指定目录下的全部文件,所以采用递归的方法遍历该目录,并标识已经处理的文件以提升效率,在文件类型识别方面采用两种方式可供选择,一个是文件内容更为精准判断(Magic),一种是以文件扩展名粗略判断。这部分是整个系统的核心组件。
对目标文件内容计算MD5值并作为文件指纹存储到ES的索引字段里边,每次在重建索引的时候判断该MD5是否存在,如果存在就不用重复建立索引了,可以避免文件索引重复,也能避免系统重启后重复遍历文件。
//判断是否已经索引 private JSONObject isIndex(File file) { JSONObject result = new JSONObject(); //用MD5生成文件指纹,搜索该指纹是否已经索引 String fileFingerprint = Md5CaculateUtil.getMD5(file); result.put("fileFingerprint", fileFingerprint); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.termQuery("fileFingerprint", fileFingerprint)); Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex("diskfile").addType("files").build(); try { //执行 SearchResult searchResult = jestClient.execute(search); if (searchResult.getTotal() > 0) { result.put("isIndex", true); } else { result.put("isIndex", false); } } catch (IOException e) { log.error("{}", e.getLocalizedMessage()); } return result; }
每种类型的文件读取方式都有区别,但是处理逻辑大致相同,所以才有抽象类的方式将共性逻辑在父类实现,各种文件的个性处理在相应子类实现,其中父类实现文件转换为索引对象及文件内容格式转换方法,子类实现文件内容读取到String的方法。
1. 抽象父类代码
package com.crazyice.lee.accumulation.search.inter;import com.crazyice.lee.accumulation.search.data.Article;import com.crazyice.lee.accumulation.search.utils.Md5CaculateUtil;import java.io.File;public abstract class ReadFileContent { public Article Read(File file, String serviceIP){ Article article = new Article(); article.setTitle(file.getName()); article.setAuthor(file.getParent()); article.setPath("file://" + serviceIP + ":" + file.getPath()); article.setContent(readToString(file)); article.setFileFingerprint(Md5CaculateUtil.getMD5(file)); return article; } public String charFilter(String s){ if (s.length() > 0) { //替换、、等为网页标签 return s.toString().replaceAll("(|||)+", "
").replaceAll("", " "); } else { return ""; } } //读取文件内容 public abstract String readToString(File file);}
2. 文本类型文件读取子类代码
package com.crazyice.lee.accumulation.search.impl;import com.crazyice.lee.accumulation.search.inter.ReadFileContent;import lombok.extern.slf4j.Slf4j;import java.io.File;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;@Slf4jpublic class TxtFile extends ReadFileContent { public String readToString(File file){ StringBuffer result = new StringBuffer(); try (FileInputStream in = new FileInputStream(file)) { byte[] buffer = new byte[8192]; int length; while ((length = in.read(buffer)) != -1) { result.append(new String(buffer, 0, length, "utf8")); } } catch (FileNotFoundException e) { log.error("{}", e.getLocalizedMessage()); } catch (IOException e) { log.error("{}", e.getLocalizedMessage()); } return charFilter(result.toString()); }}
3. doc文件内容读取子类代码
package com.crazyice.lee.accumulation.search.impl;import com.crazyice.lee.accumulation.search.inter.ReadFileContent;import lombok.extern.slf4j.Slf4j;import org.apache.poi.hwpf.extractor.WordExtractor;import org.apache.poi.openxml4j.util.ZipSecureFile;import java.io.File;import java.io.FileInputStream;@Slf4jpublic class DocFile extends ReadFileContent { public String readToString(File file){ StringBuffer result = new StringBuffer(); //使用HWPF组件中WordExtractor类从Word文档中提取文本或段落 try (FileInputStream in = new FileInputStream(file)) { ZipSecureFile.setMinInflateRatio(-1.0d); WordExtractor extractor = new WordExtractor(in); result.append(extractor.getText()); } catch (Exception e) { log.error("{}", e.getLocalizedMessage()); } return charFilter(result.toString()); }}
4. docx文件内容读取子类代码
package com.crazyice.lee.accumulation.search.impl;import com.crazyice.lee.accumulation.search.inter.ReadFileContent;import lombok.extern.slf4j.Slf4j;import org.apache.poi.openxml4j.util.ZipSecureFile;import org.apache.poi.xwpf.extractor.XWPFWordExtractor;import org.apache.poi.xwpf.usermodel.XWPFDocument;import java.io.File;import java.io.FileInputStream;@Slf4jpublic class DocxFile extends ReadFileContent { public String readToString(File file){ StringBuffer result = new StringBuffer(); try (FileInputStream in = new FileInputStream(file); XWPFDocument doc = new XWPFDocument(in)) { ZipSecureFile.setMinInflateRatio(-1.0d); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); result.append(extractor.getText()); } catch (Exception e) { log.error("{}", e.getLocalizedMessage()); } return charFilter(result.toString()); }}
5. pdf文件内容读取子类代码
package com.crazyice.lee.accumulation.search.impl;import com.crazyice.lee.accumulation.search.inter.ReadFileContent;import lombok.extern.slf4j.Slf4j;import org.apache.pdfbox.pdmodel.PDDocument;import org.apache.pdfbox.text.PDFTextStripper;import java.io.File;@Slf4jpublic class PdfFile extends ReadFileContent { public String readToString(File file){ StringBuffer result = new StringBuffer(); try (PDDocument document = PDDocument.load(file)) { PDFTextStripper stripper = new PDFTextStripper(); stripper.setSortByPosition(true); int pages = document.getNumberOfPages(); for (int i = 0; i < pages; i++) { stripper.setStartPage(i); stripper.setEndPage(i + 1); result.append(stripper.getText(document)); } } catch (Exception e) { log.error("{}", e.getLocalizedMessage()); } return charFilter(result.toString()); }}
磁盘扫码分两个步骤进行。
1. 递归扫瞄指定目录下的所有可索引文件,将待处理文件存储到List对象中。
//遍历指定目录下的全部文件 public void find(String pathName) { //获取pathName的File对象 File dirFile = new File(pathName); //判断是否有读权限 if (!dirFile.canRead()){ log.info("do not read"); return; } if (!dirFile.isDirectory()) { String fileType=fileType(dirFile,JUDGE); if(FILETYPE.contains(fileType)) { destFile.add(dirFile); } } else { //获取此目录下的所有文件名与目录名 String[] fileList = dirFile.list(); for (String subFile : fileList) { File file = new File(dirFile.getPath(), subFile); if (!file.canRead()) { continue; } //如果是一个目录,输出目录名后,进行递归 if (file.isDirectory()) { if(fileType(file, JUDGE).equals("link")) continue; //递归 try { find(file.getCanonicalPath()); } catch (Exception e) { log.error("{}", e.getLocalizedMessage()); } } else { //忽略掉临时文件,以~$起始的文件名 if (file.getName().startsWith("~$")) continue; if (FILETYPE.contains(fileType(file, JUDGE))) { destFile.add(file); } } } } log.info("已经扫描的文件数:{}",destFile.size()); }
2. 使用stream对待处理文件并行处理
//流方式并行处理文件 public void doneFile(String method,Boolean onlyFileType){ destFile.parallelStream().forEach(file -> createIndex(file,method,onlyFileType)); }
这里采用定时任务的方式来扫描指定目录以实现动态增量创建索引。顺序执行上面的文件处理过程,从而实现多线程并行高效建立文件索引。
package com.crazyice.lee.accumulation.search.service;import lombok.extern.slf4j.Slf4j;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.beans.factory.annotation.Value;import org.springframework.context.annotation.Configuration;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.stereotype.Component;import java.util.ArrayList;import java.util.Collections;import java.util.List;import java.util.Map;@Configuration@Component@Slf4jpublic class CreateIndexTask { @Autowired private DirectoryRecurse directoryRecurse; @Value("${index-root}") private String indexRoot; @Scheduled(cron = "* 5/30 * * * ?") private void addIndex() { log.info("根目录:{}", indexRoot); directoryRecurse.find(indexRoot); directoryRecurse.doneFile("ext",false); //fileTypes频率排序 List> list = new ArrayList<>(directoryRecurse.getFileTypes().entrySet()); //降序排序 Collections.sort(list, (o1, o2) -> o2.getValue().compareTo(o1.getValue())); log.info("文件类型:{}",list); //清理空间 directoryRecurse.getDestFile().clear(); directoryRecurse.getFileTypes().clear(); }}
这里通过thymeleaf模板来实现搜索服务及UI,将关键字以高亮度模式提供给前端UI。
package com.crazyice.lee.accumulation.search.web;import com.crazyice.lee.accumulation.search.data.Article;import io.searchbox.client.JestClient;import io.searchbox.core.Search;import io.searchbox.core.SearchResult;import lombok.extern.slf4j.Slf4j;import org.elasticsearch.index.query.QueryBuilders;import org.elasticsearch.search.builder.SearchSourceBuilder;import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.stereotype.Controller;import org.springframework.ui.Model;import org.springframework.web.bind.annotation.GetMapping;import org.springframework.web.bind.annotation.RequestMapping;import org.springframework.web.bind.annotation.RequestMethod;import org.springframework.web.bind.annotation.RequestParam;import java.io.IOException;import java.util.ArrayList;import java.util.List;@Controller@Slf4jclass SearchController { @Autowired private JestClient jestClient; @GetMapping("/") public String index() { return "index"; } @RequestMapping(value = "/search", method = RequestMethod.GET) public String search(@RequestParam String keyword, Model model) { SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.queryStringQuery(keyword)); HighlightBuilder highlightBuilder = new HighlightBuilder(); //path属性高亮度 HighlightBuilder.Field highlightPath = new HighlightBuilder.Field("path"); highlightPath.highlighterType("unified"); highlightBuilder.field(highlightPath); //title字段高亮度 HighlightBuilder.Field highlightTitle = new HighlightBuilder.Field("title"); highlightTitle.highlighterType("unified"); highlightBuilder.field(highlightTitle); //content字段高亮度 HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("content"); highlightContent.highlighterType("unified"); highlightBuilder.field(highlightContent); //高亮度配置生效 searchSourceBuilder.highlighter(highlightBuilder); log.info("搜索条件{}", searchSourceBuilder.toString()); //构建搜索功能 Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex("diskfile").addType("files").build(); try { //执行 SearchResult result = jestClient.execute(search); List articles = new ArrayList<>(); result.getHits(Article.class).forEach((value) -> { if (value.highlight != null && value.highlight.get("content") != null) { StringBuffer highlightContentBuffer = new StringBuffer(); value.highlight.get("content").forEach(v -> { highlightContentBuffer.append(v); }); value.source.setHighlightContent(highlightContentBuffer.toString()); } value.source.setContent(value.source.getContent()); articles.add(value.source); }); model.addAttribute("articles", articles); model.addAttribute("keyword", keyword); return "search"; } catch (IOException e) { log.error("{}", e.getLocalizedMessage()); } return "search"; }}
集成thymeleaf的模板引擎直接将搜索结果以web方式呈现。模板包括主搜索页和搜索结果页,通过@Controller注解及Model对象实现。