Lucene3.4 下载地址:http://lucene.apache.org/ 14 September 2011
简介如下:(官网简介:)
- What Is Apache Lucene?
- The Apache Lucene™ project develops open-source search software, including:
- Apache Lucene Core™ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.
- Apache Solr™ is our high performance enterprise search server, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, distributed search, database integration, web admin and search interfaces.
- Apache PyLucene™ is a Python port of the the Lucene Core project.
- Apache Open Relevance Project™ is a subproject with the aim of collecting and distributing free materials for relevance testing and performance.
★示例:本示例要实现的功能是:查找txt文本文档中的关键字,如果找到,则显示匹配结果,并输出文件名、存放路径、大小、内容.
★原理:采集建立索引,从信息源中拷贝到本地进行加工处理,这里的信息源可以是数据库、互联网等,存入索引库(一组文件的集合,二进制).搜索时从本地的信息集合中进行搜索.文本信息在建立索引和搜索时,都会使用到分词器进行分词,并且使用的是同一个分词器.索引库可以理解为包含索引表和索引表对应的数据、文档等的集合.搜索时,分词器对关键字进行处理,比照索引表,通过索引表找到数据。
★示例实战:
建立测试hello.txt文件内容如下:
- hello1 world test for fd. document document
- Just a case; hel
- hello是 测试测试搜索 1 hrllo hello hello hello
1.建立一个Java Project
2.导入Lucene3.4 必须jar包
lucene-core-3.4.0.jar//核心jar包
contrib\highlighter\lucene-highlighter-3.4.0.jar //高亮
contrib\analyzers\lucene-analyzers-3.4.0.jar //分词器
新建数据源(本地)文件夹luceneDataSource,索引文件夹luceneIndex
3.LuceneDemo.java源代码:
- import java.io.File;
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.index.IndexWriter;
- import org.apache.lucene.index.IndexWriter.MaxFieldLength;
- import org.apache.lucene.queryParser.MultiFieldQueryParser;
- import org.apache.lucene.queryParser.QueryParser;
- import org.apache.lucene.search.Filter;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.ScoreDoc;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.store.FSDirectory;
- import org.apache.lucene.util.Version;
- import org.junit.Test;
- import com.yaxing.utils.File2Document;
- public class LuceneDemo {
- String filePath = "J:\\MyEclipse-8.6\\lucene\\LuceneDemo\\luceneDataSource\\hello.txt";
- File indexPath = new File("J:\\MyEclipse-8.6\\lucene\\LuceneDemo\\luceneIndex");
- Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
- /**
- * 建立索引 IndexWriter 增、删、改
- * */
- @Test
- public void creatIndex() throws Exception {
- // file-->Document
- Document doc = File2Document.file2Document(filePath);
- //Directory dir = FSDirectory.open(indexPath);
- IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexPath), analyzer, true,MaxFieldLength.LIMITED);
- indexWriter.addDocument(doc);
- indexWriter.close();
- }
- /**
- * 搜索 IndexSearcher
- * 用来在索引库中进行查询
- * */
- @Test
- public void search() throws Exception {
- String queryString = "搜索";
- //把要搜索的文本解析为Query
- String[] fields = {"name","content"};
- QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_34, fields, analyzer); //查询解析器
- Query query = queryParser.parse(queryString);
- //查询
- IndexSearcher indexSearcher = new IndexSearcher(FSDirectory.open(indexPath));
- Filter filter = null;
- TopDocs topDocs = indexSearcher.search(query, filter, 10000);//topDocs 类似集合
- System.out.println("总共有【"+topDocs.totalHits+"】条匹配结果.");
- //输出
- for(ScoreDoc scoreDoc:topDocs.scoreDocs){
- int docSn = scoreDoc.doc;//文档内部编号
- Document doc = indexSearcher.doc(docSn);//根据文档编号取出相应的文档
- File2Document.printDocumentInfo(doc);//打印出文档信息
- }
- }
- }
4.File2Document.java源码
- import java.io.BufferedReader;
- import java.io.File;
- import java.io.FileInputStream;
- import java.io.FileNotFoundException;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.document.Field;
- import org.apache.lucene.document.Field.Index;
- import org.apache.lucene.document.Field.Store;
- public class File2Document {
- //文件属性: content,name,size,path
- public static Document file2Document(String path){
- File file = new File(path);
- Document doc = new Document();
- //Store.YES 是否存储 yes no compress
- //Index 是否进行索引 Index.ANALYZED 分词后进行索引
- doc.add(new Field("name",file.getName(),Store.YES,Index.ANALYZED));
- doc.add(new Field("content",readFileContent(file),Store.YES,Index.ANALYZED));//readFileContent()读取文件类容
- doc.add(new Field("size",String.valueOf(file.length()),Store.YES,Index.NOT_ANALYZED));//不分词,文件大小(int)转换成String
- doc.add(new Field("path",file.getAbsolutePath(),Store.YES,Index.NOT_ANALYZED));//不需要根据文件的路径来查询
- return doc;
- }
- /**
- * 读取文件类容
- * */
- private static String readFileContent(File file) {
- try {
- BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
- StringBuffer content = new StringBuffer();
- try {
- for(String line=null;(line = reader.readLine())!=null;){
- content.append(line).append("\n");
- }
- } catch (IOException e) {
- e.printStackTrace();
- }
- return content.toString();
- } catch (FileNotFoundException e) {
- e.printStackTrace();
- }
- return null;
- }
- /**
- *
- * 获取name属性值的两种方法
- * 1.Filed field = doc.getFiled("name");
- * field.stringValue();
- * 2.doc.get("name");
- *
5.Junit测试结果:
String queryString = "搜索";
- 总共有【1】条匹配结果.
- name -->hello.txt
- content -->hello1 world test for fd. document document
- Just a case; hel
- hello是 测试测试搜索 1 hrllo hello hello hello
- path -->J:\MyEclipse-8.6\lucene\LuceneDemo\luceneDataSource\hello.txt
- size -->109
String queryString = "hello";
- 总共有【1】条匹配结果.
- name -->hello.txt
- content -->hello1 world test for fd. document document
- Just a case; hel
- hello是 测试测试搜索 1 hrllo hello hello hello
- path -->J:\MyEclipse-8.6\lucene\LuceneDemo\luceneDataSource\hello.txt
- size -->109
索引建立如下:
String queryString = "zazazaza";
- 总共有【0】条匹配结果.