虽然前面我们已经集中学习过Query,但CustomScoreQuery当初略过了,今天就来学学这个Query.从类名上看,顾名思义,就大不略的猜得到它的干嘛用的。它是用来进行干预查询权重的,从而影响最终评分的,即评分公式中的queryNorm部分。
一个索引文档的评分高低意味着它的价值大小,有价值的索引文档会优先返回并靠前显示,而影响评分的因素有Term在document中的出现频率,以及term在每个document中的出现频率,Term的权重等等,但这些因素都是固定的,并不会因为随着时间的改变而有所变化。比如你希望越是新出版的书籍权重应该越高,即出版日期距离当前时间越近权重越大。再比如你想实现我关注的用户发表的文章优先靠前显示,非关注用户发表的文章靠后显示等等,而CustomScoreQuery提供了这样一个接口来实现类似上述场景中的需求。你要做的就是
继承RecencyBoostCustomScoreQuery提供自己的CustomScoreProvider实现并重写其customScore方法,编写自己的实现逻辑。
下面是使用示例:
package com.yida.framework.lucene5.function; import java.io.IOException; import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.NumericDocValues; import org.apache.lucene.index.SortedDocValues; import org.apache.lucene.queries.CustomScoreProvider; public class RecencyBoostCustomScoreProvider extends CustomScoreProvider { //权重倍数 private double multiplier; // 从1970-01-01至今的总天数 private int day; // 最大过期天数 private int maxDaysAgo; // 日期域的名称 private String dayField; // 域缓存值 private NumericDocValues publishDay; private SortedDocValues titleValues; public RecencyBoostCustomScoreProvider(LeafReaderContext context,double multiplier,int day,int maxDaysAgo,String dayField) { super(context); this.multiplier = multiplier; this.day = day; this.maxDaysAgo = maxDaysAgo; this.dayField = dayField; try { publishDay = context.reader().getNumericDocValues(dayField); titleValues = context.reader().getSortedDocValues("title2"); } catch (IOException e) { e.printStackTrace(); } } /** * subQueryScore:指的是普通Query查询的评分 * valSrcScore:指的是FunctionQuery查询的评分 */ @Override public float customScore(int docId, float subQueryScore, float valSrcScore) throws IOException { String title = titleValues.get(docId).utf8ToString(); int daysAgo = (int) (day - publishDay.get(docId)); //System.out.println(title + ":" + daysAgo + ":" + maxDaysAgo); //如果在6年之内 if (daysAgo < maxDaysAgo) { float boost = (float) (multiplier * (maxDaysAgo - daysAgo) / maxDaysAgo); return (float) (subQueryScore * (1.0 + boost)); } return subQueryScore; } }
package com.yida.framework.lucene5.function; import java.io.IOException; import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.queries.CustomScoreProvider; import org.apache.lucene.queries.CustomScoreQuery; import org.apache.lucene.search.Query; public class RecencyBoostCustomScoreQuery extends CustomScoreQuery { // 倍数 private double multiplier; // 从1970-01-01至今的总天数 private int day; // 最大过期天数 private int maxDaysAgo; // 日期域的名称 private String dayField; public RecencyBoostCustomScoreQuery(Query subQuery,double multiplier,int day,int maxDaysAgo,String dayField) { super(subQuery); this.multiplier = multiplier; this.day = day; this.maxDaysAgo = maxDaysAgo; this.dayField = dayField; } @Override protected CustomScoreProvider getCustomScoreProvider( LeafReaderContext context) throws IOException { return new RecencyBoostCustomScoreProvider(context,multiplier,day,maxDaysAgo,dayField); } }
package com.yida.framework.lucene5.function; import java.io.IOException; import java.nio.file.Paths; import java.util.Date; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.Sort; import org.apache.lucene.search.SortField; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import com.yida.framework.lucene5.util.Constans; /** * CustomScoreQuery测试 * @author Lanxiaowei * */ public class CustomScoreQueryTest { public static void main(String[] args) throws IOException, ParseException { String indexDir = "C:/lucenedir"; Directory directory = FSDirectory.open(Paths.get(indexDir)); IndexReader reader = DirectoryReader.open(directory); IndexSearcher searcher = new IndexSearcher(reader); int day = (int) (new Date().getTime() / Constans.PRE_DAY_MILLISECOND); QueryParser parser = new QueryParser("contents",new StandardAnalyzer()); Query query = parser.parse("java in action"); Query customScoreQuery = new RecencyBoostCustomScoreQuery(query,2.0,day, 6*365,"pubmonthAsDay"); Sort sort = new Sort(new SortField[] {SortField.FIELD_SCORE, new SortField("title2", SortField.Type.STRING)}); TopDocs hits = searcher.search(customScoreQuery, null, Integer.MAX_VALUE, sort,true,false); for (int i = 0; i < hits.scoreDocs.length; i++) { //两种方式取Document都行,其实searcher.doc内部本质还是调用reader.document //Document doc = reader.document(hits.scoreDocs[i].doc); Document doc = searcher.doc(hits.scoreDocs[i].doc); System.out.println((1+i) + ": " + doc.get("title") + ": pubmonth=" + doc.get("pubmonth") + " score=" + hits.scoreDocs[i].score); } reader.close(); directory.close(); } }
package com.yida.framework.lucene5.sort; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.nio.file.Paths; import java.text.ParseException; import java.util.ArrayList; import java.util.Date; import java.util.List; import java.util.Properties; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.BinaryDocValuesField; import org.apache.lucene.document.DateTools; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.IntField; import org.apache.lucene.document.NumericDocValuesField; import org.apache.lucene.document.SortedDocValuesField; import org.apache.lucene.document.SortedNumericDocValuesField; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.BytesRef; /** * 创建测试索引 * @author Lanxiaowei * */ public class CreateTestIndex { public static void main(String[] args) throws IOException { String dataDir = "C:/data"; String indexDir = "C:/lucenedir"; Directory dir = FSDirectory.open(Paths.get(indexDir)); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer); indexWriterConfig.setOpenMode(OpenMode.CREATE_OR_APPEND); IndexWriter writer = new IndexWriter(dir, indexWriterConfig); List<File> results = new ArrayList<File>(); findFiles(results, new File(dataDir)); System.out.println(results.size() + " books to index"); for (File file : results) { Document doc = getDocument(dataDir, file); writer.addDocument(doc); } writer.close(); dir.close(); } /** * 查找指定目录下的所有properties文件 * * @param result * @param dir */ private static void findFiles(List<File> result, File dir) { for (File file : dir.listFiles()) { if (file.getName().endsWith(".properties")) { result.add(file); } else if (file.isDirectory()) { findFiles(result, file); } } } /** * 读取properties文件生成Document * * @param rootDir * @param file * @return * @throws IOException */ public static Document getDocument(String rootDir, File file) throws IOException { Properties props = new Properties(); props.load(new FileInputStream(file)); Document doc = new Document(); String category = file.getParent().substring(rootDir.length()); category = category.replace(File.separatorChar, '/'); String isbn = props.getProperty("isbn"); String title = props.getProperty("title"); String author = props.getProperty("author"); String url = props.getProperty("url"); String subject = props.getProperty("subject"); String pubmonth = props.getProperty("pubmonth"); System.out.println("title:" + title + "\n" + "author:" + author + "\n" + "subject:" + subject + "\n" + "pubmonth:" + pubmonth + "\n" + "category:" + category + "\n---------"); doc.add(new StringField("isbn", isbn, Field.Store.YES)); doc.add(new StringField("category", category, Field.Store.YES)); doc.add(new SortedDocValuesField("category", new BytesRef(category))); doc.add(new TextField("title", title, Field.Store.YES)); doc.add(new Field("title2", title.toLowerCase(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS)); //doc.add(new BinaryDocValuesField("title2", new BytesRef(title.getBytes()))); doc.add(new SortedDocValuesField("title2", new BytesRef(title.getBytes()))); String[] authors = author.split(","); for (String a : authors) { doc.add(new Field("author", a, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); } doc.add(new Field("url", url, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS)); doc.add(new Field("subject", subject, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); doc.add(new IntField("pubmonth", Integer.parseInt(pubmonth), Field.Store.YES)); doc.add(new NumericDocValuesField("pubmonth", Integer.parseInt(pubmonth))); Date d = null; try { d = DateTools.stringToDate(pubmonth); } catch (ParseException pe) { throw new RuntimeException(pe); } int day = (int) (d.getTime() / (1000 * 3600 * 24)); doc.add(new IntField("pubmonthAsDay",day, Field.Store.YES)); doc.add(new NumericDocValuesField("pubmonthAsDay", day)); for (String text : new String[] { title, subject, author, category }) { doc.add(new Field("contents", text, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); } return doc; } }
重点在创建索引域那里,由于我们需要在CustomScoreQuery里获取指定域的所有值,随后根据文档ID去获取特定域的值,这里Lucene使用了FieldCache即域缓存,如果不用域缓存,我们需要根据docId通过IndexReader对象去索引目录读取每个段文件从而获取某个域的值,一个文档意味着一次磁盘IO,如果你索引文档数据量大的话,那后果将会很严重,你懂的,为了减少磁盘IO次数,Lucene引入了域缓存概念,其实内部就是用一个Map<String,Object> 来存储的,map的key就是域的名称,看源码:
IndexReader.getNumericDocValues
@Override public final NumericDocValues getNumericDocValues(String field) throws IOException { ensureOpen(); Map<String,Object> dvFields = docValuesLocal.get(); Object previous = dvFields.get(field); if (previous != null && previous instanceof NumericDocValues) { return (NumericDocValues) previous; } else { FieldInfo fi = getDVField(field, DocValuesType.NUMERIC); if (fi == null) { return null; } NumericDocValues dv = getDocValuesReader().getNumeric(fi); dvFields.put(field, dv); return dv; } }
还有一点需要注意的是域缓存只对DocValuesField有效,这也是为什么创建索引代码那里需要add SortedDocValuesField,因为我们还需要根据该域进行排序,所以使用了SortedDocValuesField,
字符串类型可以用BinaryDocValuesField,数字类型可以使用NumericDocValuesField.域缓存是Lucene内部的一个高级API,对于用户来说,它是透明的,你只需要知道,使用DocValuesField可以利用域缓存来提升查询性能,但缓存也意味着需要有更多的内存消耗,所以在使用之前请进行性能测试,至于到底使不使用域缓存根据测试结果做好权衡。当你需要在Query查询内部去获取每个索引的某个域的值的时候,你就应该考虑使用域缓存。对于给定的IndexReader和指定的域,在首次访问域缓存的时候,会加载所有索引的该域的values放入缓存中(其实就是内存),是根据indexReader和域名两者联合起来确定唯一性,换句话说,你应该在多次查询中维持同一个IndexReader对象,因为每一个IndexReader都会有一套域缓存,如果你每次都new一个新的IndexReader,你会在内存中N个域缓存,这无疑是在内存中埋了N颗定时乍弹,而且这些你也无法利用域缓存。
如果你还有什么问题请加我Q-Q:7-3-6-0-3-1-3-0-5,
或者加裙
一起交流学习!