Lucene: Indexing numbers, dates, and times And Field truncation

Although most content is textual in nature, in many cases handling numeric or date/time values is crucial. In a commerce setting, the product’s price, and perhaps other numeric attributes like weight and height, are clearly important. A video search engine may index the duration of each video. Press releases and articles have a time-stamp.

 

Indexing numbers

There are two common scenarios in which indexing numbers is important.

  • In one scenario, numbers are embedded in the text to be indexed, and you want to make sure those numbers are preserved and indexed as their own tokens so that you can use them later as ordinary tokens in searches. To enable this, simply pick an analyzer that doesn’t discard numbers.
  • In the other scenario, you have a field that contains a single number and you want to index it as a numeric value and then use it for precise (equals) matching, rangesearching, and/or sorting.
doc.add(new NumericField("price").setDoubleValue(19.99));

 

Indexing dates and times

 

Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number. The simplest approach is to use Date.getTime to get the equivalent value, in millisecond precision, for a Java Date object:

doc.add(new NumericField("timestamp")
➥ .setLongValue(new Date().getTime()));

 

doc.add(new NumericField("day")
➥ .setIntValue((int) (new Date().getTime()/24/3600)));

 

Calendar cal = Calendar.getInstance();
cal.setTime(date);
doc.add(new NumericField("dayOfMonth")
➥ .setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

 

------------------------------------------------------------------------------------------------------------------------------------

Field truncation

Some applications index documents whose sizes aren’t known in advance. As a safety mechanism to control the amount of RAM and hard disk space used, you may want to limit the amount of input they are allowed index per field. It’s also possible that a large binary document is accidentally misclassified as a text document, or contains binary content embedded in it that your document filter failed to process, which quickly adds many absurd binary terms to your index, much to your horror. Other applications deal with documents of known size but you’d like to index only a portion of each. For example, you may want to index only the first 200 words of each document.

 

To support these diverse cases, IndexWriter allows you to truncate per-Field indexing so that only the first N terms are indexed for an analyzed field. When you instantiate IndexWriter, you must pass in a MaxFieldLength instance expressing this limit. MaxFieldLength provides two convenient default instances: MaxField-Length.UNLIMITED, which means no truncation will take place, and MaxField-Length.LIMITED, which means fields are truncated at 10,000 terms. You can also instantiate MaxFieldLength with your own limit.

 

你可能感兴趣的:(Lucene)