Starting with version 4.2, Lucene provides a document classification function. In this article, we will use the same corpus to perform document classification functions of both Lucene and Mahout to compare the results.
Lucene implements Naive Bayes and k-NN rule classifiers. The trunk equivalent to Lucene 5, the next major releases, implements boolean (2-class) classification perceptron in addition to these two. We use Lucene 4.6.1, the most recent version at the time of writing, to perform document classification with Naive Bayes and k-NN rule.
Meanwhile, let’s use Mahout to do document classification with Naive Bayes and Random Forest as well.
Overview of Lucene Document Classification
Lucene’s classifier for document classification is defined as the Classifier interface.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
public
interface
Classifier<T> {
/**
* Assign a class (with score) to the given text String
* @param text a String containing text to be classified
* @return a {@link ClassificationResult} holding assigned class of type <code>T</code> and score
* @throws IOException If there is a low-level I/O error.
*/
public
ClassificationResult<T> assignClass(String text)
throws
IOException;
/**
* Train the classifier using the underlying Lucene index
* @param atomicReader the reader to use to access the Lucene index
* @param textFieldName the name of the field used to compare documents
* @param classFieldName the name of the field containing the class assigned to documents
* @param analyzer the analyzer used to tokenize / filter the unseen text
* @param query the query to filter which documents use for training
* @throws IOException If there is a low-level I/O error.
*/
public
void
train(AtomicReader atomicReader, String textFieldName, String classFieldName, Analyzer analyzer, Query query)
throws
IOException;
}
|
You need to have IndexReader with prepared index open and specify it as the first argument of the train() method because Classifier uses index as learning data. Also, set the Lucene field name that has text, which is tokenized and indexed, as the second argument of train() method. In addition, set the Lucene field that has document category as the third argument of train() method. In the same manner, set a Lucene Analyzer to the fourth argument and Query to the fifth argument. Analyzer then specifies Analyzer that is used to classify unknown document (In my personal opinion, this is a bit complicated and should use them as arguments for after-mentioned assignClass() method instead) . While Query is used to narrow down documents that are used for learning, null is used if there’s no need to do so. The train() method has 2 more varieties that have different arguments but I will skip the explanation for now.
Use unknown document in the String type as an argument to call the assignClass() method after you call train() of Classifier interface to obtain the result of classification. Classifier is an interface that uses Java Generics, and the ClassificationResult class that uses type variable T is the returned value of assignClass().
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
public
class
ClassificationResult<T> {
private
final
T assignedClass;
private
final
double
score;
/**
* Constructor
* @param assignedClass the class <code>T</code> assigned by a {@link Classifier}
* @param score the score for the assignedClass as a <code>double</code>
*/
public
ClassificationResult(T assignedClass,
double
score) {
this
.assignedClass = assignedClass;
this
.score = score;
}
/**
* retrieve the result class
* @return a <code>T</code> representing an assigned class
*/
public
T getAssignedClass() {
return
assignedClass;
}
/**
* retrieve the result score
* @return a <code>double</code> representing a result score
*/
public
double
getScore() {
return
score;
}
}
|
Calling the getAssignedClass() method of ClassificationResult gives you a classification result of the type T.
Note that Lucene’s classifier is unique in that the train() method does little work while the assignClass() does most of the work. This is where it is very different from the other commonly used machine learning software. In the learning phase of commonly used machine learning software, a model file is created by learning corpus according to a selected machine learning algorithm (This is where the most time/effort is put into. As Mahout is based on Hadoop, it uses MapReduce to try to reduce the time required here). And in the classification phase, an unknown document is classified by referring to a previously created model file. This phase usually requires little resource.
As Lucene uses an index as a model file, train() method, which is a learning phase, does almost nothing here (Its learning completes as soon as index is created). Lucene’s index, however, is optimized to perform high-speed keyword search and is not in an appropriate format for document classification model file. Therefore, here we do document classification by searching index with the assignClass() method that is a classification phase. Contrary to commonly used machine learning software, Lucene’s classifier requires very high computing power in the classification phase. For sites mainly focused on searching, this function that enables document classification should be appealing as they can create indexes without additional cost.
Now, let’s quickly go through how the 2 implement classes of Classifier interface do document classification and actually call them from a program.
Using Lucene SimpleNaiveBayesClassifier
SimpleNaiveBayesClassifier is the first implement class of Classifier interface. As you can see from the name, it’s a Naive Bayes classifier. Naive Bayes classification finds c where conditional probability P(c|d), the probability of class being c in document d, becomes the highest. Here you use Bayes’ theorem to do deformation of P(c|d) but you need to find P(c)P(d|c) to calculate class c with the highest probability. While you usually calculate logarithm to avoid underflow, the assignClass() method of SimpleNaiveBayesClassifier repeats this calculation as many times as the number of classes to perform MLE (maximum likelihood estimation).
We now use SimpleNaiveBayesClassifier, but before that, we need to prepare learning data in an index. Here we use livedoor news corpusas our corpus. Let’s add livedoor news corpus to the index using schema definition Solr as follows.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
<?
xml
version
=
"1.0"
encoding
=
"UTF-8"
?>
<
schema
name
=
"example"
version
=
"1.5"
>
<
fields
>
<
field
name
=
"url"
type
=
"string"
indexed
=
"true"
stored
=
"true"
required
=
"true"
multiValued
=
"false"
/>
<
field
name
=
"cat"
type
=
"string"
indexed
=
"true"
stored
=
"true"
required
=
"true"
multiValued
=
"false"
/>
<
field
name
=
"title"
type
=
"text_ja"
indexed
=
"true"
stored
=
"true"
multiValued
=
"false"
/>
<
field
name
=
"body"
type
=
"text_ja"
indexed
=
"true"
stored
=
"true"
multiValued
=
"true"
/>
<
field
name
=
"date"
type
=
"date"
indexed
=
"true"
stored
=
"true"
/>
</
fields
>
<
uniqueKey
>url</
uniqueKey
>
<
types
>
<
fieldType
name
=
"string"
class
=
"solr.StrField"
sortMissingLast
=
"true"
/>
<
fieldType
name
=
"boolean"
class
=
"solr.BoolField"
sortMissingLast
=
"true"
/>
<
fieldType
name
=
"int"
class
=
"solr.TrieIntField"
precisionStep
=
"0"
positionIncrementGap
=
"0"
/>
<
fieldType
name
=
"float"
class
=
"solr.TrieFloatField"
precisionStep
=
"0"
positionIncrementGap
=
"0"
/>
<
fieldType
name
=
"long"
class
=
"solr.TrieLongField"
precisionStep
=
"0"
positionIncrementGap
=
"0"
/>
<
fieldType
name
=
"double"
class
=
"solr.TrieDoubleField"
precisionStep
=
"0"
positionIncrementGap
=
"0"
/>
<
fieldType
name
=
"date"
class
=
"solr.TrieDateField"
precisionStep
=
"0"
positionIncrementGap
=
"0"
/>
<
fieldType
name
=
"text_ja"
class
=
"solr.TextField"
positionIncrementGap
=
"100"
autoGeneratePhraseQueries
=
"false"
>
<
analyzer
>
<
tokenizer
class
=
"solr.JapaneseTokenizerFactory"
mode
=
"search"
/>
<
filter
class
=
"solr.JapaneseBaseFormFilterFactory"
/>
<
filter
class
=
"solr.JapanesePartOfSpeechStopFilterFactory"
tags
=
"lang/stoptags_ja.txt"
/>
<
filter
class
=
"solr.CJKWidthFilterFactory"
/>
<
filter
class
=
"solr.StopFilterFactory"
ignoreCase
=
"true"
words
=
"lang/stopwords_ja.txt"
/>
<
filter
class
=
"solr.JapaneseKatakanaStemFilterFactory"
minimumLength
=
"4"
/>
<
filter
class
=
"solr.LowerCaseFilterFactory"
/>
</
analyzer
>
</
fieldType
>
</
types
>
</
schema
>
|
Note that the cat field is a classification class while body field is the target learning field. First, start Solr with the above schema.xml and add livedoor news corpus. You can stop Solr as soon as you finish adding the corpus.
Next, we need a Java program that uses SimpleNaiveBayesClassifier. To make things easier, we will use the same document we used for learning for classification test as is. The program looks like as follows.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
|
public
final
class
TestLuceneIndexClassifier {
public
static
final
String INDEX =
"solr2/collection1/data/index"
;
public
static
final
String[] CATEGORIES = {
"dokujo-tsushin"
,
"it-life-hack"
,
"kaden-channel"
,
"livedoor-homme"
,
"movie-enter"
,
"peachy"
,
"smax"
,
"sports-watch"
,
"topic-news"
};
private
static
int
[][] counts;
private
static
Map<String, Integer> catindex;
public
static
void
main(String[] args)
throws
Exception {
init();
final
long
startTime = System.currentTimeMillis();
SimpleNaiveBayesClassifier classifier =
new
SimpleNaiveBayesClassifier();
IndexReader reader = DirectoryReader.open(dir());
AtomicReader ar = SlowCompositeReaderWrapper.wrap(reader);
classifier.train(ar,
"body"
,
"cat"
,
new
JapaneseAnalyzer(Version.LUCENE_46));
final
int
maxdoc = reader.maxDoc();
for
(
int
i =
0
; i < maxdoc; i++){
Document doc = ar.document(i);
String correctAnswer = doc.get(
"cat"
);
final
int
cai = idx(correctAnswer);
ClassificationResult<BytesRef> result = classifier.assignClass(doc.get(
"body"
));
String classified = result.getAssignedClass().utf8ToString();
final
int
cli = idx(classified);
counts[cai][cli]++;
}
final
long
endTime = System.currentTimeMillis();
final
int
elapse = (
int
)(endTime - startTime) /
1000
;
// print results
int
fc =
0
, tc =
0
;
for
(
int
i =
0
; i < CATEGORIES.length; i++){
for
(
int
j =
0
; j < CATEGORIES.length; j++){
System.out.printf(
" %3d "
, counts[i][j]);
if
(i == j){
tc += counts[i][j];
}
else
{
fc += counts[i][j];
}
}
System.out.println();
}
float
accrate = (
float
)tc / (
float
)(tc + fc);
float
errrate = (
float
)fc / (
float
)(tc + fc);
System.out.printf(
"\n\n*** accuracy rate = %f, error rate = %f; time = %d (sec); %d docs\n"
, accrate, errrate, elapse, maxdoc);
reader.close();
}
static
Directory dir()
throws
IOException {
return
FSDirectory.open(
new
File(INDEX));
}
static
void
init(){
counts =
new
int
[CATEGORIES.length][CATEGORIES.length];
catindex =
new
HashMap<String, Integer>();
for
(
int
i =
0
; i < CATEGORIES.length; i++){
catindex.put(CATEGORIES[i], i);
}
}
static
int
idx(String cat){
return
catindex.get(cat);
}
}
|
Here we specified JapaneseAnalyzer as Analyzer (On the other hand, there is a slight difference when we create index because we use JapaneseTokenizer and relevant TokenFilter with a Solr function). A character string array CATEGORIES has document category hard-coded. Executing this program displays a confusion matrix like Mahout but the elements in the matrix are in the same order as array elements of document category that are hard-coded.
Executing this program displays the followings.
1
2
3
4
5
6
7
8
9
10
11
|
760 0 4 23 37 37 2 2 5
40 656 7 44 25 4 90 1 3
87 57 392 102 68 24 113 5 16
40 15 6 391 33 8 16 2 0
14 2 0 5 845 2 0 1 1
134 2 2 26 107 549 19 3 0
43 36 13 17 26 36 693 5 1
6 0 0 23 35 0 1 829 6
10 9 9 25 66 6 5 45 595
*** accuracy rate = 0.775078, error rate = 0.224922; time = 67 (sec); 7367 docs
|
The classification accuracy rate went up to 77%.
Using Lucene KNearestNeighborClassifier
Another implement class for Classifier is KNearestNeighborClassifier. KNearestNeighborClassifier specifies k, which is no less than 1, in an argument for constructor to create an instance. You can use the program exactly the same as one for SimpleNaiveBayesClassifier. Only you need to do is to replace the portion that is creating an instance for SimpleNaiveBayesClassifier with KNearestNeighborClassifier.
The assignClass() method does all the work for KNearestNeighborClassifier as well in the same manner described before but one interesting point is that it is using Lucene MoreLikeThis. MoreLikeThis is a tool that sees document to become criteria as a query and performs search. With this, you can find documents that are similar to the ones to be criteria. KNearestNeighborClassifier uses MoreLikeThis to “k” number of documents that are most similar to the unknown document passed to the assignClass() method. Then, the majority rule is applied to that k number of documents to determine the document category of unknown document.
Executing the same program as KNearestNeighborClassifier will display the following when k=1.
1
2
3
4
5
6
7
8
9
10
11
|
724 14 28 22 6 30 8 18 20
121 630 41 13 2 9 35 6 13
165 28 582 10 5 16 26 7 25
229 15 15 213 6 14 6 2 11
134 37 15 8 603 12 19 7 35
266 38 39 24 14 412 22 9 18
810 16 1 3 2 3 32 1 2
316 18 14 12 5 7 8 439 81
362 17 29 10 1 7 7 16 321
*** accuracy rate = 0.536989, error rate = 0.463011; time = 13 (sec); 7367 docs
|
Now the accuracy rate is 53%. In addition, if you take k=3, accuracy rate goes down to 48%.
1
2
3
4
5
6
7
8
9
10
11
|
652 5 78 3 7 40 13 38 34
127 540 82 15 1 10 58 23 14
169 34 553 3 7 16 38 15 29
242 10 32 156 12 13 15 10 21
136 30 21 9 592 11 19 15 37
309 34 58 5 23 318 40 28 27
810 8 3 1 0 10 37 1 0
312 8 44 7 5 2 13 442 67
362 11 45 5 6 10 16 34 281
*** accuracy rate = 0.484729, error rate = 0.515271; time = 9 (sec); 7367 docs
|
Document Classification by NLP4L and Mahout
If you want to use Lucene’s index as an input data in Mahout, there’s a handy command available. However, the purpose is to do document classification for a class with an instructor, you need to output field information, which specifies a class, in addition to document vector.
The tools that can easily do this are NLP4L MSDDumper and TermsDumper that we developed. NLP4L stands for Natural Language Processing for Lucene and is a natural language processing tool set that sees Lucene’s index as corpus.
Depending on the setting, MSDDumper and TermsDumper select and extract important words from Lucene’s field according to keys like tf*idf and outputs them in a format that is easy for Mahout command to read. Let’s use this function to select 2,000 important words from the body field of index and do the Mahout classification.
Looking only at the result, Mahout Naive Bayes shows accuracy rate of 96%.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 7128 96.7689%
Incorrectly Classified Instances : 238 3.2311%
Total Classified Instances : 7366
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i <--Classified as
823 1 1 6 12 19 2 4 2 | 870 a = dokujo-tsushin
1 848 2 1 0 1 11 4 2 | 870 b = it-life-hack
5 6 830 1 1 0 3 1 17 | 864 c = kaden-channel
2 6 6 486 3 1 6 0 0 | 510 d = livedoor-homme
0 0 1 1 865 1 0 1 1 | 870 e = movie-enter
31 3 6 12 14 762 6 4 4 | 842 f = peachy
0 0 2 0 0 1 867 0 0 | 870 g = smax
0 0 0 1 0 0 0 897 2 | 900 h = sports-watch
2 4 1 1 0 0 0 12 750 | 770 i = topic-news
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.955
Accuracy 96.7689%
Reliability 87.0076%
Reliability (standard deviation) 0.307
|
Also, Mahout Random Forest shows accuracy rate of 97%.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|