MAHOUT 0.9版本的贝叶斯分类器测试样例测试非常简单,执行下面这条语句,然后选择第2项就OK了。
$MAHOUT_HOME/examples/bin/classify-20newsgroups.sh[jifeng@jifeng01 hadoop]$ $MAHOUT_HOME/examples/bin/classify-20newsgroups.sh Please select a number to choose the corresponding task to run 1. cnaivebayes 2. naivebayes 3. sgd 4. clean -- cleans up the work area in /tmp/mahout-work-jifeng Enter your choice :选择第二项,等待运行完成:
Enter your choice : 2 ok. You chose 2 and we'll use naivebayes creating work directory at /tmp/mahout-work-jifeng Downloading 20news-bydate % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 13.7M 100 13.7M 0 0 96135 0 0:02:30 0:02:30 --:--:-- 103k Extracting... + echo 'Preparing 20newsgroups data' Preparing 20newsgroups data + rm -rf /tmp/mahout-work-jifeng/20news-all + mkdir /tmp/mahout-work-jifeng/20news-all + cp -R /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/alt.atheism /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/comp.graphics /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/comp.os.ms-windows.misc /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/comp.sys.ibm.pc.hardware /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/comp.sys.mac.hardware /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/comp.windows.x /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/misc.forsale /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/rec.autos /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/rec.motorcycles /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/rec.sport.baseball /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/rec.sport.hockey /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/sci.crypt /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/sci.electronics /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/sci.med /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/sci.space /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/soc.religion.christian /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/talk.politics.guns /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/talk.politics.mideast /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/talk.politics.misc /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-test/talk.religion.misc /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/alt.atheism /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/comp.graphics /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/comp.os.ms-windows.misc /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/comp.sys.ibm.pc.hardware /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/comp.sys.mac.hardware /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/comp.windows.x /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/misc.forsale /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/rec.autos /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/rec.motorcycles /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/rec.sport.baseball /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/rec.sport.hockey /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/sci.crypt /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/sci.electronics /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/sci.med /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/sci.space /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/soc.religion.christian /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/talk.politics.guns /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/talk.politics.mideast /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/talk.politics.misc /tmp/mahout-work-jifeng/20news-bydate/20news-bydate-train/talk.religion.misc /tmp/mahout-work-jifeng/20news-all + '[' /home/jifeng/hadoop/hadoop-1.2.1 '!=' '' ']' + '[' '' == '' ']' + echo 'Copying 20newsgroups data to HDFS' Copying 20newsgroups data to HDFS + set +e + /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop dfs -rmr /tmp/mahout-work-jifeng/20news-all Warning: $HADOOP_HOME is deprecated. rmr: cannot remove /tmp/mahout-work-jifeng/20news-all: No such file or directory. + set -e + /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop dfs -put /tmp/mahout-work-jifeng/20news-all /tmp/mahout-work-jifeng/20news-all Warning: $HADOOP_HOME is deprecated. + echo 'Creating sequence files from 20newsgroups data' Creating sequence files from 20newsgroups data + ./bin/mahout seqdirectory -i /tmp/mahout-work-jifeng/20news-all -o /tmp/mahout-work-jifeng/20news-seq -ow MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=/home/jifeng/hadoop/hadoop-1.2.1/conf MAHOUT-JOB: /home/jifeng/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 14/09/20 15:35:58 WARN driver.MahoutDriver: No seqdirectory.props found on classpath, will use command-line arguments only 14/09/20 15:35:59 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/tmp/mahout-work-jifeng/20news-all], --keyPrefix=[], --method=[mapreduce], --output=[/tmp/mahout-work-jifeng/20news-seq], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 14/09/20 15:36:05 INFO input.FileInputFormat: Total input paths to process : 18846 14/09/20 15:36:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/09/20 15:36:06 WARN snappy.LoadSnappy: Snappy native library not loaded 14/09/20 15:36:20 INFO mapred.JobClient: Running job: job_201409201505_0001 14/09/20 15:36:21 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:36:34 INFO mapred.JobClient: map 10% reduce 0% 14/09/20 15:36:37 INFO mapred.JobClient: map 17% reduce 0% 14/09/20 15:36:41 INFO mapred.JobClient: map 20% reduce 0% 14/09/20 15:36:44 INFO mapred.JobClient: map 24% reduce 0% 14/09/20 15:36:47 INFO mapred.JobClient: map 29% reduce 0% 14/09/20 15:36:50 INFO mapred.JobClient: map 33% reduce 0% 14/09/20 15:36:53 INFO mapred.JobClient: map 38% reduce 0% 14/09/20 15:36:56 INFO mapred.JobClient: map 43% reduce 0% 14/09/20 15:36:59 INFO mapred.JobClient: map 49% reduce 0% 14/09/20 15:37:01 INFO mapred.JobClient: map 55% reduce 0% 14/09/20 15:37:04 INFO mapred.JobClient: map 59% reduce 0% 14/09/20 15:37:07 INFO mapred.JobClient: map 66% reduce 0% 14/09/20 15:37:10 INFO mapred.JobClient: map 73% reduce 0% 14/09/20 15:37:13 INFO mapred.JobClient: map 81% reduce 0% 14/09/20 15:37:16 INFO mapred.JobClient: map 92% reduce 0% 14/09/20 15:37:20 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:37:20 INFO mapred.JobClient: Job complete: job_201409201505_0001 14/09/20 15:37:20 INFO mapred.JobClient: Counters: 18 14/09/20 15:37:20 INFO mapred.JobClient: Job Counters 14/09/20 15:37:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=56433 14/09/20 15:37:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:37:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:37:20 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:37:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/09/20 15:37:20 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:37:20 INFO mapred.JobClient: Bytes Written=19202391 14/09/20 15:37:20 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:37:20 INFO mapred.JobClient: HDFS_BYTES_READ=37622181 14/09/20 15:37:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=59535 14/09/20 15:37:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=19202391 14/09/20 15:37:20 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:37:20 INFO mapred.JobClient: Bytes Read=0 14/09/20 15:37:20 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:37:20 INFO mapred.JobClient: Map input records=18846 14/09/20 15:37:20 INFO mapred.JobClient: Physical memory (bytes) snapshot=84213760 14/09/20 15:37:20 INFO mapred.JobClient: Spilled Records=0 14/09/20 15:37:20 INFO mapred.JobClient: CPU time spent (ms)=21400 14/09/20 15:37:20 INFO mapred.JobClient: Total committed heap usage (bytes)=57679872 14/09/20 15:37:20 INFO mapred.JobClient: Virtual memory (bytes) snapshot=349954048 14/09/20 15:37:20 INFO mapred.JobClient: Map output records=18846 14/09/20 15:37:20 INFO mapred.JobClient: SPLIT_RAW_BYTES=1767178 14/09/20 15:37:20 INFO driver.MahoutDriver: Program took 81580 ms (Minutes: 1.3596666666666666) + echo 'Converting sequence files to vectors' Converting sequence files to vectors + ./bin/mahout seq2sparse -i /tmp/mahout-work-jifeng/20news-seq -o /tmp/mahout-work-jifeng/20news-vectors -lnorm -nv -wt tfidf MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=/home/jifeng/hadoop/hadoop-1.2.1/conf MAHOUT-JOB: /home/jifeng/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 14/09/20 15:37:23 WARN driver.MahoutDriver: No seq2sparse.props found on classpath, will use command-line arguments only 14/09/20 15:37:23 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1 14/09/20 15:37:23 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0 14/09/20 15:37:23 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1 14/09/20 15:37:23 INFO vectorizer.SparseVectorsFromSequenceFiles: Tokenizing documents in /tmp/mahout-work-jifeng/20news-seq 14/09/20 15:37:27 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:37:28 INFO mapred.JobClient: Running job: job_201409201505_0002 14/09/20 15:37:29 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:37:48 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:37:49 INFO mapred.JobClient: Job complete: job_201409201505_0002 14/09/20 15:37:49 INFO mapred.JobClient: Counters: 19 14/09/20 15:37:49 INFO mapred.JobClient: Job Counters 14/09/20 15:37:49 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=16049 14/09/20 15:37:49 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:37:49 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:37:49 INFO mapred.JobClient: Rack-local map tasks=1 14/09/20 15:37:49 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:37:49 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/09/20 15:37:49 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:37:49 INFO mapred.JobClient: Bytes Written=27503580 14/09/20 15:37:49 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:37:49 INFO mapred.JobClient: HDFS_BYTES_READ=19202523 14/09/20 15:37:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=57471 14/09/20 15:37:49 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=27503580 14/09/20 15:37:49 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:37:49 INFO mapred.JobClient: Bytes Read=19202391 14/09/20 15:37:49 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:37:49 INFO mapred.JobClient: Map input records=18846 14/09/20 15:37:49 INFO mapred.JobClient: Physical memory (bytes) snapshot=90681344 14/09/20 15:37:49 INFO mapred.JobClient: Spilled Records=0 14/09/20 15:37:49 INFO mapred.JobClient: CPU time spent (ms)=5370 14/09/20 15:37:49 INFO mapred.JobClient: Total committed heap usage (bytes)=31916032 14/09/20 15:37:49 INFO mapred.JobClient: Virtual memory (bytes) snapshot=519794688 14/09/20 15:37:49 INFO mapred.JobClient: Map output records=18846 14/09/20 15:37:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=132 14/09/20 15:37:49 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors 14/09/20 15:37:49 INFO vectorizer.DictionaryVectorizer: Creating dictionary from /tmp/mahout-work-jifeng/20news-vectors/tokenized-documents and saving at /tmp/mahout-work-jifeng/20news-vectors/wordcount 14/09/20 15:37:52 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:37:52 INFO mapred.JobClient: Running job: job_201409201505_0003 14/09/20 15:37:53 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:38:08 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:38:15 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:38:16 INFO mapred.JobClient: map 100% reduce 100% 14/09/20 15:38:17 INFO mapred.JobClient: Job complete: job_201409201505_0003 14/09/20 15:38:17 INFO mapred.JobClient: Counters: 29 14/09/20 15:38:17 INFO mapred.JobClient: Job Counters 14/09/20 15:38:17 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:38:17 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=8755 14/09/20 15:38:17 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:38:17 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:38:17 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:38:17 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:38:17 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8294 14/09/20 15:38:17 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:38:17 INFO mapred.JobClient: Bytes Written=2315037 14/09/20 15:38:17 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:38:17 INFO mapred.JobClient: FILE_BYTES_READ=11857906 14/09/20 15:38:17 INFO mapred.JobClient: HDFS_BYTES_READ=27503736 14/09/20 15:38:17 INFO mapred.JobClient: FILE_BYTES_WRITTEN=15512128 14/09/20 15:38:17 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2315037 14/09/20 15:38:17 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:38:17 INFO mapred.JobClient: Bytes Read=27503580 14/09/20 15:38:17 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:38:17 INFO mapred.JobClient: Map output materialized bytes=3538084 14/09/20 15:38:17 INFO mapred.JobClient: Map input records=18846 14/09/20 15:38:17 INFO mapred.JobClient: Reduce shuffle bytes=3538084 14/09/20 15:38:17 INFO mapred.JobClient: Spilled Records=849345 14/09/20 15:38:17 INFO mapred.JobClient: Map output bytes=39462740 14/09/20 15:38:17 INFO mapred.JobClient: Total committed heap usage (bytes)=219021312 14/09/20 15:38:17 INFO mapred.JobClient: CPU time spent (ms)=6420 14/09/20 15:38:17 INFO mapred.JobClient: Combine input records=3026242 14/09/20 15:38:17 INFO mapred.JobClient: SPLIT_RAW_BYTES=156 14/09/20 15:38:17 INFO mapred.JobClient: Reduce input records=192904 14/09/20 15:38:17 INFO mapred.JobClient: Reduce input groups=192904 14/09/20 15:38:17 INFO mapred.JobClient: Combine output records=554873 14/09/20 15:38:17 INFO mapred.JobClient: Physical memory (bytes) snapshot=243875840 14/09/20 15:38:17 INFO mapred.JobClient: Reduce output records=93563 14/09/20 15:38:17 INFO mapred.JobClient: Virtual memory (bytes) snapshot=699461632 14/09/20 15:38:17 INFO mapred.JobClient: Map output records=2664273 14/09/20 15:38:21 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:38:21 INFO mapred.JobClient: Running job: job_201409201505_0004 14/09/20 15:38:22 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:38:32 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:38:39 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:38:42 INFO mapred.JobClient: map 100% reduce 91% 14/09/20 15:38:44 INFO mapred.JobClient: map 100% reduce 100% 14/09/20 15:38:44 INFO mapred.JobClient: Job complete: job_201409201505_0004 14/09/20 15:38:44 INFO mapred.JobClient: Counters: 29 14/09/20 15:38:44 INFO mapred.JobClient: Job Counters 14/09/20 15:38:44 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:38:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5169 14/09/20 15:38:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:38:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:38:44 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:38:44 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:38:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=11769 14/09/20 15:38:44 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:38:44 INFO mapred.JobClient: Bytes Written=29314118 14/09/20 15:38:44 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:38:44 INFO mapred.JobClient: FILE_BYTES_READ=29226519 14/09/20 15:38:44 INFO mapred.JobClient: HDFS_BYTES_READ=27503736 14/09/20 15:38:44 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54668830 14/09/20 15:38:44 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=29314118 14/09/20 15:38:44 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:38:44 INFO mapred.JobClient: Bytes Read=27503580 14/09/20 15:38:44 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:38:44 INFO mapred.JobClient: Map output materialized bytes=27274291 14/09/20 15:38:44 INFO mapred.JobClient: Map input records=18846 14/09/20 15:38:44 INFO mapred.JobClient: Reduce shuffle bytes=27274291 14/09/20 15:38:44 INFO mapred.JobClient: Spilled Records=37692 14/09/20 15:38:44 INFO mapred.JobClient: Map output bytes=27199343 14/09/20 15:38:44 INFO mapred.JobClient: Total committed heap usage (bytes)=258617344 14/09/20 15:38:44 INFO mapred.JobClient: CPU time spent (ms)=5930 14/09/20 15:38:44 INFO mapred.JobClient: Combine input records=0 14/09/20 15:38:44 INFO mapred.JobClient: SPLIT_RAW_BYTES=156 14/09/20 15:38:44 INFO mapred.JobClient: Reduce input records=18846 14/09/20 15:38:44 INFO mapred.JobClient: Reduce input groups=18846 14/09/20 15:38:44 INFO mapred.JobClient: Combine output records=0 14/09/20 15:38:44 INFO mapred.JobClient: Physical memory (bytes) snapshot=285229056 14/09/20 15:38:44 INFO mapred.JobClient: Reduce output records=18846 14/09/20 15:38:44 INFO mapred.JobClient: Virtual memory (bytes) snapshot=699805696 14/09/20 15:38:44 INFO mapred.JobClient: Map output records=18846 14/09/20 15:38:46 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:38:46 INFO mapred.JobClient: Running job: job_201409201505_0005 14/09/20 15:38:47 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:38:58 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:39:05 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:39:07 INFO mapred.JobClient: Job complete: job_201409201505_0005 14/09/20 15:39:07 INFO mapred.JobClient: Counters: 29 14/09/20 15:39:07 INFO mapred.JobClient: Job Counters 14/09/20 15:39:07 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:39:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5851 14/09/20 15:39:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:39:07 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:39:07 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:39:07 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:39:07 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9887 14/09/20 15:39:07 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:39:07 INFO mapred.JobClient: Bytes Written=29314118 14/09/20 15:39:07 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:39:07 INFO mapred.JobClient: FILE_BYTES_READ=29059398 14/09/20 15:39:07 INFO mapred.JobClient: HDFS_BYTES_READ=29314272 14/09/20 15:39:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=58235856 14/09/20 15:39:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=29314118 14/09/20 15:39:07 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:39:07 INFO mapred.JobClient: Bytes Read=29314118 14/09/20 15:39:07 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:39:07 INFO mapred.JobClient: Map output materialized bytes=29059398 14/09/20 15:39:07 INFO mapred.JobClient: Map input records=18846 14/09/20 15:39:07 INFO mapred.JobClient: Reduce shuffle bytes=29059398 14/09/20 15:39:07 INFO mapred.JobClient: Spilled Records=37692 14/09/20 15:39:07 INFO mapred.JobClient: Map output bytes=28984080 14/09/20 15:39:07 INFO mapred.JobClient: Total committed heap usage (bytes)=161316864 14/09/20 15:39:07 INFO mapred.JobClient: CPU time spent (ms)=3580 14/09/20 15:39:07 INFO mapred.JobClient: Combine input records=0 14/09/20 15:39:07 INFO mapred.JobClient: SPLIT_RAW_BYTES=154 14/09/20 15:39:07 INFO mapred.JobClient: Reduce input records=18846 14/09/20 15:39:07 INFO mapred.JobClient: Reduce input groups=18846 14/09/20 15:39:07 INFO mapred.JobClient: Combine output records=0 14/09/20 15:39:07 INFO mapred.JobClient: Physical memory (bytes) snapshot=213598208 14/09/20 15:39:07 INFO mapred.JobClient: Reduce output records=18846 14/09/20 15:39:07 INFO mapred.JobClient: Virtual memory (bytes) snapshot=698466304 14/09/20 15:39:07 INFO mapred.JobClient: Map output records=18846 14/09/20 15:39:07 INFO common.HadoopUtil: Deleting /tmp/mahout-work-jifeng/20news-vectors/partial-vectors-0 14/09/20 15:39:07 INFO vectorizer.SparseVectorsFromSequenceFiles: Calculating IDF 14/09/20 15:39:09 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:39:09 INFO mapred.JobClient: Running job: job_201409201505_0006 14/09/20 15:39:10 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:39:23 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:39:30 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:39:31 INFO mapred.JobClient: map 100% reduce 100% 14/09/20 15:39:31 INFO mapred.JobClient: Job complete: job_201409201505_0006 14/09/20 15:39:31 INFO mapred.JobClient: Counters: 29 14/09/20 15:39:31 INFO mapred.JobClient: Job Counters 14/09/20 15:39:31 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:39:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=10231 14/09/20 15:39:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:39:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:39:31 INFO mapred.JobClient: Rack-local map tasks=1 14/09/20 15:39:31 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:39:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8205 14/09/20 15:39:31 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:39:31 INFO mapred.JobClient: Bytes Written=1890073 14/09/20 15:39:31 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:39:31 INFO mapred.JobClient: FILE_BYTES_READ=4880830 14/09/20 15:39:31 INFO mapred.JobClient: HDFS_BYTES_READ=29314273 14/09/20 15:39:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6306594 14/09/20 15:39:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1890073 14/09/20 15:39:31 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:39:31 INFO mapred.JobClient: Bytes Read=29314118 14/09/20 15:39:31 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:39:31 INFO mapred.JobClient: Map output materialized bytes=1309902 14/09/20 15:39:31 INFO mapred.JobClient: Map input records=18846 14/09/20 15:39:31 INFO mapred.JobClient: Reduce shuffle bytes=1309902 14/09/20 15:39:31 INFO mapred.JobClient: Spilled Records=442190 14/09/20 15:39:31 INFO mapred.JobClient: Map output bytes=31005336 14/09/20 15:39:31 INFO mapred.JobClient: Total committed heap usage (bytes)=132190208 14/09/20 15:39:31 INFO mapred.JobClient: CPU time spent (ms)=5200 14/09/20 15:39:31 INFO mapred.JobClient: Combine input records=2838840 14/09/20 15:39:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=155 14/09/20 15:39:31 INFO mapred.JobClient: Reduce input records=93564 14/09/20 15:39:31 INFO mapred.JobClient: Reduce input groups=93564 14/09/20 15:39:31 INFO mapred.JobClient: Combine output records=348626 14/09/20 15:39:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=186679296 14/09/20 15:39:31 INFO mapred.JobClient: Reduce output records=93564 14/09/20 15:39:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=699461632 14/09/20 15:39:31 INFO mapred.JobClient: Map output records=2583778 14/09/20 15:39:31 INFO vectorizer.SparseVectorsFromSequenceFiles: Pruning 14/09/20 15:39:35 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:39:35 INFO mapred.JobClient: Running job: job_201409201505_0007 14/09/20 15:39:36 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:39:49 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:39:57 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:40:00 INFO mapred.JobClient: map 100% reduce 67% 14/09/20 15:40:02 INFO mapred.JobClient: map 100% reduce 100% 14/09/20 15:40:04 INFO mapred.JobClient: Job complete: job_201409201505_0007 14/09/20 15:40:04 INFO mapred.JobClient: Counters: 29 14/09/20 15:40:04 INFO mapred.JobClient: Job Counters 14/09/20 15:40:04 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:40:04 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11081 14/09/20 15:40:04 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:40:04 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:40:04 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:40:04 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:40:04 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13644 14/09/20 15:40:04 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:40:04 INFO mapred.JobClient: Bytes Written=28689283 14/09/20 15:40:04 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:40:04 INFO mapred.JobClient: FILE_BYTES_READ=9646422 14/09/20 15:40:04 INFO mapred.JobClient: HDFS_BYTES_READ=29314273 14/09/20 15:40:04 INFO mapred.JobClient: FILE_BYTES_WRITTEN=15601678 14/09/20 15:40:04 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=28689283 14/09/20 15:40:04 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:40:04 INFO mapred.JobClient: Bytes Read=29314118 14/09/20 15:40:04 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:40:04 INFO mapred.JobClient: Map output materialized bytes=7741585 14/09/20 15:40:04 INFO mapred.JobClient: Map input records=18846 14/09/20 15:40:04 INFO mapred.JobClient: Reduce shuffle bytes=7741585 14/09/20 15:40:04 INFO mapred.JobClient: Spilled Records=37692 14/09/20 15:40:04 INFO mapred.JobClient: Map output bytes=28984080 14/09/20 15:40:04 INFO mapred.JobClient: Total committed heap usage (bytes)=192774144 14/09/20 15:40:04 INFO mapred.JobClient: CPU time spent (ms)=8050 14/09/20 15:40:04 INFO mapred.JobClient: Combine input records=0 14/09/20 15:40:04 INFO mapred.JobClient: SPLIT_RAW_BYTES=155 14/09/20 15:40:04 INFO mapred.JobClient: Reduce input records=18846 14/09/20 15:40:04 INFO mapred.JobClient: Reduce input groups=18846 14/09/20 15:40:04 INFO mapred.JobClient: Combine output records=0 14/09/20 15:40:04 INFO mapred.JobClient: Physical memory (bytes) snapshot=306360320 14/09/20 15:40:04 INFO mapred.JobClient: Reduce output records=18846 14/09/20 15:40:04 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1039331328 14/09/20 15:40:04 INFO mapred.JobClient: Map output records=18846 14/09/20 15:40:05 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:40:05 INFO mapred.JobClient: Running job: job_201409201505_0008 14/09/20 15:40:06 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:40:15 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:40:22 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:40:25 INFO mapred.JobClient: map 100% reduce 100% 14/09/20 15:40:25 INFO mapred.JobClient: Job complete: job_201409201505_0008 14/09/20 15:40:25 INFO mapred.JobClient: Counters: 29 14/09/20 15:40:25 INFO mapred.JobClient: Job Counters 14/09/20 15:40:25 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:40:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7062 14/09/20 15:40:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:40:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:40:25 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:40:25 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:40:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9478 14/09/20 15:40:25 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:40:25 INFO mapred.JobClient: Bytes Written=28689283 14/09/20 15:40:25 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:40:25 INFO mapred.JobClient: FILE_BYTES_READ=28437750 14/09/20 15:40:25 INFO mapred.JobClient: HDFS_BYTES_READ=28689448 14/09/20 15:40:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=56991474 14/09/20 15:40:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=28689283 14/09/20 15:40:25 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:40:25 INFO mapred.JobClient: Bytes Read=28689283 14/09/20 15:40:25 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:40:25 INFO mapred.JobClient: Map output materialized bytes=28437750 14/09/20 15:40:25 INFO mapred.JobClient: Map input records=18846 14/09/20 15:40:25 INFO mapred.JobClient: Reduce shuffle bytes=28437750 14/09/20 15:40:25 INFO mapred.JobClient: Spilled Records=37692 14/09/20 15:40:25 INFO mapred.JobClient: Map output bytes=28362505 14/09/20 15:40:25 INFO mapred.JobClient: Total committed heap usage (bytes)=160694272 14/09/20 15:40:25 INFO mapred.JobClient: CPU time spent (ms)=3090 14/09/20 15:40:25 INFO mapred.JobClient: Combine input records=0 14/09/20 15:40:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=165 14/09/20 15:40:25 INFO mapred.JobClient: Reduce input records=18846 14/09/20 15:40:25 INFO mapred.JobClient: Reduce input groups=18846 14/09/20 15:40:25 INFO mapred.JobClient: Combine output records=0 14/09/20 15:40:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=212541440 14/09/20 15:40:25 INFO mapred.JobClient: Reduce output records=18846 14/09/20 15:40:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=698466304 14/09/20 15:40:25 INFO mapred.JobClient: Map output records=18846 14/09/20 15:40:25 INFO common.HadoopUtil: Deleting /tmp/mahout-work-jifeng/20news-vectors/tf-vectors-partial 14/09/20 15:40:25 INFO common.HadoopUtil: Deleting /tmp/mahout-work-jifeng/20news-vectors/tf-vectors-toprune 14/09/20 15:40:29 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:40:29 INFO mapred.JobClient: Running job: job_201409201505_0009 14/09/20 15:40:30 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:40:40 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:40:48 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:40:51 INFO mapred.JobClient: map 100% reduce 89% 14/09/20 15:40:53 INFO mapred.JobClient: map 100% reduce 100% 14/09/20 15:40:54 INFO mapred.JobClient: Job complete: job_201409201505_0009 14/09/20 15:40:54 INFO mapred.JobClient: Counters: 29 14/09/20 15:40:54 INFO mapred.JobClient: Job Counters 14/09/20 15:40:54 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:40:54 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=9479 14/09/20 15:40:54 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:40:54 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:40:54 INFO mapred.JobClient: Rack-local map tasks=1 14/09/20 15:40:54 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:40:54 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=12699 14/09/20 15:40:54 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:40:54 INFO mapred.JobClient: Bytes Written=28689283 14/09/20 15:40:54 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:40:54 INFO mapred.JobClient: FILE_BYTES_READ=30342579 14/09/20 15:40:54 INFO mapred.JobClient: HDFS_BYTES_READ=28689430 14/09/20 15:40:54 INFO mapred.JobClient: FILE_BYTES_WRITTEN=56995482 14/09/20 15:40:54 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=28689283 14/09/20 15:40:54 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:40:54 INFO mapred.JobClient: Bytes Read=28689283 14/09/20 15:40:54 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:40:54 INFO mapred.JobClient: Map output materialized bytes=28437750 14/09/20 15:40:54 INFO mapred.JobClient: Map input records=18846 14/09/20 15:40:54 INFO mapred.JobClient: Reduce shuffle bytes=28437750 14/09/20 15:40:54 INFO mapred.JobClient: Spilled Records=37692 14/09/20 15:40:54 INFO mapred.JobClient: Map output bytes=28362505 14/09/20 15:40:54 INFO mapred.JobClient: Total committed heap usage (bytes)=192151552 14/09/20 15:40:54 INFO mapred.JobClient: CPU time spent (ms)=6690 14/09/20 15:40:54 INFO mapred.JobClient: Combine input records=0 14/09/20 15:40:54 INFO mapred.JobClient: SPLIT_RAW_BYTES=147 14/09/20 15:40:54 INFO mapred.JobClient: Reduce input records=18846 14/09/20 15:40:54 INFO mapred.JobClient: Reduce input groups=18846 14/09/20 15:40:54 INFO mapred.JobClient: Combine output records=0 14/09/20 15:40:54 INFO mapred.JobClient: Physical memory (bytes) snapshot=294711296 14/09/20 15:40:54 INFO mapred.JobClient: Reduce output records=18846 14/09/20 15:40:54 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1024688128 14/09/20 15:40:54 INFO mapred.JobClient: Map output records=18846 14/09/20 15:40:55 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:40:55 INFO mapred.JobClient: Running job: job_201409201505_0010 14/09/20 15:40:56 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:41:06 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:41:13 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:41:15 INFO mapred.JobClient: Job complete: job_201409201505_0010 14/09/20 15:41:15 INFO mapred.JobClient: Counters: 29 14/09/20 15:41:15 INFO mapred.JobClient: Job Counters 14/09/20 15:41:15 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:41:15 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7371 14/09/20 15:41:15 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:41:15 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:41:15 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:41:15 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:41:15 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9851 14/09/20 15:41:15 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:41:15 INFO mapred.JobClient: Bytes Written=28689283 14/09/20 15:41:15 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:41:15 INFO mapred.JobClient: FILE_BYTES_READ=28437750 14/09/20 15:41:15 INFO mapred.JobClient: HDFS_BYTES_READ=28689437 14/09/20 15:41:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=56992548 14/09/20 15:41:15 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=28689283 14/09/20 15:41:15 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:41:15 INFO mapred.JobClient: Bytes Read=28689283 14/09/20 15:41:15 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:41:15 INFO mapred.JobClient: Map output materialized bytes=28437750 14/09/20 15:41:15 INFO mapred.JobClient: Map input records=18846 14/09/20 15:41:15 INFO mapred.JobClient: Reduce shuffle bytes=28437750 14/09/20 15:41:15 INFO mapred.JobClient: Spilled Records=37692 14/09/20 15:41:15 INFO mapred.JobClient: Map output bytes=28362505 14/09/20 15:41:15 INFO mapred.JobClient: Total committed heap usage (bytes)=160694272 14/09/20 15:41:15 INFO mapred.JobClient: CPU time spent (ms)=3450 14/09/20 15:41:15 INFO mapred.JobClient: Combine input records=0 14/09/20 15:41:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=154 14/09/20 15:41:15 INFO mapred.JobClient: Reduce input records=18846 14/09/20 15:41:15 INFO mapred.JobClient: Reduce input groups=18846 14/09/20 15:41:15 INFO mapred.JobClient: Combine output records=0 14/09/20 15:41:15 INFO mapred.JobClient: Physical memory (bytes) snapshot=213041152 14/09/20 15:41:15 INFO mapred.JobClient: Reduce output records=18846 14/09/20 15:41:15 INFO mapred.JobClient: Virtual memory (bytes) snapshot=700563456 14/09/20 15:41:15 INFO mapred.JobClient: Map output records=18846 14/09/20 15:41:15 INFO common.HadoopUtil: Deleting /tmp/mahout-work-jifeng/20news-vectors/partial-vectors-0 14/09/20 15:41:15 INFO driver.MahoutDriver: Program took 232258 ms (Minutes: 3.8709666666666664) + echo 'Creating training and holdout set with a random 80-20 split of the generated vector dataset' Creating training and holdout set with a random 80-20 split of the generated vector dataset + ./bin/mahout split -i /tmp/mahout-work-jifeng/20news-vectors/tfidf-vectors --trainingOutput /tmp/mahout-work-jifeng/20news-train-vectors --testOutput /tmp/mahout-work-jifeng/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=/home/jifeng/hadoop/hadoop-1.2.1/conf MAHOUT-JOB: /home/jifeng/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 14/09/20 15:41:17 WARN driver.MahoutDriver: No split.props found on classpath, will use command-line arguments only 14/09/20 15:41:18 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-jifeng/20news-vectors/tfidf-vectors], --method=[sequential], --overwrite=null, --randomSelectionPct=[40], --sequenceFiles=null, --startPhase=[0], --tempDir=[temp], --testOutput=[/tmp/mahout-work-jifeng/20news-test-vectors], --trainingOutput=[/tmp/mahout-work-jifeng/20news-train-vectors]} 14/09/20 15:41:19 INFO utils.SplitInput: part-r-00000 has 162419 lines 14/09/20 15:41:19 INFO utils.SplitInput: part-r-00000 test split size is 64968 based on random selection percentage 40 14/09/20 15:41:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/09/20 15:41:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 14/09/20 15:41:20 INFO compress.CodecPool: Got brand-new compressor 14/09/20 15:41:21 INFO compress.CodecPool: Got brand-new compressor 14/09/20 15:41:25 INFO utils.SplitInput: file: part-r-00000, input: 162419 train: 11198, test: 7648 starting at 0 14/09/20 15:41:25 INFO driver.MahoutDriver: Program took 7698 ms (Minutes: 0.1283) + echo 'Training Naive Bayes model' Training Naive Bayes model + ./bin/mahout trainnb -i /tmp/mahout-work-jifeng/20news-train-vectors -el -o /tmp/mahout-work-jifeng/model -li /tmp/mahout-work-jifeng/labelindex -ow MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=/home/jifeng/hadoop/hadoop-1.2.1/conf MAHOUT-JOB: /home/jifeng/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 14/09/20 15:41:28 WARN driver.MahoutDriver: No trainnb.props found on classpath, will use command-line arguments only 14/09/20 15:41:28 INFO common.AbstractJob: Command line arguments: {--alphaI=[1.0], --endPhase=[2147483647], --extractLabels=null, --input=[/tmp/mahout-work-jifeng/20news-train-vectors], --labelIndex=[/tmp/mahout-work-jifeng/labelindex], --output=[/tmp/mahout-work-jifeng/model], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 14/09/20 15:41:28 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/09/20 15:41:28 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 14/09/20 15:41:28 INFO compress.CodecPool: Got brand-new decompressor 14/09/20 15:41:34 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:41:37 INFO mapred.JobClient: Running job: job_201409201505_0011 14/09/20 15:41:38 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:41:48 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:41:56 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:41:57 INFO mapred.JobClient: map 100% reduce 100% 14/09/20 15:41:58 INFO mapred.JobClient: Job complete: job_201409201505_0011 14/09/20 15:41:58 INFO mapred.JobClient: Counters: 29 14/09/20 15:41:58 INFO mapred.JobClient: Job Counters 14/09/20 15:41:58 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:41:58 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5959 14/09/20 15:41:58 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:41:58 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:41:58 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:41:58 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:41:58 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8506 14/09/20 15:41:58 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:41:58 INFO mapred.JobClient: Bytes Written=2736814 14/09/20 15:41:58 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:41:58 INFO mapred.JobClient: FILE_BYTES_READ=1515825 14/09/20 15:41:58 INFO mapred.JobClient: HDFS_BYTES_READ=12691767 14/09/20 15:41:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3149124 14/09/20 15:41:58 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2736814 14/09/20 15:41:58 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:41:58 INFO mapred.JobClient: Bytes Read=12691625 14/09/20 15:41:58 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:41:58 INFO mapred.JobClient: Map output materialized bytes=1515143 14/09/20 15:41:58 INFO mapred.JobClient: Map input records=11198 14/09/20 15:41:58 INFO mapred.JobClient: Reduce shuffle bytes=1515143 14/09/20 15:41:58 INFO mapred.JobClient: Spilled Records=40 14/09/20 15:41:58 INFO mapred.JobClient: Map output bytes=16617381 14/09/20 15:41:58 INFO mapred.JobClient: Total committed heap usage (bytes)=219086848 14/09/20 15:41:58 INFO mapred.JobClient: CPU time spent (ms)=2870 14/09/20 15:41:58 INFO mapred.JobClient: Combine input records=11198 14/09/20 15:41:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=142 14/09/20 15:41:58 INFO mapred.JobClient: Reduce input records=20 14/09/20 15:41:58 INFO mapred.JobClient: Reduce input groups=20 14/09/20 15:41:58 INFO mapred.JobClient: Combine output records=20 14/09/20 15:41:58 INFO mapred.JobClient: Physical memory (bytes) snapshot=204357632 14/09/20 15:41:58 INFO mapred.JobClient: Reduce output records=20 14/09/20 15:41:58 INFO mapred.JobClient: Virtual memory (bytes) snapshot=701775872 14/09/20 15:41:58 INFO mapred.JobClient: Map output records=11198 14/09/20 15:41:58 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:41:58 INFO mapred.JobClient: Running job: job_201409201505_0012 14/09/20 15:41:59 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:42:06 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:42:13 INFO mapred.JobClient: map 100% reduce 33% 14/09/20 15:42:14 INFO mapred.JobClient: map 100% reduce 100% 14/09/20 15:42:14 INFO mapred.JobClient: Job complete: job_201409201505_0012 14/09/20 15:42:14 INFO mapred.JobClient: Counters: 29 14/09/20 15:42:14 INFO mapred.JobClient: Job Counters 14/09/20 15:42:14 INFO mapred.JobClient: Launched reduce tasks=1 14/09/20 15:42:14 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3188 14/09/20 15:42:14 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:42:14 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:42:14 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:42:14 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:42:14 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8219 14/09/20 15:42:14 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:42:14 INFO mapred.JobClient: Bytes Written=899207 14/09/20 15:42:14 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:42:14 INFO mapred.JobClient: FILE_BYTES_READ=444260 14/09/20 15:42:14 INFO mapred.JobClient: HDFS_BYTES_READ=2736948 14/09/20 15:42:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1007696 14/09/20 15:42:14 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=899207 14/09/20 15:42:14 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:42:14 INFO mapred.JobClient: Bytes Read=2736814 14/09/20 15:42:14 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:42:14 INFO mapred.JobClient: Map output materialized bytes=444252 14/09/20 15:42:14 INFO mapred.JobClient: Map input records=20 14/09/20 15:42:14 INFO mapred.JobClient: Reduce shuffle bytes=444252 14/09/20 15:42:14 INFO mapred.JobClient: Spilled Records=4 14/09/20 15:42:14 INFO mapred.JobClient: Map output bytes=899081 14/09/20 15:42:14 INFO mapred.JobClient: Total committed heap usage (bytes)=225873920 14/09/20 15:42:14 INFO mapred.JobClient: CPU time spent (ms)=960 14/09/20 15:42:14 INFO mapred.JobClient: Combine input records=2 14/09/20 15:42:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=134 14/09/20 15:42:14 INFO mapred.JobClient: Reduce input records=2 14/09/20 15:42:14 INFO mapred.JobClient: Reduce input groups=2 14/09/20 15:42:14 INFO mapred.JobClient: Combine output records=2 14/09/20 15:42:14 INFO mapred.JobClient: Physical memory (bytes) snapshot=224452608 14/09/20 15:42:14 INFO mapred.JobClient: Reduce output records=2 14/09/20 15:42:14 INFO mapred.JobClient: Virtual memory (bytes) snapshot=701775872 14/09/20 15:42:14 INFO mapred.JobClient: Map output records=2 14/09/20 15:42:15 INFO driver.MahoutDriver: Program took 47200 ms (Minutes: 0.7866666666666666) + echo 'Self testing on training set' Self testing on training set + ./bin/mahout testnb -i /tmp/mahout-work-jifeng/20news-train-vectors -m /tmp/mahout-work-jifeng/model -l /tmp/mahout-work-jifeng/labelindex -ow -o /tmp/mahout-work-jifeng/20news-testing MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=/home/jifeng/hadoop/hadoop-1.2.1/conf MAHOUT-JOB: /home/jifeng/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 14/09/20 15:42:17 WARN driver.MahoutDriver: No testnb.props found on classpath, will use command-line arguments only 14/09/20 15:42:17 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-jifeng/20news-train-vectors], --labelIndex=[/tmp/mahout-work-jifeng/labelindex], --model=[/tmp/mahout-work-jifeng/model], --output=[/tmp/mahout-work-jifeng/20news-testing], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 14/09/20 15:42:20 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:42:21 INFO mapred.JobClient: Running job: job_201409201505_0013 14/09/20 15:42:22 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:42:35 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:42:35 INFO mapred.JobClient: Job complete: job_201409201505_0013 14/09/20 15:42:35 INFO mapred.JobClient: Counters: 20 14/09/20 15:42:35 INFO mapred.JobClient: Job Counters 14/09/20 15:42:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=9283 14/09/20 15:42:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:42:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:42:35 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:42:35 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:42:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/09/20 15:42:35 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:42:35 INFO mapred.JobClient: Bytes Written=2109460 14/09/20 15:42:35 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:42:35 INFO mapred.JobClient: FILE_BYTES_READ=3663744 14/09/20 15:42:35 INFO mapred.JobClient: HDFS_BYTES_READ=12691767 14/09/20 15:42:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=58876 14/09/20 15:42:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2109460 14/09/20 15:42:35 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:42:35 INFO mapred.JobClient: Bytes Read=12691625 14/09/20 15:42:35 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:42:35 INFO mapred.JobClient: Map input records=11198 14/09/20 15:42:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=57454592 14/09/20 15:42:35 INFO mapred.JobClient: Spilled Records=0 14/09/20 15:42:35 INFO mapred.JobClient: CPU time spent (ms)=5310 14/09/20 15:42:35 INFO mapred.JobClient: Total committed heap usage (bytes)=29954048 14/09/20 15:42:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=349515776 14/09/20 15:42:35 INFO mapred.JobClient: Map output records=11198 14/09/20 15:42:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=142 14/09/20 15:42:36 INFO test.TestNaiveBayesDriver: Standard NB Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 11119 99.2945% Incorrectly Classified Instances : 79 0.7055% Total Classified Instances : 11198 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 434 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 | 435 a = alt.atheism 0 555 0 2 1 3 0 0 0 0 0 1 0 1 0 0 0 0 0 0 | 563 b = comp.graphics 0 6 553 18 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 579 c = comp.os.ms-windows.misc 0 0 0 564 1 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 | 568 d = comp.sys.ibm.pc.hardware 0 0 1 0 573 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 | 575 e = comp.sys.mac.hardware 0 1 1 0 0 585 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 587 f = comp.windows.x 0 0 0 1 0 0 582 1 0 0 0 0 2 0 0 0 0 0 0 0 | 586 g = misc.forsale 0 0 0 0 1 0 1 613 0 0 0 0 0 0 0 0 0 0 0 1 | 616 h = rec.autos 0 0 0 0 0 0 1 0 603 0 0 0 0 0 0 0 0 0 0 0 | 604 i = rec.motorcycles 0 0 0 0 0 0 0 0 0 595 0 1 0 0 0 0 0 0 0 0 | 596 j = rec.sport.baseball 0 0 0 0 0 0 0 0 0 0 584 0 0 0 0 0 0 0 0 1 | 585 k = rec.sport.hockey 0 0 0 0 0 0 0 0 0 0 0 583 1 0 0 0 0 0 0 1 | 585 l = sci.crypt 0 0 0 2 0 0 0 0 0 0 0 0 584 0 0 0 0 0 0 0 | 586 m = sci.electronics 0 1 0 0 0 0 0 0 0 0 0 0 1 570 0 0 0 0 0 0 | 572 n = sci.med 0 0 0 0 0 0 0 0 0 0 0 0 0 1 617 0 0 0 0 0 | 618 o = sci.space 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 592 1 0 0 0 | 593 p = soc.religion.christian 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 565 0 0 0 | 565 q = talk.politics.mideast 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 544 0 0 | 545 r = talk.politics.guns 7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 359 1 | 370 s = talk.religion.misc 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 5 0 464 | 470 t = talk.politics.misc ======================================================= Statistics ------------------------------------------------------- Kappa 0.987 Accuracy 99.2945% Reliability 94.5236% Reliability (standard deviation) 0.2169 14/09/20 15:42:36 INFO driver.MahoutDriver: Program took 18730 ms (Minutes: 0.31216666666666665) + echo 'Testing on holdout set' Testing on holdout set + ./bin/mahout testnb -i /tmp/mahout-work-jifeng/20news-test-vectors -m /tmp/mahout-work-jifeng/model -l /tmp/mahout-work-jifeng/labelindex -ow -o /tmp/mahout-work-jifeng/20news-testing MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=/home/jifeng/hadoop/hadoop-1.2.1/conf MAHOUT-JOB: /home/jifeng/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 14/09/20 15:42:39 WARN driver.MahoutDriver: No testnb.props found on classpath, will use command-line arguments only 14/09/20 15:42:39 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-jifeng/20news-test-vectors], --labelIndex=[/tmp/mahout-work-jifeng/labelindex], --model=[/tmp/mahout-work-jifeng/model], --output=[/tmp/mahout-work-jifeng/20news-testing], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 14/09/20 15:42:39 INFO common.HadoopUtil: Deleting /tmp/mahout-work-jifeng/20news-testing 14/09/20 15:42:41 INFO input.FileInputFormat: Total input paths to process : 1 14/09/20 15:42:41 INFO mapred.JobClient: Running job: job_201409201505_0014 14/09/20 15:42:42 INFO mapred.JobClient: map 0% reduce 0% 14/09/20 15:42:59 INFO mapred.JobClient: map 100% reduce 0% 14/09/20 15:42:59 INFO mapred.JobClient: Job complete: job_201409201505_0014 14/09/20 15:42:59 INFO mapred.JobClient: Counters: 20 14/09/20 15:42:59 INFO mapred.JobClient: Job Counters 14/09/20 15:42:59 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=8035 14/09/20 15:42:59 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/09/20 15:42:59 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/09/20 15:42:59 INFO mapred.JobClient: Launched map tasks=1 14/09/20 15:42:59 INFO mapred.JobClient: Data-local map tasks=1 14/09/20 15:42:59 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/09/20 15:42:59 INFO mapred.JobClient: File Output Format Counters 14/09/20 15:42:59 INFO mapred.JobClient: Bytes Written=1440968 14/09/20 15:42:59 INFO mapred.JobClient: FileSystemCounters 14/09/20 15:42:59 INFO mapred.JobClient: FILE_BYTES_READ=3663744 14/09/20 15:42:59 INFO mapred.JobClient: HDFS_BYTES_READ=8662748 14/09/20 15:42:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=58876 14/09/20 15:42:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1440968 14/09/20 15:42:59 INFO mapred.JobClient: File Input Format Counters 14/09/20 15:42:59 INFO mapred.JobClient: Bytes Read=8662607 14/09/20 15:42:59 INFO mapred.JobClient: Map-Reduce Framework 14/09/20 15:42:59 INFO mapred.JobClient: Map input records=7648 14/09/20 15:42:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=58269696 14/09/20 15:42:59 INFO mapred.JobClient: Spilled Records=0 14/09/20 15:42:59 INFO mapred.JobClient: CPU time spent (ms)=4120 14/09/20 15:42:59 INFO mapred.JobClient: Total committed heap usage (bytes)=31711232 14/09/20 15:42:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=349515776 14/09/20 15:42:59 INFO mapred.JobClient: Map output records=7648 14/09/20 15:42:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=141 14/09/20 15:43:00 INFO test.TestNaiveBayesDriver: Standard NB Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 6913 90.3896% Incorrectly Classified Instances : 735 9.6104% Total Classified Instances : 7648 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 327 0 0 0 1 0 0 0 0 1 0 0 0 1 0 9 0 2 20 3 | 364 a = alt.atheism 0 350 2 22 7 10 8 0 0 1 0 1 2 1 5 0 0 0 1 0 | 410 b = comp.graphics 0 25 254 73 17 20 8 0 0 0 0 1 4 2 0 0 0 0 0 2 | 406 c = comp.os.ms-windows.misc 1 4 2 375 18 3 4 1 0 0 0 0 5 0 0 0 0 0 1 0 | 414 d = comp.sys.ibm.pc.hardware 0 4 3 16 355 0 5 0 0 0 0 1 4 0 0 0 0 0 0 0 | 388 e = comp.sys.mac.hardware 0 28 0 7 8 348 2 0 1 1 0 5 0 0 1 0 0 0 0 0 | 401 f = comp.windows.x 1 5 1 15 9 0 330 11 4 0 3 0 5 2 1 0 0 0 1 1 | 389 g = misc.forsale 0 1 0 0 1 1 7 357 3 0 0 0 1 0 2 0 0 1 0 0 | 374 h = rec.autos 0 0 0 0 0 0 0 5 386 0 0 0 0 1 0 0 0 0 0 0 | 392 i = rec.motorcycles 0 0 0 1 2 0 1 0 1 389 3 0 1 0 0 0 0 0 0 0 | 398 j = rec.sport.baseball 0 0 0 1 0 0 0 0 0 5 405 0 2 0 0 1 0 0 0 0 | 414 k = rec.sport.hockey 1 2 1 0 3 2 1 0 0 0 0 386 1 1 0 0 0 4 1 3 | 406 l = sci.crypt 0 3 0 14 7 1 10 6 0 0 1 2 350 1 1 0 0 1 0 1 | 398 m = sci.electronics 1 1 1 1 2 0 6 3 0 2 1 2 2 389 2 0 1 3 1 0 | 418 n = sci.med 0 3 0 0 4 2 1 0 0 0 0 1 3 1 346 0 0 2 3 3 | 369 o = sci.space 2 3 0 1 0 0 0 0 0 1 0 0 1 4 0 382 0 1 8 1 | 404 p = soc.religion.christian 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 3 367 0 0 2 | 375 q = talk.politics.mideast 0 0 1 0 0 0 0 2 2 0 0 2 0 0 1 0 0 347 1 9 | 365 r = talk.politics.guns 24 1 0 0 0 1 1 0 0 0 1 0 0 0 0 17 3 7 196 7 | 258 s = talk.religion.misc 1 0 0 0 2 0 0 1 0 0 1 1 1 0 2 2 3 15 2 274 | 305 t = talk.politics.misc ======================================================= Statistics ------------------------------------------------------- Kappa 0.8758 Accuracy 90.3896% Reliability 85.9085% Reliability (standard deviation) 0.2138 14/09/20 15:43:00 INFO driver.MahoutDriver: Program took 21261 ms (Minutes: 0.35435)