
Mahout 是 Apache Software Foundation(ASF) 旗下的一个开源项目,提供一些可扩展的 机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Apache Mahout项目已经发展到了它的第三个年头,目前已经有了三个公共发行版本。Mahout包含许多实现,包括聚类、分类、推荐过滤、频繁子项挖掘。此外,通过使用 Apache Hadoop 库,Mahout 可以有效地扩展到云中。
Mahout 的创始人 Grant Ingersoll 介绍了 机器学习的基本概念,并演示了如何使用 Mahout 来实现文档聚类、提出建议和组织内容。





[jifeng@jifeng01 ~]$ cd hadoop
[jifeng@jifeng01 hadoop]$ ls
074600-99999-2013.gz  hadoop-1.2.1.tar.gz             mahout-distribution-0.9.tar.gz
awk                   java                            sample.txt
hadoop-1.2.1          tmp
[jifeng@jifeng01 hadoop]$ tar zxf mahout-distribution-0.9.tar.gz  

[jifeng@jifeng01 hadoop]$ ls
074600-99999-2013.gz  hadoop-1.2.1         java                     mahout-distribution-0.9.tar.gz  tmp
awk                   hadoop-1.2.1.tar.gz  mahout-distribution-0.9  sample.txt



[jifeng@jifeng01 hadoop]$ cd ..
[jifeng@jifeng01 ~]$ cat .bash_profile
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc

# User specific environment and startup programs


export PATH
export JAVA_HOME=$HOME/jdk1.7.0_45
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export HADOOP_HOME=$HOME/hadoop/hadoop-1.2.1
export HADOOP_CONF_DIR=$HOME/hadoop/hadoop-1.2.1/conf
export ANT_HOME=$HOME/apache-ant-1.9.4
export HBASE_HOME=$HOME/hbase-0.94.21
export SQOOP_HOME=$HOME/sqoop-1.99.3-bin-hadoop100
export LOGDIR=$SQOOP_HOME/logs

export MAHOUT_HOME=$HOME/hadoop/mahout-distribution-0.9
export MAHOUT_CONF_DIR=$HOME/hadoop/mahout-distribution-0.9/conf

JAVA_HOME    mahout运行需指定jdk的目录 
HADOOP_HOME  如果配置,则在hadoop分布式平台上运行,否则单机运行 
MAHOUT_LOCAL    如果此变量值不为空,则单机运行mahout。 
MAHOUT_CONF_DIR  mahout配置文件的路径,默认值是$MAHOUT_HOME/src/conf 
MAHOUT_HEAPSIZE   mahout运行时可用的最大heap大小 



[jifeng@jifeng01 ~]$ mahout 
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=/home/jifeng/hadoop/hadoop-1.2.1/conf
MAHOUT-JOB: /home/jifeng/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

An example program must be given as the first argument.
Valid program names are:
  arff.vector: : Generate Vectors from an ARFF file or directory
  baumwelch: : Baum-Welch algorithm for unsupervised HMM training
  canopy: : Canopy clustering
  cat: : Print a file or resource as the logistic regression models would see it
  cleansvd: : Cleanup and verification of SVD output
  clusterdump: : Dump cluster output to text
  clusterpp: : Groups Clustering Output In Clusters
  cmdump: : Dump confusion matrix in HTML or text formats
  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix
  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
  evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
  fkmeans: : Fuzzy K-means clustering
  hmmpredict: : Generate random sequence of observations by given HMM
  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
  kmeans: : K-means clustering
  lucene.vector: : Generate Vectors from a Lucene index
  lucene2seq: : Generate Text SequenceFiles from a Lucene index
  matrixdump: : Dump matrix in CSV format
  matrixmult: : Take the product of two matrices
  parallelALS: : ALS-WR factorization of a rating matrix
  qualcluster: : Runs clustering experiments and summarizes results in a CSV
  recommendfactorized: : Compute recommendations using the factorization of a rating matrix
  recommenditembased: : Compute recommendations using item-based collaborative filtering
  regexconverter: : Convert text files on a per line basis based on regular expressions
  resplit: : Splits a set of SequenceFiles into a number of equal splits
  rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
  runlogistic: : Run a logistic regression model against CSV data
  seq2encoded: : Encoded Sparse Vector generation from Text sequence files
  seq2sparse: : Sparse Vector generation from Text sequence files
  seqdirectory: : Generate sequence files (of Text) from a directory
  seqdumper: : Generic Sequence File dumper
  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
  seqwiki: : Wikipedia xml dump to sequence file
  spectralkmeans: : Spectral k-means clustering
  split: : Split Input data into test and train sets
  splitDataset: : split a rating dataset into training and probe parts
  ssvd: : Stochastic SVD
  streamingkmeans: : Streaming k-means clustering
  svd: : Lanczos Singular Value Decomposition
  testnb: : Test the Vector-based Bayes classifier
  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
  trainlogistic: : Train a logistic regression using stochastic gradient descent
  trainnb: : Train the Vector-based Bayes classifier
  transpose: : Take the transpose of a matrix
  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
  vectordump: : Dump vectors from a sequence file to text
  viterbi: : Viterbi decoding of hidden states from given output states sequence
[jifeng@jifeng01 ~]$ ls

      MAHOUT_LOCAL:设置是否本地运行,如果设置这个参数就不会运行hadoop了,一旦设置这个参数,那HADOOP_CONF_DIR 和HADOOP_HOME 这两个参数的

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally



[jifeng@jifeng01 ~]$ cd hadoop
[jifeng@jifeng01 hadoop]$ wget
--2014-08-26 18:07:17--
Connecting to||:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:288374 (282K) [text/plain]
Saving to: `'

100%[======================================================================================>] 288,374     22.2K/s   in 30s     

2014-08-26 18:07:47 (9.46 KB/s) - `' saved [288374/288374]

[jifeng@jifeng01 hadoop]$ 


[jifeng@jifeng01 hadoop]$ cd hadoop-1.2.1
[jifeng@jifeng01 hadoop-1.2.1]$ bin/hadoop fs -mkdir ./testdata
Warning: $HADOOP_HOME is deprecated.

[jifeng@jifeng01 hadoop-1.2.1]$ bin/hadoop fs -ls ./testdata   
Warning: $HADOOP_HOME is deprecated.

[jifeng@jifeng01 hadoop-1.2.1]$ bin/hadoop fs -ls ./
Warning: $HADOOP_HOME is deprecated.

Found 10 items
-rw-r--r--   1 jifeng        supergroup         20 2014-08-06 10:30 /user/jifeng/
-rw-r--r--   1 jifeng        supergroup         29 2014-08-05 17:40 /user/jifeng/demo.txt
-rw-r--r--   3 jifeng        supergroup         24 2014-08-06 17:28 /user/jifeng/demo_c.txt
drwxr-xr-x   - jifeng        supergroup          0 2015-08-18 10:27 /user/jifeng/hbase
drwxr-xr-x   - jifeng        supergroup          0 2014-08-14 13:23 /user/jifeng/in
drwxr-xr-x   - jifeng        supergroup          0 2014-07-24 19:27 /user/jifeng/out
-rw-r--r--   1 jifeng        supergroup       1526 2014-08-06 11:36 /user/jifeng/session.log
drwxr-xr-x   - jifeng        supergroup          0 2014-08-26 18:11 /user/jifeng/testdata
[jifeng@jifeng01 hadoop-1.2.1]$ 


[jifeng@jifeng01 hadoop-1.2.1]$ bin/hadoop fs -put /home/jifeng/hadoop/ ./testdata
Warning: $HADOOP_HOME is deprecated.

[jifeng@jifeng01 hadoop-1.2.1]$ bin/hadoop fs -ls ./testdata                                            
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r--   1 jifeng supergroup     288374 2014-08-26 18:17 /user/jifeng/testdata/
[jifeng@jifeng01 hadoop-1.2.1]$ 

运行:mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job


[jifeng@jifeng01 hadoop-1.2.1]$ cd ..
[jifeng@jifeng01 hadoop]$ mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=/home/jifeng/hadoop/hadoop-1.2.1/conf
MAHOUT-JOB: /home/jifeng/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

14/08/26 18:22:37 WARN driver.MahoutDriver: No org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.props found on classpath, will use command-line arguments only
14/08/26 18:22:37 INFO kmeans.Job: Running with default arguments
14/08/26 18:22:38 INFO kmeans.Job: Preparing Input
14/08/26 18:22:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/26 18:22:47 INFO input.FileInputFormat: Total input paths to process : 1
14/08/26 18:22:47 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/08/26 18:22:47 WARN snappy.LoadSnappy: Snappy native library not loaded
14/08/26 18:22:47 INFO mapred.JobClient: Running job: job_201408221126_0001
14/08/26 18:22:48 INFO mapred.JobClient:  map 0% reduce 0%
14/08/26 18:22:55 INFO mapred.JobClient:  map 100% reduce 0%
14/08/26 18:22:55 INFO mapred.JobClient: Job complete: job_201408221126_0001


        1.0 : [distance=36.3968678356675]: 60 = [28.487, 34.156, 28.659, 35.527, 35.250, 30.795, 31.318, 28.034, 25.375, 33.610, 29.723, 28.462, 24.683, 35.245, 34.151, 35.341, 27.724, 27.502, 28.539, 32.302, 30.586, 30.747, 28.180, 27.085, 35.723, 26.787, 29.890, 28.044, 27.663, 27.240, 25.005, 24.731, 32.778, 46.403, 47.259, 45.380, 45.677, 40.544, 39.214, 43.374, 42.056, 43.683, 42.560, 38.311, 44.878, 40.559, 46.917, 43.313, 38.759, 47.516, 38.562, 47.779, 36.325, 42.066, 44.773, 48.305, 47.137, 39.604, 37.563, 44.185]
14/08/26 18:26:31 INFO clustering.ClusterDumper: Wrote 6 clusters
14/08/26 18:26:31 INFO driver.MahoutDriver: Program took 234101 ms (Minutes: 3.901683333333333)


[jifeng@jifeng01 hadoop]$ hadoop fs -ls ./output
Warning: $HADOOP_HOME is deprecated.

Found 15 items
-rw-r--r--   1 jifeng supergroup        194 2014-08-26 18:26 /user/jifeng/output/_policy
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:26 /user/jifeng/output/clusteredPoints
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:22 /user/jifeng/output/clusters-0
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:23 /user/jifeng/output/clusters-1
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:26 /user/jifeng/output/clusters-10-final
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:23 /user/jifeng/output/clusters-2
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:23 /user/jifeng/output/clusters-3
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:24 /user/jifeng/output/clusters-4
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:24 /user/jifeng/output/clusters-5
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:24 /user/jifeng/output/clusters-6
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:25 /user/jifeng/output/clusters-7
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:25 /user/jifeng/output/clusters-8
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:26 /user/jifeng/output/clusters-9
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:22 /user/jifeng/output/data
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:22 /user/jifeng/output/random-seeds
[jifeng@jifeng01 hadoop]$ hadoop fs -ls ./output/data
Warning: $HADOOP_HOME is deprecated.

Found 3 items
-rw-r--r--   1 jifeng supergroup          0 2014-08-26 18:22 /user/jifeng/output/data/_SUCCESS
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:22 /user/jifeng/output/data/_logs
-rw-r--r--   1 jifeng supergroup     335470 2014-08-26 18:22 /user/jifeng/output/data/part-m-00000
[jifeng@jifeng01 hadoop]$ hadoop fs -ls ./output/clusters-1
Warning: $HADOOP_HOME is deprecated.

Found 4 items
-rw-r--r--   1 jifeng supergroup          0 2014-08-26 18:23 /user/jifeng/output/clusters-1/_SUCCESS
drwxrwxrwx   - jifeng supergroup          0 2014-08-26 18:22 /user/jifeng/output/clusters-1/_logs
-rw-r--r--   1 jifeng supergroup        194 2014-08-26 18:23 /user/jifeng/output/clusters-1/_policy
-rw-r--r--   1 jifeng supergroup       7581 2014-08-26 18:23 /user/jifeng/output/clusters-1/part-r-00000
[jifeng@jifeng01 hadoop]$ 


mahout vectordump -i ./output/data/part-m-00000

mahout0.8版本之前用这个mahout vectordump --seqFile /user/hadoop/output/data/part-m-00000查看

[jifeng@jifeng01 hadoop]$ mahout vectordump -i ./output/data/part-m-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /home/jifeng/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=/home/jifeng/hadoop/hadoop-1.2.1/conf
MAHOUT-JOB: /home/jifeng/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

14/08/26 18:44:03 WARN driver.MahoutDriver: No vectordump.props found on classpath, will use command-line arguments only
14/08/26 18:44:03 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[./output/data/part-m-00000], --startPhase=[0], --tempDir=[temp]}
14/08/26 18:44:03 INFO vectors.VectorDumper: Sort? false
14/08/26 18:44:08 INFO driver.MahoutDriver: Program took 5349 ms (Minutes: 0.08916666666666667)
