bug发现与制造

大数据平台运维之Mahout

Mahout

47.在master节点安装MahoutClient，打开Linux Shell运行mahout命令查看Mahout自带的案例程序，将查询结果显示如下。

[root@master~]# mahout

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR toclasspath.

Running on hadoop, using/usr/hdp/2.4.3.0-227/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hdp/2.4.3.0-227/hadoop/conf

MAHOUT-JOB:/usr/hdp/2.4.3.0-227/mahout/mahout-examples-0.9.0.2.4.3.0-227-job.jar

WARNING: Use "yarn jar" to launch YARNapplications.

An example program must be given as the firstargument.

Valid program names are:

arff.vector: :Generate Vectors from an ARFF file or directory

baumwelch: :Baum-Welch algorithm for unsupervised HMM training

buildforest: :Build the random forest classifier

canopy: :Canopy clustering

cat: : Print afile or resource as the logistic regression models would see it

cleansvd: :Cleanup and verification of SVD output

clusterdump: :Dump cluster output to text

clusterpp: :Groups Clustering Output In Clusters

cmdump: : Dumpconfusion matrix in HTML or text formats

concatmatrices: : Concatenates 2 matrices ofsame cardinality into a single matrix

cvb: : LDA viaCollapsed Variation Bayes (0th deriv. approx)

cvb0_local: :LDA via Collapsed Variation Bayes, in memory locally.

describe: :Describe the fields and target variable in a data set

evaluateFactorization: : compute RMSE and MAE of a rating matrixfactorization against probes

fkmeans: :Fuzzy K-means clustering

hmmpredict: :Generate random sequence of observations by given HMM

itemsimilarity: : Compute the item-item-similarities for item-basedcollaborative filtering

kmeans: :K-means clustering

lucene.vector:: Generate Vectors from a Lucene index

lucene2seq: :Generate Text SequenceFiles from a Lucene index

matrixdump: :Dump matrix in CSV format

matrixmult: :Take the product of two matrices

parallelALS: :ALS-WR factorization of a rating matrix

qualcluster: :Runs clustering experiments and summarizes results in a CSV

recommendfactorized: : Compute recommendations using the factorizationof a rating matrix

recommenditembased: : Compute recommendations using item-basedcollaborative filtering

regexconverter: : Convert text files on a per line basis based onregular expressions

resplit: :Splits a set of SequenceFiles into a number of equal splits

rowid: : MapSequenceFile to{SequenceFile,SequenceFile}

rowsimilarity:: Compute the pairwise similarities of the rows of a matrix

runAdaptiveLogistic:: Score new production data using a probably trained and validatedAdaptivelogisticRegression model

runlogistic: :Run a logistic regression model against CSV data

seq2encoded: :Encoded Sparse Vector generation from Text sequence files

seq2sparse: :Sparse Vector generation from Text sequence files

seqdirectory:: Generate sequence files (of Text) from a directory

seqdumper: :Generic Sequence File dumper

seqmailarchives: : Creates SequenceFile from a directory containinggzipped mail archives

seqwiki: :Wikipedia xml dump to sequence file

spectralkmeans: : Spectral k-means clustering

split: : SplitInput data into test and train sets

splitDataset:: split a rating dataset into training and probe parts

ssvd: :Stochastic SVD

streamingkmeans: : Streaming k-means clustering

svd: : LanczosSingular Value Decomposition

testforest: :Test the random forest classifier

testnb: : Testthe Vector-based Bayes classifier

trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model

trainlogistic:: Train a logistic regression using stochastic gradient descent

trainnb: :Train the Vector-based Bayes classifier

transpose: :Take the transpose of a matrix

validateAdaptiveLogistic:: Validate an AdaptivelogisticRegression model against hold-out data set

vecdist: :Compute the distances between a set of Vectors (or Cluster or Canopy, they mustfit in memory) and a list of Vectors

vectordump: :Dump vectors from a sequence file to text

viterbi: :Viterbi decoding of hidden states from given output states sequence

48.使用Mahout工具将解压后的20news-bydate.tar.gz文件内容转换成序列文件，保存到/data/mahout/20news/output/20news-seq/目录中，并查看该目录的列表信息，将操作命令和查询结果显示如下。

[root@master ~]# mkdir 20news

[root@master ~]# tar -xzf 20news-bydate.tar.gz -C20news

[root@master ~]# hadoop fs -mkdir -p/data/mahout/20news/20news-all

[root@master ~]# hadoop fs -put 20news/*/data/mahout/20news/20news-all

[root@master ~]# mahout seqdirectory -i /data/mahout/20news/20news-all-o /data/mahout/20news/output/20news-seq

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR toclasspath.

Running on hadoop, using/usr/hdp/2.4.3.0-227/hadoop/bin/hadoop andHADOOP_CONF_DIR=/usr/hdp/2.4.3.0-227/hadoop/conf

MAHOUT-JOB:/usr/hdp/2.4.3.0-227/mahout/mahout-examples-0.9.0.2.4.3.0-227-job.jar

WARNING: Use "yarn jar" to launch YARNapplications.

17/05/12 05:04:32 WARN driver.MahoutDriver: Noseqdirectory.props found on classpath, will use command-line arguments only

17/05/12 05:04:32 INFO common.AbstractJob: Commandline arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],--input=[/data/mahout/20news/20news-all], --keyPrefix=[], --method=[mapreduce],--output=[/data/mahout/20news/output/20news-seq], --startPhase=[0],--tempDir=[temp]}

17/05/12 05:04:35 INFO impl.TimelineClientImpl:Timeline service address: http://slaver1:8188/ws/v1/timeline/

17/05/12 05:04:35 INFO client.RMProxy: Connecting toResourceManager at slaver1/10.0.0.108:8050

17/05/12 05:04:53 INFO input.FileInputFormat: Totalinput paths to process : 4262

17/05/12 05:04:53 INFO input.CombineFileInputFormat:DEBUG: Terminated node allocation with : CompletedNodes: 2, size left: 8691977

17/05/12 05:05:10 INFO mapreduce.JobSubmitter: numberof splits:1

17/05/12 05:05:20 INFO mapreduce.JobSubmitter:Submitting tokens for job: job_1494563840869_0001

17/05/12 05:05:21 INFO impl.YarnClientImpl: Submittedapplication application_1494563840869_0001

17/05/12 05:05:21 INFO mapreduce.Job: The url to trackthe job: http://slaver1:8088/proxy/application_1494563840869_0001/

17/05/12 05:05:21 INFO mapreduce.Job: Running job:job_1494563840869_0001

17/05/12 05:06:34 INFO mapreduce.Job: Jobjob_1494563840869_0001 running in uber mode : false

17/05/12 05:06:34 INFO mapreduce.Job: map 0% reduce 0%

17/05/12 05:06:59 INFO mapreduce.Job: map 14% reduce 0%

17/05/12 05:07:02 INFO mapreduce.Job: map 37% reduce 0%

17/05/12 05:07:05 INFO mapreduce.Job: map 66% reduce 0%

17/05/12 05:07:08 INFO mapreduce.Job: map 100% reduce 0%

17/05/12 05:07:15 INFO mapreduce.Job: Jobjob_1494563840869_0001 completed successfully

17/05/12 05:07:15 INFO mapreduce.Job: Counters: 30

FileSystem Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=140047

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=9148620

HDFS: Number of bytes written=3244364

HDFS: Number of read operations=17052

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

JobCounters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=49028

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=24514

Total vcore-seconds taken by all map tasks=24514

Total megabyte-seconds taken by all map tasks=24612056

Map-Reduce Framework

Map input records=4262

Map output records=4262

Input split bytes=456643

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=176

CPU time spent (ms)=15710

Physical memory (bytes) snapshot=250163200

Virtual memory (bytes) snapshot=2792144896

Total committed heap usage (bytes)=123207680

FileInput Format Counters

Bytes Read=0

FileOutput Format Counters

Bytes Written=3244364

17/05/12 05:07:15 INFO driver.MahoutDriver: Programtook 163575 ms (Minutes: 2.72625)

[root@master ~]# hadoop fs -ls /data/mahout/20news/output/20news-seq

Found 2 items

-rw-r--r-- 3root hdfs 0 2017-05-12 05:07/data/mahout/20news/output/20news-seq/_SUCCESS

-rw-r--r-- 3root hdfs 3244364 2017-05-12 05:07/data/mahout/20news/output/20news-seq/part-m-00000

49.使用Mahout工具将解压后的20news-bydate.tar.gz文件内容转换成序列文件，保存到/data/mahout/20news/output/20news-seq/目录中，使用-text命令查看序列文件内容（前20行即可），将操作命令和查询结果显示如下。

[root@master ~]# mkdir 20news

[root@master ~]# tar -xzf 20news-bydate.tar.gz -C20news

[root@master ~]# hadoop fs -mkdir -p /data/mahout/20news/20news-all

[root@master ~]# hadoop fs -put 20news/*/data/mahout/20news/20news-all

[root@master ~]# mahout seqdirectory -i/data/mahout/20news/20news-all -o /data/mahout/20news/output/20news-seq