Mahout 题:
在 master 节点安装 Mahout Client,打开 Linux Shell 运行 mahout 命令查看Mahout 自带的案例程序。
[root@master ~]# mahout
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/hdp/2.6.1.0-129/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hdp/2.6.1.0-129/hadoop/conf
MAHOUT-JOB: /usr/hdp/2.6.1.0-129/mahout/mahout-examples-0.9.0.2.6.1.0-129-job.jar
An example program must be given as the first argument.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
buildforest: : Build the random forest classifier
canopy: : Canopy clustering
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
clusterpp: : Groups Clustering Output In Clusters
cmdump: : Dump confusion matrix in HTML or text formats
concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix
cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
describe: : Describe the fields and target variable in a data set
evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
fkmeans: : Fuzzy K-means clustering
hmmpredict: : Generate random sequence of observations by given HMM
itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
kmeans: : K-means clustering
lucene.vector: : Generate Vectors from a Lucene index
lucene2seq: : Generate Text SequenceFiles from a Lucene index
matrixdump: : Dump matrix in CSV format
matrixmult: : Take the product of two matrices
parallelALS: : ALS-WR factorization of a rating matrix
qualcluster: : Runs clustering experiments and summarizes results in a CSV
recommendfactorized: : Compute recommendations using the factorization of a rating matrix
recommenditembased: : Compute recommendations using item-based collaborative filtering
regexconverter: : Convert text files on a per line basis based on regular expressions
resplit: : Splits a set of SequenceFiles into a number of equal splits
rowid: : Map SequenceFile
rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
runlogistic: : Run a logistic regression model against CSV data
seq2encoded: : Encoded Sparse Vector generation from Text sequence files
seq2sparse: : Sparse Vector generation from Text sequence files
seqdirectory: : Generate sequence files (of Text) from a directory
seqdumper: : Generic Sequence File dumper
seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
seqwiki: : Wikipedia xml dump to sequence file
spectralkmeans: : Spectral k-means clustering
split: : Split Input data into test and train sets
splitDataset: : split a rating dataset into training and probe parts
ssvd: : Stochastic SVD
streamingkmeans: : Streaming k-means clustering
svd: : Lanczos Singular Value Decomposition
testforest: : Test the random forest classifier
testnb: : Test the Vector-based Bayes classifier
trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
trainlogistic: : Train a logistic regression using stochastic gradient descent
trainnb: : Train the Vector-based Bayes classifier
transpose: : Take the transpose of a matrix
validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
vectordump: : Dump vectors from a sequence file to text
viterbi: : Viterbi decoding of hidden states from given output states sequence
19/05/16 13:28:15 WARN util.ShutdownHookManager: ShutdownHook ‘’ timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
使用 Mahout 工具将解压后的 20news-bydate.tar.gz 文件内容转换成序列文件,保存到/data/mahout/20news/output/20news-seq/目录中,并查看该目录的列表信息。
[root@master ~]# mkdir 20new
[root@master ~]# tar -zxf /opt/20news-bydate.tar.gz -C 20new/
[root@master ~]# hadoop fs -mkdir -p /data/mahout/20news/20news-all
[root@master ~]#hadoop fs -put 20new/* /data/mahout/20news/20news-all
[hdfs@master ~]$mahout seqdirectory -i /data/mahout/20news/20news-all -o /data/mahout/20news/output/20news-seq
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/hdp/2.6.1.0-129/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hdp/2.6.1.0-129/hadoop/conf
MAHOUT-JOB: /usr/hdp/2.6.1.0-129/mahout/mahout-examples-0.9.0.2.6.1.0-129-job.jar
19/05/21 15:15:26 WARN driver.MahoutDriver: No seqdirectory.props found on classpath, will use command-
19/05/21 15:16:25 INFO mapreduce.JobSubmitter: number of splits:1
19/05/21 15:16:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1558132743195_0001
19/05/21 15:16:29 INFO impl.YarnClientImpl: Application submission is not finished, submitted application application_1558132743195_0001 is still in NEW_SAVING
19/05/21 15:16:31 INFO impl.YarnClientImpl: Submitted application application_1558132743195_0001
19/05/21 15:16:31 INFO mapreduce.Job: The url to track the job: http:// slaver1.hadoop:8088/proxy/application_1558132743195_0001/
19/05/21 15:16:31 INFO mapreduce.Job: Running job: job_1558132743195_0001
19/05/21 15:16:52 INFO mapreduce.Job: Job job_1558132743195_0001 running in uber mode : false
19/05/21 15:16:52 INFO mapreduce.Job: map 0% reduce 0%
19/05/21 15:17:03 INFO mapreduce.Job: map 7% reduce 0%
19/05/21 15:17:06 INFO mapreduce.Job: map 12% reduce 0%
19/05/21 15:17:09 INFO mapreduce.Job: map 16% reduce 0%
19/05/21 15:17:12 INFO mapreduce.Job: map 20% reduce 0%
19/05/21 15:17:14 INFO mapreduce.Job: map 22% reduce 0%
19/05/21 15:17:17 INFO mapreduce.Job: map 25% reduce 0%
19/05/21 15:17:20 INFO mapreduce.Job: map 32% reduce 0%
19/05/21 15:17:23 INFO mapreduce.Job: map 39% reduce 0%
19/05/21 15:17:26 INFO mapreduce.Job: map 47% reduce 0%
19/05/21 15:17:29 INFO mapreduce.Job: map 55% reduce 0%
19/05/21 15:17:32 INFO mapreduce.Job: map 61% reduce 0%
19/05/21 15:17:35 INFO mapreduce.Job: map 67% reduce 0%
19/05/21 15:17:38 INFO mapreduce.Job: map 75% reduce 0%
19/05/21 15:17:41 INFO mapreduce.Job: map 83% reduce 0%
19/05/21 15:17:44 INFO mapreduce.Job: map 92% reduce 0%
19/05/21 15:17:47 INFO mapreduce.Job: map 100% reduce 0%
19/05/21 15:17:47 INFO mapreduce.Job: Job job_1558132743195_0001 completed successfully
19/05/21 15:17:48 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=151414
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=35728752
HDFS: Number of bytes written=12816846
HDFS: Number of read operations=71984
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=107006
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=53503
Total vcore-milliseconds taken by all map tasks=53503
Total megabyte-milliseconds taken by all map tasks=82180608
Map-Reduce Framework
Map input records=17995
Map output records=17995
Input split bytes=2055363
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=358
CPU time spent (ms)=39420
Physical memory (bytes) snapshot=335376384
Virtual memory (bytes) snapshot=3257303040
Total committed heap usage (bytes)=176160768
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=12816846
19/05/21 15:17:48 INFO driver.MahoutDriver: Program took 138608 ms (Minutes: 2.3101333333333334)
[hdfs@master ~]$hadoop fs -ls /data/mahout/20news/output/20news-seq
Found 2 items
-rw-r–r-- 3 hdfs hdfs 0 2019-05-21 15:17 /data/mahout/20news/output/20news-seq/_SUCCESS
-rw-r–r-- 3 hdfs hdfs 12816846 2019-05-21 15:17 /data/mahout/20news/output/20news-seq/part-m-00000
mkdir 20new
tar -zxf /opt/20news-bydate.tar.gz -C 20new/
hadoop fs -mkdir -p /data/mahout/20news/20news-all
hadoop fs -put 20new/* /data/mahout/20news/20news-all
mahout seqdirectory -i /data/mahout/20news/20news-all -o /data/mahout/20news/output/20news-seq
hadoop fs -text /data/mahout/20news/output/20news-seq/part-m-00000 | head -n 20
In article
'>
'>
'> #12) The 2 cheribums are on the Ark of the Covenant. When God said make no
'> graven image, he was refering to idols, which were created to be worshipped.
'> The Ark of the Covenant wasn’t wrodhipped and only the high priest could
'> enter the Holy of Holies where it was kept once a year, on the Day of
'> Atonement.
I am not familiar with, or knowledgeable about the original language,
but I believe there is a word for “idol” and that the translator
would have used the word “idol” instead of “graven image” had
the original said “idol.” So I think you’re wrong here, but
then again I could be too. I just suggesting a way to determine
text: Unable to write to output stream.
hadoop fs -mkdir -p /data/mahout/project
hadoop fs -put /opt/user-item-score.txt /data/mahout/project
mahout recommenditembased -i /data/mahout/project/user-item-score.txt -o /data/mahout/project/output -n 3 -b false -s SIMILARITY_EUCLIDEAN_DISTANCE –maxPrefsPerUser 4 –minPrefsPerUser1 –maxPrefslnltemSimilarity 4 –tempDir /data/mahout/project/temp