在hadoop的框架中,刚入门我们维护好Mapper和Reducer两个类就可以实现倒排索引。作为练习可以下载20 Newsgroups数据 :http://qwone.com/~jason/20Newsgroups/。
亲测:没整合,在hadoop上跑所有的文章19997篇,16g内存差点跑爆、、、
因此只是测试,为了省事Mapper、Reducer和Main全写到一个WordCount类里了。
1>worldcount是主要运行程序。
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); Path path = new Path(args[0]); // FileSystem fs = FileSystem.get(conf); //真分布式 FileSystem fs = path.getFileSystem(conf);//伪分布式 if (fs.exists(path)) {//遍历目录内文件,这里目录内还有一级,参数为目录路径 FileStatus[] fileStatus = fs.listStatus(path); for (FileStatus fileStatus1 : fileStatus) { FileStatus[] fileStatus2 = fs.listStatus(fileStatus1.getPath()); for (FileStatus fileStatu : fileStatus2) { // System.out.println(fileStatu.getPath().toString()); FileInputFormat.addInputPath(job, fileStatu.getPath()); } } } fs.close(); // FileInputFormat.addInputPath(job,new Path(args[0])); //单跑文件,参数为文件路径 FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
注意文件系统的实例化方式,真分布和假分布式不一样,假分布式用真分布式会报错:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:9000/user/hadoop/input3, expected: file:///
相关连接:https://blog.csdn.net/huangjing_whlg/article/details/39341643
2>mapper,我使用了lucene进行分词。StopAnalyzer+PorterStemFilter,进行分词+词干提取。相关包可以在porn.xml依赖。注释的是普通分类器,可以都试一下比较结果。
public static class TokenizerMapper extends Mapper
结果帮你们比较了,使用普通分词器在左,普通分词+提取词干在右:如下图所见。
3>reduce,起简单计数功能。public static class IntSumReducer extends Reducer, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
4>设置参数:
这里跑的是20_Newspapers的一部分数据。下面是原码和porn.xml
WordCount:
import java.io.*; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.core.StopAnalyzer; import org.apache.lucene.analysis.en.PorterStemFilter; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; public class WordCount { public static class TokenizerMapper extends Mapper
porn.xml:
xml version="1.0" encoding="UTF-8"?>xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 4.0.0 Test.hadoop2 HadoopTest2 1.0-SNAPSHOT nexus-aliyun nexus-aliyun http://maven.aliyun.com/nexus/content/groups/public/ true false org.apache.hadoop hadoop-hdfs 2.9.0 org.apache.hadoop hadoop-common 2.9.0 org.apache.hadoop hadoop-yarn-common 2.9.0 org.apache.hadoop hadoop-mapreduce-client-common 2.9.0 org.apache.hadoop hadoop-auth 2.9.0 org.apache.lucene lucene-analyzers-common 7.3.0 org.apache.lucene lucene-core 7.3.0 org.apache.lucene lucene-analyzers-icu 7.3.0 jfree jfreechart 1.0.13 maven-dependency-plugin false true ./lib
结果:word+docid+tf
至此入门系列结束。俺也是刚入门。哈哈