hadoop倒排索引

看到很多的hadoop关于倒排索引的例子,但是我想写一个属于我自己的,加入了本人对于hadoop中mapreduce的理解。

有下面三篇文章:

accident.txt

CHENGDU - Death toll from a colliery blast on Saturday in southwest China's Sichuan Province rose to 27, local authorities said.
As of 11:13 pm, 81 miners were rescued. Sixteen of them were injured and are treated in local hospitals, sources said.
The accident occurred at around 2 pm in Taozigou coal mine, Luxian County in the city of Luzhou, according to an official statement.
An investigation into the accident is underway.
It is the second coal mine accident in 24 hours in the country.
On Friday evening, 12 miners were killed and two others injured in a colliery gas explosion in southwest China's Guizhou Province, local authorities said on Saturday.
Taozigou coal mine [Photo/Xinhua]
Taozigou coal mine [Photo/Xinhua]

million.txt

NANCHANG - Rainstorms have battered southern and eastern China over the past five days, killing six people in Hunan Province, local authorities said Saturday.
Contiuous strong rain started to hit the central China province on Monday killing six people, the Hunan provincial flood prevention and drought control headquarters said.
As of Saturday, rainstorms have affected about 850,000 people, toppled more than 2,200 homes and forced 14,000 citizens to relocate in Hunan.
Heavy rainfall has also led to the flooding of major reservoirs and rivers.
Rainstorms have affected about 196,800 people in east China's Jiangxi Province, local authorities said Saturday.
As of 11 p.m. Friday, the heavy rain, which started from Tuesday, has battered 26 counties in Jiangxi, the provincial flood prevention and drought control headquarters said.
Local governments have relocated 6,019 residents to avoid potential risks.
The downpours have also damaged or destroyed 202 houses and ruined 16,710 hectares of crops, as well as causing high water levels of rivers and lakes, and several landslides.
Flood prevention authorities in Jiangxi warned of floods on Thursday due to the rising rivers and lakes.
The headquarters also ordered several reservoirs in the province to release water as levels had gone over or were approaching alarm lines because of the heavy rain.
No casualties have been reported as a result of the rainfall in Jiangxi.

Philippines.txt

TAIPEI - Taipei mayor Hau Lung-bin announced on Saturday the suspension of inter-city exchanges with the Philippines after a Taiwanese fisherman was shot dead by Philippine coast guards at sea.
The Philippines will also not be allowed to take part in Dragon Boat Festival races in Taipei on June 12, Hau said.
Hau condemned the Philippines over the shooting, and called it a violent act to fire upon an unarmed fisherman. He urged the Philippine government to apologize, release investigation reports and hold those responsible to account.
He also advised the Taiwanese authorities to take a hard stance on the Philippines by halting Philippine-bound tourism, suspending labor imports from the country and increasing fishing protection patrols.
The shooting happened on Thursday morning 164 nautical miles southeast of the southernmost tip of Taiwan, according to the island's coast guard authority.
The victim was identified as Hung Shih-Cheng, 65, one of four crew members of the Taiwanese fishing vessel Guang Ta Hsin 28. Hung's body was taken back to Taiwan early Saturday morning.

我期望得到的结果是这样的:

单词	总次数-文章名:次数-文章名:次数-文章名:次数

如果文章中没有相应的单词的话,就不用出现。

首先,我们需要一个工具类,把每行的String,分解成一个一个的单词:

import java.util.ArrayList;
import java.util.List;

/**
 * @author hadoop
 *
 */
public class StringUtil {
	
	/**
	 * 获得字符串中单词
	 * @param value
	 * @return
	 */
	public static List<String> getWords(String value)
	{
		List<String> wordList = new ArrayList<String>();
		int len = value.length();
		char[] charArray = value.toCharArray();
		char[] word = new char[40];
		int wordIndex = 0;
		for(int i = 0; i < len; i++)
		{
			char c = charArray[i];
			if((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
			{
				word[wordIndex] = c;
				wordIndex++;
			}
			else
			{
				if(wordIndex > 0)
				{
					wordList.add(String.valueOf(word, 0, wordIndex));
					wordIndex = 0;
				}
			}
		}
		return wordList;
	}
}
然后就是我们的mapreduce啦:

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import com.test.util.HdfsFileUtil;
import com.test.util.Prop;
import com.test.util.StringUtil;

/**
 * @author hadoop
 * 
 */
public class Seach {

	private static Log log = LogFactory.getLog(Seach.class);

	
	
	public static class Map extends Mapper<Object, Text, Text, MapWritable> {

		private FileSplit split;
		
		private static IntWritable data = new IntWritable(1);
		
		public void map(Object key, Text value, Context context)
				throws IOException, InterruptedException {
			split = (FileSplit)context.getInputSplit();
			MapWritable map = new MapWritable();
			List<String> wordList = StringUtil.getWords(value.toString());
			int len = wordList.size();
			for(int i = 0; i < len; i++)
			{
				map.put(new Text(split.getPath().getName()), data);
				context.write(new Text(wordList.get(i)), map);
			}
		}
	}

	public static class Reduce extends Reducer<Text, MapWritable, Text, Text> {

		public void reduce(Text key, Iterable<MapWritable> values,
				Context context) throws IOException, InterruptedException {
			int count = 0;
			java.util.Map<String, Integer> countMap = new HashMap<String, Integer>();
			Iterator<MapWritable> iterator = values.iterator();
			while(iterator.hasNext())
			{
				MapWritable curMap = iterator.next();
				String fileName = curMap.keySet().iterator().next().toString();
				if(countMap.containsKey(fileName))
				{
					countMap.put(fileName, countMap.get(fileName) + 1);
				}
				else
				{
					countMap.put(fileName, 1);
				}
				count++;
			}
			int fCount = 0;
			String value = "";
			Iterator<String> it = countMap.keySet().iterator();
			while(it.hasNext())
			{
				if(fCount > 0)
				{
					value += "-";
				}
				String fileName = it.next();
				value += fileName + ":" + countMap.get(fileName);
				fCount++;
			}
			context.write(key, new Text(count + "-" + value));
		}
	}

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception {
		String inputPath = "search_in";
    	String outPath = "search_out";
    	Configuration conf = new Configuration();
        conf.set("mapred.job.tracker", Prop.HADOOP_MAPRED_JOB_TRACKER);
        conf.set("fs.default.name", Prop.HDFS_HOST);
        HdfsFileUtil.checkAndDelete(conf, "/" + Prop.HDFS_DIRECTORY + "/" + inputPath);
        HdfsFileUtil.checkAndDelete(conf, "/" + Prop.HDFS_DIRECTORY + "/" + outPath);
    	HdfsFileUtil.upload(conf, Prop.LOCAL_HDFS_DIRECTORY + "/" + inputPath, Prop.HDFS_DIRECTORY + "/" + inputPath);
    	
        Job job = new Job(conf, "search");
        job.setJarByClass(STjoin.class);
        // 设置Map和Reduce处理类
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        // 设置输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(MapWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        // 设置输入和输出目录
        FileInputFormat.addInputPath(job, new Path(inputPath));
        FileOutputFormat.setOutputPath(job, new Path(outPath));
        
        boolean exitStatus = job.waitForCompletion(true);
		HdfsFileUtil.download(conf, "/" + Prop.HDFS_DIRECTORY + "/" + outPath, Prop.LOCAL_HDFS_DIRECTORY + "/" + outPath);
		System.exit(exitStatus ? 0 : 1);
	}
}
大家可以不看HdfsFileUtil类,那只是对HDFS文件系统的一些操作,方便调试。

重点是原理:

map出来的数据如下:

<单词,<文件名,次数>>


reduce就很好理解,只能统计单词的总数,以及在每个文件中的出现的次数,然后再格式化。输出就OK。

输出的结果文件如下:

An	1-accident.txt:1
As	3-million.txt:2-accident.txt:1
Boat	1-Philippines.txt:1
CHENGDU	1-accident.txt:1
Cheng	1-Philippines.txt:1
China	5-million.txt:3-accident.txt:2
Contiuous	1-million.txt:1
County	1-accident.txt:1
Death	1-accident.txt:1
Dragon	1-Philippines.txt:1
Festival	1-Philippines.txt:1
Flood	1-million.txt:1
Friday	2-million.txt:1-accident.txt:1
Guang	1-Philippines.txt:1
Guizhou	1-accident.txt:1
Hau	3-Philippines.txt:3
He	2-Philippines.txt:2
Heavy	1-million.txt:1
Hsin	1-Philippines.txt:1
Hunan	3-million.txt:3
Hung	2-Philippines.txt:2
It	1-accident.txt:1
Jiangxi	4-million.txt:4
June	1-Philippines.txt:1
Local	1-million.txt:1
Lung	1-Philippines.txt:1
Luxian	1-accident.txt:1
Luzhou	1-accident.txt:1
Monday	1-million.txt:1
NANCHANG	1-million.txt:1
No	1-million.txt:1
On	1-accident.txt:1
Philippine	3-Philippines.txt:3
Philippines	4-Philippines.txt:4
Photo	2-accident.txt:2
Province	4-million.txt:2-accident.txt:2
Rainstorms	2-million.txt:2
Saturday	7-million.txt:3-Philippines.txt:2-accident.txt:2
Shih	1-Philippines.txt:1
Sichuan	1-accident.txt:1
Sixteen	1-accident.txt:1
TAIPEI	1-Philippines.txt:1
Ta	1-Philippines.txt:1
Taipei	2-Philippines.txt:2
Taiwan	2-Philippines.txt:2
Taiwanese	3-Philippines.txt:3
Taozigou	3-accident.txt:3
The	6-million.txt:2-Philippines.txt:3-accident.txt:1
Thursday	2-million.txt:1-Philippines.txt:1
Tuesday	1-million.txt:1
Xinhua	2-accident.txt:2
a	6-million.txt:1-Philippines.txt:3-accident.txt:2
about	2-million.txt:2
accident	3-accident.txt:3
according	2-Philippines.txt:1-accident.txt:1
account	1-Philippines.txt:1
act	1-Philippines.txt:1
advised	1-Philippines.txt:1
affected	2-million.txt:2
after	1-Philippines.txt:1
alarm	1-million.txt:1
allowed	1-Philippines.txt:1
also	5-million.txt:3-Philippines.txt:2
an	2-Philippines.txt:1-accident.txt:1
and	14-million.txt:9-Philippines.txt:3-accident.txt:2
announced	1-Philippines.txt:1
apologize	1-Philippines.txt:1
approaching	1-million.txt:1
are	1-accident.txt:1
around	1-accident.txt:1
as	5-million.txt:4-Philippines.txt:1
at	2-Philippines.txt:1-accident.txt:1
authorities	6-million.txt:3-Philippines.txt:1-accident.txt:2
authority	1-Philippines.txt:1
avoid	1-million.txt:1
back	1-Philippines.txt:1
battered	2-million.txt:2
be	1-Philippines.txt:1
because	1-million.txt:1
been	1-million.txt:1
bin	1-Philippines.txt:1
blast	1-accident.txt:1
body	1-Philippines.txt:1
bound	1-Philippines.txt:1
by	2-Philippines.txt:2
called	1-Philippines.txt:1
casualties	1-million.txt:1
causing	1-million.txt:1
central	1-million.txt:1
citizens	1-million.txt:1
city	2-Philippines.txt:1-accident.txt:1
coal	4-accident.txt:4
coast	2-Philippines.txt:2
colliery	2-accident.txt:2
condemned	1-Philippines.txt:1
control	2-million.txt:2
counties	1-million.txt:1
country	2-Philippines.txt:1-accident.txt:1
crew	1-Philippines.txt:1
crops	1-million.txt:1
damaged	1-million.txt:1
days	1-million.txt:1
dead	1-Philippines.txt:1
destroyed	1-million.txt:1
downpours	1-million.txt:1
drought	2-million.txt:2
due	1-million.txt:1
early	1-Philippines.txt:1
east	1-million.txt:1
eastern	1-million.txt:1
evening	1-accident.txt:1
exchanges	1-Philippines.txt:1
explosion	1-accident.txt:1
fire	1-Philippines.txt:1
fisherman	2-Philippines.txt:2
fishing	2-Philippines.txt:2
five	1-million.txt:1
flood	2-million.txt:2
flooding	1-million.txt:1
floods	1-million.txt:1
forced	1-million.txt:1
four	1-Philippines.txt:1
from	3-million.txt:1-Philippines.txt:1-accident.txt:1
gas	1-accident.txt:1
gone	1-million.txt:1
government	1-Philippines.txt:1
governments	1-million.txt:1
guard	1-Philippines.txt:1
guards	1-Philippines.txt:1
had	1-million.txt:1
halting	1-Philippines.txt:1
happened	1-Philippines.txt:1
hard	1-Philippines.txt:1
has	2-million.txt:2
have	6-million.txt:6
headquarters	3-million.txt:3
heavy	2-million.txt:2
hectares	1-million.txt:1
high	1-million.txt:1
hit	1-million.txt:1
hold	1-Philippines.txt:1
homes	1-million.txt:1
hospitals	1-accident.txt:1
hours	1-accident.txt:1
houses	1-million.txt:1
identified	1-Philippines.txt:1
imports	1-Philippines.txt:1
in	17-million.txt:7-Philippines.txt:2-accident.txt:8
increasing	1-Philippines.txt:1
injured	2-accident.txt:2
inter	1-Philippines.txt:1
into	1-accident.txt:1
investigation	2-Philippines.txt:1-accident.txt:1
is	2-accident.txt:2
island	1-Philippines.txt:1
it	1-Philippines.txt:1
killed	1-accident.txt:1
killing	2-million.txt:2
labor	1-Philippines.txt:1
lakes	2-million.txt:2
landslides	1-million.txt:1
led	1-million.txt:1
levels	2-million.txt:2
lines	1-million.txt:1
local	5-million.txt:2-accident.txt:3
m	1-million.txt:1
major	1-million.txt:1
mayor	1-Philippines.txt:1
members	1-Philippines.txt:1
miles	1-Philippines.txt:1
mine	4-accident.txt:4
miners	2-accident.txt:2
more	1-million.txt:1
morning	2-Philippines.txt:2
nautical	1-Philippines.txt:1
not	1-Philippines.txt:1
occurred	1-accident.txt:1
of	16-million.txt:8-Philippines.txt:5-accident.txt:3
official	1-accident.txt:1
on	8-million.txt:2-Philippines.txt:4-accident.txt:2
one	1-Philippines.txt:1
or	2-million.txt:2
ordered	1-million.txt:1
others	1-accident.txt:1
over	3-million.txt:2-Philippines.txt:1
p	1-million.txt:1
part	1-Philippines.txt:1
past	1-million.txt:1
patrols	1-Philippines.txt:1
people	4-million.txt:4
pm	2-accident.txt:2
potential	1-million.txt:1
prevention	3-million.txt:3
protection	1-Philippines.txt:1
province	2-million.txt:2
provincial	2-million.txt:2
races	1-Philippines.txt:1
rain	3-million.txt:3
rainfall	2-million.txt:2
rainstorms	1-million.txt:1
release	2-million.txt:1-Philippines.txt:1
relocate	1-million.txt:1
relocated	1-million.txt:1
reported	1-million.txt:1
reports	1-Philippines.txt:1
rescued	1-accident.txt:1
reservoirs	2-million.txt:2
residents	1-million.txt:1
responsible	1-Philippines.txt:1
result	1-million.txt:1
rising	1-million.txt:1
risks	1-million.txt:1
rivers	3-million.txt:3
rose	1-accident.txt:1
ruined	1-million.txt:1
s	5-million.txt:1-Philippines.txt:2-accident.txt:2
said	8-million.txt:4-Philippines.txt:1-accident.txt:3
sea	1-Philippines.txt:1
second	1-accident.txt:1
several	2-million.txt:2
shooting	2-Philippines.txt:2
shot	1-Philippines.txt:1
six	2-million.txt:2
sources	1-accident.txt:1
southeast	1-Philippines.txt:1
southern	1-million.txt:1
southernmost	1-Philippines.txt:1
southwest	2-accident.txt:2
stance	1-Philippines.txt:1
started	2-million.txt:2
statement	1-accident.txt:1
strong	1-million.txt:1
suspending	1-Philippines.txt:1
suspension	1-Philippines.txt:1
take	2-Philippines.txt:2
taken	1-Philippines.txt:1
than	1-million.txt:1
the	25-million.txt:10-Philippines.txt:11-accident.txt:4
them	1-accident.txt:1
those	1-Philippines.txt:1
tip	1-Philippines.txt:1
to	15-million.txt:6-Philippines.txt:7-accident.txt:2
toll	1-accident.txt:1
toppled	1-million.txt:1
tourism	1-Philippines.txt:1
treated	1-accident.txt:1
two	1-accident.txt:1
unarmed	1-Philippines.txt:1
underway	1-accident.txt:1
upon	1-Philippines.txt:1
urged	1-Philippines.txt:1
vessel	1-Philippines.txt:1
victim	1-Philippines.txt:1
violent	1-Philippines.txt:1
warned	1-million.txt:1
was	3-Philippines.txt:3
water	2-million.txt:2
well	1-million.txt:1
were	4-million.txt:1-accident.txt:3
which	1-million.txt:1
will	1-Philippines.txt:1
with	1-Philippines.txt:1
符合当初的设计。Very Good。



你可能感兴趣的:(hadoop,倒排索引)