Hadoop 常见问题及解决

1,jar包执行出错,提示“class wordcount.WordCountMapper not found”

错误原因:在run()代码中没有定义setJarByClass
解决方法:在wordcountJob.java中增加job.setJarByClass(getClass());

2,jar包执行出错,提示“线程中的异常”主“java.lang.ClassNotFound”

错误原因:
1)build.gradle文件project.ext.mainclass配置错误
解决方法:重新配置文件夹的build.gradle文件,把project.ext.mainclass配置为packageName.mainClassName(packageName与mainClassName用实际名称替换,如'wordcount.WordCountJob')

2)源代码的源文件夹文件夹结构错误。
解决方法:源文件夹需要严格配置为src / main / java,包放在源文件夹里面,类放在包下面。
注意:
- eclipse编译jar包时候,需要读取build.gradle内容。在build.gradle的mainclass中,packageName与mainClassName需要与实际包,类名称一致。
- source folder文件夹结构必须为src / main / java。
- 各类(非内部类)名称需要与该代码的文件名一致。

3,罐包执行出错,提示“显示java.lang.NullPointerException”

若是 :public class NaturalKeyGroupingComparator extends WritableComparator 实现自定义的分组器

这一定要有一个无参构造方法 如:

public NaturalKeyGroupingComparator() {
        /**
         * 一定要写 super(CompositeKey.class, true); 否则报下面的错 Error:
         * java.lang.NullPointerException at
         * org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:157)
         * 
         */
        super(CompositeKey.class, true);
    }

错误原因:相关类没有实例化。在错误日志中,“在wordcount.WordCount.Mapper.Map(WordCountMapper.java:26)”指示第26行outputValue没有初始化(代码中仅仅声明私有IntWritable outputValue;,没有用new IntWritable()分配存储空间),但后面代码直接使用set方法(outputValue.set),导致意外。
解决方法:在outputValue声明时候,增加新内存分配(private intWritable outputValue = new IntWritable();)

4,在类的方法中定义的变量,不能加访问控制(私有/保护/公共),否则出错。

   类内部  本包  子类 外部包 
上市   √  √  √  √
保护  √  √  √  ×
默认   √  √  ×

 ×

私人的

 √  ×  ×  ×


(1)公共:。可以被所有其他类所访问可以访问任何一个在CLASSPATH下的类,接口,异常等它往往用于对外的情况,也就是对象或类对外的一种接口的形式
( 2)private:只能被自己访问和修改。
(3)protected:自身,子类及同一个包中类可以访问。
(4)default(默认):同一包中的类可以访问,声明时没有加修饰符,认为是友好的。

静:如果将类中的域定义为静态的,则这个域属于这个类,而不属于这个类的某个对象,每个类中只有一个这样的域,而每一个类对象对于所有的实例域(即没有定义为static的域)都有自己的一份拷贝。

静态常量:如果一个域被定义为static final,则这个域就是一个静态常量。不能省略任何一个关键字,若是少了静,则该域变成了一个实例域,需要由类对象对其进行访问若是省略了最后,则该域变成了静态域,静态方法可以对其进行修改。

静态方法:静态方法是一种不能向对象实施操作的方法.Math的战俘方法就是一个静态方法,在运算时,不使用任何数学对象,换句话说,没有隐式的参数这一点。因为静态方法不能操作对象,所以不能在静态方法中访问实例域,但是静态方法可以访问自身类中的静态域。可以使用对象调用静态方法,但是这样容易引起混淆 ,因为计算的结果与对象毫无关系,建议还是使用类名,而不是类对象调用静态方法。

在一面
方法不需要访问对象的状态,其所需的参数都是通过显式的提供
2.一个方法只需访问类的静态域


5,爪哇泛型(泛型)说明

1)介绍总体
在引入范型之前,爪哇类型分为原始类型,复杂类型,其中复杂类型分为数组和类。引入范型后,一个复杂类型就可以在细分成更多的类型例如。原先的类型列表,现在在细分成List ,List 等更多的类型。注意,现在List ,List 是两种不同的类型,他们之间没有继承关系,即使String继承了Object。下面的代码是非法的
    List ls = new ArrayList ();
    List lo = ls;
这样设计的原因在于,根据lo的声明,编译器允许你向中添加任意对象(例如整数),但是此对象是List ,破坏了数据类型的完整性。
在引入范型之前,要在类中的方法支持多个数据类型,就需要对方法进行重载,在引入范型后,可以解决此问题(多态),更进一步可以定义多个参数以及返回值之间的关系。例如
public void write(Integer i,Integer [] ia);
public void write(Double d,Double [] da);
的范型版本为
public void write(T t,T [] ta);

2)定义及使用
类型参数的命名风格为:推荐你用简练的名字作为形式类型参数的名字(如果可能,单个字符)最好避免小写字母,这使它和其他的普通的形式参数很容易被区分开来。使用Ť代表类型,无论何时都没有比这更具体的类型来区分它。这经常见于泛型方法如果有多个类型参数,我们可能使用字母表中Ť的临近的字母,比如S.如果一个泛型函数在一个泛型类里面出现,最好避免在方法的类型参数和类的类型参数中使用同样的名字来避免混淆。对内部类也是同样。

- 定义带类型参数的类
在定义带类型参数的类时,在紧跟类命之后的<>内,指定一个或多个类型参数的名字,同时也可以对类型参数的取值范围进行限定,多个类型参数之间用,号分隔。定义完类型参数后,可以在定义位置之后的类的几乎任意地方(静态块,静态属性,静态方法除外)使用类型参数 时,在紧跟可见范围修饰(例如公共)之后的<>内,指定一个或多个类型参数的名字,同时也可以对类型参数的取值范围进行限定,多个类型参数之间用,号分隔。定义完类型参数后,可以在定义位置之后的方法的任意地方使用类型参数,就像使用普通的类型一样。例如: public T testGenericMethodDefine(T t,S s){      ...  }










注意:定义带类型参数的方法,主要目的是为了表达多个参数以及返回值之间的关系例如本例子中Ť和小号继的
承关系,返回值的类型和第一个类型参数的值相同
 public void testGenericMethodDefine2(List s){
     ...
 }
 应改为
 public void testGenericMethodDefine2(List 如果仅仅是想实现多态,请优先使用通配符解决。通配符的内容见下面章节。> s){
     ...
 }

3.)类型参数赋值
当对类或方法的类型参数进行赋值时,要求对所有的类型参数进行赋值。否则,将得到一个编译错误。

- 对带类型参数的类进行类型参数赋值
对带类型参数的类进行类型参数赋值有两种方式
第一声​​明类变量或者实例化时。例如
List list;
list = new ArrayList ;
第二继承类或者实现接口时。例如
public class MyList extends ArrayList implements List {...}

- 对带类型参数方法进行赋值
当调用范型方法时,编译器自动对类型参数进行赋值,当不能成功赋值时报编错错误。例如
 public
     ...
 }
 public T testGenericMethodDefine4(List list1,List list2){
     ...
 }

 Number n = null;
 整数i = null;
 对象o = null;
 testGenericMethodDefine(n,i); //此时T
 为数,S为整数testGenericMethodDefine(o,i); // T为对象,S为整数

 列表 list1 = null;
 testGenericMethodDefine3(i,list1)//此时T为Number

 List list2 = null;
 testGenericMethodDefine4(list1,list2)//编译报错

- 通配符
 在上面两小节中,对是类型参数赋予具体的值,除此,还可以对类型参数赋予不确定值。例如
 List <?> unknownList;
 名单<?extends Number> unknownNumberList;
 名单<?super Integer> unknownBaseLineIntgerList;
注意:在Java集合框架中,对于参数值是未知类型的容器类,只能读取其中元素,不能像其中添加元素,因为,其类型是未知,所以编译器无法识别添加元素的类型和容器的类型是否兼容,唯一的例外是NULL

 List listString;
 List <?> unknownList2 = listString;
 unknownList = unknownList2;
 listString = unknownList; //编译错误

4)数组范型
 可以使用带范型参数值的类声明数组,却不可有创建
 数组List [] iListArray;
 new ArrayList [10]; //编译时错误

5)实现原理
Java范型时编译技术,在运行时不包含范型信息,仅仅类的实例中包含了类型参数的定义信息。是通过java的编译器的称为擦除(擦除)的前端处理来实现的。你可以(基本上就是)把它认为是一个从源码到源码的转换,它把泛型版本转换成非泛型版本。基本上,擦除去掉了所有的泛型类型信息。所有在尖括号之间的类型信息都被扔掉了,因此,比如说一个列表<字符串>类型被转换为列表。所有对类型变量的引用被替换成类型变量的上限(通常是对象)。而且,无论何时结果代码类型不正确,会插入一个到合适类型的转换。
       T badCast(T t,Object o){
         return(T) )o; //未经检查的警告
       }
类型参数在运行时并不存在。这意味着它们不会添加任何的时间或者空间上的负担,这很好。不幸的是,这也意味着你不能依靠他们进行类型转换。

- 一个泛型类被       列出 l1 = new ArrayList (); 被其所有调用共享
下面的代码打印的结果是什么?       List l2 = new ArrayList ();        System.out.println(l1.getClass()== l2.getClass()); 或许你会说false,但是你想错了。它打印出真。因为一个泛型类的所有实例在运行时具有相同的运行时类(类),而不管他们的实际类型参数事实上,泛型之所以叫泛型,就是因为它对所有其可能的类型参数,有同样的行为;同样的类可以被当作许多不同的类型。作为一个结果,类的静态变量和方法也在所有的实例间共享。这就是为什么在静态方法或静态初始化代码中或者在静态变量的声明和初始化时使用类型参数(类型参数是属于具体实例的)是不合法的原因。-转型和的instanceof 。泛型类被所有其实例(实例)共享的另一个暗示是检查一个实例是不是一个特定类型的泛型类是没有意义的       收藏CS = new ArrayList ();        if(cs instanceof Collection ){...} //非法类似的,如下的类型转换










集合 cstr =(集合)cs;得到
一个未经检查的警告,因为运行时环境不会为你作这样的检查

.6)Class的范型处理
Java 5之后,Class变成范型化了.JDK1.5中一个变化是类java.lang.Class是泛型化的。这是把泛型扩展到容器类之外的一个很有意思的例子。现在,类有一个类型参数T,你很可能会问,T代表什么?它代表类对象代表的类型。比如说,String.class类型代表类,Serializable.class代表类。这可以被用来提高你的反射代码的类型安全。
特别的,因为Class的newInstance()方法现在返回一个T,你可以在使用反射创建对象时得到更精确的类型。比如说,假定你要写一个工具方法来进行一个数据库查询,给定一个SQL语句,并返回一个数据库中符合查询条件的对象集合(集)一个方法是显式的传递一个工厂对象,像下面的代码:
INTERF ace Factory {
      public T [] make();
}
public Collection select(Factory factory,String statement){
       Collection result = new ArrayList ();

       for(int i = 0; i <10; i ++){/ *遍历jdbc结果* /
            T item = factory.make();
            / *使用反射并设置sql results * /
            result.add(item)中所有项目的字段;
       }
       返回结果;
}
你可以这样调用:
select(new Factory (){
    public EmpInfo make(){
        return new EmpInfo();
        }
       },“selection string”);
也可以声明一个类EmpInfoFactory来支持接口工厂:
class EmpInfoFactory实现Factory {...
    public EmpInfo make(){return new EmpInfo();}
}
然后调用:
select(getMyEmpInfoFactory(),“selection string”);
这个解决方案的缺点是它需要下面的二者之一:调用处那冗长的匿名工厂类,或为每个要使用的类型声明一个工厂类并传递其对象给调用的地方,这很不自然。使用类类型参数值是非常自然的,它可以被反射使用。没有泛型的代码可能是:
集合emps = sqlUtility.select(EmpInfo.class,“select * from emps”); ...
public static Collection select(Class c,String sqlStatement){
    Collection result = new ArrayList();
    / *使用jdbc * /
    for 运行sql查询(/ *遍历jdbc结果* /){
        Object item = c.newInstance();
        / *使用反射并设置sql results * /
        result.add(item)中所有项目的字段;
    }
        返回结果;
}
但是这不能给我们返回一个我们要的精确类型的集合。现在Class是泛型的,我们可以写:
Collection emps = sqlUtility.select(EmpInfo.class,“select * from emps”); ...
public static Collection select(Class c,String sqlStatement){
    Collection result = new ArrayList ();
    / *使用jdbc * /
    for 运行sql查询(/ *遍历jdbc结果* /){
        T item = c.newInstance();
        / *使用反射并设置sql results * /
        result.add(item)中所有项目的字段;
    }
    返回结果;
}
来通过一种类型安全的方式得到我们要的集合。这项技术是一个非常有用的技巧,它已成为一个在处理注释(注释)的新的API中被广泛使用的习惯用法。

7)新老代码兼容
- 为了保证代码的兼容性,下面的代码编译器(javac)允许,类型安全有你自己保证
List l = new ArrayList ();
List l = new ArrayList();

- 在将你的类库升级为范型版本时,慎用协变式返回值。
例如,将代码
公共类Foo {
    public Foo create(){
        return new Foo();
    }
}

公共类酒吧扩展美孚{
    公共美孚创建(){
        返回新的酒吧();
    }
}
采用协变式返回值风格,将酒吧修改为
公共类酒吧扩展美孚{
    公共酒吧创建(){
        返回新的酒吧();
    }
}
要小心你类库的客户端。

如图6所示,映射器类说明


1)InputFormat 产生 InputSplit,并且调用RecordReader将这些逻辑单元(InputSplit)转化为map task的输入。其中InputSplit是map task处理的最小输入单元的逻辑表示。
2)在客户端代码中调用Job类来设置参数,并执行在hadoop集群的上的MapReduce程序。
3)Mapper类在Job中被实例化,并且通过MapContext对象来传递参数设置。可以调用Job.getConfiguration().set(“myKey”, “myVal”)来设置参数。
4)Mapper的run()方法调用了自身的setup()来设置参数。
5)Mapper的run()方法调用map(KeyInType, ValInType, Context)方法来依次处理InputSplit中输入的key/value数据对。并且可以通过Context参数的Context.write(KeyOutType, ValOutType)方法来发射处理的结果。用户程序还可以用Context来设置状态信息等。同时,用户还可以用Job.setCombinerClass(Class)来设置Combiner,将map产生的中间态数据在Mapper本地进行汇聚,从而减少传递给Reducer的数据。
6)Mapper的run()方法调用cleanup()方法。

一些说明:
1)所有的中间结果都会被MapReduce框架自动的分组,然后传递给Reducer(s)去产生最后的结果。用户可以通过Job.setGroupingComparatorClass(Class)来设置Comparator。如果这个Comparator没有被设置,那么所有有一样的key的数据不会被排序。
2)Mapper的结果都是被排序过的,并被划分为R个区块(R是Reducer的个数)。用户可以通过实现自定义的Partitioner类来指定哪些数据被划分给哪个Reducer。
3)中间态的数据往往用(key-len, key, value-len, value)的简单格式存储。用户程序可以指定CompressionCodec来压缩中间数据。
4)Map的数目由输入数据的总大小决定。一般来说,一个计算节点10-100个map任务有较好的并行性。如果cpu的计算量很小,那么平均每个计算节点300个map任务也是可以的。但是每个任务都是需要时间来初始化的,因此每个任务不能划分的太小,至少也要平均一个任务执行个一分钟。可以通过mapreduce.job.maps参数来设置map的数目。更进一步说,任务的数量是有InputFormat.getSplits()方法来控制的,用户可以重写这个方法。

7、Reducer类说明


Reducer主要分三个步骤:
1)Shuffle洗牌:这步骤是Reducer从相关的Mapper节点获取中间态的数据。
2)Sort排序:在洗牌的同时,Reducer对获取的数据进行排序。在这个过程中,可以用Job.setGroupingComparatorClass(Class)来对同一个key的数据进行Secondary Sort二次排序。
3)Reduce:Reduce的调用顺序和Map差不多,也是通过run()方法调用setup(),reduce(),cleanup()来实现的。Reducer输出的数据是没有经过排序的。

- Reduce 的数目可以通过Job.setNumReduceTasks(int)来设置。一般来说,Reduce的数目是节点数的0.95到1.75倍。
- Reduce的数目也可以设置为0,那么这样map的输出会直接写到文件系统中。
- Reduce中还可以使用Mark-Reset的功能。简而言之就是可以在遍历map产生的中间态的数据的时候可以进行标记,然后在后面适当的时候用reset回到最近标记的位置。当然这是有一点限制的,如下面的例子,必须自己在Reduce方法中对reduce的Iterator重新new 一个MarkableIterator才能使用。
public void reduce(IntWritable key, Iterable values, Context context) throws IOException, InterruptedException {
MarkableIterator mitr = new MarkableIterator(values.iterator());
// Mark the position
mitr.mark();
while (mitr.hasNext()) {
i = mitr.next();
// Do the necessary processing
}

// Reset
mitr.reset();
// Iterate all over again. Since mark was called before the first
// call to mitr.next() in this example, we will iterate over all
// the values now
while (mitr.hasNext()) {
i = mitr.next();
// Do the necessary processing
}
}

Partitioner 对map输出的中间态数据按照reduce的数目进行分区,一般通过hash来进行划分。默认的实现是HashPartitioner。

MapReduce程序可以通过mapper或者reducer的Context来汇报应用程序的状态。

当我们在写MapReduce程序的时候,通常,在main函数里。建立一个Job对象,设置它的JobName,然后配置输入输出路径,设置我们的Mapper类和Reducer类,设置InputFormat和正确的输出类型等等。然后我们会使用job.waitForCompletion()提交到JobTracker,等待job运行并返回,这就是一般的Job设置过程。JobTracker会初始化这个Job,获取输入分片,然后将一个一个的task任务分配给TaskTrackers执行。TaskTracker获取task是通过心跳的返回值得到的,然后TaskTracker就会为收到的task启动一个JVM来运行。

Job其实就是提供配置作业、获取作业配置、以及提交作业的功能,以及跟踪作业进度和控制作业。Job类继承于JobContext类。JobContext提供了获取作业配置的功能,如作业ID,作业的Mapper类,Reducer类,输入格式,输出格式等等,它们除了作业ID之外,都是只读的。 Job类在JobContext的基础上,提供了设置作业配置信息的功能、跟踪进度,以及提交作业的接口和控制作业的方法。
一个Job对象有两种状态,DEFINE和RUNNING,Job对象被创建时的状态时DEFINE,当且仅当Job对象处于DEFINE状态,才可以用来设置作业的一些配置,如Reduce task的数量、InputFormat类、工作的Mapper类,Partitioner类等等,这些设置是通过设置配置信息conf来实现的;当作业通过submit()被提交,就会将这个Job对象的状态设置为RUNNING,这时候作业以及提交了,就不能再设置上面那些参数了,作业处于调度运行阶段。处于RUNNING状态的作业我们可以获取作业、maptask和reduce task的进度,通过代码中的*Progress()获得,这些函数是通过info来获取的,info是RunningJob对象,它是实际在运行的作业的一组获取作业情况的接口,如Progress。

在waitForCompletion()中,首先用submit()提交作业,然后等待info.waitForCompletion()返回作业执行完毕。verbose参数用来决定是否将运行进度等信息输出给用户。submit()首先会检查是否正确使用了new API,这通过setUseNewAPI()检查旧版本的属性是否被设置来实现的,接着就connect()连接JobTracker并提交。实际提交作业的是一个JobClient对象,提交作业后返回一个RunningJob对象,这个对象可以跟踪作业的进度以及含有由JobTracker设置的作业ID。

 getCounter()函数是用来返回这个作业的计数器列表的,计数器被用来收集作业的统计信息,比如失败的map task数量,reduce输出的记录数等等。它包括内置计数器和用户定义的计数器,用户自定义的计数器可以用来收集用户需要的特定信息。计数器首先被每个task定期传输到TaskTracker,最后TaskTracker再传到JobTracker收集起来。这就意味着,计数器是全局的。

8、Mapper类的方法说明


Mapper类有四个方法:setup()、cleanup()、run()、map()。在run方法中调用了上面的三个方法:setup方法,map方法,cleanup方法。
- setup方法和cleanup方法默认是不做任何操作,且它们只被执行一次。setup方法一般会在map函数之前执行一些准备工作,如作业的一些配置信息等;
- cleanup方法则是在map方法运行完之后最后执行 的,该方法是完成一些结尾清理的工作,如:资源释放等。如果需要做一些配置和清理的工作,需要在Mapper/Reducer的子类中进行重写来实现相应的功能。
- map方法会在对应的子类中重新实现,就是我们自定义的map方法。该方法在一个while循环里面,表明该方法是执行很多次的。run方法就是每个maptask调用的方法。

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.mapreduce;

import java.io.IOException;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.task.MapContextImpl;

/**
 * Maps input key/value pairs to a set of intermediate key/value pairs.
 *
 *

Maps are the individual tasks which transform input records into a
 * intermediate records. The transformed intermediate records need not be of
 * the same type as the input records. A given input pair may map to zero or
 * many output pairs.


 *
 *

The Hadoop Map-Reduce framework spawns one map task for each
 * {@link InputSplit} generated by the {@link InputFormat} for the job.
 * Mapper implementations can access the {@link Configuration} for
 * the job via the {@link JobContext#getConfiguration()}.
 *
 *

The framework first calls
 * {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
 * {@link #map(Object, Object, Context)}
 * for each key/value pair in the InputSplit. Finally
 * {@link #cleanup(Context)} is called.


 *
 *

All intermediate values associated with a given output key are
 * subsequently grouped by the framework, and passed to a {@link Reducer} to
 * determine the final output. Users can control the sorting and grouping by
 * specifying two key {@link RawComparator} classes.


 *
 *

The Mapper outputs are partitioned per
 * Reducer. Users can control which keys (and hence records) go to
 * which Reducer by implementing a custom {@link Partitioner}.
 *
 *

Users can optionally specify a combiner, via
 * {@link Job#setCombinerClass(Class)}, to perform local aggregation of the
 * intermediate outputs, which helps to cut down the amount of data transferred
 * from the Mapper to the Reducer.
 *
 *

Applications can specify if and how the intermediate
 * outputs are to be compressed and which {@link CompressionCodec}s are to be
 * used via the Configuration.


 *
 *

If the job has zero
 * reduces then the output of the Mapper is directly written
 * to the {@link OutputFormat} without sorting by keys.


 *
 *

Example:


 *


 * public class TokenCounterMapper
 *     extends Mapper<Object, Text, Text, IntWritable>{
 *
 *   private final static IntWritable one = new IntWritable(1);
 *   private Text word = new Text();
 *
 *   public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
 *     StringTokenizer itr = new StringTokenizer(value.toString());
 *     while (itr.hasMoreTokens()) {
 *       word.set(itr.nextToken());
 *       context.write(word, one);
 *     }
 *   }
 * }
 *


 *
 *

Applications may override the {@link #run(Context)} method to exert
 * greater control on map processing e.g. multi-threaded Mappers
 * etc.


 *
 * @see InputFormat
 * @see JobContext
 * @see Partitioner
 * @see Reducer
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper {

  /**
   * The Context passed on to the {@link Mapper} implementations.
   */
  public abstract class Context
    implements MapContext {
  }

  /**
   * Called once at the beginning of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Called once for each key/value pair in the input split. Most applications
   * should override this, but the default is the identity function.
   */
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value,
                     Context context) throws IOException, InterruptedException {
    context.write((KEYOUT) key, (VALUEOUT) value);
  }

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

12、Reducer类说明

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.mapreduce;

import java.io.IOException;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.task.annotation.Checkpointable;

import java.util.Iterator;

/**
 * Reduces a set of intermediate values which share a key to a smaller set of
 * values.
 *
 *

Reducer implementations
 * can access the {@link Configuration} for the job via the
 * {@link JobContext#getConfiguration()} method.



 *

Reducer has 3 primary phases:


 *

     *  

  1.  *
     *  

    Shuffle


     *
     *  

    The Reducer copies the sorted output from each
     *   {@link Mapper} using HTTP across the network.


     *  

  2.  *
     *  

  3.  *  

    Sort


     *
     *  

    The framework merge sorts Reducer inputs by
     *   keys
     *   (since different Mappers may have output the same key).


     *
     *  

    The shuffle and sort phases occur simultaneously i.e. while outputs are
     *   being fetched they are merged.


     *
     *  
    SecondarySort

     *
     *  

    To achieve a secondary sort on the values returned by the value
     *   iterator, the application should extend the key with the secondary
     *   key and define a grouping comparator. The keys will be sorted using the
     *   entire key, but will be grouped using the grouping comparator to decide
     *   which keys and values are sent in the same call to reduce.The grouping
     *   comparator is specified via
     *   {@link Job#setGroupingComparatorClass(Class)}. The sort order is
     *   controlled by
     *   {@link Job#setSortComparatorClass(Class)}.


     *
     *
     *   For example, say that you want to find duplicate web pages and tag them
     *   all with the url of the "best" known example. You would set up the job
     *   like:
     *  

       *    
    • Map Input Key: url

    •  *    
    • Map Input Value: document

    •  *    
    • Map Output Key: document checksum, url pagerank

    •  *    
    • Map Output Value: url

    •  *    
    • Partitioner: by checksum

    •  *    
    • OutputKeyComparator: by checksum and then decreasing pagerank

    •  *    
    • OutputValueGroupingComparator: by checksum

    •  *  

     *  

  4.  *
     *  

  5.  *  

    Reduce


     *
     *  

    In this phase the
     *   {@link #reduce(Object, Iterable, Context)}
     *   method is called for each <key, (collection of values)> in
     *   the sorted inputs.


     *  

    The output of the reduce task is typically written to a
     *   {@link RecordWriter} via
     *   {@link Context#write(Object, Object)}.


     *  

  6.  *

 *
 *

The output of the Reducer is not re-sorted.


 *
 *

Example:


 *


 * public class IntSumReducer<Key> extends Reducer<Key,IntWritable,
 *                                                 Key,IntWritable> {
 *   private IntWritable result = new IntWritable();
 *
 *   public void reduce(Key key, Iterable<IntWritable> values,
 *                      Context context) throws IOException, InterruptedException {
 *     int sum = 0;
 *     for (IntWritable val : values) {
 *       sum += val.get();
 *     }
 *     result.set(sum);
 *     context.write(key, result);
 *   }
 * }
 *


 *
 * @see Mapper
 * @see Partitioner
 */
@Checkpointable
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Reducer {

  /**
   * The Context passed on to the {@link Reducer} implementations.
   */
  public abstract class Context
    implements ReduceContext {
  }

  /**
   * Called once at the start of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * This method is called once for each key. Most applications will define
   * their reduce class by overriding this method. The default implementation
   * is an identity function.
   */
  @SuppressWarnings("unchecked")
  protected void reduce(KEYIN key, Iterable values, Context context
                        ) throws IOException, InterruptedException {
    for(VALUEIN value: values) {
      context.write((KEYOUT) key, (VALUEOUT) value);
    }
  }

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Advanced application writers can use the
   * {@link #run(org.apache.hadoop.mapreduce.Reducer.Context)} method to
   * control how the reduce task works.
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKey()) {
        reduce(context.getCurrentKey(), context.getValues(), context);
        // If a back up store is used, reset it
        Iterator iter = context.getValues().iterator();
        if(iter instanceof ReduceContext.ValueIterator) {
          ((ReduceContext.ValueIterator)iter).resetBackupStore();
        }
      }
    } finally {
      cleanup(context);
    }
  }
}

 9、jar包执行出错,提示“java.lang.NoSuch Method Exception:grep.Grep$Grep Mapper.()”


错误原因:Grep为inner class,在代码中缺少static
解决方法:在class申明中,增加static

10、combiner设置


- 在run()函数中,设置job.setCombinerClass(xxSumReducer)
- xxSumReducer可以设置FieldSelectionReducer, IntSumReducer, LongSumReducer, SecondarySort.Reduce, WordCount.IntSumReducer

11、totalOrderPartition中,setInputFormatClass()用KeyValueTextInputFormat.class。


说明:
TextInputFormat
用于读取纯文本文件,文件被分为一系列以LF或者CR结束的行,key是每一行的位置(偏移量,LongWritable类型),value是每一行的内容,Text类型。
KeyValueTextInputFormat
同样用于读取文件,如果行被分隔符(缺省是tab)分割为两部分,第一部分为key,剩下的部分为value;如果没有分隔符,整行作为 key,value为空
SequenceFileInputFormat
用于读取sequence file。 sequence file是Hadoop用于存储数据自定义格式的binary文件。它有两个子类:SequenceFileAsBinaryInputFormat,将 key和value以BytesWritable的类型读出;
SequenceFileAsTextInputFormat,将key和value以 Text的类型读出。
SequenceFileInputFilter
根据filter从sequence文件中取得部分满足条件的数据,通过setFilterClass指定Filter,内置了三种 Filter,RegexFilter取key值满足指定的正则表达式的记录;PercentFilter通过指定参数f,取记录行数%f==0的记录;MD5Filter通过指定参数f,取MD5(key)%f==0的记录。
NLineInputFormat
0.18.x新加入,可以将文件以行为单位进行split,比如文件的每一行对应一个map。得到的key是每一行的位置(偏移量,LongWritable类型),value是每一行的内容,Text类型。
CompositeInputFormat,用于多个数据源的join。
TextOutputFormat,输出到纯文本文件,格式为 key + ” ” + value。
NullOutputFormat,hadoop中的/dev/null,将输出送进黑洞。
SequenceFileOutputFormat, 输出到sequence file格式文件。
MultipleSequenceFileOutputFormat, MultipleTextOutputFormat,根据key将记录输出到不同的文件。
DBInputFormat和DBOutputFormat,从DB读取,输出到DB,预计将在0.19版本加入。

12、secondary sort实现说明


主要步骤:
1)【排序】定制key class,实现WritableComparable,作为Map outputKey。重点需要实现compareTo()方法。
key class必须实现WritableComparable,这需要定义compareTo方法。
如果没有定制key class,将无法实现部分outputvalue参与排序过程。
public int compareTo(Stock arg0) {
// TODO Auto-generated method stub
int response = this.symbol.compareTo(arg0.symbol);
if(response==0){
response = this.date.compareTo(arg0.date);
}
return response;
}

Compares this object with the specified object for order. Returns a negative integer, zero, or a positive integer as this object is less than, equal to, or greater than the specified object.
The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compareTo(x)) for all x and y. (This implies that x.compareTo(y) must throw an exception iff y.compareTo(x) throws an exception.)
The implementor must also ensure that the relation is transitive: (x.compareTo(y)>0 && y.compareTo(z)>0) implies x.compareTo(z)>0.
Finally, the implementor must ensure that x.compareTo(y)==0 implies that sgn(x.compareTo(z)) == sgn(y.compareTo(z)), for all z.
It is strongly recommended, but not strictly required that (x.compareTo(y)==0) == (x.equals(y)). Generally speaking, any class that implements the Comparable interface and violates this condition should clearly indicate this fact. The recommended language is "Note: this class has a natural ordering that is inconsistent with equals."
In the foregoing description, the notation sgn(expression) designates the mathematical signum function, which is defined to return one of -1, 0, or 1 according to whether the value of expression is negative, zero or positive.

2)【分区】定制custom partition,实现按照key class的部分属性进行分区。
char firstLetter = key.getSymbol().trim().charAt(0);
return (firstLetter - 'A') % numPartitions;
在run()中需要增加 job.setPartitionerClass(xxx.class);
如果没有定制custom partition,将使用默认hashpartition,将无法实现按照outputKey进行group。

3)【合并】编写定制 grouping comparator class,继承WritableComparator。重点实现compare()方法,确定两个key class如何实现group

public int compare(WritableComparable a, WritableComparable b) {
// TODO Auto-generated method stub
Stock lhs = (Stock) a;
Stock rhs = (Stock) b;
return lhs.getSymbol().compareTo(rhs.getSymbol());
}
在run()中需要增加 job.setGroupingComparatorClass(xx.class);

public void setGroupingComparatorClass(Class cls
 ) throws IllegalStateException {
    ensureState(JobState.DEFINE);
    conf.setOutputValueGroupingComparator(cls);
}

setGroupingComparatorClass说明:Define the comparator that controls which keys are grouped together for a single call to Reducer.reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context)

void org.apache.hadoop.mapred.JobConf.setOutputValueGroupingComparator(Class theClass)

Set the user defined RawComparator comparator for grouping keys in the input to the reduce.
This comparator should be provided if the equivalence rules for keys for sorting the intermediates are
 different from those for grouping keys before each call to Reducer.reduce(Object, java.util.Iterator,
 OutputCollector, Reporter).
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed in a single call to the reduce
 function if K1 and K2 compare as equal.
Since setOutputKeyComparatorClass(Class) can be used to control how keys are sorted, this can be used
 in conjunction to simulate secondary sort on values.
Note: This is not a guarantee of the reduce sort being stable in any sense. (In any case, with the order of
 available map-outputs to the reduce being non-deterministic, it wouldn't make that much sense.)
Parameters:
    theClass the comparator class to be used for grouping keys. It should implement RawComparator.
See Also:
    setOutputKeyComparatorClass(Class)
    setCombinerKeyGroupingComparator(Class)

4)定义Reducer的outputValue class,实现Writable虚类,重点实现write()、readFields()、toString()等方法。

13、执行jar包错误

15/12/01 03:27:34 INFO mapreduce.Job: Task Id : attempt_1447915042936_0028_r_000000_0, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "dividends"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
    at java.lang.Double.parseDouble(Double.java:540)
    at customSort.Customsort$CustomReducer.reduce(Customsort.java:46)
    at customSort.Customsort$CustomReducer.reduce(Customsort.java:35)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

错误原因:格式转换错误,提示输入数据有“dividends”这样的英文字符串,无法格式转换。
解决方法:代码中有如下内容,String类型对比不能直接使用!=,而应该使用equals()方法。
String [] words = StringUtils.split(value.toString(), ',');
if(words[0]!="exchange”)  //需要修改为words[0].equals(“exchange")

java中判断字符串是否相等有两种方法:
1、用“==”运算符,该运算符表示指向字符串的引用是否相同,比如: String a="abc";String b="abc",那么a==b将返回true。这是因为在java中字符串的值是不可改变的,相同的字符串在内存中只会存一份,所以a和b指向的是同一个对象;再比如:String a=new String("abc"); String b=new String("abc");那么a==b将返回false,因为a和b指向不同的对象。
2、用equals方法,该方法比较的是字符串的内容是否相同,比如:String a=new String("abc"); String b=new String("abc"); a.equals(b);将返回true。所以通常情况下,为了避免出现上述问题,判断字符串是否相等使用equals方法。

equals()方法说明:
    /**
     * Compares this string to the specified object.  The result is {@code
     * true} if and only if the argument is not {@code null} and is a {@code
     * String} object that represents the same sequence of characters as this
     * object.
     *
     * @param  anObject
     *         The object to compare this {@code String} against
     *
     * @return  {@code true} if the given object represents a {@code String}
     *          equivalent to this string, {@code false} otherwise
     *
     * @see  #compareTo(String)
     * @see  #equalsIgnoreCase(String)
     */
    public boolean equals(Object anObject) {
        if (this == anObject) {
            return true;
        }
        if (anObject instanceof String) {
            String anotherString = (String)anObject;
            int n = value.length;
            if (n == anotherString.value.length) {
                char v1[] = value;
                char v2[] = anotherString.value;
                int i = 0;
                while (n-- != 0) {
                    if (v1[i] != v2[i])
                        return false;
                    i++;
                }
                return true;
            }
        }
        return false;
    }

14、执行jar包出错,namenode为safemode状态

[root@sandbox jar]# yarn jar customsort.jar /xuefei/dividends /xuefei/output
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot delete /xuefei/output. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
错误原因:磁盘空间不足,导致namenode无法写文件(原因之一)
解决方法:
1)尝试执行
[root@sandbox jar]# hdfs dfsadmin -safemode leave
safemode: Access denied for user root. Superuser privilege is required
2)删除大文件,清理磁盘,尝试重启后恢复

15、Java的this用法


/**
 * @author fengzhi-neusoft
 *
 * 本示例为了说明this的三种用法!
 */
package test;
public class ThisTest {
    private int i=0;
    //第一个构造器:有一个int型形参
    ThisTest(int i){
       this.i=i+1;//此时this表示引用成员变量i,而非函数参数i
       System.out.println("Int constructor i——this.i:  "+i+"——"+this.i);
       System.out.println("i-1:"+(i-1)+"this.i+1:"+(this.i+1));
       //从两个输出结果充分证明了i和this.i是不一样的!
    }
    //  第二个构造器:有一个String型形参
    ThisTest(String s){
       System.out.println("String constructor:  "+s);
    }
    //  第三个构造器:有一个int型形参和一个String型形参
    ThisTest(int i,String s){
       this(s);//this调用第二个构造器
       //this(i);
       /*此处不能用,因为其他任何方法都不能调用构造器,只有构造方法能调用他。
       但是必须注意:就算是构造方法调用构造器,也必须为于其第一行,构造方法也只能调
       用一个且仅一次构造器!*/
       this.i=i++;//this以引用该类的成员变量
       System.out.println("Int constructor:  "+i+"/n"+"String constructor:  "+s);
    }
    public ThisTest increment(){
       this.i++;
       return this;//返回的是当前的对象,该对象属于(ThisTest)
    }
    public static void main(String[] args){
       ThisTest tt0=new ThisTest(10);
       ThisTest tt1=new ThisTest("ok");
       ThisTest tt2=new ThisTest(20,"ok again!");

       System.out.println(tt0.increment().increment().increment().i);
       //tt0.increment()返回一个在tt0基础上i++的ThisTest对象,
       //接着又返回在上面返回的对象基础上i++的ThisTest对象!
    }
}

运行结果:

Int constructor i——this.i:  10——11
String constructor:  ok
String constructor:  ok again!
Int constructor:  21
String constructor:  ok again!
14
细节问题注释已经写的比较清楚了,这里不在赘述,只是总结一下,其实this主要要三种用法:
1、表示对当前对象的引用!
2、表示用类的成员变量,而非函数参数,注意在函数参数和成员变量同名是进行区分!其实这是第一种用法的特例,比较常用,所以那出来强调一下。
3、用于在构造方法中引用满足指定参数类型的构造器(其实也就是构造方法)。但是这里必须非常注意:只能引用一个构造方法且必须位于开始!
还有就是注意:this不能用在static方法中!所以甚至有人给static方法的定义就是:没有this的方法!虽然夸张,但是却充分说明this不能在static方法中使用!

16、执行jar包出错  Error:java.lang.NullPointerException at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:128)

15/12/03 23:41:28 INFO mapreduce.Job: Task Id : attempt_1449157066412_0001_r_000000_0, Status : FAILED
Error: java.lang.NullPointerException    at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:128)
    at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
    at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
    at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
错误原因:WritableComparator类没有重构构造函数
解决方法:重构构造函数,在类中增加如下代码
    public CustomCompator() {
        super(Stock.class, true);
        // TODO Auto-generated constructor stub
    }

17、NullWritable类说明


NullWritable是Writable的一个特殊类,序列化的长度为0,实现方法为空实现,不从数据流中读数据,也不写入数据,只充当占位符,如在MapReduce中,如果你不需要使用键或值,你就可以将键或值声明为NullWritable,NullWritable是一个不可变的单实例类型。

18、InputFormat说明

1)继承RecordReader类,重点实现initialize()、nextKeyValue()、getCurrentKey()、getCurrentValue()、getProgress()等方法
RecordRead类:The record reader breaks the data into key/value pairs for input to the Mapper。相关方法被mapper调用,获取key/value作为mapper输入
- initialize() Called once at initialization.
- nextKeyValue() Read the next key, value pair, return true if a key/value pair was read
- getCurrentKey() Get the current value, return the current key or null if there is no current key
- getCurrentValue() Get the current value, return the object that was read
- getProgress() The current progress of the record reader through its data, return a number between 0.0 and 1.0 that is the fraction of the data read
- close() close the record reader

public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }

2)继承FileInputFormat类,重点实现createRecordReader()方法
Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.

Overrides: createRecordReader(...) in InputFormat
Parameters:
split the split to be read
context the information about the task
Returns:
a new record reader
Throws:
IOException
InterruptedException

3)设置job.setInputFormatClass(xx)

19、outputFormatClass为sequenceFileOutputFormat,查看class出错


[root@sandbox jar]# hdfs dfs -text /xuefei/output/custominput/part-r-00000
-text: Fatal internal error
java.lang.RuntimeException: java.io.IOException: WritableName can't load class: customInput.Stock
    at org.apache.hadoop.io.SequenceFile$Reader.getKeyClass(SequenceFile.java:2016)
    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1947)
    at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813)
    at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1762)
错误原因:sequenceFileOutputFormat不可以直接查看,需要load相关类
解决方法:把bin目录下的class文件放到hadoop classpath中

20、WritableComparable  Writable Comparable WritableComparator Comparator的区别

implements WritableComparable
implement Writable
implement Comparable
extends WritableComparator
extends Comparator

简单说:
1.extends是继承父类,只要那个类不是声明为final或者那个类定义为abstract的就能继承,
2.JAVA中不支持多重继承(一个类可以同时继承多个父类的行为和特征功能),但是可以用接口来实现,这样就要用到implements,
3.继承只能继承一个类,但implements可以实现多个接口,用逗号分开就行了 ,
      比如  class A extends B implements C,D,E

术语话来说:
      extends 继承类;implements 实现接口。

类和接口是不同的:类里是有程序实现的;而接口无程序实现,只可以预定义方法
Java也提供继承机制﹐但还另外提供一个叫interface的概念。由于Java的继承机制只能提供单一继承(就是只能继承一种父类别)﹐所以就以Java的interface来代替C++的多重继承。interface就是一种介面﹐规定欲沟通的两物件﹐其通讯该有的规范有哪些。
如以Java程式语言的角度来看﹐Java的interface则表示:
      一些函数或资料成员为另一些属于不同类别的物件所需共同拥有,则将这些函数与资料成员定义在一个interface中,然后让所有不同类别的Java物件可以共同操作使用之。
      Java的class只能继承一个父类别(用extends关键字), 但可以拥有(或称实作)许多interface(用implements关键字)。

extends和implements有什么不同?
      对于class而言,extends用于(单)继承一个类(class),而implements用于实现一个接口(interface)。
      interface的引入是为了部分地提供多继承的功能。在interface中只需声明方法头,而将方法体留给实现的class来做。 这些实现的class的实例完全可以当作interface的实例来对待。 在interface之间也可以声明为extends(多继承)的关系。

 注意: 一个interface可以extends多个其他interface。

WritableComparable
org.apache.hadoop.io.Writable

@Public
@Stable
A serializable object which implements a simple, efficient, serialization protocol, based on DataInput and DataOutput.

Any key or value type in the Hadoop Map-Reduce framework implements this interface.

Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.

Example:

     public class MyWritable implements Writable {
       // Some data
       private int counter;
       private long timestamp;

       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }

       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }

       public static MyWritable read(DataInput in) throws IOException {
         MyWritable w = new MyWritable();
         w.readFields(in);
         return w;
       }
     }

public interface WritableComparable extends Writable, Comparable {
}

 writable
org.apache.hadoop.io.Writable

@Public
@Stable
A serializable object which implements a simple, efficient, serialization protocol, based on DataInput and DataOutput.

Any key or value type in the Hadoop Map-Reduce framework implements this interface.

Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.

Example:

     public class MyWritable implements Writable {
       // Some data
       private int counter;
       private long timestamp;

       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }

       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }

       public static MyWritable read(DataInput in) throws IOException {
         MyWritable w = new MyWritable();
         w.readFields(in);
         return w;
       }
     }

@InterfaceAudience.Public
@InterfaceStability.Stable
public interface Writable {
  /**
   * Serialize the fields of this object to out.
   *
   * @param out DataOuput to serialize this object into.
   * @throws IOException
   */
  void write(DataOutput out) throws IOException;

  /**
   * Deserialize the fields of this object from in.
   *
   * 

For efficiency, implementations should attempt to re-use storage in the * existing object where possible.

* * @param in DataInput to deseriablize this object from. * @throws IOException */ void readFields(DataInput in) throws IOException; }

Comparable
java.lang.Comparable

This interface imposes a total ordering on the objects of each class that implements it. This ordering is referred to as the class's natural ordering, and the class's compareTo method is referred to as its natural comparison method.

Lists (and arrays) of objects that implement this interface can be sorted automatically by Collections.sort (and Arrays.sort). Objects that implement this interface can be used as keys in a sorted map or as elements in a sorted set, without the need to specify a comparator.

The natural ordering for a class C is said to be consistent with equals if and only if e1.compareTo(e2) == 0 has the same boolean value as e1.equals(e2) for every e1 and e2 of class C. Note that null is not an instance of any class, and e.compareTo(null) should throw a NullPointerException even though e.equals(null) returns false.

It is strongly recommended (though not required) that natural orderings be consistent with equals. This is so because sorted sets (and sorted maps) without explicit comparators behave "strangely" when they are used with elements (or keys) whose natural ordering is inconsistent with equals. In particular, such a sorted set (or sorted map) violates the general contract for set (or map), which is defined in terms of the equals method.

For example, if one adds two keys a and b such that (!a.equals(b) && a.compareTo(b) == 0) to a sorted set that does not use an explicit comparator, the second add operation returns false (and the size of the sorted set does not increase) because a and b are equivalent from the sorted set's perspective.

Virtually all Java core classes that implement Comparable have natural orderings that are consistent with equals. One exception is java.math.BigDecimal, whose natural ordering equates BigDecimal objects with equal values and different precisions (such as 4.0 and 4.00).

For the mathematically inclined, the relation that defines the natural ordering on a given class C is:

       {(x, y) such that x.compareTo(y) <= 0}.

The quotient for this total order is:
       {(x, y) such that x.compareTo(y) == 0}.

It follows immediately from the contract for compareTo that the quotient is an equivalence relation on C, and that the natural ordering is a total order on C. When we say that a class's natural ordering is consistent with equals, we mean that the quotient for the natural ordering is the equivalence relation defined by the class's equals(Object) method:
     {(x, y) such that x.equals(y)}.
This interface is a member of the Java Collections Framework.

Type Parameters:
 the type of objects that this object may be compared to
Since:
1.2
Author:
Josh Bloch
See Also:
java.util.Comparator

public interface Comparable {
    public int compareTo(T o);
}

WritableComparator
org.apache.hadoop.io.WritableComparator

@Public
@Stable
A Comparator for WritableComparables.

This base implemenation uses the natural ordering. To define alternate orderings, override compare(WritableComparable, WritableComparable).

One may optimize compare-intensive operations by overriding compare(byte [], int, int, byte [], int, int). Static utility methods are provided to assist in optimized implementations of this method.

public class WritableComparator implements RawComparator {

  private static final ConcurrentHashMap comparators
          = new ConcurrentHashMap(); // registry

  /** Get a comparator for a {@link WritableComparable} implementation. */
  public static WritableComparator get(Class c) {
    WritableComparator comparator = comparators.get(c);
    if (comparator == null) {
      // force the static initializers to run
      forceInit(c);
      // look to see if it is defined now
      comparator = comparators.get(c);
      // if not, use the generic one
      if (comparator == null) {
        comparator = new WritableComparator(c, true);
      }
    }
    return comparator;
  }

  /**
   * Force initialization of the static members.
   * As of Java 5, referencing a class doesn't force it to initialize. Since
   * this class requires that the classes be initialized to declare their
   * comparators, we force that initialization to happen.
   * @param cls the class to initialize
   */
  private static void forceInit(Class cls) {
    try {
      Class.forName(cls.getName(), true, cls.getClassLoader());
    } catch (ClassNotFoundException e) {
      throw new IllegalArgumentException("Can't initialize class " + cls, e);
    }
  }

  /** Register an optimized comparator for a {@link WritableComparable}
   * implementation. Comparators registered with this method must be
   * thread-safe. */
  public static void define(Class c, WritableComparator comparator) {
    comparators.put(c, comparator);
  }

  private final Class keyClass;
  private final WritableComparable key1;
  private final WritableComparable key2;
  private final DataInputBuffer buffer;

  protected WritableComparator() {
    this(null);
  }

  /** Construct for a {@link WritableComparable} implementation. */
  protected WritableComparator(Class keyClass) {
    this(keyClass, false);
  }

  protected WritableComparator(Class keyClass,
      boolean createInstances) {
    this.keyClass = keyClass;
    if (createInstances) {
      key1 = newKey();
      key2 = newKey();
      buffer = new DataInputBuffer();
    } else {
      key1 = key2 = null;
      buffer = null;
    }
  }

  /** Returns the WritableComparable implementation class. */
  public Class getKeyClass() { return keyClass; }

  /** Construct a new {@link WritableComparable} instance. */
  public WritableComparable newKey() {
    return ReflectionUtils.newInstance(keyClass, null);
  }

  /** Optimization hook.  Override this to make SequenceFile.Sorter's scream.
   *
   * 

The default implementation reads the data into two {@link * WritableComparable}s (using {@link * Writable#readFields(DataInput)}, then calls {@link * #compare(WritableComparable,WritableComparable)}. */ @Override public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { try { buffer.reset(b1, s1, l1); // parse key1 key1.readFields(buffer); buffer.reset(b2, s2, l2); // parse key2 key2.readFields(buffer); } catch (IOException e) { throw new RuntimeException(e); } return compare(key1, key2); // compare them } /** Compare two WritableComparables. * *

The default implementation uses the natural ordering, calling {@link * Comparable#compareTo(Object)}. */ @SuppressWarnings("unchecked") public int compare(WritableComparable a, WritableComparable b) { return a.compareTo(b); } @Override public int compare(Object a, Object b) { return compare((WritableComparable)a, (WritableComparable)b); } /** Lexicographic order of binary data. */ public static int compareBytes(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { return FastByteComparisons.compareTo(b1, s1, l1, b2, s2, l2); } /** Compute hash for binary data. */ public static int hashBytes(byte[] bytes, int offset, int length) { int hash = 1; for (int i = offset; i < offset + length; i++) hash = (31 * hash) + (int)bytes[i]; return hash; } /** Compute hash for binary data. */ public static int hashBytes(byte[] bytes, int length) { return hashBytes(bytes, 0, length); } /** Parse an unsigned short from a byte array. */ public static int readUnsignedShort(byte[] bytes, int start) { return (((bytes[start] & 0xff) << 8) + ((bytes[start+1] & 0xff))); } /** Parse an integer from a byte array. */ public static int readInt(byte[] bytes, int start) { return (((bytes[start ] & 0xff) << 24) + ((bytes[start+1] & 0xff) << 16) + ((bytes[start+2] & 0xff) << 8) + ((bytes[start+3] & 0xff))); } /** Parse a float from a byte array. */ public static float readFloat(byte[] bytes, int start) { return Float.intBitsToFloat(readInt(bytes, start)); } /** Parse a long from a byte array. */ public static long readLong(byte[] bytes, int start) { return ((long)(readInt(bytes, start)) << 32) + (readInt(bytes, start+4) & 0xFFFFFFFFL); } /** Parse a double from a byte array. */ public static double readDouble(byte[] bytes, int start) { return Double.longBitsToDouble(readLong(bytes, start)); } /** * Reads a zero-compressed encoded long from a byte array and returns it. * @param bytes byte array with decode long * @param start starting index * @throws java.io.IOException * @return deserialized long */ public static long readVLong(byte[] bytes, int start) throws IOException { int len = bytes[start]; if (len >= -112) { return len; } boolean isNegative = (len < -120); len = isNegative ? -(len + 120) : -(len + 112); if (start+1+len>bytes.length) throw new IOException( "Not enough number of bytes for a zero-compressed integer"); long i = 0; for (int idx = 0; idx < len; idx++) { i = i << 8; i = i | (bytes[start+1+idx] & 0xFF); } return (isNegative ? (i ^ -1L) : i); } /** * Reads a zero-compressed encoded integer from a byte array and returns it. * @param bytes byte array with the encoded integer * @param start start index * @throws java.io.IOException * @return deserialized integer */ public static int readVInt(byte[] bytes, int start) throws IOException { return (int) readVLong(bytes, start); } }

Comparator
org.apache.hadoop.io.Text.Comparator

  /** A WritableComparator optimized for Text keys. */
  public static class Comparator extends WritableComparator {
    public Comparator() {
      super(Text.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      int n1 = WritableUtils.decodeVIntSize(b1[s1]);
      int n2 = WritableUtils.decodeVIntSize(b2[s2]);
      return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
    }
  }

21、Map side Join说明(replicated Join)

步骤:
1)把小数据集放在LocalResource
job.addCacheFile(new URI(“afile.txt”));
job.addCacheArchive(new URI(“helper.zip”));
2)在mapper的setup()方法中,获取小数据集,并加载到内存中(使用hashmap、treemap等数据结构)
3)当大数据集记录被map()方法读取时,使用小数据集进行匹配过滤
4)如果匹配成功,把两个数据集的记录拼接,并作为output输出
LocalResource表示运行container需要的文件、库,典型的LocalResource有:运行container所需要的jar包、container运行前所需要的配置(远程服务URLs、程序配置等)、静态字典文件等。NodeManager负责在运行container之前,Localize相关资源。
 

22、执行jar包出错:Error: java.lang.NullPointerException
    at mapsidejoin.Mapsidejoin$MapsideMapper.map(Mapsidejoin.java:37)

15/12/15 22:50:46 INFO mapreduce.Job: Task Id : attempt_1450188830584_0013_m_000000_1, Status : FAILED

Error: java.lang.NullPointerException
    at mapsidejoin.Mapsidejoin$MapsideMapper.map(Mapsidejoin.java:37)
    at mapsidejoin.Mapsidejoin$MapsideMapper.map(Mapsidejoin.java:26)
错误原因:hashmap使用containKey方法,需要调用key class的equals()方法
解决方法:做stock class(implements WritableComparable)漏重构方法equal()和hashCode()
Java code

  /** * Returns the entry associated with the specified key in the * HashMap. Returns null if the HashMap contains no mapping * for the key. */

final Entry getEntry(Object key) {

     int hash = (key == null) 0 : hash(key.hashCode());

     for (Entry e = table[indexFor(hash, table.length)]; e != null; e = e.next) {

     Object k;

     if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k))))

     return e;

}

return null; }

23、执行jar包出错,unknowHostexception

root@HWX_Java:~/java/labs/Solutions/Lab7.1/MapSideJoin# yarn jar mapsidejoin.jar AIT

15/12/15 23:07:36 INFO impl.TimelineClientImpl: Timeline service address: http://resourcemanager:8188/ws/v1/timeline/
15/12/15 23:07:36 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.17.0.3:8050
15/12/15 23:07:37 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/root/.staging/job_1450188830584_0017
java.lang.IllegalArgumentException: java.net.UnknownHostException: sandbox

 

24 在Map Side Join中,使用hashmap存储小表的时候,需要对key的对象重载hashCode()函数,否则hashmap的containsKey()函数找不到对应内容

 

25、TotalOrderPartitioner出错,提示:Error: java.lang.IllegalArgumentException: Can't read partitions file
    at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116)

16/06/09 14:01:20 INFO mapreduce.Job: Task Id : attempt_1465443169491_0015_m_000000_2, Status : FAILED
Error: java.lang.IllegalArgumentException: Can't read partitions file
    at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:701)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    在java.security.AccessController.doPrivileged(本机方法)
    在javax.security.auth.Subject.doAs(Subject.java:415)
    在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    在组织.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
引起:java.io.IOException:
    org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner)中密钥集中的分区数量错误.java:90)
    ... 10更多
问题原因:job.setNumReduceTask(3);放在了InputSampler.Sampler的后面。
解决方法:把job.setNumReduceTask(3)放在前面。

 

 

 

 

 

 

 

 

你可能感兴趣的:(Hadoop)