hadoop安装前准备工作

  1. 在安装好的Ubuntu系统下添加具有sudo权限的用户。

root@nodeA:~# sudo adduser zyx

Adding user `zyx' ...

Adding new group `zyx' (1001) ...

Adding new user `zyx' (1001) with group `zyx' ...

Creating home directory `/home/zyx' ...

Copying files from `/etc/skel' ...

Enter new UNIX password:

Retype new UNIX password:

passwd: password updated successfully

Changing the user information for zyx

Enter the new value, or press ENTER for the default

        Full Name []: ^Cadduser: `/usr/bin/chfn zyx' exited from signal 2. Exiting.

root@nodeA:~#

root@nodeA:~# sudo  usermod -G admin -a zyx

root@nodeA:~#

  1. 建立SSH无密码登陆

(1)namenode上实现无密码登陆本机

zyx@nodeA:~$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

Generating public/private dsa key pair.

Created directory '/home/zyx/.ssh'.

Your identification has been saved in /home/zyx/.ssh/id_dsa.

Your public key has been saved in /home/zyx/.ssh/id_dsa.pub.

The key fingerprint is:

65:2e:e0:df:2e:61:a5:19:6a:ab:0e:38:45:a9:6a:2b zyx@nodeA

The key's randomart image is:

+--[ DSA 1024]----+

|                 |

|   .             |

|  o   .   o      |

| o   . ..+.      |

|. .   ..S=.      |

|.o    o.=o       |

|+..  . o...      |

|E...  . ..       |

|.. .o.   ..      |

+-----------------+

zyx@nodeA:~$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

zyx@nodeA:~$

(2)实现namenode无密码登陆其他datanode

hadoop@nodeB:~$ scp hadoop@nodea:/home/hadoop/.ssh/id_dsa.pub /home/hadoop

hadoop@nodea's password:

id_dsa.pub                                    100%  602     0.6KB/s   00:00   

hadoop@nodeB:~$ cat id_dsa.pub >> .ssh/authorized_keys

hadoop@nodeB:~$ sudo ufw disable

  1. 复制JDK(jdk-6u20-linux-i586.bin)文件到linux

利用F-Secure SSH File Transfer Trial 工具,直接拖拽

  1. jdk-6u20-linux-i586.bin的安装和配置

(1)安装

zyx@nodeA:~$ ls

Examples  jdk

zyx@nodeA:~$ cd jdk

zyx@nodeA:~/jdk$ ls

jdk-6u20-linux-i586.bin

zyx@nodeA:~/jdk$ chmod a+x jdk*

zyx@nodeA:~/jdk$ ./jdk*

接下来显示许可协议,然后选择yes, 然后按Enter键,安装结束。

zyx@nodeA:~/jdk$ ls

jdk1.6.0_20  jdk-6u20-linux-i586.bin

(2)配置

用root@nodeA:/home/zyx# vi .bashrc 打开bashrc, 然后在最后加入下面几行:

export JAVA_HOME=/home/zyx/jdk/jdk1.6.0_20

export JRE_HOME=/home/zyx/jdk/jdk1.6.0_20/jre

export CLASS_PATH=$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib

export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH:$HOMR/bin

  1. Hadoop的安装

下载地址:

http://labs.renren.com/apache-mirror/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz

把hadoop-0.20.2.tar.gz 放到home/zyx/hadoop下,然后解压该文件

zyx@nodeB:~/hadoop$ tar -zvxf hadoop-0.20.2.tar.gz

设置环境变量, 添加到home/zyx/.bashrc

zyx@nodeA:~$ vi .bashrc

export HADOOP_HOME=/home/zyx/hadoop/hadoop-0.20.2

export PATH=$HADOOP_HOME/bin:$PATH

  1. Hadoop的配置
  1. 在conf/hadoop-env.sh中配置java环境

export JAVA_HOME=/home/zyx/jdk/jdk/jdk1.6.0_20

  1.  配置conf/masters, slaves 文件, 只需要在nodename上配置。
  2. 配置core-site.xml, hdfs-site.xml, mapred-site.xml

      zyx@nodeC:~/hadoop-0.20.2/conf$ more core-site.xml

 

  fs.default.name

hdfs://192.168.1.103:54310

   hdfs://192.168.1.103:9000

 

zyx@nodeC:~/hadoop-0.20.2/conf$ more hdfs-site.xml

 

  dfs.replication

  1

 

zyx@nodeC:~/hadoop-0.20.2/conf$ more mapred-site.xml

 

  mapred.job.tracker

hdfs://192.168.1.103:54320

   hdfs://192.168.1.103:9001

 

  1. Hadoop的运行

(0) 格式化:

zyx@nodeC:~/hadoop-0.20.2/bin$ hadoop namenode –format

(1)用jps查看进程:

zyx@nodeC:~/hadoop-0.20.2/bin$ jps

31030 NameNode

31488 TaskTracker

31283 SecondaryNameNode

31372 JobTracker

31145 DataNode

31599 Jps

(2)查看集群状态

zyx@nodeC:~/hadoop-0.20.2/bin$ hadoop dfsadmin -report

Configured Capacity: 304716488704 (283.79 GB)

Present Capacity: 270065557519 (251.52 GB)

DFS Remaining: 270065532928 (251.52 GB)

DFS Used: 24591 (24.01 KB)

DFS Used%: 0%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

Datanodes available: 1 (1 total, 0 dead)

Name: 192.168.1.103:50010

Decommission Status : Normal

Configured Capacity: 304716488704 (283.79 GB)

DFS Used: 24591 (24.01 KB)

Non DFS Used: 34650931185 (32.27 GB)

DFS Remaining: 270065532928(251.52 GB)

DFS Used%: 0%

DFS Remaining%: 88.63%

Last contact: Fri Apr 23 15:39:10 CST 2010

(3)Stop 文件:

zyx@nodeC:~/hadoop-0.20.2/bin$ stop-all.sh

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

localhost: stopping secondarynamenode

  1. 运行一个简单JAVA 程序
  1. 先在本地磁盘建立两个文件file01和file02
    [cuijj@station1 ~]$ echo "Hello cuijj bye cuijj" > file01
    [cuijj@station1 ~]$ echo "Hello Hadoop Goodbye Hadoop" > file02
    2)在hdfs中建立一个input目录
    [cuijj@station1 ~]$ hadoop dfs -mkdir input
  1. 将file01和file02拷贝到hdfs的input目录下

zyx@nodeC:~$ hadoop dfs -copyFromLocal /home/zyx/file0* input

  1. 查看hdfs中有没有input目录

zyx@nodeC:~$ hadoop dfs -ls

Found 1 items

drwxr-xr-x   - zyx supergroup          0 2010-04-23 16:40 /user/zyx/input

  1. 查看input目录下有没有复制成功file01和file02

zyx@nodeC:~$ hadoop dfs -ls input

Found 2 items

-rw-r--r--   1 zyx supergroup          0 2010-04-23 16:40 /user/zyx/input/file01

-rw-r--r--   1 zyx supergroup          0 2010-04-23 16:40 /user/zyx/input/file02

  1. 执行wordcount(确保hdfs上没有output目录)

zyx@nodeC:~/hadoop-0.20.2$ hadoop jar hadoop-0.20.2-examples.jar wordcount input output

10/04/24 09:25:10 INFO input.FileInputFormat: Total input paths to process : 2

10/04/24 09:25:11 INFO mapred.JobClient: Running job: job_201004240840_0001

10/04/24 09:25:12 INFO mapred.JobClient:  map 0% reduce 0%

10/04/24 09:25:22 INFO mapred.JobClient:  map 100% reduce 0%

10/04/24 09:25:34 INFO mapred.JobClient:  map 100% reduce 100%

10/04/24 09:25:36 INFO mapred.JobClient: Job complete: job_201004240840_0001

10/04/24 09:25:36 INFO mapred.JobClient: Counters: 17

10/04/24 09:25:36 INFO mapred.JobClient:   Job Counters

10/04/24 09:25:36 INFO mapred.JobClient:     Launched reduce tasks=1

10/04/24 09:25:36 INFO mapred.JobClient:     Launched map tasks=2

10/04/24 09:25:36 INFO mapred.JobClient:     Data-local map tasks=2

10/04/24 09:25:36 INFO mapred.JobClient:   FileSystemCounters

10/04/24 09:25:36 INFO mapred.JobClient:     FILE_BYTES_READ=79

10/04/24 09:25:36 INFO mapred.JobClient:     HDFS_BYTES_READ=50

10/04/24 09:25:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=228

10/04/24 09:25:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=41

10/04/24 09:25:36 INFO mapred.JobClient:   Map-Reduce Framework

10/04/24 09:25:36 INFO mapred.JobClient:     Reduce input groups=5

10/04/24 09:25:36 INFO mapred.JobClient:     Combine output records=6

10/04/24 09:25:36 INFO mapred.JobClient:     Map input records=2

10/04/24 09:25:36 INFO mapred.JobClient:     Reduce shuffle bytes=85

10/04/24 09:25:36 INFO mapred.JobClient:     Reduce output records=5

10/04/24 09:25:36 INFO mapred.JobClient:     Spilled Records=12

10/04/24 09:25:36 INFO mapred.JobClient:     Map output bytes=82

10/04/24 09:25:36 INFO mapred.JobClient:     Combine input records=8

10/04/24 09:25:36 INFO mapred.JobClient:     Map output records=8

10/04/24 09:25:36 INFO mapred.JobClient:     Reduce input records=6

  1. 查看运行结果

zyx@nodeC:~/hadoop-0.20.2$ hadoop fs -cat output/part-r-00000

Goodbye 1

Hadoop  2

Hello   2

bye     1

cuijj   2

  1. MapReduce的安装
  2. MapReduce程序的运行

11. 对于.java程序的hadoop编译:

root@nodeC:/home/zyx/hadoop-0.20.2# javac -classpath /home/zyx/hadoop-0.20.2/hadoop-0.20.2-core.jar:/home/zyx/hadoop-0.20.2/lib/commons-cli-1.2.jar -d /home/zyx/wordcount_class /home/zyx/hadoop-0.20.2/src/examples/org/apache/hadoop/examples/WordCount.java

12. 把 .class 文件生成  .jar 文件

root@nodeC:/home/zyx/wordcount_class/org/apache/hadoop/examples# jar -cvf /home/zyx/wordcount.jar /home/zyx/wordcount_class/ .

added manifest

adding: home/zyx/wordcount_class/(in = 0) (out= 0)(stored 0%)

adding: home/zyx/wordcount_class/org/(in = 0) (out= 0)(stored 0%)

adding: home/zyx/wordcount_class/org/apache/(in = 0) (out= 0)(stored 0%)

adding: home/zyx/wordcount_class/org/apache/hadoop/(in = 0) (out= 0)(stored 0%)

adding: home/zyx/wordcount_class/org/apache/hadoop/examples/(in = 0) (out= 0)(stored 0%)

adding: home/zyx/wordcount_class/org/apache/hadoop/examples/WordCount.class(in = 1911) (out= 996)(deflated 47%)

adding: home/zyx/wordcount_class/org/apache/hadoop/examples/WordCount$TokenizerMapper.class(in = 1790) (out= 765)(deflated 57%)

adding: home/zyx/wordcount_class/org/apache/hadoop/examples/WordCount$IntSumReducer.class(in = 1789) (out= 746)(deflated 58%)

adding: WordCount.class(in = 1911) (out= 996)(deflated 47%)

adding: WordCount$TokenizerMapper.class(in = 1790) (out= 765)(deflated 57%)

例子:WordCount v1.0

在深入细节之前,让我们先看一个Map/Reduce的应用示例,以便对它们的工作方式有一个初步的认识。

WordCount是一个简单的应用,它可以计算出指定数据集中每一个单词出现的次数。

这个应用适用于 单 机模式, 伪 分布式模式 或 完 全分布式模式 三种Hadoop安装方式。

源代码

WordCount.java

1.

package org.myorg;

2.

3.

import java.io.IOException;

4.

import java.util.*;

5.

6.

import org.apache.hadoop.fs.Path;

7.

import org.apache.hadoop.conf.*;

8.

import org.apache.hadoop.io.*;

9.

import org.apache.hadoop.mapred.*;

10.

import org.apache.hadoop.util.*;

11.

12.

public class WordCount {

13.

14.

   public static class Map extends MapReduceBase implements Mapper {

15.

     private final static IntWritable one = new IntWritable(1);

16.

     private Text word = new Text();

17.

18.

     public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {

19.

       String line = value.toString();

20.

       StringTokenizer tokenizer = new StringTokenizer(line);

21.

       while (tokenizer.hasMoreTokens()) {

22.

         word.set(tokenizer.nextToken());

23.

         output.collect(word, one);

24.

       }

25.

     }

26.

   }

27.

28.

   public static class Reduce extends MapReduceBase implements Reducer {

29.

     public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {

30.

       int sum = 0;

31.

       while (values.hasNext()) {

32.

         sum += values.next().get();

33.

       }

34.

       output.collect(key, new IntWritable(sum));

35.

     }

36.

   }

37.

38.

   public static void main(String[] args) throws Exception {

39.

     JobConf conf = new JobConf(WordCount.class);

40.

     conf.setJobName("wordcount");

41.

42.

     conf.setOutputKeyClass(Text.class);

43.

     conf.setOutputValueClass(IntWritable.class);

44.

45.

     conf.setMapperClass(Map.class);

46.

     conf.setCombinerClass(Reduce.class);

47.

     conf.setReducerClass(Reduce.class);

48.

49.

     conf.setInputFormat(TextInputFormat.class);

50.

     conf.setOutputFormat(TextOutputFormat.class);

51.

52.

     FileInputFormat.setInputPaths(conf, new Path(args[0]));

53.

     FileOutputFormat.setOutputPath(conf, new Path(args[1]));

54.

55.

     JobClient.runJob(conf);

57.

   }

58.

}

59.

用法

假设环境变量HADOOP_HOME对应安装时的根目录,HADOOP_VERSION对应Hadoop的当前安装版本,编译WordCount.java来创建jar包,可如下操作:

$ mkdir wordcount_classes
$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java
$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .

假设:

  • /usr/joe/wordcount/input - 是HDFS中的输入路径
  • /usr/joe/wordcount/output - 是HDFS中的输出路径

用示例文本文件做为输入:

$ bin/hadoop dfs -ls /usr/joe/wordcount/input/
/usr/joe/wordcount/input/file01
/usr/joe/wordcount/input/file02

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01
Hello World Bye World

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop

运行应用程序:

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output

输出是:

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

应用程序能够使用-files选项来指定一个由逗号分隔的路径列表,这些路径是 task的当前工作目录。使用选项-libjars可以向map和reduce的 classpath中添加jar包。使用-archives选项程序可以传递档案文件做为 参数,这些档案文件会被解压并且在task的当前工作目录下会创建一个指向解压生成的目录的符号链接(以压缩包的名字命名)。 有关命令行选项的更多细节请参考 Commands manual。

使用-libjars和-files运 行wordcount例子:
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars mylib.jar input output

你可能感兴趣的:(hadoop,大数据,分布式)