今天介绍一下hadoop的相关配置。现在hadoop的版本更新比较快,在配置的时候肯定有些不同,大家可以参考官方文档进行配置。安装hadoop有些先决条件:Sun Java6(更高的版本也行,至于OpenJDK我还没有尝试过。),添加专用的hadoop系统用户,配置SSH(这里的ssh是指的OpenServer,用于在多节点下进行远程操作)
1.linux下安装sun-jdk ,下面是具体步骤
参考:http://www.devsniper.com/ubuntu-12-04-install-sun-jdk-6-7/
- 下载sun-jdk-6-bin 点击下载
- 确保文件具有可执行权限
1 |
chmod +x jdk-6u32-linux-x64.bin |
- 执行bin文件
1 |
./jdk-6u32-linux-x64.bin |
- 移动解压后的文件到指定目录
1 |
sudo mv jdk1.6.0_32 /usr/lib/jvm/ |
- 在系统中安装新的java源
1 |
sudo update-alternatives -- install /usr/bin/javac javac /usr/lib/jvm/jdk1.6.0_32/bin/javac 1 |
2 |
sudo update-alternatives -- install /usr/bin/java java /usr/lib/jvm/jdk1.6.0_32/bin/java 1 |
3 |
sudo update-alternatives -- install /usr/bin/javaws javaws /usr/lib/jvm/jdk1.6.0_32/bin/javaws 1 |
- 当系统中存在多个java版本时,需要配置系统默认的java
1 |
sudo update-alternatives --config javac |
2 |
sudo update-alternatives --config java |
3 |
sudo update-alternatives --config javaws |
- 验证java版本
1 |
java -version |
2.添加hadoop的系统用户
我们需要使用一个hadoop用户来运行hadoop.
$ sudo addgroup hadoop //添加用户组 $ sudo adduser --ingroup hadoop hduser //在组内添加用户
3.SSH配置
SSH的功能已经给大家介绍了。这里我们直接进行SSH的配置。注意:为了在远程访问的时候避免每次都输入密码,我们在生成密钥的时候一般不输入密码或者密码为空。
user@ubuntu:~$ su - hduser hduser@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu The key's randomart p_w_picpath is: [...snipp...] hduser@ubuntu:~$
接下来我们要让SSH能使用新生成的密钥。需要做一下事情。
hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
最后测试链接到本机是否成功:
hduser@ubuntu:~$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...] hduser@ubuntu:~$
当你看到上面的信息时候。说明已经成功了。
4.Hadoop安装,你需要从Apache的官方网站下载Hadoop的文件。当前安装的文件是0.2的版本。
下载以后的操作:
$ cd /usr/local $ sudo tar xzf hadoop-1.0.3.tar.gz $ sudo mv hadoop-1.0.3 hadoop $ sudo chown -R hduser:hadoop hadoop
更新$HOME/.bashrc文件,在文件的末尾添加以下内容:
# Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-6-sun # Some convenient aliases and functions for running Hadoop-related commands unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls" # If you have LZO compression enabled in your Hadoop cluster and # compress job outputs with LZOP (not covered in this tutorial): # Conveniently inspect an LZOP compressed file from the command # line; run via: # # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo # # Requires installed 'lzop' command. # lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less } # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin
下面对hadoop的文件进行配置:
首先是/usr/local/hadoop/conf/hadoop-env.sh文件
将${JAVA_HOME}改成你的jdk安装路径
# The java implementation to use. Required. # export JAVA_HOME=${JAVA_HOME}
to
# The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun
其次是改变conf/core-site.xml文件:
hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories. fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
接下来是conf/mapred-site.xml文件:
mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
最后是conf/hdfs-site.xml:
dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
在启动hadoop之前我们需要对HDFS文件系统进行格式化,执行一下命令即可。
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. 10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted. 10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/ hduser@ubuntu:/usr/local/hadoop$
启动单节点集群:
hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out hduser@ubuntu:/usr/local/hadoop$
你可以使用以下命令来查看hadoop的监听端口:
hduser@ubuntu:~$ sudo netstat -plten | grep java tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/java tcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java tcp 0 0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/java tcp 0 0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java hduser@ubuntu:~$
停止单节点集群:
hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode hduser@ubuntu:/usr/local/hadoop$