= HADOOP =
* Install JRE
* centos
[1] 删除老版本的jdk (如果有)
* java -version 如果有输出表示有老版本的jdk
* rpm -qa |grep java
* rpm -e --nodeps (rmp -qa |grep java的输出结果)
[2] 下载 & 安装jdk
* http://java.com/zh_CN/download/manual.jsp
* rpm -ivh jre-package.rpm
* rpm -Uvh jre-package.rpm
* ubuntu
[1] 删除老版本的jdk(如果有)
* sudo apt-get purge openjdk* (其实是openjdk残余)
[2] 安装新jdk(oracle jdk7)
* sudo add-apt-repository ppa:eugenesan/java
* sudo apt-get update
* sudo apt-get install oracle-java7-installer
* javac -version
* Install HADOOP [ cdh3 ]
* centos
[1] 添加cloudera-chd3.repo (http://archive.cloudera.com/redhat/6/x86_64/cdh/cloudera-cdh3.repo)
* 添加到 /etc/yum.repo.d/cloudera-cdh3.repo
* yum search hadoop
* install hadoop packages (namenode, datanode, basic_package, and etc)
* configure hadoop files (core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-end.sh)
* format namenode (sudo su hdfs, hadoop namenode -format)
* start hadoop services (namenode, second-namenode, datanode, jobtracker, tasktracker, zookeeper)
* hadoop jar hadoop-example.jar input/* output
* ubuntu
[1] 下载hadoop官方源码 http://www.apache.org/dyn/closer.cgi/hadoop/common/
* 解压缩hadoop压缩包 (tar zvxf hadoop-2.40.tar.gz -> /usr/lib/hadoop)
* 修改etc/hadoop/hadoop-env.sh
* export JAVA_HOME=/usr/java/latest
* export HADOOP_PREFIX=/usr/local/hadoop
* 试着使用bin/hadoop 来查看是否可以显示信息
[2] 选择hadoop模式
[1] Local (Standalone) Mode
* mkdir input
* cp etc/hadoop/*.xml input
* bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar grep input output 'dfs[a-z.]+'
* cat output/*
[2] Pseudo-Distributed Mode
* 修改 etc/hadoop/core-site.xml
{{{
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
}}}
* 修改etc/hadoop/hdfs-site.xml
{{{
<configuration>
<property>
<name>hdfs.replication</name>
<value>1</value>
</property>
</configuration>
}}}
* ssh localhost (保证ssh能不需要密码登陆)
* 格式化namenode bin/hdfs namenode -format
* 启动hdfs sbin/start-dfs.sh
* 可以在浏览器中看到: http://your.server.host:50070/
* 测试是否可用
{{{
# 创建一个测试目录,用以存放数据
bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/your-user-name
# 将需要使用的测试文件放入到hdfs中
bin/hdfs dfs -put etc/hadoop input
# 执行mapreduce任务
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar grep input output 'dfs[a-z.]+'
# 从hdfs中取出计算结果
bin/hdfs dfs -get output output
cat output/* # 可以看到相应地输出信息
或者可以直接在hdfs中查看计算结果
bin/hdfs dfs -cat output/*
当使用完毕可以关闭后台运行的namenode, secondnamenode等进程
sbin/stop-dfs.sh
}}}
[Tips] 也可以运行yarn
* 修改etc/hadoop/mapred-site.xml
{{{
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
}}}
* 修改etc/hadoop/yarn-site.xml
{{{
<configuration>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</configuration>
}}}
* 启动ResourceManager sbin/start-yarn.sh
* 可以在浏览器中发现: http://your-server.host:8088/
* 运行一个mapreduce任务(跟上面类似)
* 关闭时: sbin/stop-yarn.sh
[Tips Again]
* 在启动hdfs的时候可能会遇到一个bug: sed: -e expression #1, char 6: unknown option to `s and blablabla
* 解决方法如下:
{{{
修改etc/hadoop/hadoop-env.sh
加入
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
修改
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
}}}
[Tips X3]
* 退出safe mode
* bin/hdfs dfs -safemode leave (不再在bin/hadoop下了)
[3] Fully-Distributed Mode
* Prerequisites 下载一个稳定版的hadoop (http://www.apache.org/dyn/closer.cgi/hadoop/common/)
# TODO