记录使用yum通过rpm方式安装Cloudera CDH中的hadoop、yarn、HBase,需要注意初始化namenode之前需要手动创建一些目录并设置权限。
1.设置hosts 临时设置hostname,以node1为例
sudo hostname node1
确保/etc/hosts
中包含ip和FQDN,如果你在使用DNS,保存这些信息到/etc/hosts
不是必要的,却是最佳实践。 确保/etc/sysconfig/network
中包含hostname=node1 检查网络,运行下面命令检查是否配置了hostname以及其对应的ip是否正确。
host -v -t A `hostname`
hadoop的配置文件core-site.xml
、mapred-site.xml
和yarn-site.xml
配置节点时,请使用hostname和不是ip
2.关闭防火墙
setenforce 0
vim /etc/sysconfig/selinux #修改SELINUX=disabled
3.清空iptables iptables -F
4.检查每个节点上的/tmp
目录权限是否为1777
,如果不是请修改。
5.设置时钟同步服务
在所有节点安装ntp
yum install ntp
设置开机启动
chkconfig ntpd on
在所有节点启动ntp
/etc/init.d/ntpd start
是client使用local NTP server,修改/etc/ntp.conf,添加以下内容:
server $LOCAL_SERVER_IP OR HOSTNAME
检查jdk版本
java -version
如果其版本低于v1.6 update 31,则将其卸载
rpm -qa | grep java
yum remove {java-1.*}
验证默认的jdk是否被卸载
which java
安装jdk,使用yum安装或者手动下载安装jdk-6u31-linux-x64.bin,下载地址:这里
yum install jdk -y
创建符号连接
ln -s XXXXX/jdk1.6.0_31 /usr/java/default
ln -s /usr/java/default/bin/java /usr/bin/java
设置环境变量:
echo "export JAVA_HOME=/usr/java/latest" >>/root/.bashrc
echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> /root/.bashrc
source /root/.bashrc
验证版本
java -version
你将看到以下输出:
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
检查环境变量中是否有设置JAVA_HOME
sudo env | grep JAVA_HOME
如果env中没有JAVA_HOM
E变量,则修改/etc/sudoers
文件
vi /etc/sudoers
Defaults env_keep+=JAVA_HOME
从这里 下载压缩包解压并设置本地或ftp yum源,可以参考Creating a Local Yum Repository
yum list hadoop
yum install hadoop-hdfs-namenode
yum install hadoop-hdfs-secondarynamenode
yum install hadoop-yarn-resourcemanager
yum install hadoop-mapreduce-historyserver
yum list hadoop
yum install hadoop-hdfs-datanode
yum install hadoop-yarn-nodemanager
yum install hadoop-mapreduce
yum install zookeeper-server
yum install hadoop-httpfs
yum install hadoop-debuginfo
sudo cp -r /etc/hadoop/conf.dist /etc/hadoop/conf.my_cluster
sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
hadoop默认使用/etc/hadoop/conf
路径读取配置文件,经过上述配置之后,/etc/hadoop/conf
会软连接到/etc/hadoop/conf.my_cluster
目录
进入/etc/hadoop/conf编辑配置文件。
修改core-site.xml配置:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node1</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>10080</value>
</property>
<property>
<name>io.bytes.per.checksum</name>
<value>4096</value>
</property>
</configuration>
修改hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/data/hadoop</value>
</property>
<property>
<name>dfs.block.size</name>
<value>268435456</value>
</property>
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
</property>
<property>
<name>dfs.datanode.handler.count</name>
<value>100</value>
</property>
<property>
<name>dfs.datanode.balance.bandwidthPerSec</name>
<value>1048576</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>node1:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node1:50090</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
修改master和slaves文件
注意:
The value of NameNode new generation size should be 1/8 of maximum heap size (-Xmx). Please check, as the default setting may not be accurate. To change the default value, edit the /etc/hadoop/conf/hadoop-env.sh file and change the value of the -XX:MaxnewSize parameter to 1/8th the value of the maximum heap size (-Xmx) parameter.
请参考Introduction to HDFS High Availability
在hdfs-site.xml中可以配置以下参数:
dfs.namenode.checkpoint.check.period
dfs.namenode.checkpoint.txns
dfs.namenode.checkpoint.dir
dfs.namenode.checkpoint.edits.dir
dfs.namenode.num.checkpoints.retained
设置多个secondarynamenode,请参考multi-host-secondarynamenode-configuration.
在hadoop中默认的文件路径以及权限要求如下:
目录 所有者 权限 默认路径
hadoop.tmp.dir hdfs:hdfs drwx------ /var/hadoop
dfs.namenode.name.dir hdfs:hdfs drwx------ file://${hadoop.tmp.dir}/dfs/name
dfs.datanode.data.dir hdfs:hdfs drwx------ file://${hadoop.tmp.dir}/dfs/data
dfs.namenode.checkpoint.dir hdfs:hdfs drwx------ file://${hadoop.tmp.dir}/dfs/namesecondary
yarn.nodemanager.local-dirs yarn:yarn drwxr-xr-x ${hadoop.tmp.dir}/nm-local-dir
yarn.nodemanager.log-dirs yarn:yarn drwxr-xr-x ${yarn.log.dir}/userlogs
yarn.nodemanager.remote-app-log-dir /tmp/logs
我的配置如下:
hadoop.tmp.dir /opt/data/hadoop
dfs.namenode.name.dir ${hadoop.tmp.dir}/dfs/name
dfs.datanode.data.dir ${hadoop.tmp.dir}/dfs/data
dfs.namenode.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary
yarn.nodemanager.local-dirs /opt/data/yarn/local
yarn.nodemanager.log-dirs /var/log/hadoop-yarn/logs
yarn.nodemanager.remote-app-log-dir /var/log/hadoop-yarn/app
在hadoop中dfs.permissions.superusergroup
默认为hdfs,我的hdfs-site.xml
配置文件将其修改为了hadoop。
请参考Configuring Ports for CDH4
在namenode节点创建name目录
mkdir -p /opt/data/hadoop/dfs/name
chown -R hdfs:hadoop /opt/data/hadoop/dfs/name
chmod 700 /opt/data/hadoop/dfs/name
在所有datanode节点创建data目录
mkdir -p /opt/data/hadoop/dfs/data
chown -R hdfs:hadoop /opt/data/hadoop/dfs/data
chmod 700 /opt/data/hadoop/dfs/data
在secondarynode节点创建namesecondary目录
mkdir -p /opt/data/hadoop/dfs/namesecondary
chown -R hdfs:hadoop /opt/data/hadoop/dfs/namesecondary
chmod 700 /opt/data/hadoop/dfs/namesecondary
在所有datanode节点创建yarn的local目录
mkdir -p /opt/data/hadoop/yarn/local
chown -R yarn:yarn /opt/data/hadoop/yarn/local
chmod 700 /opt/data/hadoop/yarn/local
sudo scp -r /etc/hadoop/conf root@nodeX:/etc/hadoop/conf
sudo -u hdfs hdfs namenode -format
#!/bin/bash
if ! jps | grep -q DataNode ; then
echo ERROR: datanode not up
fi
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x restart ; done
先在一台机器上配置好,然后在做同步。
修改mapred-site.xml文件:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>node1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node1:19888</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m -XX:+UseConcMarkSweepGC -XX:ParallelCMSThreads=1 -XX:ParallelGCThreads=1</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>200</value>
</property>
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>16</value>
<!-- 一般介于节点数开方和节点数一半之间,小于20节点,则为节点数-->
</property>
<property>
<name>mapreduce.task.timeout</name>
<value>1800000</value>
</property>
<property>
<name>mapreduce.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapreduce.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
</property>
</configuration>
修改yarn-site.xml文件:
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>node1:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>node1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>node1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>node1:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>node1:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,$YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/opt/hadoop/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/var/log/hadoop-yarn/logs</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>
</configuration>
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir /user/history
sudo -u hdfs hadoop fs -chmod 1777 /user/history
sudo -u hdfs hadoop fs -chown yarn /user/history
sudo -u hdfs hadoop fs -mkdir /user/history/done
sudo -u hdfs hadoop fs -chmod 777 /user/history/done
sudo -u hdfs hadoop fs -chown yarn /user/history/done
sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
[root@node1 data]# sudo -u hdfs hadoop fs -ls -R /
drwxrwxrwt - hdfs hadoop 0 2012-04-19 14:31 /tmp
drwxr-xr-x - hdfs hadoop 0 2012-05-31 10:26 /user
drwxrwxrwt - yarn hadoop 0 2012-04-19 14:31 /user/history
drwxrwxrwx - yarn hadoop 0 2012-04-19 14:31 /user/history/done
drwxr-xr-x - hdfs hadoop 0 2012-05-31 15:31 /var
drwxr-xr-x - hdfs hadoop 0 2012-05-31 15:31 /var/log
drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn
/etc/init.d/hadoop-mapreduce-historyserver start
for x in `cd /etc/init.d ; ls hadoop-yarn-*` ; do sudo service $x start ; done
sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
sudo chkconfig hadoop-hdfs-namenode on
sudo chkconfig hadoop-hdfs-datanode on
sudo chkconfig hadoop-hdfs-secondarynamenode on
sudo chkconfig hadoop-yarn-resourcemanager on
sudo chkconfig hadoop-yarn-nodemanager on
sudo chkconfig hadoop-mapreduce-historyserver on
sudo chkconfig hbase-master on
sudo chkconfig hbase-regionserver on
sudo chkconfig hive-metastore on
sudo chkconfig hive-server2 on
sudo chkconfig zookeeper-server on
sudo chkconfig hadoop-httpfs on
安装zookeeper
yum install zookeeper*
设置crontab
crontab -e
15 * * * * java -cp $classpath:/usr/lib/zookeeper/lib/log4j-1.2.15.jar:\
/usr/lib/zookeeper/lib/jline-0.9.94.jar:\
/usr/lib/zookeeper/zookeeper.jar:/usr/lib/zookeeper/conf\
org.apache.zookeeper.server.PurgeTxnLog /var/zookeeper/ -n 5
在每个需要安装zookeeper的节点上创建zookeeper的目录
mkdir -p /opt/data/zookeeper
chown -R zookeeper:zookeeper /opt/data/zookeeper
设置zookeeper配置:/etc/zookeeper/conf/zoo.cfg,并同步到其他机器
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/opt/data/zookeeper
clientPort=2181
server.1=node1:2888:3888
server.2=node2:2888:3888
server.3=node3:2888:3888
在每个节点上初始化并启动zookeeper,注意修改n值
service zookeeper-server init --myid=n
service zookeeper-server restart
yum install hbase*
sudo -u hdfs hadoop fs -mkdir /hbase
sudo -u hdfs hadoop fs -chown hbase:hbase /hbase
crontab -e
* 10 * * * cd /var/log/hbase/; rm -rf\
`ls /var/log/hbase/|grep -P 'hbase\-hbase\-.+\.log\.[0-9]'\`>> /dev/null &
修改hbase-site.xml文件:
<configuration>
<property>
<name>hbase.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://node1:8020/hbase</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/opt/data/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node1,node2,node3</value>
</property>
<property>
<name>hbase.hregion.max.filesize</name>
<value>536870912</value>
</property>
<property>
<name>hbase.hregion.memstore.flush.size</name>
<value>67108864</value>
</property>
<property>
<name>hbase.regionserver.lease.period</name>
<value>600000</value>
</property>
<property>
<name>hbase.client.retries.number</name>
<value>3</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>100</value>
</property>
<property>
<name>hbase.zookeeper.property.maxClientCnxns</name>
<value>2000</value>
</property>
<property>
<name>hfile.block.cache.size</name>
<value>0.1</value>
</property>
<property>
<name>hbase.regions.slop</name>
<value>0</value>
</property>
<property>
<name>hbase.hstore.compactionThreshold</name>
<value>10</value>
</property>
<property>
<name>hbase.hstore.blockingStoreFiles</name>
<value>30</value>
</property>
</configuration>
service hbase-master start
service hbase-regionserver start
sudo yum install hive*
手动安装、配置postgresql数据库,请参考手动安装Cloudera Hive CDH4.2
yum方式安装:
sudo yum install postgresql-server
初始化数据库:
sudo service postgresql initdb
修改配置文件postgresql.conf,修改完后内容如下:
sudo cat /var/lib/pgsql/data/postgresql.conf | grep -e listen -e standard_conforming_strings
listen_addresses = '*'
standard_conforming_strings = off
修改 pg_hba.conf,添加以下一行内容:
host all all 0.0.0.0 0.0.0.0 md5
启动数据库
sudo service postgresql start
配置开启启动
chkconfig postgresql on
安装jdbc驱动
sudo yum install postgresql-jdbc
ln -s /usr/share/java/postgresql-jdbc.jar /usr/lib/hive/lib/postgresql-jdbc.jar
创建数据库和用户
bash# sudo –u postgres psql
bash$ psql
postgres=# CREATE USER hiveuser WITH PASSWORD 'redhat';
postgres=# CREATE DATABASE metastore owner=hiveuser;
postgres=# GRANT ALL privileges ON DATABASE metastore TO hiveuser;
postgres=# \q;
bash$ psql -U hiveuser -d metastore
postgres=# \i /usr/lib/hive/scripts/metastore/upgrade/postgres/hive-schema-0.10.0.postgres.sql
SET
SET
..
修改hive-site.xml文件:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node1:8020</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://node1/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>redhat</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>node1:8031</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://node1:9083</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>false</value>
</property>
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>node2,node3,node1</value>
</property>
<property>
<name>hive.hwi.listen.host</name>
<value>node1</value>
</property>
<property>
<name>hive.hwi.listen.port</name>
<value>9999</value>
</property>
<property>
<name>hive.hwi.war.file</name>
<value>lib/hive-hwi-0.10.0-cdh4.2.0.war</value>
</property>
<property>
<name>hive.merge.mapredfiles</name>
<value>true</value>
</property>
</configuration>
/etc/hadoop/conf/hadoop-env.sh
添加环境变量HADOOP_MAPRED_HOME
,如果不添加,则当你使用yarn运行mapreduce时候会出现UNKOWN RPC TYPE
的异常
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
/user/hive/warehouse
,建议修改其访问权限为1777,以便其他所有用户都可以创建、访问表,但不能删除不属于他的表。/user
目录下,如root用户的为/user/root
)/tmp
必须是world-writable权限的。创建目录并设置权限:
sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse
sudo -u hdfs hadoop fs -chmod 1777 /user/hive/warehouse
sudo -u hdfs hadoop fs -chown hive /user/hive/warehouse
service hive-metastore start
service hive-server start
service hive-server2 start
$ /usr/lib/hive/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000> SHOW TABLES;
show tables;
+-----------+
| tab_name |
+-----------+
+-----------+
No rows selected (0.238 seconds)
0: jdbc:hive2://localhost:10000>
其 sql语法参考SQLLine CLI,在这里,你不能使用HiveServer的sql语句
需要在hive里添加以下jar包:
ADD JAR /usr/lib/hive/lib/zookeeper.jar;
ADD JAR /usr/lib/hive/lib/hbase.jar;
ADD JAR /usr/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.2.0.jar
ADD JAR /usr/lib/hive/lib/guava-11.0.2.jar;
cdh4.3 rpm中默认已经包含了snappy,可以再不用安装。
在每个节点安装Snappy
yum install snappy snappy-devel
使snappy对hadoop可用
ln -sf /usr/lib64/libsnappy.so /usr/lib/hadoop/lib/native/
cdh4.3 rpm中默认不包含了lzo,需要自己额外安装。
在每个节点安装:
yum install lzo lzo-devel hadoop-lzo hadoop-lzo-native