本环境搭建是在,可参考
华为云——数字中国创新大赛·鲲鹏赛道·天府赛区暨四川鲲鹏应用开发者大赛
本文详细描述了在华为云鲲鹏生态下的 BigData Pro 解决方案实验手段, 涵盖集群搭建
及验证, 本指导手册所涉及的大数据组件版本如下所示
组件 | 版本 |
---|---|
Hadoop | 2.8.3 |
Spark | 2.3.0 |
Hive | 2.3.3 |
Hbase | 2.1.0 |
各节点主机名
内网IP | 主机名 |
---|---|
192.168.1.122 | node1 |
192.168.1.27 | node2 |
192.168.1.133 | node3 |
192.168.1.101 | node4 |
vi /etc/hosts
systemctl stop firewalld
systemctl disable firewalld
ssh-keygen -t rsa
ssh-copy-id root@node1
ssh-copy-id root@node2
ssh-copy-id root@node3
ssh-copy-id root@node4
fdisk /dev/vdb
partprobe
mkfs -t ext4 /dev/vdb1
mount /dev/vdb1 /home
df -h
blkid
UUID=查询到的 UUID /home ext4 defaults 1 1
因为我们使用的是华为云容器引擎,鲲鹏集群时间已经同步
下面是检查过程,通过Xshell工具,同时输入命令date,各节点结果一致
各节点执行如下命令
mkdir -p /home/modules/data/buf/
mkdir -p /home/test_tools/
mkdir -p /home/nm/localdir
wget https://big-data-pro-test.obs.cn-east-3.myhuaweicloud.com/arm_bigdata_suite.tar.gz
wget http://mirrors.huaweicloud.com/centos-altarch/7.7.1908/isos/aarch64/CentOS-7-aarch64-Everything-1908.iso
软件列表如下图
tar zxvf OpenJDK8U-jdk_aarch64_linux_hotspot_8u191b12.tar.gz -C /usr/lib/jvm
export JAVA_HOME=/usr/lib/jvm/jdk8u191-b12
scp -r /usr/lib/jvm/jdk8u191-b12/ 主机名:/usr/lib/jvm
source /etc/profile
java -version
mount -o loop /home/arm_bigdata_suite/CentOS-7-aarch64-Everything-1908.iso/media/
df -h
mv /etc/yum.repos.d/* /tmp/
[local]
name=local
baseurl=file:///media
enabled=1
gpgcheck=0
yum clean all
yum makecache
yum list | grep libaio
yum list |grep mysql-connector-java
本次搭建过程中需要使用到如下资源, 需提前准备好
资源名称 | 备注 |
---|---|
确认所使用的华为云区域 | 选择具备鲲鹏弹性云服务器的任一区域, 后续创建并行文件系统及创建鲲鹏 ECS 虚拟机需在同一区域进行。 |
并行文件系统 | 1、创建方式:登陆华为云, 在上述,选择对象存储服务,点击并行文件系统,点击创建并行文件系统2、 创建后记录文件系统名称(本文后续也称为:桶名),为避免相互影响,每套大数据集群建议对接不同的 OBS 桶。 |
访问密钥(AK 和 SK) | 获取后备用。 |
OBS 区域域名 | 上述桶所在区域的 OBS 的区域域名 |
cp /home/arm_bigdata_suite/hadoop-2.8.3.tar.gz /home/modules/
cd /home/modules/
tar zxvf hadoop-2.8.3.tar.gz
cd /home/modules/hadoop-2.8.3/etc/hadoop
将export JAVA_HOME=${JAVA_HOME}替换为:
export JAVA_HOME=/usr/lib/jvm/jdk8u191-b12
fs.obs.access.key、 fs.obs.secret.key、 fs.obs.endpoint 需根据obs桶进行修改
fs.obs.readahead.inputstream.enabled</name>
true</value>
</property>
fs.obs.buffer.max.range</name>
6291456</value>
</property>
fs.obs.buffer.part.size</name>
2097152</value>
</property>
fs.obs.threads.read.core</name>
500</value>
</property>
fs.obs.threads.read.max</name>
1000</value>
</property>
fs.obs.write.buffer.size</name>
8192</value>
</property>
fs.obs.read.buffer.size</name>
8192</value>
</property>
fs.obs.connection.maximum</name>
1000</value>
</property>
fs.defaultFS</name>
hdfs://node1:8020</value>
</property>
hadoop.tmp.dir</name>
/home/modules/hadoop-2.8.3/tmp</value>
</property>
fs.obs.access.key</name>
RSM2WMT03R38ZLY2TCOL</value>
</property>
fs.obs.secret.key</name>
KaP5ajgg8FGvRP6Sy2QX7UUO4sXFFisuPoCuseB8</value>
</property>
fs.obs.endpoint</name>
obs.cn-north-4.myhuaweicloud.com:5080</value>
</property>
fs.obs.buffer.dir</name>
/home/modules/data/buf</value>
</property>
fs.obs.impl</name>
org.apache.hadoop.fs.obs.OBSFileSystem</value>
</property>
fs.obs.connection.ssl.enabled</name>
false</value>
</property>
fs.obs.fast.upload</name>
true</value>
</property>
fs.obs.socket.send.buffer</name>
65536</value>
</property>
fs.obs.socket.recv.buffer</name>
65536</value>
</property>
fs.obs.max.total.tasks</name>
20</value>
</property>
fs.obs.threads.max</name>
20</value>
</property>
</configuration>
dfs.replication</name>
3</value>
</property>
dfs.namenode.secondary.http-address</name>
node1:50090</value>
</property>
dfs.namenode.secondary.https-address</name>
node1:50091</value>
</property>
</configuration>
mapreduce.framework.name</name>
yarn</value>
</property>
mapreduce.jobhistory.address</name>
node1:10020</value>
</property>
mapreduce.jobhistory.webapp.address</name>
node1:19888</value>
</property>
mapred.task.timeout</name>
1800000</value>
</property>
</configuration>
yarn.nodemanager.local-dirs</name>
/home/nm/localdir</value>
</property>
yarn.nodemanager.resource.memory-mb</name>
28672</value>
</property>
yarn.scheduler.minimum-allocation-mb</name>
3072</value>
</property>
yarn.scheduler.maximum-allocation-mb</name>
28672</value>
</property>
yarn.nodemanager.resource.cpu-vcores</name>
38</value>
</property>
yarn.scheduler.maximum-allocation-vcores</name>
38</value>
</property>
yarn.nodemanager.aux-services</name>
mapreduce_shuffle</value>
</property>
yarn.resourcemanager.hostname</name>
node1</value>
</property>
yarn.log-aggregation-enable</name>
true</value>
</property>
yarn.log-aggregation.retain-seconds</name>
106800</value>
</property>
yarn.nodemanager.vmem-check-enabled</name>
false</value>
Whether virtual memory limits will be enforced for containers</description>
</property>
yarn.nodemanager.vmem-pmem-ratio</name>
4</value>
Ratio between virtual memory to physical memory when setting memory
limits for containers</description>
</property>
yarn.resourcemanager.scheduler.class</name>
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
yarn.log.server.url</name>
http://node1:19888/jobhistory/logs</value>
</property>
</configuration>
cd /home/arm_bigdata_suite/
echo y | cp snappy-java-1.0.4.1.jar /home/modules/hadoop-2.8.3/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/
echo y | cp snappy-java-1.0.4.1.jar /home/modules/hadoop-2.8.3/share/hadoop/tools/lib/
echo y | cp snappy-java-1.0.4.1.jar /home/modules/hadoop-2.8.3/share/hadoop/mapreduce/lib/
echo y | cp snappy-java-1.0.4.1.jar /home/modules/hadoop-2.8.3/share/hadoop/common/lib/
echo y | cp snappy-java-1.0.4.1.jar /home/modules/hadoop-2.8.3/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/
cp /home/arm_bigdata_suite/hadoop-huaweicloud-2.8.3.36.jar /home/modules/hadoop-
2.8.3/share/hadoop/common/lib/
cp /home/arm_bigdata_suite/hadoop-huaweicloud-2.8.3.36.jar /home/modules/hadoop-
2.8.3/share/hadoop/tools/lib
cp /home/arm_bigdata_suite/hadoop-huaweicloud-2.8.3.36.jar /home/modules/hadoop-
2.8.3/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/
cp /home/arm_bigdata_suite/hadoop-huaweicloud-2.8.3.36.jar /home/modules/hadoop-
2.8.3/share/hadoop/hdfs/lib/
for i in {2..4};do scp -r /home/modules/hadoop-2.8.3 root@node${i}:/home/modules/;done
export HADOOP_HOME=/home/modules/hadoop-2.8.3
export PATH=$JAVA_HOME/bin:$PATH
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CLASSPATH=/home/modules/hadoop-2.8.3/share/hadoop/tools/lib/*:$HADOOP_CLASSPATH
source /etc/profile
hdfs namenode -format
start-dfs.sh
start-yarn.sh
hadoop dfs -mkdir /test_folder
hadoop dfs -ls /
hdfs dfs -ls obs://[桶名]/
tar zxvf /home/arm_bigdata_suite/spark-2.3.0-bin-dev-with-sparkr.tgz -C /home/modules/
cd /home/modules/
mv spark-2.3.0-bin-dev spark-2.3.0
执行完毕后,可查看到如下目录:
cp /home/arm_bigdata_suite/hadoop-huaweicloud-2.8.3.36.jar /home/modules/spark-
2.3.0/jars/
cd /home/arm_bigdata_suite/
echo y | cp snappy-java-1.0.4.1.jar /home/modules/spark-2.3.0/jars/
export SCALA_HOME=/home/modules/spark-2.3.0/examples/src/main/scala
export SPARK_HOME=/home/modules/spark-2.3.0
export SPARK_DIST_CLASSPATH=$(/home/modules/hadoop-2.8.3/bin/hadoop classpath)
for i in {2..4};do scp -r /home/modules/spark-2.3.0 root@node${i}:/home/modules/;done
拷贝完毕后,在 node2~node4 节点, 均可出现如下目录:
export SPARK_HOME=/home/modules/spark-2.3.0
export PATH=${SPARK_HOME}/bin:${SPARK_HOME}/sbin:$PATH
source /etc/profile
show databases;
create database testdb;
show databases;
use testdb;
create table testtable(value INT);
desc testtable;
insert into testtable values (1000);
select * from testtable;
exit;
groupadd -r mysql && useradd -r -g mysql -s /sbin/nologin -M mysql
yum install -y libaio*
tar xvf /home/arm_bigdata_suite/mysql-5.7.27-aarch64.tar.gz -C /usr/local/
mv /usr/local/mysql-5.7.27-aarch64 /usr/local/mysql
mkdir -p /usr/local/mysql/logs
chown -R mysql:mysql /usr/local/mysql
ln -sf /usr/local/mysql/my.cnf /etc/my.cnf
cp -rf /usr/local/mysql/extra/lib* /usr/lib64/
mv /usr/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6.old
ln -s /usr/lib64/libstdc++.so.6.0.24 /usr/lib64/libstdc++.so.6
cp -rf /usr/local/mysql/support-files/mysql.server /etc/init.d/mysqld
chmod +x /etc/init.d/mysqld
systemctl enable mysqld
export MYSQL_HOME=/usr/local/mysql
export PATH=$PATH:$MYSQL_HOME/bin
source /etc/profile
mysqld --initialize-insecure --user=mysql --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data
systemctl start mysqld
systemctl status mysqld
[root@node1 ~]# mysql_secure_installation
NOTE: RUNNING ALL PARTS OF THIS SCRIPT IS RECOMMENDED FOR ALL MySQL
SERVERS IN PRODUCTION USE! PLEASE READ EACH STEP CAREFULLY!
In order to log into MySQL to secure it, we'll need the current
password for the root user. If you've just installed MySQL, and
you haven't set the root password yet, the password will be blank,
so you should just press enter here.
Enter current password for root (enter for none):<–初次运行直接回车
OK, successfully used password, moving on…Setting the root password ensures that nobody can log into the MySQL
root user without the proper authorisation.
Set root password? [Y/n] #是否设置 root 用户密码, 输入 y 并回车
New password: #设置 root 用户的密码,输入 Hadoop@2020 后回车
Re-enter new password: #再输入 Hadoop@2020 后回车
Password updated successfully!
Reloading privilege tables..
… Success!
By default, a MySQL installation has an anonymous user, allowing anyone
to log into MySQL without having to have a user account created for
them. This is intended only for testing, and to make the installation
go a bit smoother. You should remove them before moving into a
production environment.
Remove anonymous users? [Y/n] #是否删除匿名用户,输入 Y 并回车
… Success!
Normally, root should only be allowed to connect from 'localhost'. This
ensures that someone cannot guess at the root password from the network.
Disallow root login remotely? [Y/n] #是否禁止 root 远程登录,输入 n 并回车
… Success!
By default, MySQL comes with a database named 'test' that anyone can
access. This is also intended only for testing, and should be removed
before moving into a production environment.
Remove test database and access to it? [Y/n] #是否删除 test 数据库,直接回车
- Dropping test database…
… Success!
- Removing privileges on test database…
… Success!
Reloading the privilege tables will ensure that all changes made so far
will take effect immediately.
Reload privilege tables now? [Y/n] #是否重新加载权限表, 直接回车
… Success!
Cleaning up…
All done! If you've completed all of the above steps, your MySQL
installation should now be secure.
Thanks for using MySQL!
[root@node1~]#
systemctl restart mysql
systemctl status mysql
mysql -u root -p
mysql>use mysql;
mysql>update user set host = '%' where user = 'root';
mysql>select host, user from user;mysql>flush privileges;
yum -y install mysql-connector-java
cp /home/arm_bigdata_suite/apache-hive-2.3.3-bin.tar.gz /home/modules/
cd /home/modules/
tar zxvf apache-hive-2.3.3-bin.tar.gz
mv apache-hive-2.3.3-bin hive-2.3.3
cp /usr/share/java/mysql-connector-java.jar /home/modules/hive-2.3.3/lib/
容到本文本中后, 保存退出。
【特别注意】此配置文件中的 javax.jdo.option.ConnectionPassword 的值需要和 mysql 的密码保持一致
javax.jdo.option.ConnectionURL</name>
jdbc:mysql://node1:3306/hive_metadata?createDatabaseIfNotExist=true</value>
</property>
javax.jdo.option.ConnectionDriverName</name>
com.mysql.jdbc.Driver</value>
</property>
javax.jdo.option.ConnectionUserName</name>
root</value>
</property>
javax.jdo.option.ConnectionPassword</name>
Hadoop@2020</value>
</property>
hive.strict.checks.cartesian.product</name>
false</value>
</property>
</configuration>
HADOOP_HOME=/home/modules/hadoop-2.8.3
cd /home/modules/hive-2.3.3/bin
./schematool -initSchema -dbType mysql
for i in {2..4};do scp -r /home/modules/hive-2.3.3 root@node${i}:/home/modules/;done
拷贝完毕后,在 node2~node4 节点, 均可出现如下目录:
export HIVE_HOME=/home/modules/hive-2.3.3
export PATH=${HIVE_HOME}/bin:$PATH
source /etc/profile
show databases;
create database testdb;
show databases;
use testdb;
create table testtable(value INT);
desc testtable;
insert into testtable values (1000);
select * from testtable;
exit;
进入到命令行后,执行如下操作, 通过 on OBS 的方式访问 hive 数据(需提前将 bucket_name 替换为真实使用的 OBS 桶名), 验证存算分离
基本能力:
use testdb;
create table testtable_obs(a int, b string) row format delimited fields terminated by ","
stored as textfile location "obs://bucket_name/testtable_obs";
insert into testtable_obs values (1,'test');
select * from testtable_obs;
exit;
tar zxvf /home/arm_bigdata_suite/hbase-2.1.0-bin.tar.gz -C /home/modules/
执行完毕后,可查看到如下目录:
export JAVA_HOME=/usr/lib/jvm/jdk8u191-b12
vi /home/modules/hbase-2.1.0/conf/hbase-site.xml 将和中的内容替换为如下:
【特别说明】本配置文件中的 hbase.rootdir 参数中 bucket_name 需要替换为实际用到的
桶名。
hbase.regionserver.handler.count</name>
1000</value>
</property>
hbase.client.write.buffer</name>
5242880</value>
</property>
hbase.rootdir</name>
obs://bucket_name/hbasetest</value>
</property>
zookeeper.session.timeout</name>
120000</value>
</property>
hbase.zookeeper.property.tickTime</name>
6000</value>
</property>
hbase.zookeeper.property.dataDir</name>
/home/modules/hbase-2.1.0/data/zookeeper</value>
</property>
hbase.cluster.distributed</name>true</value>
</property>
hbase.zookeeper.quorum</name>
node1,node2,node3</value>
</property>
hbase.tmp.dir</name>
/home/modules/hbase-2.1.0/tmp</value>
</property>
hbase.wal.provider</name>
org.apache.hadoop.hbase.wal.FSHLogProvider</value>
</property>
hbase.wal.dir</name>
hdfs://node1:8020/hbase</value>
</property>
hbase.client.write.buffer</name>
2097152</value>
</property>
hbase.regionserver.handler.count</name>
200</value>
</property>
hbase.hstore.compaction.min</name>
6</value>
</property>
hbase.hregion.memstore.block.multiplier</name>
16</value>
</property>
hfile.block.cache.size</name>
0.2</value>
</property>
hbase.master.maxclockskew</name>
150000</value>
</property>
</configuration>
node1
node2
node3
node4
cp /home/modules/hadoop-2.8.3/etc/hadoop/core-site.xml /home/modules/hbase-2.1.0/conf/
echo y | cp /home/modules/hadoop-2.8.3/share/hadoop/common/hadoop-common-
2.8.3.jar /home/modules/hbase-2.1.0/lib/
md5sum /home/modules/hbase-2.1.0/lib/hadoop-common-2.8.3.jar
md5sum /home/modules/hadoop-2.8.3/share/hadoop/common/hadoop-common-
2.8.3.jar
cp /home/arm_bigdata_suite/hadoop-huaweicloud-2.8.3.36.jar /home/modules/hbase-
2.1.0/lib/
export SERVER_GC_OPTS="-Xms20480M -Xmx20480M -XX:NewSize=2048M -XX:MaxNewSize=2048M -XX:MetaspaceSize=512M -XX:MaxMetaspaceSize=512M -XX:MaxDirectMemorySize=2048M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M"
拷贝完毕后,在 node2~node4 节点, 均可出现如下目录:
for i in {2..4};do scp -r /home/modules/hbase-2.1.0 root@node${i}:/home/modules/;done
export HBASE_HOME=/home/modules/hbase-2.1.0
export PATH=${HBASE_HOME}/bin:$PATH
source /etc/profile
start-hbase.sh
http://node1弹性 ip:16010/master-status
create 't', 'f'
put 't', 'rold', 'f', 'v'
scan 't'
describe 't'
disable 't'
drop 't'
exit
tar zxvf /home/arm_bigdata_suite/HiBench.tar.gz -C /home/test_tools/
cd /home/test_tools/HiBench/bin/workloads/micro/wordcount/prepare
sh prepare.sh
数据准备完毕后,可执行如下命令查看数据存储的位置(bucket_name 替换成真正
的桶名)
hdfs dfs -ls obs://bucket_name/test1/HiBench/Wordcount/Input
cd /home/test_tools/HiBench/bin/workloads/micro/wordcount/hadoop
sh run.sh
cd /home/test_tools/HiBench/bin/workloads/micro/wordcount/spark
sh run.sh
hdfs dfs -ls obs://bucket_name/test1/HiBench/Wordcount/Output
cd /home/test_tools/HiBench/bin/workloads/micro/terasort/prepare
sh prepare.sh
cd /home/test_tools/HiBench/bin/workloads/micro/terasort/hadoop
sh run.sh
cd /home/test_tools/HiBench/bin/workloads/micro/terasort/spark
sh run.sh
hdfs dfs -ls obs://bucket_name/test1/HiBench/Terasort/Output
工具准备
tar zxvf /home/arm_bigdata_suite/ycsb-0.12.0.tar.gz -C /home/test_tools/
chmod 755 -R /home/test_tools/ycsb-0.12.0/
数据准备阶段
create 'BTable','family',{SPLITS => (1..20).map {|i| "user#{1000+i*(9999-1000)/20}"}}
exit
cd /home/test_tools/ycsb-0.12.0/
nohup bin/ycsb load hbase10 -P workloads/workloada -cp /home/modules/hbase-
2.1.0//conf:/home/modules/hbase-2.1.0/lib/*:hbase10-binding/lib/hbase10-binding-
0.12.0.jar -p table=BTable -p columnfamily=family -p recordcount=10000 -threads 100
-s 2> workload-loadrun.txt -s 1> workload-loadresult.txt &
本次此时中分别运行测试模型 a,b,c,d:
YCSB | 读 | 写 |
---|---|---|
模型a | 10%写 | 90%读 |
模型 b | 50%写 | 50%读 |
模型 c | 90%写 | 10%读 |
模型 d | 100%写 | 0%读 |
cd /home/test_tools/ycsb-0.12.0/
nohup bin/ycsb run hbase10 -P workloads/workloada -cp /home/modules/hbase-
2.1.0//conf:/home/modules/hbase-2.1.0/lib/*:hbase10-binding/lib/hbase10-binding-
0.12.0.jar -p table=BTable -p columnfamily=family -p operationcount=10000 -threads
100 -s 2> workloada-run.txt -s 1> workloada-result.txt &
nohup bin/ycsb run hbase10 -P workloads/workloadb -cp /home/modules/hbase-
2.1.0//conf:/home/modules/hbase-2.1.0/lib/*:hbase10-binding/lib/hbase10-binding-
0.12.0.jar -p table=BTable -p columnfamily=family -p operationcount=10000 -threads
100 -s 2> workloadb-run.txt -s 1> workloadb-result.txt &
在 node1, 执行如下命令(参数为桶名) 可以完成大数据集群格式化操作, 这意味着:
本集群中 hdfs 将会重新初始化, obs://${bucket_name}/hbasetest001/*将被清空, hbase 元数
据将被删除, 集群将会重新启动,因此请谨慎使用本脚本。
[root@node1 ~]# sh /home/arm_bigdata_suite/complete_clean_restart.sh
bucket_name is empty. Please check it! Usage:
sh /home/arm_bigdata_suite/complete_clean_restart.sh [bucket_name]
[root@node1 ~]#
执行完毕后,检查如下页面是否可以正常打开:
http://node弹性 ip:16010/master-status