环境变量配置
在根目录下新建一个soft 文件夹
以下为安装包,全部放到该文件夹下,解压,并软连接
0 jdk-8u191-linux-x64.tar
1 hadoop-2.7.2.tar
2 spark-2.3.1-bin-hadoop2.7
3 scala-2.11.12
4 zookeeper-3.4.10.tar
5 kafka_2.11-1.1.1 (更新了版本)
6 redis-3.2.12.tar
7 hbase-1.2.9-bin.tar
8 apache-tomcat-7.0.91.tar
然后master中 vim /etc/environment 文件 打开,复制其中所有内容,到所有从机的 /etc/environment 文件中,然后 source /etc/environment 生效
JAVA_HOME="/soft/jdk/"
HADOOP_HOME="/soft/hadoop/"
HIVE_HOME="/soft/hive"
HBASE_HOME="/soft/hbase"
ZK_HOME="/soft/zookeeper"
KAFKA_HOME="/soft/kafka"
SCALA_HOME="/soft/scala"
SPARK_HOME="/soft/spark"
BIGDL_HOME="/soft/bigdl"
SQOOP_HOME="/soft/sqoop"
KE_HOME="/soft/kafka-eagle/kafka-eagle-web-1.2.4"
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/soft/jdk/bin:/soft/hadoop/bin:/soft/hadoop/sbin:/soft/hive/bin:/soft/zookeeper/bin:/soft/hbase/bin:/soft/kafka/bin:/soft/scala/bin:/soft/spark/bin:/soft/spark/sbin:/soft/bigdl/bin:/soft/sqoop/bin:/soft/kafka-eagle/kafka-eagle-web-1.2.4/bin"
hadoop集群搭建
1.将hadoop压缩包拖进master主机中,解压(我这里解压的路径是/home/soft/),配置环境变量,同jdk的环境变量配置
2.在hadoop-2.7.2文件夹里面先创建4个文件夹:(hdfs在hadoop-2.7.2下,其他三个文件夹在hdfs下)
sudo mkdir hdfs
cd hdfs
sudo mkdir data
sudo mkdir tmp
sudo mkdir name
3.配置 hadoop的配置文件
先进入配置文件的路径:cd /home/soft/hadoop-2.7.2/etc/hadoop(再次强调,使用自己的路径)
ls
查看该文件夹下的文件。
集群/分布式模式需要修改 /home/soft/hadoop-2.7.2/etc/hadoop 中的5个配置文件: slaves、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml
slaves
slaves文件中记录的主机都会作为DataNode使用,1台从机都有DataNode,所以将从机的主机名写入该文件中,配置如下:
执行命令:
sudo gedit slaves
删掉localhost,写上slave1 ,如果没有看到localhost,看你是不是进错了文件夹,是在/etc/hadoop之下的slaves。有多台从机时,注意在这里加上其他从机的主机名。如:slave1,slave2 ....
core-site.xml
sudo gedit core-site.xml
加入:
fs.defaultFS
hdfs://master:9000
hadoop.tmp.dir
/home/soft/hadoop-2.7.2/hdfs/tmp
fs.trash.interval
10080
hdfs-site.xml
sudo gedit hdfs-site.xml
加入:
dfs.namenode.secondary.http-address
master:9001
dfs.replication
2
dfs.namenode.name.dir
/home/soft/hadoop-2.7.2/hdfs/name
dfs.datanode.data.dir
/home/soft/hadoop-2.7.2/hdfs/data
mapred-site.xml
没有这个文件的,需要先执行
cp mapred-site.xml.template mapred-site.xml
复制一份,然后再
sudo gedit mapred-site.xml
加入:
mapreduce.framework.name
yarn
yarn-site.xml
sudo gedit yarn-site.xml
加入:
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.resourcemanager.hostname
master
以上操作在主机master 上执行,然后格式化 namenode,只格式化一次,执行下面的命令:
hadoop namenode -format
注意:上面只要出现 “successfully formatted” 就表示成功了。
接下来,将hadoop传到slave1,slave2...等从机上面去:
scp -r hadoop-2.7.2 zhjc@slave1:/home/soft/
注意:zhjc是从机的用户名,创建slave1时设定的
传过去后,在slave1上面同样对hadoop进行环境变量配置,配置好之后,按上面的方法测试一下。 测试:hadoop version 有结果就代表成功。其他从机与slave1保持一致即可。
六、开启hadoop
两种方法:
start-all.sh
或者
start-dfs.sh
start-yarn.sh
如果在mater上面键入jps后看到:
在slave1上键入jps后看到:
则说明集群搭建成功。关闭集群:
stop-all.sh
七、最后用自带的样例测试hadoop集群能不能正常跑任务
使用命令:
hadoop jar /home/soft/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 2 10
用来求圆周率,pi是类名,第一个2表示Map次数,第二个10表示随机生成点的次数(与计算原理有关)
最后出现结果:
hadoop集群搭建完成。
spark 集群(完全分布式)
该步骤在hadoop集群搭建之后,默认主从机上的环境变量都配置完成。
Spark集群的四种运行模式
1、Local
单机运行,一般用于开发测试。
2、Yarn
Spark客户端直接连接Yarn,不需要额外构建Spark集群。
3、Standalone
构建一个由Master+Worker构成的Spark集群,Spark运行在集群中。
4、Mesos
Spark客户端直接连接Mesos,不需要额外构建Spark集群。
其他配置都与master上的一致,需要修改 conf/ 下的 slaves 文件, 然后发送到所有从机上
vim slaves
#配置集群worker
slave1
slave2
slave3
slave4
slave5
slave6
standalone启动:
[zhjc@master spark]# ./sbin/start-all.sh
提交任务时,不用standalone,而是直接交给yarn管理。
具体配置如下:(打开master的spark文件夹下的conf/spark-env.sh 文件夹查看)
export JAVA_HOME=/soft/jdk #Java环境变量
export SCALA_HOME=/soft/scala #SCALA环境变量
export SPARK_WORKING_MEMORY=8g #每一个worker节点上可用的最大内存
export SPARK_MASTER_IP=master #驱动器节点IP
export HADOOP_HOME=/soft/hadoop #Hadoop路径
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop #Hadoop配置目录
export SPARK_DIST_CLASSPATH=$(hadoop classpath):$(hbase classpath)
export SPARK_YARN_USER_ENV=/soft/hadoop/etc/hadoop/
zookeeper集群搭建
默认主从机上的环境变量都配置完成。
进入zookeeper目录中,首先创建一个data目录(用于存放数据),然后进入到conf目录使用mv命令将zoo_simple.cfg文件名修改为zoo.cfg,然后使用vi或vim编辑器进行相应配置。
在data目录中创建一个myid文件,用于指定节点id
[zhjc@master data]# vim myid //master的myid为1,slave1的myid为2,依次类推
打开conf目录下的zoo.cfg文件
[zhjc@master conf]# vim zoo.cfg
修改其中的相关 dataDir , server.1 ,server.2 ,server.3 ... 其中的数字要与myid配置文件中数字保持一致,该id指定机器id在启动时用于选举使用。
具体配置如下:(打开master的zoo.cfg 文件查看)
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/soft/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
server.4=slave3:2888:3888
server.5=slave4:2888:3888
server.6=slave5:2888:3888
server.7=slave6:2888:3888
zookeeper集群启动:(所有主从机都要启动,都要运行以下指令)
[zhjc@master bin]# ./zkServer.sh start
各节点进入zookeeper的bin目录使用zkServer.sh start 命令启动zookeeper。然后使用zkServer.sh status查看zk状态。
kafka集群搭建
默认主从机上的环境变量都配置完成。
修改 kafka文件夹下的conf/server.properties 文件
[zhjc@master conf]# vim server.properties
修改broker.id=1,默认是0, master 为1,其他类推
具体配置如下:(打开master的server.properties 文件查看)
############################# Server Basics #############################
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
############################# Socket Server Settings #############################
# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092
# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured. Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092
# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3
# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8
# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400
# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400
# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600
############################# Log Basics #############################
# A comma separated list of directories under which to store log files
log.dirs=/tmp/kafka-logs
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1
# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1
############################# Internal Topic Settings #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
############################# Log Flush Policy #############################
# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
# 1. Durability: Unflushed data may be lost if you are not using replication.
# 2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
# 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.
# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000
# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000
############################# Log Retention Policy #############################
# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.
# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=72
# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824
# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824
# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000
############################# Zookeeper #############################
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=master:2181,slave1:2181,slave2:2181,slave3:2181,slave4:2181,slave5:2181,slave6:2181
# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000
############################# Group Coordinator Settings #############################
# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0
最后对所有节点都要创建日志目录: mkdir /hdfs/kafka 并且根据需要创建软链接,完成之后kafka集群就安装完毕了
kafka启动:(所有主从机都要启动,主从机都要运行以下指令)
[zhjc@master bin]# nohup kafka-server-start.sh /soft/kafka/config/server.properties > /dev/null 2>&1 &
或者:
[zhjc@master bin]# kafka-server-start.sh -daemon /soft/kafkaconfig/server.properties
新版本的kafka无需使用nohup挂起,直接使用-daemon参数就可以运行在后台,启动后通过jps查看有Kafka进程就启动成功,对于创建topic,生产,消费操作和之前基本都是一样的,停止同样执行bin/kafka-server-stop.sh即可
centos下的Redis单机安装
redis启动:到/soft/redis目录下:
redis-server redis.conf
redis关闭:
redis-cli shutdown
删除固定前缀 key :
redis-cli -p 6379 --scan --pattern "C0*" | xargs -L 5000 redis-cli -n 0 -p 6379 DEL
Redis安装:
找到Redis的安装包,解压 到相关文件夹下,然后解压编译安装
安装redis需要c语言的编译环境。如果没有gcc需要在线安装,命令:yum install gcc-c++
cd redis-4.0.9 //进入redis源码目录
make //编译
make install PREFIX=/soft/redis //安装
PREFIX指定安装目录,安装完成后,可以看到/soft/redis有个bin文件夹,进入bin文件夹, ll
至此,redis安装完成,可以直接启动 ./redis-server,这种方式是前端启动方式,可以ctrl+c关闭redis。
也可以通过初始化脚本启动Redis,在编译后的目录utils文件夹中有 redis_init_script 首先将初始化脚本复制到/etc/init.d 目录中,文件名为 redis_端口号(这个mv成了redis_6379),其中端口号表示要让Redis监听的端口号,客户端通过该端口连接Redis。然后修改脚本中的 REDISPORT 变量的值为同样的端口号。
然后建立存放Redis的配置文件目录和存放Redis持久化的文件目录
/etc/redis 存放Redis的配置文件
/var/redis/端口号 存放Redis的持久化文件(这里是 /var/redis/6379 )
修改配置文件
将配置文件模板 redis-4.0.9/redis.conf 复制到 /etc/redis 目录中,以端口号命名(如 6379.conf ),然后对其中的部分参数进行编辑。
daemonize yes 使Redis以守护进程模式运行
pidfile /var/run/redis_端口号.pid 设置Redis的PID文件位置
port 端口号 设置Redis监听的端口号
dir /var/redis/端口号 设置持久化文件存放位置
#requirepass foobared 若需要设置密码就把注释打开,改成你要设置的密码
bind 127.0.0.1 将其默认的127.0.0.1改为0.0.0.0(代表不做限制),这样外网就能访问了
现在也可以使用下面的命令来启动和关闭Redis了
/etc/init.d/redis_6379 start
/etc/init.d/redis_6379 stop
redis随系统自动启动:
chkconfig redis_6379 on
通过上面的操作后,以后也可以直接用下面的命令对Redis进行启动和关闭了,如下
service redis_6379 start
service redis_6379 stop
这样系统重启,Redis也会随着系统启动自动启动起来。
上面的stop方法可以停止redis,但是考虑到 Redis 有可能正在将内存中的数据同步到硬盘中,强行终止 Redis 进程可能会导致数据丢失。正确停止Redis的方式应该是向Redis发送SHUTDOWN命令,方法为:
redis-cli SHUTDOWN
当Redis收到SHUTDOWN命令后,会先断开所有客户端连接,然后根据配置执行持久化,最后完成退出。
Redis可以妥善处理 SIGTERM信号,所以使用 kill Redis 进程的 PID也可以正常结束Redis,效果与发送SHUTDOWN命令一样。
如果需要外网访问,首先检查是否被防火墙挡住
然后在配置文件中将bind配置项默认的127.0.0.1改为0.0.0.0
修改Hadoop的默认日志级别
修改log4j.properties的配置
# Define some default values that can be overridden by system properties
hadoop.root.logger=WARN,console
这里的配置会被系统属性覆盖!
还需要修改以下2个文件,才能把默认的日志级别改掉(我这里只改的HDFS的,Yarn的自行参考即可):
第一处是${HADOOP_HOME}/etc/hadoop/hadoop-env.sh,把INFO改为WARN即可:
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Xmx30720m -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-WARN,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-WARN,NullAppender} $HADOOP_NAMENODE_OPTS"
还有yarn的也要这样修改,${HADOOP_HOME}/etc/hadoop/yarn-env.sh 也需要这样改一下
启动脚本${HADOOP_HOME}/sbin/hadoop-daemon.sh,也需要这样改一下:
export HADOOP_ROOT_LOGGER=${HADOOP_ROOT_LOGGER:-"WARN,RFA"}
还有yarn的启动脚本也要这样修改,${HADOOP_HOME}/sbin/yarn-daemon.sh 也需要这样改一下
最后在重启下NameNode就可以了。
mysql数据转存Hbase
需要以下包:
1 sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar
2 mysql-connector-java-5.1.38
常用指令:
//sqoop转存
sqoop import --connect jdbc:mysql://192.168.139.1:3306/transportation --username wcc --password 123456 --table 't_videodata_raw' --where "CreateTime>='2019-03-27 00:00:00' AND CreateTime < '2019-03-28 00:00:00'" --hbase-table 'transportation:t_videodata_raw' --hbase-row-key 'Rowkey' --column-family 'info' --split-by 'Rowkey'
//
拼接rowkey
UPDATE t_link_set as t SET Rowkey =CONCAT(LinkID,'_',create_time)
//yarn提交
spark-submit --master yarn --deploy-mode cluster --driver-memory 1G --executor-memory 1500m --executor-cores 2 --class Forecastion.knnTest SparkTrain-1.0-SNAPSHOT.jar
//hbase统计表数据条数
hbase org.apache.hadoop.hbase.mapreduce.RowCounter "transportation:t_videodata_raw"
//转存+定时
***************************定时***************************************
sudo yum install crontabs
sudo systemctl enable crond (设为开机启动)
sudo systemctl start crond(启动crond服务)
sudo systemctl status crond (查看状态)
sudo nano /etc/crontab
1 0 * * * root /usr/local/mycommand.sh (这样就是每天凌晨零点过一分执行一次命令脚本)
sudo crontab /etc/crontab
sudo crontab -l
***************************转存***************************************
yesday=$(date -d last-day +%Y-%m-%d)
export SQOOP_HOME=/soft/sqoop
sqoop import
--connect jdbc:mysql://192.168.139.1:3306/whtmb
--username xxxx
--password xxxx1234
--table 't_videodata_raw'
--check-column CreateTime
--incremental lastmodified
--last-value ${yesday}
--hbase-table 'transportation:t_videodata_raw'
--merge-key 'RowKey'
--hbase-row-key 'RowKey'
--column-family 'info'
--split-by 'RowKey'
#sqoop import --connect jdbc:mysql://192.168.139.1:3306/whtmb --username wcc --password 123456 --table 't_earthmagnetic_raw' --hbase-table 'transportation:t_earthmagnetic_raw' --hbase-row-key 'RowKey' --colu$
#sqoop import --connect jdbc:mysql://192.168.139.1:3306/whtmb --username ${rdbms_username} --password ${rdbms_pwd} --table ${rdbms_table1} --hbase-table ${hbase_table1} --hbase-row-key ${row_key} --column-fam$
#sqoop import --connect ${rdbms_url} --username ${rdbms_username} --password ${rdbms_pwd} --table ${rdbms_table2} --hbase-table ${hbase_table2} --hbase-row-key ${row_key} --column-family ${column-family} --sp$
#sqoop import --connect ${rdbms_url} --username ${rdbms_username} --password ${rdbms_pwd} --table ${rdbms_table3} --hbase-table ${hbase_table3} --hbase-row-key ${row_key} --column-family ${column-family} --sp$
echo "等待批量任务完成"
wait
echo "开始下一批导入"
输入指令:
sqoop list-databases --connect jdbc:mysql://192.168.139.1:3306/whtmb --username xxxx --password xxxx1234
sqoop import --connect jdbc:mysql://192.168.139.1:3306/whtmb --username xxxx --password xxxx1234 --table 't_videodata_raw' --hbase-table 'transportation:t_videodata_raw' --hbase-row-key 'Rowkey' --column-family 'info' --split-by 'Rowkey'