完整集群搭建,hadoop,spark,zookeeper,kafka,redis等,修改hadoop默认日志级别。

环境变量配置

在根目录下新建一个soft 文件夹

以下为安装包,全部放到该文件夹下,解压,并软连接

0 jdk-8u191-linux-x64.tar
1 hadoop-2.7.2.tar
2 spark-2.3.1-bin-hadoop2.7
3 scala-2.11.12
4 zookeeper-3.4.10.tar
5 kafka_2.11-1.1.1 (更新了版本)
6 redis-3.2.12.tar
7 hbase-1.2.9-bin.tar
8 apache-tomcat-7.0.91.tar

然后master中 vim /etc/environment 文件 打开,复制其中所有内容,到所有从机的 /etc/environment 文件中,然后 source /etc/environment 生效

JAVA_HOME="/soft/jdk/"
HADOOP_HOME="/soft/hadoop/"
HIVE_HOME="/soft/hive"
HBASE_HOME="/soft/hbase"
ZK_HOME="/soft/zookeeper"
KAFKA_HOME="/soft/kafka"
SCALA_HOME="/soft/scala"
SPARK_HOME="/soft/spark"
BIGDL_HOME="/soft/bigdl"
SQOOP_HOME="/soft/sqoop"
KE_HOME="/soft/kafka-eagle/kafka-eagle-web-1.2.4"
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/soft/jdk/bin:/soft/hadoop/bin:/soft/hadoop/sbin:/soft/hive/bin:/soft/zookeeper/bin:/soft/hbase/bin:/soft/kafka/bin:/soft/scala/bin:/soft/spark/bin:/soft/spark/sbin:/soft/bigdl/bin:/soft/sqoop/bin:/soft/kafka-eagle/kafka-eagle-web-1.2.4/bin"

hadoop集群搭建

1.将hadoop压缩包拖进master主机中,解压(我这里解压的路径是/home/soft/),配置环境变量,同jdk的环境变量配置

2.在hadoop-2.7.2文件夹里面先创建4个文件夹:(hdfs在hadoop-2.7.2下,其他三个文件夹在hdfs下)

sudo mkdir hdfs
cd hdfs
sudo mkdir data
sudo mkdir tmp
sudo mkdir name

3.配置 hadoop的配置文件
先进入配置文件的路径:cd /home/soft/hadoop-2.7.2/etc/hadoop(再次强调,使用自己的路径)

ls

查看该文件夹下的文件。

集群/分布式模式需要修改 /home/soft/hadoop-2.7.2/etc/hadoop 中的5个配置文件: slaves、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml

slaves

slaves文件中记录的主机都会作为DataNode使用,1台从机都有DataNode,所以将从机的主机名写入该文件中,配置如下:
执行命令:

sudo gedit slaves

删掉localhost,写上slave1 ,如果没有看到localhost,看你是不是进错了文件夹,是在/etc/hadoop之下的slaves。有多台从机时,注意在这里加上其他从机的主机名。如:slave1,slave2 ....

core-site.xml
sudo gedit core-site.xml

加入:



fs.defaultFS
hdfs://master:9000     


hadoop.tmp.dir
/home/soft/hadoop-2.7.2/hdfs/tmp      


fs.trash.interval
10080


hdfs-site.xml
sudo gedit hdfs-site.xml

加入:



dfs.namenode.secondary.http-address
master:9001


dfs.replication
2                 


dfs.namenode.name.dir
/home/soft/hadoop-2.7.2/hdfs/name    


dfs.datanode.data.dir
/home/soft/hadoop-2.7.2/hdfs/data


mapred-site.xml

没有这个文件的,需要先执行

cp mapred-site.xml.template mapred-site.xml

复制一份,然后再

sudo gedit mapred-site.xml

加入:



mapreduce.framework.name
yarn
  

yarn-site.xml
sudo gedit yarn-site.xml

加入:



    yarn.nodemanager.aux-services
    mapreduce_shuffle


    yarn.resourcemanager.hostname         
    master                           


以上操作在主机master 上执行,然后格式化 namenode,只格式化一次,执行下面的命令:

hadoop namenode -format

注意:上面只要出现 “successfully formatted” 就表示成功了。

接下来,将hadoop传到slave1,slave2...等从机上面去:

scp -r hadoop-2.7.2 zhjc@slave1:/home/soft/

注意:zhjc是从机的用户名,创建slave1时设定的

传过去后,在slave1上面同样对hadoop进行环境变量配置,配置好之后,按上面的方法测试一下。 测试:hadoop version 有结果就代表成功。其他从机与slave1保持一致即可。

六、开启hadoop

两种方法:

start-all.sh

或者

start-dfs.sh
start-yarn.sh

如果在mater上面键入jps后看到:

在slave1上键入jps后看到:

则说明集群搭建成功。关闭集群:

stop-all.sh

七、最后用自带的样例测试hadoop集群能不能正常跑任务

使用命令:

hadoop jar /home/soft/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar  pi 2 10

用来求圆周率,pi是类名,第一个2表示Map次数,第二个10表示随机生成点的次数(与计算原理有关)

最后出现结果:

hadoop集群搭建完成。​

spark 集群(完全分布式)

该步骤在hadoop集群搭建之后,默认主从机上的环境变量都配置完成。

Spark集群的四种运行模式

1、Local

单机运行,一般用于开发测试。

2、Yarn

Spark客户端直接连接Yarn,不需要额外构建Spark集群。

3、Standalone

构建一个由Master+Worker构成的Spark集群,Spark运行在集群中。

4、Mesos

Spark客户端直接连接Mesos,不需要额外构建Spark集群。

其他配置都与master上的一致,需要修改 conf/ 下的 slaves 文件, 然后发送到所有从机上

vim slaves
#配置集群worker
slave1
slave2
slave3
slave4
slave5
slave6

standalone启动:

[zhjc@master spark]# ./sbin/start-all.sh

提交任务时,不用standalone,而是直接交给yarn管理。

具体配置如下:(打开master的spark文件夹下的conf/spark-env.sh 文件夹查看)

export JAVA_HOME=/soft/jdk   #Java环境变量
export SCALA_HOME=/soft/scala #SCALA环境变量
export SPARK_WORKING_MEMORY=8g  #每一个worker节点上可用的最大内存
export SPARK_MASTER_IP=master   #驱动器节点IP
export HADOOP_HOME=/soft/hadoop  #Hadoop路径
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop #Hadoop配置目录
export SPARK_DIST_CLASSPATH=$(hadoop classpath):$(hbase classpath)
export SPARK_YARN_USER_ENV=/soft/hadoop/etc/hadoop/

zookeeper集群搭建

默认主从机上的环境变量都配置完成。

进入zookeeper目录中,首先创建一个data目录(用于存放数据),然后进入到conf目录使用mv命令将zoo_simple.cfg文件名修改为zoo.cfg,然后使用vi或vim编辑器进行相应配置。

在data目录中创建一个myid文件,用于指定节点id

[zhjc@master data]# vim myid  //master的myid为1,slave1的myid为2,依次类推

打开conf目录下的zoo.cfg文件

[zhjc@master conf]# vim zoo.cfg

修改其中的相关 dataDir , server.1 ,server.2 ,server.3 ... 其中的数字要与myid配置文件中数字保持一致,该id指定机器id在启动时用于选举使用。

具体配置如下:(打开master的zoo.cfg 文件查看)

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/soft/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
server.4=slave3:2888:3888
server.5=slave4:2888:3888
server.6=slave5:2888:3888
server.7=slave6:2888:3888

zookeeper集群启动:(所有主从机都要启动,都要运行以下指令)

[zhjc@master bin]# ./zkServer.sh start

各节点进入zookeeper的bin目录使用zkServer.sh start 命令启动zookeeper。然后使用zkServer.sh status查看zk状态。

kafka集群搭建

默认主从机上的环境变量都配置完成。

修改 kafka文件夹下的conf/server.properties 文件

[zhjc@master conf]# vim server.properties

修改broker.id=1,默认是0, master 为1,其他类推

具体配置如下:(打开master的server.properties 文件查看)

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

# A comma separated list of directories under which to store log files
log.dirs=/tmp/kafka-logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.
# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=72

# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=master:2181,slave1:2181,slave2:2181,slave3:2181,slave4:2181,slave5:2181,slave6:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000


############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0

最后对所有节点都要创建日志目录: mkdir /hdfs/kafka 并且根据需要创建软链接,完成之后kafka集群就安装完毕了

kafka启动:(所有主从机都要启动,主从机都要运行以下指令)

[zhjc@master bin]# nohup kafka-server-start.sh /soft/kafka/config/server.properties > /dev/null 2>&1 &

或者:

[zhjc@master bin]# kafka-server-start.sh -daemon /soft/kafkaconfig/server.properties

新版本的kafka无需使用nohup挂起,直接使用-daemon参数就可以运行在后台,启动后通过jps查看有Kafka进程就启动成功,对于创建topic,生产,消费操作和之前基本都是一样的,停止同样执行bin/kafka-server-stop.sh即可

centos下的Redis单机安装

redis启动:到/soft/redis目录下:

redis-server redis.conf

redis关闭:

redis-cli shutdown

删除固定前缀 key :

redis-cli  -p 6379 --scan --pattern "C0*" | xargs -L 5000 redis-cli -n 0 -p 6379 DEL

Redis安装:

找到Redis的安装包,解压 到相关文件夹下,然后解压编译安装

安装redis需要c语言的编译环境。如果没有gcc需要在线安装,命令:yum install gcc-c++

cd redis-4.0.9 //进入redis源码目录
make  //编译
make install PREFIX=/soft/redis //安装

PREFIX指定安装目录,安装完成后,可以看到/soft/redis有个bin文件夹,进入bin文件夹, ll

至此,redis安装完成,可以直接启动 ./redis-server,这种方式是前端启动方式,可以ctrl+c关闭redis。

也可以通过初始化脚本启动Redis,在编译后的目录utils文件夹中有 redis_init_script 首先将初始化脚本复制到/etc/init.d 目录中,文件名为 redis_端口号(这个mv成了redis_6379),其中端口号表示要让Redis监听的端口号,客户端通过该端口连接Redis。然后修改脚本中的 REDISPORT 变量的值为同样的端口号。

然后建立存放Redis的配置文件目录和存放Redis持久化的文件目录

/etc/redis 存放Redis的配置文件

/var/redis/端口号 存放Redis的持久化文件(这里是 /var/redis/6379 )

修改配置文件

将配置文件模板 redis-4.0.9/redis.conf 复制到 /etc/redis 目录中,以端口号命名(如 6379.conf ),然后对其中的部分参数进行编辑。

daemonize yes 使Redis以守护进程模式运行
pidfile /var/run/redis_端口号.pid 设置Redis的PID文件位置
port 端口号 设置Redis监听的端口号
dir /var/redis/端口号 设置持久化文件存放位置
#requirepass foobared 若需要设置密码就把注释打开,改成你要设置的密码
bind 127.0.0.1   将其默认的127.0.0.1改为0.0.0.0(代表不做限制),这样外网就能访问了

现在也可以使用下面的命令来启动和关闭Redis了

/etc/init.d/redis_6379 start
/etc/init.d/redis_6379 stop

redis随系统自动启动:

chkconfig redis_6379 on

通过上面的操作后,以后也可以直接用下面的命令对Redis进行启动和关闭了,如下

service redis_6379 start

service redis_6379 stop

这样系统重启,Redis也会随着系统启动自动启动起来。

上面的stop方法可以停止redis,但是考虑到 Redis 有可能正在将内存中的数据同步到硬盘中,强行终止 Redis 进程可能会导致数据丢失。正确停止Redis的方式应该是向Redis发送SHUTDOWN命令,方法为:

redis-cli SHUTDOWN

当Redis收到SHUTDOWN命令后,会先断开所有客户端连接,然后根据配置执行持久化,最后完成退出。
Redis可以妥善处理 SIGTERM信号,所以使用 kill Redis 进程的 PID也可以正常结束Redis,效果与发送SHUTDOWN命令一样。

如果需要外网访问,首先检查是否被防火墙挡住

然后在配置文件中将bind配置项默认的127.0.0.1改为0.0.0.0

修改Hadoop的默认日志级别

修改log4j.properties的配置

# Define some default values that can be overridden by system properties
hadoop.root.logger=WARN,console

这里的配置会被系统属性覆盖!

还需要修改以下2个文件,才能把默认的日志级别改掉(我这里只改的HDFS的,Yarn的自行参考即可):

第一处是${HADOOP_HOME}/etc/hadoop/hadoop-env.sh,把INFO改为WARN即可:

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Xmx30720m -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-WARN,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-WARN,NullAppender} $HADOOP_NAMENODE_OPTS"

还有yarn的也要这样修改,${HADOOP_HOME}/etc/hadoop/yarn-env.sh 也需要这样改一下

启动脚本${HADOOP_HOME}/sbin/hadoop-daemon.sh,也需要这样改一下:

export HADOOP_ROOT_LOGGER=${HADOOP_ROOT_LOGGER:-"WARN,RFA"}

还有yarn的启动脚本也要这样修改,${HADOOP_HOME}/sbin/yarn-daemon.sh 也需要这样改一下

最后在重启下NameNode就可以了。

mysql数据转存Hbase

需要以下包:

1 sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar

2 mysql-connector-java-5.1.38

常用指令:

//sqoop转存
sqoop import --connect jdbc:mysql://192.168.139.1:3306/transportation --username wcc --password 123456 --table 't_videodata_raw' --where "CreateTime>='2019-03-27 00:00:00' AND CreateTime < '2019-03-28 00:00:00'" --hbase-table 'transportation:t_videodata_raw' --hbase-row-key 'Rowkey' --column-family 'info' --split-by 'Rowkey'
//
拼接rowkey
UPDATE t_link_set as t SET Rowkey =CONCAT(LinkID,'_',create_time)
//yarn提交
spark-submit --master yarn --deploy-mode cluster --driver-memory 1G --executor-memory 1500m --executor-cores 2 --class Forecastion.knnTest SparkTrain-1.0-SNAPSHOT.jar

//hbase统计表数据条数
hbase org.apache.hadoop.hbase.mapreduce.RowCounter "transportation:t_videodata_raw"
//转存+定时
***************************定时***************************************
sudo yum install crontabs
sudo systemctl enable crond (设为开机启动)
sudo systemctl start crond(启动crond服务)
sudo systemctl status crond (查看状态)
sudo nano /etc/crontab
1 0 * * * root /usr/local/mycommand.sh (这样就是每天凌晨零点过一分执行一次命令脚本)
sudo crontab /etc/crontab
sudo crontab -l
***************************转存***************************************
yesday=$(date -d last-day +%Y-%m-%d)
export SQOOP_HOME=/soft/sqoop

sqoop import
  --connect jdbc:mysql://192.168.139.1:3306/whtmb
  --username xxxx
  --password xxxx1234
  --table 't_videodata_raw'
  --check-column CreateTime
  --incremental lastmodified
  --last-value ${yesday}  
  --hbase-table 'transportation:t_videodata_raw'
  --merge-key 'RowKey'
  --hbase-row-key 'RowKey'
  --column-family 'info'
  --split-by 'RowKey'
#sqoop import --connect jdbc:mysql://192.168.139.1:3306/whtmb --username wcc --password 123456 --table 't_earthmagnetic_raw'  --hbase-table 'transportation:t_earthmagnetic_raw' --hbase-row-key 'RowKey' --colu$

#sqoop import --connect jdbc:mysql://192.168.139.1:3306/whtmb --username ${rdbms_username} --password ${rdbms_pwd} --table ${rdbms_table1} --hbase-table ${hbase_table1} --hbase-row-key ${row_key} --column-fam$
#sqoop import --connect ${rdbms_url} --username ${rdbms_username} --password ${rdbms_pwd} --table ${rdbms_table2} --hbase-table ${hbase_table2} --hbase-row-key ${row_key} --column-family ${column-family} --sp$
#sqoop import --connect ${rdbms_url} --username ${rdbms_username} --password ${rdbms_pwd} --table ${rdbms_table3} --hbase-table ${hbase_table3} --hbase-row-key ${row_key} --column-family ${column-family} --sp$
echo "等待批量任务完成"
         wait
echo "开始下一批导入"

输入指令:

sqoop list-databases --connect jdbc:mysql://192.168.139.1:3306/whtmb --username xxxx --password xxxx1234

sqoop import --connect jdbc:mysql://192.168.139.1:3306/whtmb --username xxxx --password xxxx1234 --table 't_videodata_raw' --hbase-table 'transportation:t_videodata_raw' --hbase-row-key 'Rowkey' --column-family 'info' --split-by 'Rowkey'

你可能感兴趣的:(完整集群搭建,hadoop,spark,zookeeper,kafka,redis等,修改hadoop默认日志级别。)