完整集群搭建，hadoop,spark,zookeeper,kafka,redis等，修改hadoop默认日志级别。

环境变量配置

在根目录下新建一个soft 文件夹

以下为安装包，全部放到该文件夹下，解压，并软连接

0 jdk-8u191-linux-x64.tar
1 hadoop-2.7.2.tar
2 spark-2.3.1-bin-hadoop2.7
3 scala-2.11.12
4 zookeeper-3.4.10.tar
5 kafka_2.11-1.1.1 （更新了版本）
6 redis-3.2.12.tar
7 hbase-1.2.9-bin.tar
8 apache-tomcat-7.0.91.tar

然后master中 vim /etc/environment 文件打开，复制其中所有内容，到所有从机的 /etc/environment 文件中，然后 source /etc/environment 生效

JAVA_HOME="/soft/jdk/"
HADOOP_HOME="/soft/hadoop/"
HIVE_HOME="/soft/hive"
HBASE_HOME="/soft/hbase"
ZK_HOME="/soft/zookeeper"
KAFKA_HOME="/soft/kafka"
SCALA_HOME="/soft/scala"
SPARK_HOME="/soft/spark"
BIGDL_HOME="/soft/bigdl"
SQOOP_HOME="/soft/sqoop"
KE_HOME="/soft/kafka-eagle/kafka-eagle-web-1.2.4"
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/soft/jdk/bin:/soft/hadoop/bin:/soft/hadoop/sbin:/soft/hive/bin:/soft/zookeeper/bin:/soft/hbase/bin:/soft/kafka/bin:/soft/scala/bin:/soft/spark/bin:/soft/spark/sbin:/soft/bigdl/bin:/soft/sqoop/bin:/soft/kafka-eagle/kafka-eagle-web-1.2.4/bin"

hadoop集群搭建

1.将hadoop压缩包拖进master主机中，解压(我这里解压的路径是/home/soft/)，配置环境变量，同jdk的环境变量配置

2.在hadoop-2.7.2文件夹里面先创建4个文件夹：(hdfs在hadoop-2.7.2下，其他三个文件夹在hdfs下)

sudo mkdir hdfs
cd hdfs
sudo mkdir data
sudo mkdir tmp
sudo mkdir name

3.配置 hadoop的配置文件
先进入配置文件的路径：cd /home/soft/hadoop-2.7.2/etc/hadoop(再次强调，使用自己的路径)

ls

查看该文件夹下的文件。

集群/分布式模式需要修改 /home/soft/hadoop-2.7.2/etc/hadoop 中的5个配置文件： slaves、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml

slaves

slaves文件中记录的主机都会作为DataNode使用，1台从机都有DataNode，所以将从机的主机名写入该文件中，配置如下：
执行命令：

sudo gedit slaves

删掉localhost，写上slave1 ，如果没有看到localhost，看你是不是进错了文件夹，是在/etc/hadoop之下的slaves。有多台从机时，注意在这里加上其他从机的主机名。如：slave1，slave2 ....

core-site.xml

sudo gedit core-site.xml

加入：



fs.defaultFS
hdfs://master:9000     


hadoop.tmp.dir
/home/soft/hadoop-2.7.2/hdfs/tmp      


fs.trash.interval
10080

hdfs-site.xml

sudo gedit hdfs-site.xml

加入：



dfs.namenode.secondary.http-address
master:9001


dfs.replication
2                 


dfs.namenode.name.dir
/home/soft/hadoop-2.7.2/hdfs/name    


dfs.datanode.data.dir
/home/soft/hadoop-2.7.2/hdfs/data

mapred-site.xml

没有这个文件的，需要先执行

cp mapred-site.xml.template mapred-site.xml

复制一份，然后再

sudo gedit mapred-site.xml

加入：



mapreduce.framework.name
yarn

yarn-site.xml

sudo gedit yarn-site.xml

加入：



    yarn.nodemanager.aux-services
    mapreduce_shuffle


    yarn.resourcemanager.hostname         
    master

以上操作在主机master 上执行，然后格式化 namenode，只格式化一次，执行下面的命令：

hadoop namenode -format

注意：上面只要出现 “successfully formatted” 就表示成功了。

接下来，将hadoop传到slave1，slave2...等从机上面去：

scp -r hadoop-2.7.2 zhjc@slave1:/home/soft/

注意：zhjc是从机的用户名，创建slave1时设定的

传过去后，在slave1上面同样对hadoop进行环境变量配置，配置好之后，按上面的方法测试一下。测试：hadoop version 有结果就代表成功。其他从机与slave1保持一致即可。

六、开启hadoop

两种方法：

start-all.sh

或者

start-dfs.sh
start-yarn.sh

如果在mater上面键入jps后看到：

在slave1上键入jps后看到：

则说明集群搭建成功。关闭集群：

stop-all.sh

七、最后用自带的样例测试hadoop集群能不能正常跑任务

使用命令:

hadoop jar /home/soft/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar  pi 2 10

用来求圆周率，pi是类名，第一个2表示Map次数，第二个10表示随机生成点的次数(与计算原理有关)

最后出现结果:

hadoop集群搭建完成。

spark 集群（完全分布式）

该步骤在hadoop集群搭建之后，默认主从机上的环境变量都配置完成。

Spark集群的四种运行模式

1、Local

单机运行，一般用于开发测试。

2、Yarn

Spark客户端直接连接Yarn，不需要额外构建Spark集群。

3、Standalone

构建一个由Master+Worker构成的Spark集群，Spark运行在集群中。

4、Mesos

Spark客户端直接连接Mesos，不需要额外构建Spark集群。

其他配置都与master上的一致，需要修改 conf/ 下的 slaves 文件, 然后发送到所有从机上

vim slaves
#配置集群worker
slave1
slave2
slave3
slave4
slave5
slave6

standalone启动：

[zhjc@master spark]# ./sbin/start-all.sh

提交任务时，不用standalone，而是直接交给yarn管理。

具体配置如下：（打开master的spark文件夹下的conf/spark-env.sh 文件夹查看）

export JAVA_HOME=/soft/jdk   #Java环境变量
export SCALA_HOME=/soft/scala #SCALA环境变量
export SPARK_WORKING_MEMORY=8g  #每一个worker节点上可用的最大内存
export SPARK_MASTER_IP=master   #驱动器节点IP
export HADOOP_HOME=/soft/hadoop  #Hadoop路径
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop #Hadoop配置目录
export SPARK_DIST_CLASSPATH=$(hadoop classpath):$(hbase classpath)
export SPARK_YARN_USER_ENV=/soft/hadoop/etc/hadoop/

zookeeper集群搭建

默认主从机上的环境变量都配置完成。

进入zookeeper目录中，首先创建一个data目录（用于存放数据），然后进入到conf目录使用mv命令将zoo_simple.cfg文件名修改为zoo.cfg，然后使用vi或vim编辑器进行相应配置。

在data目录中创建一个myid文件，用于指定节点id

[zhjc@master data]# vim myid  //master的myid为1，slave1的myid为2，依次类推

打开conf目录下的zoo.cfg文件

[zhjc@master conf]# vim zoo.cfg

修改其中的相关 dataDir ， server.1 ,server.2 ,server.3 ... 其中的数字要与myid配置文件中数字保持一致，该id指定机器id在启动时用于选举使用。

具体配置如下：（打开master的zoo.cfg 文件查看）

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/soft/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
server.4=slave3:2888:3888
server.5=slave4:2888:3888
server.6=slave5:2888:3888
server.7=slave6:2888:3888

zookeeper集群启动：（所有主从机都要启动，都要运行以下指令）

[zhjc@master bin]# ./zkServer.sh start

各节点进入zookeeper的bin目录使用zkServer.sh start 命令启动zookeeper。然后使用zkServer.sh status查看zk状态。

kafka集群搭建

默认主从机上的环境变量都配置完成。

修改 kafka文件夹下的conf/server.properties 文件

[zhjc@master conf]# vim server.properties

修改broker.id=1，默认是0， master 为1，其他类推

具体配置如下：（打开master的server.properties 文件查看）

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

# A comma separated list of directories under which to store log files
log.dirs=/tmp/kafka-logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.
# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=72

# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=master:2181,slave1:2181,slave2:2181,slave3:2181,slave4:2181,slave5:2181,slave6:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000


############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0

最后对所有节点都要创建日志目录： mkdir /hdfs/kafka 并且根据需要创建软链接，完成之后kafka集群就安装完毕了

kafka启动：（所有主从机都要启动，主从机都要运行以下指令）

[zhjc@master bin]# nohup kafka-server-start.sh /soft/kafka/config/server.properties > /dev/null 2>&1 &

或者：

[zhjc@master bin]# kafka-server-start.sh -daemon /soft/kafkaconfig/server.properties

新版本的kafka无需使用nohup挂起，直接使用-daemon参数就可以运行在后台，启动后通过jps查看有Kafka进程就启动成功，对于创建topic，生产，消费操作和之前基本都是一样的，停止同样执行bin/kafka-server-stop.sh即可

centos下的Redis单机安装

redis启动：到/soft/redis目录下：

redis-server redis.conf

redis关闭：

redis-cli shutdown

删除固定前缀 key :

redis-cli  -p 6379 --scan --pattern "C0*" | xargs -L 5000 redis-cli -n 0 -p 6379 DEL

Redis安装：

找到Redis的安装包，解压到相关文件夹下，然后解压编译安装

安装redis需要c语言的编译环境。如果没有gcc需要在线安装，命令：yum install gcc-c++

cd redis-4.0.9 //进入redis源码目录
make  //编译
make install PREFIX=/soft/redis //安装

PREFIX指定安装目录，安装完成后，可以看到/soft/redis有个bin文件夹，进入bin文件夹, ll

至此，redis安装完成，可以直接启动 ./redis-server,这种方式是前端启动方式，可以ctrl+c关闭redis。

也可以通过初始化脚本启动Redis，在编译后的目录utils文件夹中有 redis_init_script 首先将初始化脚本复制到/etc/init.d 目录中，文件名为 redis_端口号（这个mv成了redis_6379），其中端口号表示要让Redis监听的端口号，客户端通过该端口连接Redis。然后修改脚本中的 REDISPORT 变量的值为同样的端口号。

然后建立存放Redis的配置文件目录和存放Redis持久化的文件目录

/etc/redis 存放Redis的配置文件

/var/redis/端口号存放Redis的持久化文件（这里是 /var/redis/6379 ）

修改配置文件

将配置文件模板 redis-4.0.9/redis.conf 复制到 /etc/redis 目录中，以端口号命名（如 6379.conf ），然后对其中的部分参数进行编辑。

daemonize yes 使Redis以守护进程模式运行
pidfile /var/run/redis_端口号.pid 设置Redis的PID文件位置
port 端口号 设置Redis监听的端口号
dir /var/redis/端口号 设置持久化文件存放位置
#requirepass foobared 若需要设置密码就把注释打开，改成你要设置的密码
bind 127.0.0.1   将其默认的127.0.0.1改为0.0.0.0(代表不做限制)，这样外网就能访问了

现在也可以使用下面的命令来启动和关闭Redis了

/etc/init.d/redis_6379 start
/etc/init.d/redis_6379 stop

redis随系统自动启动：

chkconfig redis_6379 on

通过上面的操作后，以后也可以直接用下面的命令对Redis进行启动和关闭了，如下

service redis_6379 start

service redis_6379 stop

这样系统重启，Redis也会随着系统启动自动启动起来。

上面的stop方法可以停止redis，但是考虑到 Redis 有可能正在将内存中的数据同步到硬盘中，强行终止 Redis 进程可能会导致数据丢失。正确停止Redis的方式应该是向Redis发送SHUTDOWN命令，方法为：

redis-cli SHUTDOWN

当Redis收到SHUTDOWN命令后，会先断开所有客户端连接，然后根据配置执行持久化，最后完成退出。
Redis可以妥善处理 SIGTERM信号，所以使用 kill Redis 进程的 PID也可以正常结束Redis，效果与发送SHUTDOWN命令一样。

如果需要外网访问，首先检查是否被防火墙挡住

然后在配置文件中将bind配置项默认的127.0.0.1改为0.0.0.0

修改Hadoop的默认日志级别

修改log4j.properties的配置

# Define some default values that can be overridden by system properties
hadoop.root.logger=WARN,console

这里的配置会被系统属性覆盖！

还需要修改以下2个文件，才能把默认的日志级别改掉（我这里只改的HDFS的，Yarn的自行参考即可）：

第一处是${HADOOP_HOME}/etc/hadoop/hadoop-env.sh，把INFO改为WARN即可：

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Xmx30720m -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-WARN,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-WARN,NullAppender} $HADOOP_NAMENODE_OPTS"

还有yarn的也要这样修改，${HADOOP_HOME}/etc/hadoop/yarn-env.sh 也需要这样改一下

启动脚本${HADOOP_HOME}/sbin/hadoop-daemon.sh，也需要这样改一下：

export HADOOP_ROOT_LOGGER=${HADOOP_ROOT_LOGGER:-"WARN,RFA"}

还有yarn的启动脚本也要这样修改，${HADOOP_HOME}/sbin/yarn-daemon.sh 也需要这样改一下

最后在重启下NameNode就可以了。

mysql数据转存Hbase

需要以下包：

1 sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar

2 mysql-connector-java-5.1.38

常用指令：

//sqoop转存
sqoop import --connect jdbc:mysql://192.168.139.1:3306/transportation --username wcc --password 123456 --table 't_videodata_raw' --where "CreateTime>='2019-03-27 00:00:00' AND CreateTime < '2019-03-28 00:00:00'" --hbase-table 'transportation:t_videodata_raw' --hbase-row-key 'Rowkey' --column-family 'info' --split-by 'Rowkey'
//
拼接rowkey
UPDATE t_link_set as t SET Rowkey =CONCAT(LinkID,'_',create_time)
//yarn提交
spark-submit --master yarn --deploy-mode cluster --driver-memory 1G --executor-memory 1500m --executor-cores 2 --class Forecastion.knnTest SparkTrain-1.0-SNAPSHOT.jar

//hbase统计表数据条数
hbase org.apache.hadoop.hbase.mapreduce.RowCounter "transportation:t_videodata_raw"
//转存+定时
***************************定时***************************************
sudo yum install crontabs
sudo systemctl enable crond （设为开机启动）
sudo systemctl start crond（启动crond服务）
sudo systemctl status crond （查看状态）
sudo nano /etc/crontab
1 0 * * * root /usr/local/mycommand.sh (这样就是每天凌晨零点过一分执行一次命令脚本)
sudo crontab /etc/crontab
sudo crontab -l
***************************转存***************************************
yesday=$(date -d last-day +%Y-%m-%d)
export SQOOP_HOME=/soft/sqoop

sqoop import
  --connect jdbc:mysql://192.168.139.1:3306/whtmb
  --username xxxx
  --password xxxx1234
  --table 't_videodata_raw'
  --check-column CreateTime
  --incremental lastmodified
  --last-value ${yesday}  
  --hbase-table 'transportation:t_videodata_raw'
  --merge-key 'RowKey'
  --hbase-row-key 'RowKey'
  --column-family 'info'
  --split-by 'RowKey'
#sqoop import --connect jdbc:mysql://192.168.139.1:3306/whtmb --username wcc --password 123456 --table 't_earthmagnetic_raw'  --hbase-table 'transportation:t_earthmagnetic_raw' --hbase-row-key 'RowKey' --colu$

#sqoop import --connect jdbc:mysql://192.168.139.1:3306/whtmb --username ${rdbms_username} --password ${rdbms_pwd} --table ${rdbms_table1} --hbase-table ${hbase_table1} --hbase-row-key ${row_key} --column-fam$
#sqoop import --connect ${rdbms_url} --username ${rdbms_username} --password ${rdbms_pwd} --table ${rdbms_table2} --hbase-table ${hbase_table2} --hbase-row-key ${row_key} --column-family ${column-family} --sp$
#sqoop import --connect ${rdbms_url} --username ${rdbms_username} --password ${rdbms_pwd} --table ${rdbms_table3} --hbase-table ${hbase_table3} --hbase-row-key ${row_key} --column-family ${column-family} --sp$
echo "等待批量任务完成"
         wait
echo "开始下一批导入"

输入指令：

sqoop list-databases --connect jdbc:mysql://192.168.139.1:3306/whtmb --username xxxx --password xxxx1234

sqoop import --connect jdbc:mysql://192.168.139.1:3306/whtmb --username xxxx --password xxxx1234 --table 't_videodata_raw' --hbase-table 'transportation:t_videodata_raw' --hbase-row-key 'Rowkey' --column-family 'info' --split-by 'Rowkey'