1.zookeeper报错
2017
-
12
-
13
16
:
47
:
55
,
968
[myid:] - INFO [main-SendThread(localhost:
2181
):ClientCnxn$SendThread
@975
] - Opening socket connection to server localhost/
127.0
.
0.1
:
2181
. Will not attempt to authenticate using SASL (unknown error)
2017
-
12
-
13
16
:
47
:
55
,
968
[myid:] - WARN [main-SendThread(localhost:
2181
):ClientCnxn$SendThread
@1102
] - Session
0x0
for
server
null
, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
717
)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:
350
)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:
1081
)
原因:zookeeper节点挂了,启动即可
2.kafka消费报错:Job aborted due to stage failure:kafka.common.OffsetOutOfRangeException
Exception in thread
"main"
org.apache.spark.SparkException: Job aborted due to stage failure: Task
0
in stage
0.0
failed
1
times, most recent failure: Lost task
0.0
in stage
0.0
(TID
0
, localhost): kafka.common.OffsetOutOfRangeException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
kafka message过期时间log.retention.hours=168
解决:问题原因是,cosumer-group消费的offset已早于kafka存储的最早的message。参考blog里面有更详尽的解释
获取topic mysqlslowlog的offset的最小值
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name --time -2
获取topic:mysqlslowlog的offset的最大值
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name--time -1
在zk上更新topic partition的offset
#查partition 0最小值
get /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0
#更新partition 0最小值
set /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0 3546232
或者可以使用如下命令批量更新为最小值
./kafka-run-class.sh kafka.tools.UpdateOffsetsInZK earliest
参考:
http://blog.csdn.net/xueba207/article/details/51135423
http://blog.csdn.net/xueba207/article/details/51174818
3.重启hbase regionserver节点报错:
Server ...
,
1514436003346
has been rejected; Reported time is too far out of sync with master. Time difference of 136758ms > max allowed of 30000ms
一般是因为hmaster 节点和 regionserver节点时间不一致导致。同步时间,重启节点即可。
4.摘除hdfs datanode节点,datanode节点一直处于Decommission In Progress状态
通过WEB UI查看:
#低于副本数要求的blocks
Under replicated blocks :2979
#没有副本的blocks
Blocks with no live replicas: 0
#低于副本数要求的blocks,且正在创建中
Under Replicated Blocks In files under construction:1
或者通过../bin/hadoop dfsadmin -report命令查看datanode的状态。
副本数为:2,当Under replicated blocks是越来越低,等于0时,应该就会完全摘除。
另外,因为同一个rack的datanode节点一般会有一个副本,因此,可以通过修改副本数的方式,快速下线datanode
#查看集群状态
./bin/hadoop fsck / -blocks -locations -files
#修改副本数(当Blocks with no live replicas为0时可以操作)
./bin/hadoop fs -setrep -R 1 /
#关闭datanode节点,
./sbin/hadoop-daemon.sh stop datanode
#从slaves列表和rack列表中删掉对应节点
#freshnode或者依次重启namenode
./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes
5.摘除hdfs的datanode节点
Failed to add xxxxxxxx
:
50010
: You cannot have a rack and a non-rack node at the same level of the network topology.
解决:
通过 ./bin/hdfs dfsadmin -printTopology查看rack list
刷新
./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes
不管用,
(1)页面依然显示状态为dead的datanode,
(2)依然报You cannot have a rack and a non-rack node at the same level of the network topology.
依次重启namenode,生效
./sbin/hadoop-daemon.sh stop namenode
./sbin/hadoop-daemon.sh start namenode
通过
./bin/hdfs dfsadmin -printTopology
查看rack信息,应该被摘掉的节点也不再显示