大数据平台中遇到的实际问题,整理了一下,使用CDH5.8版本,包括Hadoop、Spark、Hive、kafka、Hbase、Phoenix、Impala、Sqoop、CDH等问题,初步整理下最近遇到的问题,不定期更新。
启动nodemanager失败
2016-09-07 14:28:46,434 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040]
java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.createServer(ResourceLocalizationService.java:278)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.serviceStart(ResourceLocalizationService.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart
(ContainerManagerImpl.java:293)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:199)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:339)
问题原因:8040端口被占用,netstat -apn | grep 8040后发现有nodemanager进程再运行。
Spark on yarn部署遇到的问题
命令行启动spark-shell --master yarn
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:530)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
at $iwC$$iwC.(:15)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
applicationMaster日志只有:
启动spark sql server
16/09/08 18:45:26 INFO service.AbstractService: Service:HiveServer2 is started.
16/09/08 18:45:26 INFO thriftserver.HiveThriftServer2: HiveThriftServer2 started
16/09/08 18:45:26 ERROR thrift.ThriftCLIService: Error starting HiveServer2: could not start ThriftBinaryCLIService
org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address /172.16.0.43:10002.
at org.apache.thrift.transport.TServerSocket.(TServerSocket.java:109)
at org.apache.thrift.transport.TServerSocket.(TServerSocket.java:91)
at org.apache.thrift.transport.TServerSocket.(TServerSocket.java:87)
at org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(HiveAuthFactory.java:241)
at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:66)
at java.lang.Thread.run(Thread.java:722)
16/09/08 18:45:26 INFO server.HiveServer2: Shutting down HiveServer2
16/09/08 18:45:26 INFO service.AbstractService: Service:ThriftBinaryCLIService is stopped.
16/09/08 18:45:26 INFO service.AbstractService: Service:OperationManager is stopped.
16/09/08 18:45:26 INFO service.AbstractService: Service:SessionManager is stopped.
16/09/08 18:45:26 INFO service.AbstractService: Service:CLIService is stopped.
问题原因:与Hive中的HiveServer2 进程端口冲突。
安装impala
启动impala-server服务:[hadoop@master impala]$ sudo service impala-server status
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0913 15:12:06.521616 5070 logging.cc:103] stderr will be logged to this file.
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
E0913 15:12:10.286190 5070 impala-server.cc:210] Could not read the HDFS root directory at hdfs://master:50070. Error was:
Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "master/172.16.0.39"; destination host is: "master":50070;
E0913 15:12:10.286229 5070 impala-server.cc:212] Aborting Impala Server startup due to improper configuration
问题原因:是版本问题
使用yum install安装软件时,总会找原来的jar包 问题原因:是yum缓存的原因 解决办法:使用yum clean all清除缓存即可。
部署线上CDH5.8集群,Cdh安装中遇到“正在获取安装锁”,安装时出错了,后来重新装,就遇到了这个问题
问题原因:上次安装时有scm的一些文件遗留。
解决办法:进入/tmp 目录,ls -a查看,删除scm_prepare_node.*的文件,以及.scm_prepare_node.lock文件。
DNS反向解析错误,不能正确解析Cloudera Manager Server主机名
正在检测 Cloudera Manager Server...
BEGIN host -t PTR 172.16.0.240
240.0.16.172.in-addr.arpa domain name pointer localhost.
END (0)
using localhost as scm server hostname
BEGIN which python
/usr/bin/python
END (0)
BEGIN python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect((sys.argv[1], int(sys.argv[2]))); s.close();' localhost 7182
Traceback (most recent call last):
File "", line 1, in
File "", line 1, in connect
socket.error: [Errno 111] Connection refused
END (1)
could not contact scm server at localhost:7182, giving up
waiting for rollback request
解决办法:将连不上的机器 /usr/bin/host 文件删掉。mv /usr/bin/host /usr/bin/host.bak
安装Hadoop时,启动失败
java.io.IOException: the pathcomponent: '/data' is world-writable. Its permissions are 0777. Pleasefix this or select a different socket path
问题原因:
DataNode的root根目录权限设置为0777太高导致不安全
解决办法:修改为755或者默认权限
使用sqoop导msyql数据到hbase中,mysql数据库字段 tinyint 类型导入到hbase中成了true/false
原因:这属于sqoop的一种错误转换
解决办法:url后面拼上tinyInt1isBit=false
hive执行join操作,会失败
语句:use hivedb;insert into table report.sitef2_tmp_payment_hbase select concat_ws('_', member0.site_id,substr(trade0.pay_time, 0, 10),substr(trade0.pay_time, 0, 10)), member0.site_id,substr(trade0.pay_time, 0, 10) pay_time,count(*) trade_num, sum(trade0.curry_amount) trade_volume from trade_trade_payments trade0 left join member_member member0 on (trade0.pay_status = '20' and trade0.sup_id = member0.member_id) group by substr(trade0.pay_time, 0, 10),member0.site_id
return code 3 from MapredLocalTask
Execution failed with exit status: 3
Obtaining error information
Task failed!
Task ID:
Stage-4
Logs:
/tmp/root/hive.log
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
问题原因:join基于hbase的映射表时,join进行了转换
解决办法:set hive.auto.convert.join = false;
hive建表时插入注释时,中文乱码。
解决办法:修改hive元数据表的编码设置
//修改字段注释字符集
alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
//修改分区注释字符集
alter table PARTITION_KEYS modify column PKEY_COMMENT varchar(4000) character set utf8;
//修改表注释字符集
CDH集群总是使用交换区内存,有时候impala会使用交换区内存比较慢:
解决办法:设置交换区使用
http://blog.csdn.net/huaishu/article/details/8762957
[root@rhce ~]# sysctl vm.swappiness=10
vm.swappiness = 10
[root@rhce ~]# cat /proc/sys/vm/swappiness
10
这里我们的修改已经生效,但是如果我们重启了系统,又会变成60.
--永久修改:
在/etc/sysctl.conf 文件里添加如下参数:
vm.swappiness=10
或者:
[root@rhce ~]# echo 'vm.swappiness=10' >>/etc/sysctl.conf
安装kafka,直接安装官方地址的kafka(https://archive.cloudera.com/kafka/parcels/latest/),下载,激活。
安装好之后启动失败。
问题原因:broker_max_heap_size配置太小。这个安装好之后,cdh配置的大小为256MB,其实默认是1G的。
解决办法:将broker_max_heap_size调大,我改为了512MB,启动的成功。根据具体情况可继续调大该参数。
CDH安装Phoenix,启动Phoenix失败
Apache Phoenix unable to connect to HBase
./sqlline.py master1:2181/kafka启动报错
6/10/24 18:42:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: org.apache.hadoop.hbase.DoNotRetryIOException: Class org.apache.phoenix.coprocessor.MetaDataRegionObserver cannot be loaded Set hbase.table.sanity.checks to false at conf or table descriptor if you want to bypass sanity checks
at org.apache.hadoop.hbase.master.HMaster.warnOrThrowExceptionForFailure(HMaster.java:1707)
at org.apache.hadoop.hbase.master.HMaster.sanityCheckTableDescriptor(HMaster.java:1568)
at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1497)
at org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:468)
at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:55682)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745) (state=08000,code=101)
org.apache.phoenix.exception.PhoenixIOException: org.apache.hadoop.hbase.DoNotRetryIOException: Class org.apache.phoenix.coprocessor.MetaDataRegionObserver cannot be loaded Set hbase.table.sanity.checks to false at conf or table descriptor if you want to bypass sanity checks
原因:安装Phoenix时,只把Phoenix的包放到hbase的lib中了,但是没有重启hbase,hbase没有加载Phoenix的相关class
连接远程kafka时,使用ip地址连接报错:
命令:kafka-console-consumer --zookeeper 172.16.0.242:2181/kafka --topic wk_wordlibrary_title --from-beginning
2016-10-26 14:39:57,949] WARN Fetching topic metadata with correlation id 12 for topics [Set(wk_wordlibrary_title)] from broker [BrokerEndPoint(92,node1,9092)] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:76)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:75)
at kafka.producer.SyncProducer.send(SyncProducer.scala:120)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2016-10-26 14:39:57,949] WARN [console-consumer-89764_t2-1477463994654-bcfb2442-leader-finder-thread], Failed to find leader for Set([wk_wordlibrary_title,0]) (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread)
原因:在client使用的元信息是机器名而不是IP。客户端无法解析这个机器名所以出现了前面的异常。
其他部门使用hdfs存储,占用存储空间较大,将磁盘沾满。(副本数为3)
解决办法:先将hdfs副本数配置设为1,然后使用hdfs dfs -setrep -R 1 /tmp1/logs命令将对应目录的文件副本数更新为1,腾出空间,然后当再写hdfs文件时,设置dfs.replication=1(Hadoop dfs -D dfs.replication=1 -put 70M logs/2)---参考链接:http://blog.csdn.net/lskyne/article/details/8898666
hbase regionserver(master1)启动失败,报错日志
ABORTING region server master1,60020,1478833728358: Unhandled: org.apache.hadoop.hbase.ClockOutOfSyncException: Server master1,60020,1478833728358 has been rejected; Reported time is too far out of sync with master. Time difference of 36240ms > max allowed of 30000ms
at org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.java:401)
at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:267)
at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:366)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
原因:master1时钟不同步
spark sql server(thriftserver.HiveThriftServer2)执行任务时会停掉
报错:ERROR executor.CoarseGrainedExecutorBackend: Driver 172.16.0.241:43286 disassociated! Shutting down.
INFO util.ShutdownHookManager: Shutdown hook called
INFO util.ShutdownHookManager: Deleting directory /data/cdh/yarn/nm/usercache/root/appcache/application_1478760707461_0459/spark-ab6b5ac2-0d5c-4e10-90a1-75b948596b7c
INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
原因:从信息上看来不知道是什么问题,但是归根结底还是内存的问题。