关于集群突然断电的解决方法

今天在测试的时候,数据正在入库。突然断电了。

重启动以后,启动./start-dfs.sh以后,查看日志,会报出

2012-04-13 15:39:43,208 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call rollEditLog() from 196.1.2.160:34939: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Checkpoint not created. Name node is in safe mode.
The ratio of reported blocks 0.8544 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Checkpoint not created. Name node is in safe mode.
The ratio of reported blocks 0.8544 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4584)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.rollEditLog(NameNode.java:660)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)

这样的错,且安全模式不会自动退出。

这是因为,在断电的时候,正在进入入库,造成了数据块丢失。

这时,使用./hadoop fsck /查看,会发现,里面的

 Corrupt blocks: 15

证明有15个丢失的块儿。

这些块已经没有办法找回了,唯有进行删除操作。

这时,在${hadoophome}/bin目录下,执行

./hadoop dfsadmin -safemode leave 关闭安全模式。

执行./hadoop fsck -delete /将丢失的块儿删除即可。

然后再执行./stop-dfs.sh命令,将系统关闭,

再次重启。即可

因为我的集群是使用hbase进行数据的存取的。因此,如果这时候直接启动hbase的话,它也会报出找不到数据的问题。

2012-04-13 16:08:58,172 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 85176ms for lease recovery on hdfs://gsmaster/hbase/.logs/gsdata1,60020,1334287545838/gsdata1%3A60020.1334291689722:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/gsdata1,60020,1334287545838/gsdata1%3A60020.1334291689722 for DFSClient_hb_m_gsmaster:60000_1334304432374 on client 196.1.2.160, because this file is already being created by NN_Recovery on 196.1.2.161
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)

这时,我们先将master进程KISS掉,并在${hadoophome}/bin目录下,将hbase目录下的,

 

ROOT-
dir



2012-04-13 17:07
rwxr-xr-x
goldsoft
supergroup
.META.
dir



2012-04-13 17:07
rwxr-xr-x
goldsoft
supergroup
.logs
dir



2012-04-13 17:12
rwxr-xr-x
goldsoft
supergroup
.oldlogs
dir



2012-04-13 17:11
rwxr-xr-x
goldsoft
supergroup

这四个目录中除.oldlogs以外的删除,使用./hadoop fs -rmr /hbase/-ROOT-这样的命令,

然后再启动hbase,等等hbase启动完成以后,即在日志里显示

2012-04-13 17:08:43,901 INFO org.apache.hadoop.hbase.master.HMaster: Master has completed initialization
时,在${hbasehome}/bin下,执行./hbase org.jruby.Main add_table.rb /hbase目录/表名即可

为了防止数据的丢失,我们在入库的时候,最好是在已经确认放库成功以后,再将源数据删除,这样,才能保证数据不会被丢失。因为断电这样的情况,是谁也料想不到的

你可能感兴趣的:(HDFS+HBASE)