HDFS的NIO有一些相关的知识偶尔需要注意下:
(1) 使用了堆外内存
Control direct memory buffer consumption by HBaseClient
https://issues.apache.org/jira/browse/HBASE-4956
standard hbase client, asynchbase client, netty and direct memory buffers
https://groups.google.com/forum/?fromgroups=#!topic/asynchbase/xFvHuniLI1c
I thought I'd take a moment to explain what I discovered trying to track down serious problems with the regular (non-async) hbase client and Java's nio implementation.
We were having issues running out of direct memory and here's a stack trace which says it all:
java.nio.Buffer.<init>(Buffer.java:172)
java.nio.ByteBuffer.<init>(ByteBuffer.java:259)
java.nio.ByteBuffer.<init>(ByteBuffer.java:267)
java.nio.MappedByteBuffer.<init>(MappedByteBuffer.java:64)
java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:97)
java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:155)
sun.nio.ch.IOUtil.write(IOUtil.java:37)
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
java.io.DataOutputStream.flush(DataOutputStream.java:106)
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:518)
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:751)
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
$Proxy11.getProtocolVersion(<Unknown Source>:Unknown line)
Here you can see that an HBaseClient request is flushing a stream which has a socket channel at the other end of it. HBase has decided not to use direct memory for its byte buffers which I thought was smart since they are difficult to manage. Unfortunately, behind the scenes the JDK is noticing the lack of direct memory buffer in the socket channel write call, and it is allocating a direct memory buffer on your behalf! The size of that direct memory buffer depends on the amount of data you want to write at that time, so if you are writing 1M of data, the JDK will allocate 1M of direct memory.
The same is done on the reading side as well. If you perform channel I/O with a non-direct memory buffer, the JDK will allocate a direct memory buffer for you. In the reading case it allocates a size that equals the amount of room you have in the direct memory buffer you passed in to the read call. WTF!? That can be a very large value.
To make matters worse, the JDK caches these direct memory buffers in thread local storage and it caches not one, but three of these arbitrarily sized buffers. (Look in sun.nio.ch.Util.getTemporaryDirectBuffer and let me know if I have interpreted the code incorrectly.) So if you have a large number of threads talking to hbase you can find yourself overflowing with direct memory buffers that you have not allocated and didn't even know about.
This issue is what caused us to check out the asynchbase client, which happily didn't have any of these problems. The reason is that asynchbase uses netty and netty knows the proper way of using direct memory buffers for I/O. The correct way is to use direct memory buffers in manageable sizes, 16k to 64k or something like that, for the purpose of invoking a read or write system call. Netty has algorithms for calculating the best size given a particular socket connection, based on the amount of data it seems to be able to read at once, etc. Netty reads the data from the OS using direct memory and copies that data into Java byte buffers.
Now you might be wondering why you don't just pass a regular Java byte array into the read/write calls, to avoid the copy from direct memory to java heap memory, and here's the story about that. Let's assume you're doing a file or socket read. There are two cases:
- If the amount being read is < 8k, it uses a native char array on the C stack for the read system call, and then copies the result into your Java buffer.
- If the amount being read is > 8k, the JDK calls malloc, does the read system call with that buffer, copies the result into your Java buffer, and then calls free.
The reason for this is that the the compacting Java garbage collector might move your Java buffer while you're blocked in the read system call and clearly that will not do. But if you are not aware of the malloc/free being called every time you perform a read larger than 8k, you might be surprised by the performance of that. Direct memory buffers were created to avoid the malloc/free every time you read. You still need to copy but you don't need to malloc/free every time.
People get into trouble with direct memory because you cannot free them up when you know you are done. Instead you need to wait for the garbage collector to run and THEN the finalizers to be executed. You can never tell when the GC will run and/or your finalizers be run, so it's really a crap shoot. That's why the JDK caches these buffers (that it shouldn't be creating in the first place). And the larger your heap size, the less frequent the GCs. And actually, I saw some code in the JDK which called System.gc() manually when a direct memory buffer allocation failed, which is an absolute no-no. That might work with small heap sizes but with large heap sizes, a full System.gc() can take 15 or 20 seconds. We were trying to use the G1 collector which allows for very large heap sizes without long GC delays, but those delays were occurring because some piece of code was manually running GC. When I disabled System.gc() with a command line option, we ran out of direct memory instead.
http://hllvm.group.iteye.com/group/topic/27945
http://www.ibm.com/developerworks/cn/java/j-nativememory-linux/
http://www.kdgregory.com/index.php?page=java.byteBuffer
https://issues.apache.org/jira/browse/HADOOP-8069
In the Server implementation, we write with maximum 8KB write() calls, to avoid a heap malloc inside the JDK's SocketOutputStream implementation (with less than 8K it uses a stack buffer instead).
(2) 使用了比较多的文件句柄(fd)
http://www.zeroc.com/forums/bug-reports/4221-possible-file-handle-leaks.html
http://code.alibabatech.com/blog/experience_766/danga_memcached_nio_leak.html
https://issues.apache.org/jira/browse/HADOOP-4346
根据https://issues.apache.org/jira/browse/HADOOP-2346所说一个
一个selector takes up 3 fds: 2 for a pipe (used for {{wakeup()}, I guess) and for epoll().
$ jps
30255 DataNode
14118 Jps
$ lsof -p 30255 | wc -l
35163
$ lsof -p 30255 | grep TCP | wc -l
8117
$ lsof -p 30255 | grep pipe | wc -l
16994
$ lsof -p 30255 | grep eventpoll | wc -l
8114
8117 + 8114 + 16994 = 33225
$ jstack 30255 | grep org.apache.hadoop.hdfs.server.datanode.DataXceiver.run | wc -l
8115
测试环境DataNode出现有很多pipe和eventpoll
http://search-hadoop.com/m/JIkTGc7w1S1/+%2522Too+many+open+files%2522+error%252C+which+gets+resolved+after+some+time&subj=+Too+many+open+files+error+which+gets+resolved+after+some+time
For writes, there is an extra thread waiting on i/o. So it would be 3
fds more. To simplify earlier equation, on the client side :
for writes : max fds (for io bound load) = 7 * #write_streams
for reads : max fds (for io bound load) = 4 * #read_streams
The main socket is cleared as soon as you close the stream.
The rest of fds stay for 10 sec (they get reused if you open more streams meanwhile).
发现HFile很多,删除了一些无用文件后
$ lsof -p 30255 | grep pipe | wc -l
982
$ lsof -p 30255 | wc -l
3141
$ jstack 30255 | grep org.apache.hadoop.hdfs.server.datanode.DataXceiver.run | wc -l
139