Advantages of Kosmix's KFS vs. HDFS

October 02, 2007

I was excited to learn last week that my friends at Kosmix have decided to open source a project long in the works: the Kosmix Distributed File System, or KFS  (see the offical blog post).  A number of people have commented on this release including Ethan Stock of zVents, who plans to use KFS along with their HyperTable clone of BigTable, and Rich Skrenta, who gives an excellent list of features of KFS.

Now, as a dumb product manager, my biggest questions were about KFS vs. HDFS, which is the distributed file system built by the Hadoop project.  Powerset already makes extensive use of the Hadoop stack, including HDFS.  So, I asked Sriram Rao, the lead engineer of KFS if he could explain to me what the different is between HDFS and KFS.  Here are some of his answers, which I think give more insight into why Kosmix chose to build KFS.

  • So why did Kosmix build KFS instead of using HDFS?  Apparently, KFS/HDFS were done in parallel.  The implementation was done from 2006-2007 and now Kosmix feels it's in a releasable state.  One of the reasons to stick with KFS over HDFS is that HDFS is written in Java and Kosmix's back-end is written in C++ and they were worried about the speed of the JNI interface.
  • File writing -  HDFS writes to a file once and read many times.  But, when writing to a file, you have to write from the start to the end and that is it.  Conversely, in KFS you can write to a file as many times as you want and write anywhere in the file (i.e., seek and write) and append to an existing file.  I've heard that Yahoo is working to fix this problem in HDFS, but it still isn't implemented.
  • Data integrity - Currently, with HDFS, after you write to a file, the data becomes “visible” to other apps only when the application closes the file. So, if the process were to crash before closing, the data written is lost.  With KFS, the data becomes visible when it gets pushed out to the chunkservers.  For performance, clients cache data; when the cache is full or when the applicatiohn choses, data gets flushed out.
  • Data rebalancing - KFS has rudimentary support for automatic rebalancing.  When you add new nodes/there is a change in space utilization amongst nodes, the system may migrate chunks from over-utilized nodes to under-utilized nodes.  HDFS doesn’t have such support now.

Hopefully I transcribed these accurately!  Definitely check out the KFS project, as the more people contributing, the better.  Powerset will be evaluating KFS in the coming weeks to see if it has any features that can propel us ahead of using HDFS.

你可能感兴趣的:(Advantages of Kosmix's KFS vs. HDFS)