Lustre vs. HDFS

1. Challenges of Hadoop + HDFS
Hadoop cannot make task is data local absolutely
Save big MapTask outputs in local Linux file system will get OS/disk I/O bottleneck
Reduce node need to use HTTP to shuffle: 
lots of net I/O and merge/spill operation
take up mass resources
the bursting shuffle stage will make memory exhausted 
make kernel kill some key threads
HDFS is time consuming for little files.


2. Some useful suggestions
Distribute MapTask intermediate results to multiple local disks. This can lighten disk I/O obstruction;
Disperse MapTask intermediate results to multiple nodes (on Lustre). This can lessen stress of the MapTask nodes;
Delay actual net I/O transmission time, rather than transmitting all intermediate results at the beginning of ReduceTask (shuffle phase).
This can disperse the network usage in time scale.


3. Hadoop over Lustre
Each Hadoop MapTask can read parallelly when inputs are not locally
saving big intermediate outputs in Lustre
Lustre creates a hardlink in the shuffle stage for the reducer node, which is delay actual network transmission time and more effective. 
Lustre can be mounted as a normal POSIX file system and is more efficient for reading/writing small files.


4. Differences at each stage between HDFS and Lustre
With block’s location information
HDFS: streaming read in each task, most locally, rare remotely network I/O.
Lustre: reads in parallel each task from Lustre client, less network I/O with location information than without it.
Map output: read/write
HDFS: writes on local Linux file system, not HDFS.
Lustre: writes on Lustre
Reduce input: shuffle phase read/write
HDFS: Uses HTTP to fetch map output from remote mapper nodes 
Lustre: Build a hardlink of the map output.
Reduce output: write
HDFS: ReduceTask write results to HDFS, each reducer is serial.
Lustre: ReduceTask write results to Lustre, each reducer can be parallel.


5. approach
(**) Add block location information for Lustre
(**) Use hardlink in shuffle stage.
Lustre provides a POSIX-compliant UNIX file system interface. We can use Lustre as local file system at each node.


6. config
Modify the configuration which Hadoop used to build the file system. 
Give the path where Lustre was mounted to the variable ‘fs.default.name’. 
And ‘mapred.local.dir’ should be set to an independent directory. 
When running job, just start JobTracker and TaskTracker. 
In this means, Hadoop will use Lustre file system to store all information.


Details of what we have done:
1. Mount Lustre at /Lustre on each node;
2. Build a Hadoop-site.xml[14] configuration file and set some variables; 
3. Start JobTracker and TaskTracker;
4. Now we can run MapRedcue jobs over Lustre.


7. Adding Lustre block location information
Lustre provides some user utilities to gather the extended attributes of a specific file.
set the Lustre stripe size to 32M, This will make that each map task input is on a single node.

We create a file containing the map from OST->hostname. This can be done with “lctl dl” command at each node. Another way, we add a new JAVA class to Hadoop source code and rebuild it to get a new jar. Through the new class we can get location information of each file stored in Lustre. The location information is saved in an array. When JobTracker pre-assign map tasks, these information helps Hadoop to give map task to the node(OSS) where it can read input from local disk.


8. Input/Output Formats
There are many input/output formats in Hadoop, such as: SequenceFile(Input/Output)Format, Text(Input/Output)Format, DB(Input/Output)Format, etc. 
We can also implement our own file format by implementing InputFormat/OutputFormat interfaces. 

10. Test Cases and Test Results
(1) Test environment
8 nodes in total: 1 mds/namenode, 7 OSS/DataNode
two 2.2 GHz processors, 8G of memory, and Gigabit Ethernet


(2) Application Tests
General statistic applications (such as: WordCount), 
Computational complexity applications (such as: BigMapOutput[9], webpages analytics processing).
*Through this test we should see Lustre has better performance than HDFS

(3) Test1: WordCount with a big file
Item HDFS Lustre Lustre Lustre
Time (s) 303 325 330 318


(4) Test2: WordCount with many small files
Item HDFS Lustre
Time 1h19m16s 1h21m8s


(5) Test3: BigMapOutput with one big file
Item HDFS Lustre stripesize=1M stripecount=7
Time (s) 207 207

(6) SUB:Test4: BigMapOutput with hardlink
Item Lustre Lustre with hardlink
Time (s) 446 391


(7) SUB: Test5: BigMapOutput with hardlink and location information
Item Lustre with hardlink Lustre with hardlink location info
Time (s) 391 366


(8) SUB: Test6: BigMapOutput Map_Read phase
Item HDFS Lustre Lustre location info
Time (s) 66 111 106
In order to find out why HDFS get a better performance, we tried to analyst the log files. 
We have scanned all map task logs, and find out cost time of reading input of each task.


(9) Test7: BigMapOutput copy_MapOutput phase
Item Lustre with hardlink Lustre
Time (s) 228 287


(10) Test8: BigMapOutput Reduce output phase
Item HDFS Lustre
Time (s) 336 197

你可能感兴趣的:(Cluster)