2018-01-24 6 HDFS Architecture and Configuration

Architecture

Summary: 

HDFS is a scalable distributed filesystem. Haddoop distrubutes the big data as block on local data which is closed by compute. Nodes consists of heterogeneous low price commodity hardware.

key point of design: 

distribute data as block to scalable data nodes.

feature:

Data high availability is by data replication in different nodes.

Simplified coherency model - once write many read.

move computation close to data

Relax POSIX requirements - increase thoughput

Achitecture:

2018-01-24 6 HDFS Architecture and Configuration_第1张图片

Name Node - manage the file system namespace and regulates the access to files by clients.

Data Nodes - manage storage; serve read/write requests from clients; block creation\deletion\replication based on instructions from Name Node.


Performance Envelope

Every block has represented as a object.

default block size is 64MB.  The file size depends on how many blocks to create, then :

impact the memory usage and netowork load from the perspective of namespace

impact the number of map task which process block, even further the disk IO performance.


How to improve performance:

- merge small file

- sequence files

- HBASE, HIVE configuration

- CombineFileInputFormat


Write/Replication/Read Processes on HDFS

initially, data is cached at client buffer until it reaches a block size. then:

2018-01-24 6 HDFS Architecture and Configuration_第2张图片

lesson 6 - slides

HDFS command list

HDFS Architect (official document)

你可能感兴趣的:(2018-01-24 6 HDFS Architecture and Configuration)