Below are 2 approaches to access HDFS data: HDFS cil command, application API.
HDFS Commands
HDFS Command List
$ /bin/hdfs dfs xxx $ /bin/hadoop fs xxx
Besides above commands, it provides hdfsadmin commands like:
$ hdfs dfsadmin -report --> it gives a summary report on fs, so you look at the capacity, replicated block info, datanode etc.
Application API
Nativa Java API
Base Class :org.apache.hadoop.fs.FileSystem
JAVA API
Classes below are included:
FSDataInputStream (read from HDFS)
read : read bytes
readFully : read from stream to buffer
seek: seek to given offset
getPos: get current position in stream
FSDataOutputStream (write to HDFS)
getPos: get current position in stream
hflush: flush out the data in client's user buffer.
close: close the underlying output stream.
C APPI
Lib : libhdfs
head file : hdfs.h
C API
Web RESTful API
based on http, it has methods : get, put, post, delete, correspond to read, update/replace, create, delete. WebHDFS API
Fisrtly, it needs enable the parameters in /etc/hadoop/conf/hdfs-site.xml by editing file or cloudera-maanger GUI:
dfs.webhdfs.enabled
security parameter is also important to WebHDFS command :
A. dfs.web.authentication.kerberos.principal
Here Security on with Kerberos:
$ curl -i --negotiate -u :"http://:/webhdfs/v1/
B. dfs.web.authentication.kerberos.keytab
Here Security is on using Hadoop delegationtoken:
$ curl -i "http://:/webhdfs/v1/
C. If security is off:
$ curl -i"http://
Test and trouble shooting:
1. hdfs dfs command got error: "quickstart.cloudera:8120 failed to connected exception". It was fixed by restart hadoop-hdfs-namenode service.
2. I run $ curl -i "http://quickstart.cloudera:14000/webhdfs/v1/user/spark?user.name=cloudera&op=GETFILESTATUS". It returns error: couldn't connect to host. RC is 14000 not working, because httpfs is not default installed, which works on 14000.
I use 50070 instead, as it is hdfs default port for webhdfs. Succeed. refers
3. I run $ curl -i -X PUT "http://quickstart.cloudera:50070/webhdfs/v1/user/josie?user.name=cloudera&op=MKDIRS". It reports error "forbidden" because namenode is in safe mode since the number of live data node has reach minimum 0.
Update minumum number as 3 via cloudera-maanger GUI. restart namenode/datanode sevice. without lucky....
Other uncommon using for HDFS access
HDFS Gateway
To mount hdfs then access it as local filesystems
Apache Flume
collecting, aggregating streaming data and moving into HDFS
Apache Sqoop
Transfer Relational DB data in bulk to HDFS
Applications using HDFS core stack API , like HBase / Spark
lesson 8-10 slides