2018-01-28 8-10 HDFS access, API

Below are 2 approaches to access HDFS data: HDFS cil command, application API.


HDFS Commands

HDFS Command List

$ /bin/hdfs dfs xxx         $ /bin/hadoop fs xxx


Besides above commands, it provides hdfsadmin commands like:

$ hdfs dfsadmin -report                    --> it gives a summary report on fs, so you look at the capacity, replicated block info, datanode etc.

Application API

    Nativa Java API

    Base Class :org.apache.hadoop.fs.FileSystem

    JAVA API

    Classes below are included:

        FSDataInputStream (read from HDFS)

            read : read bytes

            readFully : read from stream to buffer

            seek: seek to given offset

            getPos: get current position in stream

        FSDataOutputStream (write to HDFS)

            getPos: get current position in stream

            hflush: flush out the data in client's user buffer.

            close: close the underlying output stream.

    C APPI

    Lib : libhdfs

    head file : hdfs.h

    C API

    Web RESTful API

    based on http,  it has methods : get, put, post, delete, correspond to read, update/replace, create, delete. WebHDFS API

    Fisrtly, it needs enable the parameters in /etc/hadoop/conf/hdfs-site.xml by editing file or cloudera-maanger GUI:

        dfs.webhdfs.enabled

    security parameter is also important to WebHDFS command :

        A. dfs.web.authentication.kerberos.principal

        Here Security on with Kerberos:

            $ curl -i --negotiate -u :"http://:/webhdfs/v1/?op=...”

        B. dfs.web.authentication.kerberos.keytab

        Here Security is on using Hadoop delegationtoken:

            $ curl -i "http://:/webhdfs/v1/?delegation=&op=..."

        C. If security is off:

            $ curl -i"http://:/webhdfs/v1/?[user.name=&]op=...”


Test and trouble shooting:

1. hdfs dfs command got error: "quickstart.cloudera:8120 failed to connected exception". It was fixed by restart hadoop-hdfs-namenode service.

2. I run $ curl -i "http://quickstart.cloudera:14000/webhdfs/v1/user/spark?user.name=cloudera&op=GETFILESTATUS".  It returns error: couldn't connect to host. RC is 14000 not working, because httpfs is not default installed, which works on 14000. 

I use 50070 instead, as it is hdfs default port for webhdfs. Succeed. refers

3. I run $ curl -i -X PUT "http://quickstart.cloudera:50070/webhdfs/v1/user/josie?user.name=cloudera&op=MKDIRS". It reports error "forbidden" because namenode is in safe mode since the number of live data node has reach minimum 0. 

Update minumum number as 3 via cloudera-maanger GUI. restart namenode/datanode sevice. without lucky....



Other uncommon using for HDFS access

HDFS Gateway 

    To mount hdfs then access it as local filesystems

Apache Flume

    collecting, aggregating streaming data and moving into HDFS

Apache Sqoop

    Transfer Relational DB data in bulk to HDFS


Applications using HDFS core stack API , like HBase / Spark


lesson 8-10 slides

你可能感兴趣的:(2018-01-28 8-10 HDFS access, API)