hadoop2.2编程:DFS API 操作

 

1. Reading data from a hadoop URL

说明:想要让java从hadoop的dfs里读取数据,则java 必须能够识别hadoop hdfs URL schema, 因此我们应该将hdfs的FsUrlStreamHandlerFactory作为一个实例提供给java, java的setURLStreamHandlerFactory方法可以实现此功能;

注意:此方法有缺陷,由于在java里,setURLStreamHandlerFactorymethod在每一个JVM里只能调用一次,加入第三方component已经set a URLStreamHandlerFactory,则hadoop用户就不能使用setURLStreamHandlerFactory方法来reading data from hadoop。

简要提示:

 1.[java.net.URL]

    methods:

        InputStream openStream()

        static void setURLStreamHandlerFactory(URLStreamHandlerFactory fac)

                                                                                                                                              2.[org.apache2.hadoop.fs.FsUrlStreamHandlerFactory]

    method:

        public class FsUrlStreamHandlerFactory         

        extends Object         

        implements URLStreamHandlerFactory

3.[org.apache.hadoop.io.IOUtils]

    method:

    static void copyBytes(InputStream in, OutputStream out, long length, int bufferSize, boolean close)

代码:

 1 import java.io.InputStream;

 2 import java.net.URL;

 3                                                                                                                                                                                                                                                                                                                                                                                                                                       

 4 import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;

 5 import org.apache.hadoop.io.IOUtils;

 6                                                                                                                                                                                                                                                                                                                                                                                                                                       

 7 public class URLCat {

 8   static {

 9     URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());

10   }

11                                                                                                                                                                                                                                                                                                                                                                                                                                       

12   public static void main(String[] args) throws Exception {

13     InputStream in = null;

14     try {

15       in = new URL(args[0]).openStream();

16       IOUtils.copyBytes(in, System.out, 4096, false);

17     } finally {

18       IOUtils.closeStream(in);

19     }  

20   }

21 }

执行步骤:

$source $YARN_HOME/libexec/hadoop-config.sh

$mkdir myclass

$javac -cp $CLASSPATH URLCat.java -d myclass

$jar -cvf urlcat.jar -C myclass ./

# assume we have a file bar.txt in hdfs: /user/grid/bar.txt

# then we need run yarn with this command

$yarn jar -cp urlcat.jar URLCat hdfs:///user/grid/bar.txt

 

2. Reading data using HDFS API

说明:使用hadoop的FileSystem API可以避免上面所述的JVM只能调用一次setURLStreamHandlerFactory的缺陷;

简要提示:

 (1)org.apache.hadoop.conf.Configured

    |__ org.apache.hadoop.fs.FileSystem

            public abstract class FileSystem

            extends Configured

            implements Closeable

            [method]:

            static FileSystem get(URI uri, Configuration conf)

            FSDataInputStream open(Path f)

(2)java.io.InputStream

    |__ java.io.FilterInputStream

          |__ java.io.DataInputStream

               |__ org.apache.hadoop.fs.FSDataInputStream

                        public class FSDataInputStream

                        extends DataInputStream

                        implements Seekable, PositionedReadable, Closeable

                        [methods]:

                         void seek(long desired)

                         long getPos()

(3)org.apache.hadoop.fs

    public class Path

    extends Object

    implements Comparable

    [methods]:

    Path(String pathString)

                                                                                                                                                           

(4)java.net.URI

    public final class URI

    extends Object

    implements Comparable<URI>, Serializable

    [methods]:

    static URI create(String str)

代码:

 1 import java.net.URI;

 2 import java.io.InputStream;

 3                                                                                                                                                                                                                                                                                                                                                                                                              

 4 import org.apache.hadoop.io.IOUtils;

 5 import org.apache.hadoop.fs.FileSystem;

 6 import org.apache.hadoop.fs.Path;

 7 import org.apache.hadoop.conf.Configuration;

 8                                                                                                                                                                                                                                                                                                                                                                                                              

 9 public class URICat {

10   public static void main(String[] args) throws Exception {

11    String uri = args[0];

12    Configuration conf = new Configuration();

13    FileSystem fs = FileSystem.get(URI.create(uri), conf);

14    InputStream in = null;

15    try {

16      in = fs.open(new Path(uri));

17      IOUtils.copyBytes(in, System.out, 4096, false);

18    } finally {

19      IOUtils.closeStream(in);

20    }  

21   }

22 }

 

执行步骤:

$source $YARN_HOME/libexec/hadoop-config.sh

$mkdir myclass

$javac -cp $CLASSPATH URICat.java -d myclass

$jar -cvf uricat.jar -C myclass ./

$yarn jar -cp uricat.jar URICat /user/grid/bar.txt

 

备注1:因为我们调用了FileSystem的API,故输入的filepath也可以省略HDFS的URI全名hdfs://,如上面执行步骤里所写。

备注2:FileSystem是抽象类,故不能new FileSystem()来得到instance, 而需要调用其的静态方法get()来得到;

备注3:注意java里的向上转型,体现在简要提示里各种Stream的继承关系上;

备注4:Configuration conf = new Configuration();

  • Configurations需要xml文件里的键值对<name>x</name>来配置,规则为:

    if x is named by a String, 则在classpath里检查同名文件;

    if x is named by a Path, 则直接本地查找,不检查classpath;

  • 若用户不指定,则默认调用两个resources: core-site.xml和core-default.xml

  • 用户可以指定xml文件以添加自己定义的configurations:

    conf.addResource("my_configuration.xml");

3. Writing data

3.1 从本地复制文件到hdfs

  • 版本1 FileCopy with copyBytes() method

简要提示:

  • 核心代码就一行,即从InputStrea 以二进制方式复制到OutputStream:

static void copyBytes(InputStream in, OutputStream out, int buffSize, boolean close)

 

  • 我们新建一个FileInputStream(localsrc)实例, 将其暂存在BufferedInputStream()里,并向上转型生成InputStream:

FileInputStream(String name )
  • 调用FileSystem来产生OutputStream:

FSDataOutputStream create(Path f, Progressable progress)

代码:

 1    import java.net.URI;

 2 import java.io.InputStream;

 3 import java.io.BufferedInputStream;

 4 import java.io.FileInputStream;

 5 import java.io.OutputStream;

 6                                                                                                                                                                                                                                                                                                                                             

 7 import org.apache.hadoop.fs.BufferedFSInputStream;

 8 import org.apache.hadoop.util.Progressable;

 9 import org.apache.hadoop.util.Progressable;

10 import org.apache.hadoop.io.IOUtils;

11 import org.apache.hadoop.fs.Path;

12 import org.apache.hadoop.fs.FileSystem;

13 import org.apache.hadoop.conf.Configuration;

14                                                                                                                                                                                                                                                                                                                                             

15 public class FileCopyWithProgress {

16   public static void main(String[] args) throws Exception {

17     String localsrc = args[0];

18     String dst = args[1];

19     InputStream in = new BufferedInputStream(new FileInputStream(localsrc));

20     Configuration conf = new Configuration();

21     FileSystem fs = FileSystem.get(URI.create(dst), conf);

22     OutputStream out = fs.create(new Path(dst), new Progressable() {

23         public void progress() { System.out.print(".");} }

24                                 );

25     IOUtils.copyBytes(in, out, 4096, true);

26   }

27 }

 

执行步骤:

$. $YARN_HOME/libexec/hadoop-config.sh 

$javac -cp $CLASSPATH -d my_class FileCopyWithProgress.java

$jar -cvf filecopywithprogress.jar -C my_class/ .

# assum we have a local file foo.out in directory: /home/grid/foo.out, then we should run yarn like below

$yarn jar filecopywithprogress.jar FileCopyWithProgress /home/grid/foo.out hdfs:///user/grid/copied_foo.out

# we can do a check for the copied file

$hadoop fs -ls -R /user/grid/

 

注:从下面开始使用另一种方式来编译、运行代码

  • 版本2 使用FileSystem的copyFromLocalFile()方法

代码如下:

 1 import org.apache.hadoop.fs.Path;

 2 import org.apache.hadoop.fs.FileSystem;

 3 import org.apache.hadoop.conf.Configuration;

 4                                                                                                                                                                                                                  

 5 public class FileCopyFromLocal {

 6   public static void main(String[] args) throws Exception {

 7     String localSrc = args[0];

 8     String dst = args[1];

 9     Configuration conf = new Configuration();

10     FileSystem fs = FileSystem.get(conf);

11     fs.copyFromLocalFile(new Path(localSrc),new Path(dst));

12   }

13 }

 

执行步骤:

$source $YARN_HOME/libexec/hadoop-config.sh

$javac FileCopyFromLocal.java -d class/

$jar -cvf filecopyfromlocal.jar -C class ./

$export HADOOP_CLASSPATH=$CLASSPATH:filecopyfromlocal.jar

# suppose we have a file bar.txt in local disk, then we use the following command line to copy it to hdfs

$yarn FileCopyFromLocal bar.txt hdfs:///user/grid/kissyou

# we can check the copied file on hdfs

$hadoop fs -ls /user/grid/

w-r--r--   3 grid supergroup        899 2013-11-17 01:33 /user/grid/kissyou

 

3.2 新建文件夹/文件

  • 新建文件夹 FileSystem.mkdirs()

代码如下:

 1 import org.apache.hadoop.fs.FileSystem;

 2 import org.apache.hadoop.conf.Configuration;

 3 import org.apache.hadoop.fs.Path;

 4                                                                                                                                                                                          

 5 public class CreateDir {

 6   public static void main(String[] args) throws Exception {

 7     Configuration conf = new Configuration();

 8     String dst = args[0];

 9     FileSystem fs = FileSystem.get(conf);

10     fs.mkdirs(new Path(dst));

11   }

12 }

 

执行步骤:

$source $YARN_HOME/libexec/hadoop-config.sh

$javac CreatDir.java -d class/

$jar -cvf createdir.jar -C class ./

$export HADOOP_CLASSPATH=$CLASSPATH:createdir.jar

$yarn CreateDir hdfs:///user/grid/kissyou

# we can check the created directory on hdfs

$hadoop fs -ls /user/grid/

w-r--r--   3 grid supergroup        899 2013-11-17 01:33 /user/grid/kissyou

 

  • 新建文件 FileSystem.create()

代码如下:

 1 import org.apache.hadoop.fs.FileSystem;

 2 import org.apache.hadoop.conf.Configuration;

 3 import org.apache.hadoop.fs.Path;

 4                                                                                                                                                                                           

 5 public class CreateFile {

 6   public static void main(String[] args) throws Exception {

 7     Configuration conf = new Configuration();

 8     String dst = args[0];

 9     FileSystem fs = FileSystem.get(conf);

10     fs.create(new Path(dst));

11   }

12 }

 

执行步骤:

$source $YARN_HOME/libexec/hadoop-config.sh

$javac CreatFile.java -d class/

$jar -cvf createfile.jar -C class ./

$export HADOOP_CLASSPATH=$CLASSPATH:createfile.jar

$yarn CreatFile hdfs:///user/grid/kissyou.txt

# we can check the created file on hdfs

$hadoop fs -ls /user/grid/

w-r--r--   3 grid supergroup        899 2013-11-17 01:33 /user/grid/kissyou.txt

 

注意三点

1. 同一路径下不可以新建同名的文件foo和目录foo/, 否则运行时会抛出异常:

    fs.FileAlreadyExistsException

2. 我们进行copy复制、写文件操作时mkdirs()方法会被自动调用,故一般不会调用mkdirs()来手动创建目录;

3. 官方API文档里对mkdirs()的描述是:"Make the given file and all non-existent parents into directories", 所以在hadoop里创建文件的方法是recursive(递归的),相当于linux里的:

    $mkdir -p foo/bar/qzx

同样等价于hdfs-shell里的命令:

    %$YARN_HOME/bin/hadoop fs -mkdir -p hdfs:///foo/bar/qzx

4.Testing file and Getting fileStatus

提示: hadoop2.2中一些API已经deprecated, 现只列出本例中用到的已经deprecated的method,并给出最新的method.

    deprecated APIs:

(1)java.lang.Object

    |__ org.apache.hadoop.fs.FileStatu

       //deprecated method:

        boolean isDir() //Deprecated. Use isFile(), 

                        //isDirectory(), and isSymlink() instead.

(2)java.lang.Object

    |__org.apache.hadoop.conf.Configured

        |__org.apache.hadoop.fs.FileSystem

            //deprecated methods:

            boolean isDirectory(Path f)    //Deprecated. Use 

                                           //getFileStatus() instead 

            short getReplication(Path src) //Deprecated. Use 

                                           //getFileStatus() instead 

            long getLength(Path f)         //Deprecated. Use

                                           //getFileStatus()instead

代码:

 1 import java.net.URI;

 2                                                                                                       

 3 import org.apache.hadoop.fs.FileSystem;

 4 import org.apache.hadoop.fs.Path;

 5 import org.apache.hadoop.conf.Configuration;

 6 import org.apache.hadoop.fs.FileStatus;

 7                                                                                                       

 8 public class TestFileStatus {

 9   public static void main(String[] args) throws Exception {

10     Configuration conf = new Configuration();

11     FileSystem fs = FileSystem.get(conf);

12     FileStatus stat = fs.getFileStatus(new Path(args[0]));

13     if (stat.isDirectory()) {

14       System.out.println(stat.getPath().toUri().getPath() + " is a directory.");

15     } else if (stat.isFile()) {

16       System.out.println(stat.getPath().toUri().getPath() + " is a file.");

17       System.out.println(stat.getPath().toUri().getPath() + " getBlockSize: " + stat.getBlockSize());

18       System.out.println(stat.getPath().toUri().getPath() + " getLen(): " + stat.getLen());

19       System.out.println(stat.getPath().toUri().getPath() + " getOwner(): " + stat.getOwner());

20       System.out.println(stat.getPath().toUri().getPath() + " getGroup(): " + stat.getGroup());

21       System.out.println(stat.getPath().toUri().getPath() + " getAccessTime(): " + stat.getAccessTime());

22       System.out.println(stat.getPath().toUri().getPath() + " getModificationTime(): " + stat.getModificationTime());

23       System.out.println(stat.getPath().toUri().getPath() + " getPermission(): " + stat.getPermission());

24       System.out.println(stat.getPath().toUri().getPath() + " hashcode(): " + stat.hashCode());

25       System.out.println(stat.getPath().toUri().getPath() + " getPath(): " + stat.getPath());

26     }

27   }

28 }

 

先给一个福利^_^, 利用下面我写的小脚本可以方便地编译并生成jar文件:

 1 #!/usr/bin/env sh

 2 CWD=$(pwd)

 3 export CLASSPATH=''

 4 . $YARN_HOME/libexec/hadoop-config.sh

 5       

 6 if [ -d class ]; then

 7   rm -rf class/*

 8 else

 9   mkdir $CWD/class

10 fi

11       

12 for f in $@

13   do

14     srcs="$srcs $CWD/$f"

15   done

16       

17 javac $srcs -d class

18       

19 if [ $? -ne 0 ] ;then

20   echo Error found when compiling the code!

21   exit 1

22 fi

23       

24 class=$( cat $1 |grep 'package'|sed -e "s/\(package\s\)\|\(;\)//g" ).$(echo $1 | sed -r 's/(.*).java/echo \1/ge')

25 jarfile=$(echo $1 | sed -r 's/(.*)\.java/echo \L\1\.jar/ge')

26       

27 jar -cvf $CWD/$jarfile -C $CWD/class . > /dev/null 2>&1

28 #echo jar -cvf $jarfile -C class . 

29 echo -----------------CMD Lines-----------------------

30 echo source $YARN_HOME/libexec/hadoop-config.sh >sourceIt.sh

31 echo export HADOOP_CLASSPATH=$jarfile:'$CLASSPATH'>>sourceIt.sh

32 echo source  $CWD/sourceIt.sh

33 echo yarn $class  [command args]...

执行步骤:

注意,为了简化起见,脚本定义:

$./compack.sh args1 args2 args3...中args1为main class

$chmod 500 compack.sh

$./compack.sh TestFileStatus.java

#then the script will reminder you with the following message:

-----------------CMD Lines------------------

source /home/grid/hadoop-2.2.0-src/hadoop-dist/target/hadoop-2.2.0/task/DFSAPIProgramming/sourceIt.sh

yarn TestFileStatus  [command args]...

$source sourceIt.sh

# suppose we have a file "part-m-00000" in hdfs,run yarn like below

$yarn TestFileStatus /user/hive/warehouse/footbl/part-m-00000

Output:

#output

/user/hive/warehouse/footbl/part-m-00000 is a file.

/user/hive/warehouse/footbl/part-m-00000 getBlockSize: 134217728

/user/hive/warehouse/footbl/part-m-00000 getLen(): 1275

/user/hive/warehouse/footbl/part-m-00000 getOwner(): grid

/user/hive/warehouse/footbl/part-m-00000 getGroup(): supergroup

/user/hive/warehouse/footbl/part-m-00000 getAccessTime(): 1384675957784

/user/hive/warehouse/footbl/part-m-00000 getModificationTime(): 1384675958368

/user/hive/warehouse/footbl/part-m-00000 getPermission(): rw-r--r--

/user/hive/warehouse/footbl/part-m-00000 hashcode(): 1096001837

/user/hive/warehouse/footbl/part-m-00000 getPath(): hdfs://cluster1:9000/user/hive/warehouse/footbl/part-m-00000

 

5. Listing files & glob files

  • Listing files

代码:

 1 import java.net.URI;

 2                                                           

 3 import org.apache.hadoop.fs.FileUtil;

 4 import org.apache.hadoop.fs.FileSystem;

 5 import org.apache.hadoop.fs.FileStatus;

 6 import org.apache.hadoop.fs.Path;

 7 import org.apache.hadoop.conf.Configuration;

 8                                                           

 9 public class ListFiles {

10   public static void main(String[] args) throws Exception {

11     Configuration conf = new Configuration();

12     FileSystem fs = FileSystem.get(conf);

13                                                           

14     Path[] paths = new Path[args.length];

15     for(int i = 0; i < args.length; i++) {

16       paths[i] = new Path(args[i]);

17     }

18                                                           

19     FileStatus[] status = fs.listStatus(paths);

20     Path[] pathList = FileUtil.stat2Paths(status);

21     for(Path p : pathList) {

22       System.out.println(p);

23     }

24   }

25 }

 

执行步骤:

$./compack.sh ListFiles.java 

$source sourceIt.s

$yarn ListFiles /user/hive/warehouse/footbl /user/grid/

 

output:

hdfs://cluster1:9000/user/hive/warehouse/footbl/_SUCCESS

hdfs://cluster1:9000/user/hive/warehouse/footbl/part-m-00000

hdfs://cluster1:9000/user/grid/kiss

hdfs://cluster1:9000/user/grid/kissyou

hdfs://cluster1:9000/user/grid/missyou
  • Filter files

提示:

  1. java.lang.Object

  |__ org.apache.hadoop.conf.Configured

       |__ org.apache.hadoop.fs.FileSystem

            public abstract class FileSystem

            extends Configured

            implements Closeable

            //method:

            FileStatus[] globStatus(Path pathPattern, PathFilter filter)  

   2. org.apache.hadoop.fs 

    public interface PathFilter

    //method:

    boolean accept(Path path)

代码:

 1 package org.apache.hadoop.MyCode;

 2                                       

 3 import org.apache.hadoop.fs.PathFilter;

 4 import org.apache.hadoop.fs.Path;

 5                                       

 6 public class MyFilter implements PathFilter {

 7   private final String regex;

 8   public MyFilter(String regex) {

 9     this.regex = regex;

10   }

11   public boolean accept(Path path) {

12     return path.toString().matches(regex);

13   }

14 }
 1 package org.apache.hadoop.MyCode;

 2                                      

 3 import org.apache.hadoop.MyCode.MyFilter;

 4                                      

 5 import java.net.URI;

 6                                      

 7 import org.apache.hadoop.fs.FileSystem;

 8 import org.apache.hadoop.fs.FileStatus;

 9 import org.apache.hadoop.fs.Path;

10 import org.apache.hadoop.fs.FileUtil;

11 import org.apache.hadoop.conf.Configuration;

12                                      

13 public class ListStatusWithPattern {

14   public static void main(String[] args) throws Exception {

15     Configuration conf = new Configuration();

16     FileSystem fs = FileSystem.get(conf);

17                                      

18     FileStatus[] status = fs.globStatus(new Path(args[0]), new MyFilter(args[1]));

19     Path[] pathList = FileUtil.stat2Paths(status);

20                                      

21     for( Path p : pathList ) { 

22       System.out.println(p);

23     }

24   }

25 }

执行步骤:

$source $YARN_HOME/libexec/hadoop-config.sh

$mkdir class

$javac ListStatusWithPattern.java  MyFilter.java -d class

$jar -cvf liststatuswithpattern.jar -C class ./

$export HADOOP_CLASSPATH=liststatuswithpattern.jar:$CLASSPATH

#suppose we have four files in hdfs like below

$hadoop fs -ls /user/grid/

Found 4 items

drwxr-xr-x   - grid supergroup          0 2013-11-17 01:06 /user/grid/kiss

-rw-r--r--   3 grid supergroup          0 2013-11-17 06:05 /user/grid/kissyou

drwxr-xr-x   - grid supergroup          0 2013-11-17 19:33 /user/grid/miss

-rw-r--r--   3 grid supergroup        899 2013-11-17 01:33 /user/grid/missyou

# then we can run the command to filter the matched file

$yarn jar liststatuswithpattern.jar org.apache.hadoop.MyCode.ListStatusWithPattern "hdfs:///user/grid/*ss*" "^.*grid/[k].*$

或者可以使用前面给出的脚本编译、打包并生成主要的执行yarn的代码:

$./compack.sh ListStatusWithPattern.java MyFilter.java #注意,脚本默认输入的第一个源文件为main class所在文件

$source source /home/grid/hadoop-2.2.0-src/hadoop-dist/target/hadoop-2.2.0/task/DFSAPIProgramming/sourceIt.sh

-----------------CMD Lines-----------------------

source /home/grid/hadoop-2.2.0-src/hadoop-dist/target/hadoop-2.2.0/task/DFSAPIProgramming/sourceIt.sh

yarn org.apache.hadoop.MyCode.MyFilter [command args]...

$yarn org.apache.hadoop.MyCode.ListStatusWithPattern "hdfs:///user/grid/*ss*" "^.*grid/[k].*$"

output:



hdfs://cluster1:9000/user/grid/kiss

hdfs://cluster1:9000/user/grid/kissyou

 

(完)

你可能感兴趣的:(hadoop2)