1. Reading data from a hadoop URL
说明:想要让java从hadoop的dfs里读取数据,则java 必须能够识别hadoop hdfs URL schema, 因此我们应该将hdfs的FsUrlStreamHandlerFactory作为一个实例提供给java, java的setURLStreamHandlerFactory方法可以实现此功能;
注意:此方法有缺陷,由于在java里,setURLStreamHandlerFactorymethod在每一个JVM里只能调用一次,加入第三方component已经set a URLStreamHandlerFactory,则hadoop用户就不能使用setURLStreamHandlerFactory方法来reading data from hadoop。
简要提示:
1.[java.net.URL] methods: InputStream openStream() static void setURLStreamHandlerFactory(URLStreamHandlerFactory fac) 2.[org.apache2.hadoop.fs.FsUrlStreamHandlerFactory] method: public class FsUrlStreamHandlerFactory extends Object implements URLStreamHandlerFactory 3.[org.apache.hadoop.io.IOUtils] method: static void copyBytes(InputStream in, OutputStream out, long length, int bufferSize, boolean close)
代码:
1 import java.io.InputStream; 2 import java.net.URL; 3 4 import org.apache.hadoop.fs.FsUrlStreamHandlerFactory; 5 import org.apache.hadoop.io.IOUtils; 6 7 public class URLCat { 8 static { 9 URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); 10 } 11 12 public static void main(String[] args) throws Exception { 13 InputStream in = null; 14 try { 15 in = new URL(args[0]).openStream(); 16 IOUtils.copyBytes(in, System.out, 4096, false); 17 } finally { 18 IOUtils.closeStream(in); 19 } 20 } 21 }
执行步骤:
$source $YARN_HOME/libexec/hadoop-config.sh $mkdir myclass $javac -cp $CLASSPATH URLCat.java -d myclass $jar -cvf urlcat.jar -C myclass ./ # assume we have a file bar.txt in hdfs: /user/grid/bar.txt # then we need run yarn with this command $yarn jar -cp urlcat.jar URLCat hdfs:///user/grid/bar.txt
2. Reading data using HDFS API
说明:使用hadoop的FileSystem API可以避免上面所述的JVM只能调用一次setURLStreamHandlerFactory的缺陷;
简要提示:
(1)org.apache.hadoop.conf.Configured |__ org.apache.hadoop.fs.FileSystem public abstract class FileSystem extends Configured implements Closeable [method]: static FileSystem get(URI uri, Configuration conf) FSDataInputStream open(Path f) (2)java.io.InputStream |__ java.io.FilterInputStream |__ java.io.DataInputStream |__ org.apache.hadoop.fs.FSDataInputStream public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable, Closeable [methods]: void seek(long desired) long getPos() (3)org.apache.hadoop.fs public class Path extends Object implements Comparable [methods]: Path(String pathString) (4)java.net.URI public final class URI extends Object implements Comparable<URI>, Serializable [methods]: static URI create(String str)
代码:
1 import java.net.URI; 2 import java.io.InputStream; 3 4 import org.apache.hadoop.io.IOUtils; 5 import org.apache.hadoop.fs.FileSystem; 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.Configuration; 8 9 public class URICat { 10 public static void main(String[] args) throws Exception { 11 String uri = args[0]; 12 Configuration conf = new Configuration(); 13 FileSystem fs = FileSystem.get(URI.create(uri), conf); 14 InputStream in = null; 15 try { 16 in = fs.open(new Path(uri)); 17 IOUtils.copyBytes(in, System.out, 4096, false); 18 } finally { 19 IOUtils.closeStream(in); 20 } 21 } 22 }
执行步骤:
$source $YARN_HOME/libexec/hadoop-config.sh $mkdir myclass $javac -cp $CLASSPATH URICat.java -d myclass $jar -cvf uricat.jar -C myclass ./ $yarn jar -cp uricat.jar URICat /user/grid/bar.txt
备注1:因为我们调用了FileSystem的API,故输入的filepath也可以省略HDFS的URI全名hdfs://,如上面执行步骤里所写。
备注2:FileSystem是抽象类,故不能new FileSystem()来得到instance, 而需要调用其的静态方法get()来得到;
备注3:注意java里的向上转型,体现在简要提示里各种Stream的继承关系上;
备注4:Configuration conf = new Configuration();
Configurations需要xml文件里的键值对<name>x</name>来配置,规则为:
if x is named by a String, 则在classpath里检查同名文件;
if x is named by a Path, 则直接本地查找,不检查classpath;
若用户不指定,则默认调用两个resources: core-site.xml和core-default.xml
用户可以指定xml文件以添加自己定义的configurations:
conf.addResource("my_configuration.xml");
3. Writing data
3.1 从本地复制文件到hdfs
版本1 FileCopy with copyBytes() method
简要提示:
核心代码就一行,即从InputStrea 以二进制方式复制到OutputStream:
static void copyBytes(InputStream in, OutputStream out, int buffSize, boolean close)
我们新建一个FileInputStream(localsrc)实例, 将其暂存在BufferedInputStream()里,并向上转型生成InputStream:
FileInputStream(String name )
调用FileSystem来产生OutputStream:
FSDataOutputStream create(Path f, Progressable progress)
代码:
1 import java.net.URI; 2 import java.io.InputStream; 3 import java.io.BufferedInputStream; 4 import java.io.FileInputStream; 5 import java.io.OutputStream; 6 7 import org.apache.hadoop.fs.BufferedFSInputStream; 8 import org.apache.hadoop.util.Progressable; 9 import org.apache.hadoop.util.Progressable; 10 import org.apache.hadoop.io.IOUtils; 11 import org.apache.hadoop.fs.Path; 12 import org.apache.hadoop.fs.FileSystem; 13 import org.apache.hadoop.conf.Configuration; 14 15 public class FileCopyWithProgress { 16 public static void main(String[] args) throws Exception { 17 String localsrc = args[0]; 18 String dst = args[1]; 19 InputStream in = new BufferedInputStream(new FileInputStream(localsrc)); 20 Configuration conf = new Configuration(); 21 FileSystem fs = FileSystem.get(URI.create(dst), conf); 22 OutputStream out = fs.create(new Path(dst), new Progressable() { 23 public void progress() { System.out.print(".");} } 24 ); 25 IOUtils.copyBytes(in, out, 4096, true); 26 } 27 }
执行步骤:
$. $YARN_HOME/libexec/hadoop-config.sh $javac -cp $CLASSPATH -d my_class FileCopyWithProgress.java $jar -cvf filecopywithprogress.jar -C my_class/ . # assum we have a local file foo.out in directory: /home/grid/foo.out, then we should run yarn like below $yarn jar filecopywithprogress.jar FileCopyWithProgress /home/grid/foo.out hdfs:///user/grid/copied_foo.out # we can do a check for the copied file $hadoop fs -ls -R /user/grid/
注:从下面开始使用另一种方式来编译、运行代码
版本2 使用FileSystem的copyFromLocalFile()方法
代码如下:
1 import org.apache.hadoop.fs.Path; 2 import org.apache.hadoop.fs.FileSystem; 3 import org.apache.hadoop.conf.Configuration; 4 5 public class FileCopyFromLocal { 6 public static void main(String[] args) throws Exception { 7 String localSrc = args[0]; 8 String dst = args[1]; 9 Configuration conf = new Configuration(); 10 FileSystem fs = FileSystem.get(conf); 11 fs.copyFromLocalFile(new Path(localSrc),new Path(dst)); 12 } 13 }
执行步骤:
$source $YARN_HOME/libexec/hadoop-config.sh $javac FileCopyFromLocal.java -d class/ $jar -cvf filecopyfromlocal.jar -C class ./ $export HADOOP_CLASSPATH=$CLASSPATH:filecopyfromlocal.jar # suppose we have a file bar.txt in local disk, then we use the following command line to copy it to hdfs $yarn FileCopyFromLocal bar.txt hdfs:///user/grid/kissyou # we can check the copied file on hdfs $hadoop fs -ls /user/grid/ w-r--r-- 3 grid supergroup 899 2013-11-17 01:33 /user/grid/kissyou
3.2 新建文件夹/文件
新建文件夹 FileSystem.mkdirs()
代码如下:
1 import org.apache.hadoop.fs.FileSystem; 2 import org.apache.hadoop.conf.Configuration; 3 import org.apache.hadoop.fs.Path; 4 5 public class CreateDir { 6 public static void main(String[] args) throws Exception { 7 Configuration conf = new Configuration(); 8 String dst = args[0]; 9 FileSystem fs = FileSystem.get(conf); 10 fs.mkdirs(new Path(dst)); 11 } 12 }
执行步骤:
$source $YARN_HOME/libexec/hadoop-config.sh $javac CreatDir.java -d class/ $jar -cvf createdir.jar -C class ./ $export HADOOP_CLASSPATH=$CLASSPATH:createdir.jar $yarn CreateDir hdfs:///user/grid/kissyou # we can check the created directory on hdfs $hadoop fs -ls /user/grid/ w-r--r-- 3 grid supergroup 899 2013-11-17 01:33 /user/grid/kissyou
新建文件 FileSystem.create()
代码如下:
1 import org.apache.hadoop.fs.FileSystem; 2 import org.apache.hadoop.conf.Configuration; 3 import org.apache.hadoop.fs.Path; 4 5 public class CreateFile { 6 public static void main(String[] args) throws Exception { 7 Configuration conf = new Configuration(); 8 String dst = args[0]; 9 FileSystem fs = FileSystem.get(conf); 10 fs.create(new Path(dst)); 11 } 12 }
执行步骤:
$source $YARN_HOME/libexec/hadoop-config.sh $javac CreatFile.java -d class/ $jar -cvf createfile.jar -C class ./ $export HADOOP_CLASSPATH=$CLASSPATH:createfile.jar $yarn CreatFile hdfs:///user/grid/kissyou.txt # we can check the created file on hdfs $hadoop fs -ls /user/grid/ w-r--r-- 3 grid supergroup 899 2013-11-17 01:33 /user/grid/kissyou.txt
注意三点:
1. 同一路径下不可以新建同名的文件foo和目录foo/, 否则运行时会抛出异常:
fs.FileAlreadyExistsException
2. 我们进行copy复制、写文件操作时mkdirs()方法会被自动调用,故一般不会调用mkdirs()来手动创建目录;
3. 官方API文档里对mkdirs()的描述是:"Make the given file and all non-existent parents into directories", 所以在hadoop里创建文件的方法是recursive(递归的),相当于linux里的:
$mkdir -p foo/bar/qzx
同样等价于hdfs-shell里的命令:
%$YARN_HOME/bin/hadoop fs -mkdir -p hdfs:///foo/bar/qzx
4.Testing file and Getting fileStatus
提示: hadoop2.2中一些API已经deprecated, 现只列出本例中用到的已经deprecated的method,并给出最新的method.
deprecated APIs: (1)java.lang.Object |__ org.apache.hadoop.fs.FileStatu //deprecated method: boolean isDir() //Deprecated. Use isFile(), //isDirectory(), and isSymlink() instead. (2)java.lang.Object |__org.apache.hadoop.conf.Configured |__org.apache.hadoop.fs.FileSystem //deprecated methods: boolean isDirectory(Path f) //Deprecated. Use //getFileStatus() instead short getReplication(Path src) //Deprecated. Use //getFileStatus() instead long getLength(Path f) //Deprecated. Use //getFileStatus()instead
代码:
1 import java.net.URI; 2 3 import org.apache.hadoop.fs.FileSystem; 4 import org.apache.hadoop.fs.Path; 5 import org.apache.hadoop.conf.Configuration; 6 import org.apache.hadoop.fs.FileStatus; 7 8 public class TestFileStatus { 9 public static void main(String[] args) throws Exception { 10 Configuration conf = new Configuration(); 11 FileSystem fs = FileSystem.get(conf); 12 FileStatus stat = fs.getFileStatus(new Path(args[0])); 13 if (stat.isDirectory()) { 14 System.out.println(stat.getPath().toUri().getPath() + " is a directory."); 15 } else if (stat.isFile()) { 16 System.out.println(stat.getPath().toUri().getPath() + " is a file."); 17 System.out.println(stat.getPath().toUri().getPath() + " getBlockSize: " + stat.getBlockSize()); 18 System.out.println(stat.getPath().toUri().getPath() + " getLen(): " + stat.getLen()); 19 System.out.println(stat.getPath().toUri().getPath() + " getOwner(): " + stat.getOwner()); 20 System.out.println(stat.getPath().toUri().getPath() + " getGroup(): " + stat.getGroup()); 21 System.out.println(stat.getPath().toUri().getPath() + " getAccessTime(): " + stat.getAccessTime()); 22 System.out.println(stat.getPath().toUri().getPath() + " getModificationTime(): " + stat.getModificationTime()); 23 System.out.println(stat.getPath().toUri().getPath() + " getPermission(): " + stat.getPermission()); 24 System.out.println(stat.getPath().toUri().getPath() + " hashcode(): " + stat.hashCode()); 25 System.out.println(stat.getPath().toUri().getPath() + " getPath(): " + stat.getPath()); 26 } 27 } 28 }
先给一个福利^_^, 利用下面我写的小脚本可以方便地编译并生成jar文件:
1 #!/usr/bin/env sh 2 CWD=$(pwd) 3 export CLASSPATH='' 4 . $YARN_HOME/libexec/hadoop-config.sh 5 6 if [ -d class ]; then 7 rm -rf class/* 8 else 9 mkdir $CWD/class 10 fi 11 12 for f in $@ 13 do 14 srcs="$srcs $CWD/$f" 15 done 16 17 javac $srcs -d class 18 19 if [ $? -ne 0 ] ;then 20 echo Error found when compiling the code! 21 exit 1 22 fi 23 24 class=$( cat $1 |grep 'package'|sed -e "s/\(package\s\)\|\(;\)//g" ).$(echo $1 | sed -r 's/(.*).java/echo \1/ge') 25 jarfile=$(echo $1 | sed -r 's/(.*)\.java/echo \L\1\.jar/ge') 26 27 jar -cvf $CWD/$jarfile -C $CWD/class . > /dev/null 2>&1 28 #echo jar -cvf $jarfile -C class . 29 echo -----------------CMD Lines----------------------- 30 echo source $YARN_HOME/libexec/hadoop-config.sh >sourceIt.sh 31 echo export HADOOP_CLASSPATH=$jarfile:'$CLASSPATH'>>sourceIt.sh 32 echo source $CWD/sourceIt.sh 33 echo yarn $class [command args]...
执行步骤:
注意,为了简化起见,脚本定义:
$./compack.sh args1 args2 args3...中args1为main class $chmod 500 compack.sh $./compack.sh TestFileStatus.java #then the script will reminder you with the following message: -----------------CMD Lines------------------ source /home/grid/hadoop-2.2.0-src/hadoop-dist/target/hadoop-2.2.0/task/DFSAPIProgramming/sourceIt.sh yarn TestFileStatus [command args]... $source sourceIt.sh # suppose we have a file "part-m-00000" in hdfs,run yarn like below $yarn TestFileStatus /user/hive/warehouse/footbl/part-m-00000
Output:
#output /user/hive/warehouse/footbl/part-m-00000 is a file. /user/hive/warehouse/footbl/part-m-00000 getBlockSize: 134217728 /user/hive/warehouse/footbl/part-m-00000 getLen(): 1275 /user/hive/warehouse/footbl/part-m-00000 getOwner(): grid /user/hive/warehouse/footbl/part-m-00000 getGroup(): supergroup /user/hive/warehouse/footbl/part-m-00000 getAccessTime(): 1384675957784 /user/hive/warehouse/footbl/part-m-00000 getModificationTime(): 1384675958368 /user/hive/warehouse/footbl/part-m-00000 getPermission(): rw-r--r-- /user/hive/warehouse/footbl/part-m-00000 hashcode(): 1096001837 /user/hive/warehouse/footbl/part-m-00000 getPath(): hdfs://cluster1:9000/user/hive/warehouse/footbl/part-m-00000
5. Listing files & glob files
Listing files
代码:
1 import java.net.URI; 2 3 import org.apache.hadoop.fs.FileUtil; 4 import org.apache.hadoop.fs.FileSystem; 5 import org.apache.hadoop.fs.FileStatus; 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.Configuration; 8 9 public class ListFiles { 10 public static void main(String[] args) throws Exception { 11 Configuration conf = new Configuration(); 12 FileSystem fs = FileSystem.get(conf); 13 14 Path[] paths = new Path[args.length]; 15 for(int i = 0; i < args.length; i++) { 16 paths[i] = new Path(args[i]); 17 } 18 19 FileStatus[] status = fs.listStatus(paths); 20 Path[] pathList = FileUtil.stat2Paths(status); 21 for(Path p : pathList) { 22 System.out.println(p); 23 } 24 } 25 }
执行步骤:
$./compack.sh ListFiles.java $source sourceIt.s $yarn ListFiles /user/hive/warehouse/footbl /user/grid/
output:
hdfs://cluster1:9000/user/hive/warehouse/footbl/_SUCCESS hdfs://cluster1:9000/user/hive/warehouse/footbl/part-m-00000 hdfs://cluster1:9000/user/grid/kiss hdfs://cluster1:9000/user/grid/kissyou hdfs://cluster1:9000/user/grid/missyou
Filter files
提示:
1. java.lang.Object |__ org.apache.hadoop.conf.Configured |__ org.apache.hadoop.fs.FileSystem public abstract class FileSystem extends Configured implements Closeable //method: FileStatus[] globStatus(Path pathPattern, PathFilter filter) 2. org.apache.hadoop.fs public interface PathFilter //method: boolean accept(Path path)
代码:
1 package org.apache.hadoop.MyCode; 2 3 import org.apache.hadoop.fs.PathFilter; 4 import org.apache.hadoop.fs.Path; 5 6 public class MyFilter implements PathFilter { 7 private final String regex; 8 public MyFilter(String regex) { 9 this.regex = regex; 10 } 11 public boolean accept(Path path) { 12 return path.toString().matches(regex); 13 } 14 }
1 package org.apache.hadoop.MyCode; 2 3 import org.apache.hadoop.MyCode.MyFilter; 4 5 import java.net.URI; 6 7 import org.apache.hadoop.fs.FileSystem; 8 import org.apache.hadoop.fs.FileStatus; 9 import org.apache.hadoop.fs.Path; 10 import org.apache.hadoop.fs.FileUtil; 11 import org.apache.hadoop.conf.Configuration; 12 13 public class ListStatusWithPattern { 14 public static void main(String[] args) throws Exception { 15 Configuration conf = new Configuration(); 16 FileSystem fs = FileSystem.get(conf); 17 18 FileStatus[] status = fs.globStatus(new Path(args[0]), new MyFilter(args[1])); 19 Path[] pathList = FileUtil.stat2Paths(status); 20 21 for( Path p : pathList ) { 22 System.out.println(p); 23 } 24 } 25 }
执行步骤:
$source $YARN_HOME/libexec/hadoop-config.sh $mkdir class $javac ListStatusWithPattern.java MyFilter.java -d class $jar -cvf liststatuswithpattern.jar -C class ./ $export HADOOP_CLASSPATH=liststatuswithpattern.jar:$CLASSPATH #suppose we have four files in hdfs like below $hadoop fs -ls /user/grid/ Found 4 items drwxr-xr-x - grid supergroup 0 2013-11-17 01:06 /user/grid/kiss -rw-r--r-- 3 grid supergroup 0 2013-11-17 06:05 /user/grid/kissyou drwxr-xr-x - grid supergroup 0 2013-11-17 19:33 /user/grid/miss -rw-r--r-- 3 grid supergroup 899 2013-11-17 01:33 /user/grid/missyou # then we can run the command to filter the matched file $yarn jar liststatuswithpattern.jar org.apache.hadoop.MyCode.ListStatusWithPattern "hdfs:///user/grid/*ss*" "^.*grid/[k].*$
或者可以使用前面给出的脚本编译、打包并生成主要的执行yarn的代码:
$./compack.sh ListStatusWithPattern.java MyFilter.java #注意,脚本默认输入的第一个源文件为main class所在文件 $source source /home/grid/hadoop-2.2.0-src/hadoop-dist/target/hadoop-2.2.0/task/DFSAPIProgramming/sourceIt.sh -----------------CMD Lines----------------------- source /home/grid/hadoop-2.2.0-src/hadoop-dist/target/hadoop-2.2.0/task/DFSAPIProgramming/sourceIt.sh yarn org.apache.hadoop.MyCode.MyFilter [command args]... $yarn org.apache.hadoop.MyCode.ListStatusWithPattern "hdfs:///user/grid/*ss*" "^.*grid/[k].*$" output: hdfs://cluster1:9000/user/grid/kiss hdfs://cluster1:9000/user/grid/kissyou
(完)