一、hadoop环境准备(我安装的hadoop版本是hadoop-1.2.1)
首先要配置hadoop环境变量HADOOP_CLASSPATH(我的hadoop安装在了/home/grid/hadoop-1.2.1):
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
export JAVA_HOME=/usr/jdk1.6.0_45
# Extra Java CLASSPATH elements. Optional.
export HADOOP_CLASSPATH=/home/grid/hadoop-1.2.1/myclass
我在hadoop安装目录下创建了myclass目录
确认hadoop_classpath环境变量生效
[grid@h1 conf]$ source hadoop-env.sh
[grid@h1 conf]$ echo $HADOOP_CLASSPATH
/home/grid/hadoop-1.2.1/myclass
二、编写并编译URLCat源代码
1、代码功能
调用hadoop的java API接口,读取hdfs中的文件
hadoop的javaAPI说明可以参考http://hadoop.apache.org/docs/stable/api/index.html
2、代码内容
[grid@h1 myclass]$ cat URLCat.java
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;
import java.io.InputStream;
import java.net.URL;
public class URLCat {
static
{
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception
{
InputStream in = null; //定义一个输入stream in
try {
in = new URL(args[0]).openStream(); // 传递参数给 in
IOUtils.copyBytes(in, System.out, 4096, false); //读取IO in的内容并打印到屏幕(System.out)
}
finally {
IOUtils.closeStream(in);
}
}
}
3、代码编译及运行
[grid@h1 myclass]$ javac URLCat.java
URLCat.java:1: 软件包 org.apache.hadoop.fs 不存在
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
^
URLCat.java:2: 软件包 org.apache.hadoop.io 不存在
import org.apache.hadoop.io.IOUtils;
^
URLCat.java:9: 找不到符号
符号: 类 FsUrlStreamHandlerFactory
位置: 类 URLCat
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
^
URLCat.java:16: 找不到符号
符号: 变量 IOUtils
位置: 类 URLCat
IOUtils.copyBytes(in, System.out, 4096, false);
^
URLCat.java:19: 找不到符号
符号: 变量 IOUtils
位置: 类 URLCat
IOUtils.closeStream(in);
^
5 错误
需要指定classpath进行编译
[grid@h1 myclass]$ javac -cp /home/grid/hadoop-1.2.1/hadoop-core-1.2.1.jar URLCat.java
[grid@h1 myclass]$ ls
URLCat.class URLCat.java
编译成功在$HADOOP_CLASSPATH目录下产生了.class的编译文件
运行代码:
[grid@h1 myclass]$ hadoop fs -ls in
Found 4 items
-rw-r--r-- 2 grid supergroup 101 2013-08-24 12:50 /user/grid/in/VERSION
-rw-r--r-- 2 grid supergroup 7 2013-08-23 21:30 /user/grid/in/test3.txt
-rw-r--r-- 2 grid supergroup 12 2013-08-23 21:30 /user/grid/in/text1.txt
-rw-r--r-- 2 grid supergroup 13 2013-08-23 21:30 /user/grid/in/text2.txt
[grid@h1 myclass]$ hadoop fs -cat in/test3.txt
hadoop
[grid@h1 myclass]$ hadoop URLCat hdfs://h1:9000/user/grid/in/test3.txt
hadoop
4、代码用到的API解读
1、hadoop基础类
org.apache.hadoop.fs Class FsUrlStreamHandlerFactory
java.lang.Object
org.apache.hadoop.fs.FsUrlStreamHandlerFactory
All Implemented Interfaces:
URLStreamHandlerFactory
public class FsUrlStreamHandlerFactory
extends Object
implements URLStreamHandlerFactory
Factory for URL stream handlers. There is only one handler whose job is to create UrlConnections. A FsUrlConnection relies on FileSystem to choose the appropriate FS implementation. Before returning our handler, we make sure that FileSystem knows an implementation for the requested scheme/protocol.
FsUrlStreamHandlerFactory:URL流处理工具,具体的作用要使用java.net.URL的 setURLStreamHandlerFactory()方法设置URLStreamHandlerFactory,这时需要传递一个 FsUrlStreamHandlerFactory。这个操作对一个jvm只能使用一次。
org.apache.hadoop.io Class IOUtils
java.lang.Object
org.apache.hadoop.io.IOUtils
public class IOUtils
extends Object
An utility class for I/O related functionality.
一个I/O相关功能的实用类