鲁春利的工作笔记,谁说程序员不能有文艺范?
通过hadoop shell与java api访问hdfs
工作笔记之Hadoop2.6集群搭建已经将集群环境搭建好了,下面来进行一些HDFS的操作
1、HDFS的shell访问
HDFS设计主要用来对海量数据进行处理,即HDFS上存储大量文件。HDFS将这些文件进行分割后存储在不同的DataNode上。HDFS提供了一个shell接口,屏蔽了block存储的内部细节,所有的Hadoop操作均由bin/hadoop脚本引发。
不指定任何参数的hadoop命令将打印所有命令的描述,与hdfs文件相关的操作为hadoop fs(hadoop脚本其他的命令此处不涉及)。
[hadoop@nnode ~]$ hadoop fs Usage: hadoop fs [generic options] [-appendToFile <localsrc> ... <dst>] [-cat [-ignoreCrc] <src> ...] [-checksum <src> ...] [-chgrp [-R] GROUP PATH...] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>] [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-count [-q] [-h] <path> ...] [-cp [-f] [-p | -p[topax]] <src> ... <dst>] [-createSnapshot <snapshotDir> [<snapshotName>]] [-deleteSnapshot <snapshotDir> <snapshotName>] [-df [-h] [<path> ...]] [-du [-s] [-h] <path> ...] [-expunge] [-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-getfacl [-R] <path>] [-getfattr [-R] {-n name | -d} [-e en] <path>] [-getmerge [-nl] <src> <localdst>] [-help [cmd ...]] [-ls [-d] [-h] [-R] [<path> ...]] [-mkdir [-p] <path> ...] [-moveFromLocal <localsrc> ... <dst>] [-moveToLocal <src> <localdst>] [-mv <src> ... <dst>] [-put [-f] [-p] [-l] <localsrc> ... <dst>] [-renameSnapshot <snapshotDir> <oldName> <newName>] [-rm [-f] [-r|-R] [-skipTrash] <src> ...] [-rmdir [--ignore-fail-on-non-empty] <dir> ...] [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]] [-setfattr {-n name [-v value] | -x name} <path>] [-setrep [-R] [-w] <rep> <path> ...] [-stat [format] <path> ...] [-tail [-f] <file>] [-test -[defsz] <path>] [-text [-ignoreCrc] <src> ...] [-touchz <path> ...] [-usage [cmd ...]] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|resourcemanager:port> specify a ResourceManager -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions]
hadoop2.6版本中提示hadoop fs为“Deprecated, use hdfs dfs instead.”(2.6之前的版本未接触过,这里就没有深究从哪一个版本开始的,但是hadoop fs仍然可以使用)。
[hadoop@nnode ~]$ hdfs dfs Usage: hadoop fs [generic options] [-appendToFile <localsrc> ... <dst>] [-cat [-ignoreCrc] <src> ...] [-checksum <src> ...] [-chgrp [-R] GROUP PATH...] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>] [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-count [-q] [-h] <path> ...] [-cp [-f] [-p | -p[topax]] <src> ... <dst>] [-createSnapshot <snapshotDir> [<snapshotName>]] [-deleteSnapshot <snapshotDir> <snapshotName>] [-df [-h] [<path> ...]] [-du [-s] [-h] <path> ...] [-expunge] [-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-getfacl [-R] <path>] [-getfattr [-R] {-n name | -d} [-e en] <path>] [-getmerge [-nl] <src> <localdst>] [-help [cmd ...]] [-ls [-d] [-h] [-R] [<path> ...]] [-mkdir [-p] <path> ...] [-moveFromLocal <localsrc> ... <dst>] [-moveToLocal <src> <localdst>] [-mv <src> ... <dst>] [-put [-f] [-p] [-l] <localsrc> ... <dst>] [-renameSnapshot <snapshotDir> <oldName> <newName>] [-rm [-f] [-r|-R] [-skipTrash] <src> ...] [-rmdir [--ignore-fail-on-non-empty] <dir> ...] [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]] [-setfattr {-n name [-v value] | -x name} <path>] [-setrep [-R] [-w] <rep> <path> ...] [-stat [format] <path> ...] [-tail [-f] <file>] [-test -[defsz] <path>] [-text [-ignoreCrc] <src> ...] [-touchz <path> ...] [-usage [cmd ...]] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|resourcemanager:port> specify a ResourceManager -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions]
如:
[hadoop@nnode ~]$ hdfs dfs -ls -R /user/hadoop -rw-r--r-- 2 hadoop hadoop 2297 2015-06-29 14:44 /user/hadoop/20130913152700.txt.gz -rw-r--r-- 2 hadoop hadoop 211 2015-06-29 14:45 /user/hadoop/20130913160307.txt.gz -rw-r--r-- 2 hadoop hadoop 93046447 2015-07-18 18:01 /user/hadoop/apache-hive-1.2.0-bin.tar.gz -rw-r--r-- 2 hadoop hadoop 4139112 2015-06-28 22:54 /user/hadoop/httpInterceptor_192.168.1.101_1_20130913160307.txt -rw-r--r-- 2 hadoop hadoop 240 2015-05-30 20:54 /user/hadoop/lucl.gz -rw-r--r-- 2 hadoop hadoop 63 2015-05-27 23:55 /user/hadoop/lucl.txt -rw-r--r-- 2 hadoop hadoop 9994248 2015-06-29 14:12 /user/hadoop/scalog.txt -rw-r--r-- 2 hadoop hadoop 2664495 2015-06-28 20:54 /user/hadoop/scalog.txt.gz -rw-r--r-- 3 hadoop hadoop 28026803 2015-06-24 21:16 /user/hadoop/test.txt.gz -rw-r--r-- 2 hadoop hadoop 28551 2015-05-27 23:54 /user/hadoop/zookeeper.out [hadoop@nnode ~]$ # 这里的点为当前目录,我是通过hadoop用户操作的因此类似于/user/hadoop # hdfs默认具有/user/{hadoop-user},但是在/下也可以自己通过mkdir命令来创建自己的目录 [hadoop@nnode ~]$ hdfs dfs -ls -R . -rw-r--r-- 2 hadoop hadoop 2297 2015-06-29 14:44 20130913152700.txt.gz -rw-r--r-- 2 hadoop hadoop 211 2015-06-29 14:45 20130913160307.txt.gz -rw-r--r-- 2 hadoop hadoop 93046447 2015-07-18 18:01 apache-hive-1.2.0-bin.tar.gz -rw-r--r-- 2 hadoop hadoop 4139112 2015-06-28 22:54 httpInterceptor_192.168.1.101_1_20130913160307.txt -rw-r--r-- 2 hadoop hadoop 240 2015-05-30 20:54 lucl.gz -rw-r--r-- 2 hadoop hadoop 63 2015-05-27 23:55 lucl.txt -rw-r--r-- 2 hadoop hadoop 9994248 2015-06-29 14:12 scalog.txt -rw-r--r-- 2 hadoop hadoop 2664495 2015-06-28 20:54 scalog.txt.gz -rw-r--r-- 3 hadoop hadoop 28026803 2015-06-24 21:16 test.txt.gz -rw-r--r-- 2 hadoop hadoop 28551 2015-05-27 23:54 zookeeper.out [hadoop@nnode ~]$
如果不清楚hdfs命令的详细操作,可以查看帮助信息:
[hadoop@nnode ~]$ hdfs dfs -help ls -ls [-d] [-h] [-R] [<path> ...] : List the contents that match the specified file pattern. If path is not specified, the contents of /user/<currentUser> will be listed. Directory entries are of the form: permissions - userId groupId sizeOfDirectory(in bytes) modificationDate(yyyy-MM-dd HH:mm) directoryName and file entries are of the form: permissions numberOfReplicas userId groupId sizeOfFile(in bytes) modificationDate(yyyy-MM-dd HH:mm) fileName -d Directories are listed as plain files. -h Formats the sizes of files in a human-readable fashion rather than a number of bytes. -R Recursively list the contents of directories. [hadoop@nnode ~]$
2、HDFS的Java API访问
Hadoop中通过DataNode节点存储数据,而NameNode节点则记录数据的存储位置。Hadoop中各部分的通信基于RPC来实现,NameNode也是hadoop中RPC的server端(dfs.namenode.rpc-address
说明了rpc端的主机名和端口号),而Hadoop提供的FileSystem类为hadoop中RPC Client的抽象实现。
a.) 通过java.util.URL来读取hdfs的数据
为了让java程序能够识别Hadoop的hdfs URL需要通过URL的setURLStreamHandlerFactory(...);
每个Java虚拟机只能调用依次这个方法,因此通常在静态方法中调用。
package com.invic.hdfs; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import java.net.URL; import org.apache.hadoop.fs.FsUrlStreamHandlerFactory; import org.apache.hadoop.io.IOUtils; /** * * @author lucl * @ 通过java api来访问hdfs上特定的数据 * */ public class MyHdfsOfJavaApi { static { /** * 为了让java程序能够识别hadoop的hdfs url需要配置额外的URLStreamHandlerFactory * 如下方法java虚拟机只能调用一次,若原有的其他程序已经声明过该factory,则我的java程序将无法从hadoop中读取数据 */ URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } public static void main(String[] args) throws IOException { String path = "hdfs://nnode:8020/user/hadoop/lucl.txt"; InputStream in = new URL(path).openStream(); OutputStream ou = System.out; int buffer = 4096; boolean close = false; IOUtils.copyBytes(in, ou, buffer, close); IOUtils.closeStream(in); } }
b.) 通过Hadoop的FileSystem来访问HDFS
Hadoop有一个抽象的文件系统概念,HDFS只是其中的一个实现。java抽象类org.apache.hadoop.fs.FileSystem定义了Hadoop中的一个文件系统接口。
java.lang.Object org.apache.hadoop.conf.Configured org.apache.hadoop.fs.FileSystem |--org.apache.hadoop.fs.FilterFileSystem |----org.apache.hadoop.fs.ChecksumFileSystem |----org.apache.hadoop.fs.LocalFileSystem |--org.apache.hadoop.fs.ftp.FTPFileSystem |--org.apache.hadoop.fs.s3native.NativeS3FileSystem |--org.apache.hadoop.fs.RawLocalFileSystem |--org.apache.hadoop.fs.viewfs.ViewFileSystem
package com.invic.hdfs; import java.io.IOException; import java.io.OutputStream; import java.net.URI; import java.util.Scanner; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FileUtil; import org.apache.hadoop.fs.LocatedFileStatus; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.PathFilter; import org.apache.hadoop.fs.RemoteIterator; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.util.Progressable; /** * * @author lucl * @ 通过FileSystem API来实现 * FileSystem get(Configuration) 通过设置配置文件core-site.xml读取类路径来实现,默认本地文件系统 * FileSystem get(URI, Configuration) 通过URI来设定要使用的文件系统 * FileSystem get(URI, Configuration, user) 作为给定用户来访问文件系统,对安全来说至关重要 */ public class MyHdfsOfFS { private static String HOST = "hdfs://nnode"; private static String PORT = "8020"; private static String NAMENODE = HOST + ":" + PORT; public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); String path = NAMENODE + "/user/"; /** * 由于这里设计的为hadoop的user目录,默认会查询hdfs的用户家目录下的文件 */ String user = "hadoop"; FileSystem fs = null; try { fs = FileSystem.get(URI.create(path), conf, user); } catch (InterruptedException e) { e.printStackTrace(); } if (null == fs) { return; } /** * 递归创建目录 */ boolean mkdirs = fs.mkdirs(new Path("invic/test/mvtech")); if (mkdirs) { System.out.println("Dir ‘invic/test/mvtech’ create success."); } /** * 判断目录是否存在 */ boolean exists = fs.exists(new Path("invic/test/mvtech")); if (exists) { System.out.println("Dir ‘invic/test/mvtech’ exists."); } /** * FSDataInputStream支持随意位置访问 * 这里的lucl.txt默认查找路径为/user/Administrator/lucl.txt 因为我是windows的eclipse * 如果我上面的get方法最后指定了user 则查询的路径为/user/get方法指定的user/lucl.txt */ FSDataInputStream in = fs.open(new Path("lucl.txt")); OutputStream os = System.out; int buffSize = 4098; boolean close = false; IOUtils.copyBytes(in, os, buffSize, close); System.out.println("\r\n跳到文件开始重新读取文件。。。。。。"); in.seek(0); IOUtils.copyBytes(in, os, buffSize, close); IOUtils.closeStream(in); /** * 创建文件 */ FSDataOutputStream create = fs.create(new Path("sample.txt")); create.write("This is my first sample file.".getBytes()); create.flush(); create.close(); /** * 文件拷贝 */ fs.copyFromLocalFile(new Path("F:\\Mvtech\\ftpfile\\cg-10086.com.csv"), new Path("cg-10086.com.csv")); /** * 文件追加 */ FSDataOutputStream append = fs.append(new Path("sample.txt")); append.writeChars("\r\n"); append.writeChars("New day, new World."); append.writeChars("\r\n"); IOUtils.closeStream(append); /** * progress的使用 */ FSDataOutputStream progress = fs.create(new Path("progress.txt"), new Progressable() { @Override public void progress() { System.out.println("write is in progress......"); } }); // 接收键盘输入到hdfs上 Scanner sc = new Scanner(System.in); System.out.print("Please type your enter : "); String name = sc.nextLine(); while (!"quit".equals(name)) { if (null == name || "".equals(name.trim())) { continue; } progress.writeChars(name); System.out.print("Please type your enter : "); name = sc.nextLine(); } /** * 递归列出文件 */ RemoteIterator<LocatedFileStatus> it = fs.listFiles(new Path(path), true); while (it.hasNext()) { LocatedFileStatus loc = it.next(); System.out.println(loc.getPath().getName() + "|" + loc.getLen() + "|" + loc.getOwner()); } /** * 文件或目录元数据:文件长度、块大小、复本、修改时间、所有者及权限信息 */ FileStatus status = fs.getFileStatus(new Path("lucl.txt")); System.out.println(status.getPath().getName() + "|" + status.getPath().getParent().getName() + "|" + status.getBlockSize() + "|" + status.getReplication() + "|" + status.getOwner()); /** * 列出目录中文件listStatus,若参数为文件则以数组方式返回长度为1的FileStatus对象 */ fs.listStatus(new Path(path)); fs.listStatus(new Path(path), new PathFilter() { @Override public boolean accept(Path tmpPath) { String tmpName = tmpPath.getName(); if (tmpName.endsWith(".txt")) { return true; } return false; } }); // 可以传入一组路径,会最终累计合并成一个数组返回 // fs.listStatus(Path [] files); FileStatus [] mergeStatus = fs.listStatus(new Path[]{new Path("lucl.txt"), new Path("progress.txt"), new Path("sample.txt")}); Path [] listPaths = FileUtil.stat2Paths(mergeStatus); for (Path p : listPaths) { System.out.println(p); } /** * 文件模式匹配 */ FileStatus [] patternStatus = fs.globStatus(new Path("*.txt")); for (FileStatus stat : patternStatus) { System.out.println(stat.getPath()); } /** * 删除数据 */ boolean recursive = true; fs.delete(new Path("demo.txt"), recursive); fs.close(); } }
c.) 访问HDFS集群
package com.invic.hdfs; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.LocatedFileStatus; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.RemoteIterator; import org.apache.log4j.Logger; /** * * @author lucl * @ 通过访问hadoop集群来访问hdfs * */ public class MyClusterHdfs { public static void main(String[] args) throws IOException { System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\"); Logger logger = Logger.getLogger(MyClusterHdfs.class); Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://cluster"); conf.set("dfs.nameservices", "cluster"); conf.set("dfs.ha.namenodes.cluster", "nn1,nn2"); conf.set("dfs.namenode.rpc-address.cluster.nn1", "nnode:8020"); conf.set("dfs.namenode.rpc-address.cluster.nn2", "dnode1:8020"); conf.set("dfs.client.failover.proxy.provider.cluster", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"); FileSystem fs = FileSystem.get(conf); RemoteIterator<LocatedFileStatus> it = fs.listFiles(new Path("/"), true); while (it.hasNext()) { LocatedFileStatus loc = it.next(); logger.info(loc.getPath().getName() + "|" + loc.getLen() + loc.getOwner()); } /*for (int i = 0; i < 500; i++) { String str = "the sequence is " + i; logger.info(str); }*/ try { Thread.sleep(10); } catch (InterruptedException e) { e.printStackTrace(); } System.exit(0); } }
说明:
System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\"); # 在main方法的第一行配置hadoop的home路径,否则在Windows下可能报错如下: 15/07/19 22:05:54 DEBUG util.Shell: Failed to detect a valid hadoop home directory java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set. at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:302) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:327) at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438) at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484) at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170) at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29) 15/07/19 22:05:54 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363) at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438) at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484) at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170) at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29)