本文隶属于专栏《1000个问题搞定大数据技术体系》,该专栏为笔者原创,引用请注明来源,不足和错误之处请在评论区帮忙指出,谢谢!
本专栏目录结构和参考文献请见1000个问题搞定大数据技术体系
Hadoop Archives 为特殊的档案格式。一个 Hadoop Archive 对应一个文件系统目录。
Hadoop Archives 的扩展名为 *.har。
Hadoop Archives 包含元数据(形式为 _index 和 _masterindex )和数据(part-*)文件。
index 文件包含了档案中的文件的文件名和位置信息。
创建 Archive 的 Shell 命令语法如下:
[root@node1 ~]# hadoop archive -help
usage: archive <-archiveName <NAME>.har> <-p <parent path>> [-r <replication factor>] <src>* <dest>
-archiveName <arg> Name of the Archive. This is mandatory option
-help Show the usage
-p <arg> Parent path of sources. This is mandatory option
-r <arg> Replication factor archive files
由于 -archiveName 选项指定要创建的 Archive 的名字,如 foo.har。
Archive 的名字的扩展名应该为 *.har。
输入为文件系统的路径名,路径名的格式和平时的表达方式一样。
创建的 Archive 会保存到目标目录下。
注意创建 Archives 为一个 MapReduce job,应该在 MapReduce 集群上运行这个命令
Archive 作为文件系统层暴露给外界,所以所有的 FS shell 命令都能在 Archive 上运行, 但是要使用不同的URI。
另外, Archive 是不可改变的,所以重命名、删除和创建都会返回错误。
# Hadoop Archives 的 URI 为:
har://scheme-hostname:port/archivepath/fileinarchive
# 如果没提供 scheme-hostname,其会使用默认的文件系统。这种情况下 URI 的格式为:
har:///archivepath/fileinarchive
理论容易,实践起来难
[root@node1 ~]# hadoop archive -archiveName f1.har -p /data/tmp/f1 /tmp
Permission denied: user=root, access=EXECUTE, inode="/tmp":hadoop:supergroup:drwx------
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:399)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:315)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:242)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:193)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:606)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1845)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1863)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:686)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3208)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1186)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:982)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)
这个是因为我在 hadoop 用户下启动的 NameNode,故系统默认 hadoop 为超级用户,root用户不一定是超级用户,不能混淆了
关于 HDFS 的权限管理可以参考我的这篇博客——HDFS是如何进行权限管理的?
[root@node1 ~]# su - hadoop
上一次登录:五 5月 28 21:52:33 CST 2021pts/0 上
[hadoop@node1 ~]$ hadoop archive -archiveName f1.har -p /data/tmp/f1 /tmp
2021-06-05 17:33:35,563 INFO client.RMProxy: Connecting to ResourceManager at node1/172.16.68.201:18040
2021-06-05 17:33:36,847 INFO client.RMProxy: Connecting to ResourceManager at node1/172.16.68.201:18040
2021-06-05 17:33:36,903 INFO client.RMProxy: Connecting to ResourceManager at node1/172.16.68.201:18040
2021-06-05 17:33:37,800 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1614270487333_0008
2021-06-05 17:33:38,474 INFO mapreduce.JobSubmitter: number of splits:1
2021-06-05 17:33:38,948 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1614270487333_0008
2021-06-05 17:33:38,952 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-06-05 17:33:39,552 INFO conf.Configuration: resource-types.xml not found
2021-06-05 17:33:39,552 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-06-05 17:33:40,027 INFO impl.YarnClientImpl: Submitted application application_1614270487333_0008
2021-06-05 17:33:40,221 INFO mapreduce.Job: The url to track the job: http://node1:18088/proxy/application_1614270487333_0008/
2021-06-05 17:33:40,241 INFO mapreduce.Job: Running job: job_1614270487333_0008
2021-06-05 17:33:51,514 INFO mapreduce.Job: Job job_1614270487333_0008 running in uber mode : false
2021-06-05 17:33:51,517 INFO mapreduce.Job: map 0% reduce 0%
2021-06-05 17:33:58,652 INFO mapreduce.Job: map 100% reduce 0%
2021-06-05 17:34:06,727 INFO mapreduce.Job: map 100% reduce 100%
2021-06-05 17:34:07,756 INFO mapreduce.Job: Job job_1614270487333_0008 completed successfully
2021-06-05 17:34:07,909 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=69
FILE: Number of bytes written=473469
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=285
HDFS: Number of bytes written=71
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=12
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=4932
Total time spent by all reduces in occupied slots (ms)=5542
Total time spent by all map tasks (ms)=4932
Total time spent by all reduce tasks (ms)=5542
Total vcore-milliseconds taken by all map tasks=4932
Total vcore-milliseconds taken by all reduce tasks=5542
Total megabyte-milliseconds taken by all map tasks=5050368
Total megabyte-milliseconds taken by all reduce tasks=5675008
Map-Reduce Framework
Map input records=1
Map output records=1
Map output bytes=61
Map output materialized bytes=69
Input split bytes=118
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=69
Reduce input records=1
Reduce output records=0
Spilled Records=2
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=210
CPU time spent (ms)=2310
Physical memory (bytes) snapshot=533671936
Virtual memory (bytes) snapshot=5206069248
Total committed heap usage (bytes)=363331584
Peak Map Physical memory (bytes)=310661120
Peak Map Virtual memory (bytes)=2601361408
Peak Reduce Physical memory (bytes)=223010816
Peak Reduce Virtual memory (bytes)=2604707840
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=167
File Output Format Counters
Bytes Written=0
[hadoop@node1 ~]$
事实证明创建 Archive 确实提交了 MapReduce 任务。
[hadoop@node1 ~]$ hadoop fs -ls -R /tmp/f1.har
-rw-r--r-- 3 hadoop supergroup 0 2021-06-05 17:34 /tmp/f1.har/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 57 2021-06-05 17:34 /tmp/f1.har/_index
-rw-r--r-- 3 hadoop supergroup 14 2021-06-05 17:34 /tmp/f1.har/_masterindex
-rw-r--r-- 3 hadoop supergroup 0 2021-06-05 17:33 /tmp/f1.har/part-0
[hadoop@node1 ~]$ hadoop fs -cat har://tmp/f1.har
cat: URI: har://tmp/f1.har is an invalid Har URI since '-' not found. Expecting har://<scheme>-<host>/<path>.
[hadoop@node1 ~]$ hadoop fs -cat har:///tmp/f1.har
-_-|| 也是一不注意就报错,少了一个杠。