一篇文章搞懂 HDFS 的 Archive 到底是什么

前言

本文隶属于专栏《1000个问题搞定大数据技术体系》,该专栏为笔者原创,引用请注明来源,不足和错误之处请在评论区帮忙指出,谢谢!

本专栏目录结构和参考文献请见1000个问题搞定大数据技术体系

正文

什么是 Hadoop Archives ?

Hadoop Archives 为特殊的档案格式。一个 Hadoop Archive 对应一个文件系统目录。

Hadoop Archives 的扩展名为 *.har。

Hadoop Archives 包含元数据(形式为 _index 和 _masterindex )和数据(part-*)文件。

index 文件包含了档案中的文件的文件名和位置信息。

怎样创建 Archive ?

创建 Archive 的 Shell 命令语法如下:

[root@node1 ~]# hadoop archive -help
usage: archive <-archiveName <NAME>.har> <-p <parent path>> [-r <replication factor>] <src>* <dest>
               
 -archiveName <arg>   Name of the Archive. This is mandatory option
 -help                Show the usage
 -p <arg>             Parent path of sources. This is mandatory option
 -r <arg>             Replication factor archive files

由于 -archiveName 选项指定要创建的 Archive 的名字,如 foo.har。

Archive 的名字的扩展名应该为 *.har。

输入为文件系统的路径名,路径名的格式和平时的表达方式一样。

创建的 Archive 会保存到目标目录下。

注意创建 Archives 为一个 MapReduce job,应该在 MapReduce 集群上运行这个命令

怎样查看 Archives 中的文件 ?

Archive 作为文件系统层暴露给外界,所以所有的 FS shell 命令都能在 Archive 上运行, 但是要使用不同的URI。

另外, Archive 是不可改变的,所以重命名、删除和创建都会返回错误。

# Hadoop Archives 的 URI 为: 
har://scheme-hostname:port/archivepath/fileinarchive

# 如果没提供 scheme-hostname,其会使用默认的文件系统。这种情况下 URI 的格式为: 
har:///archivepath/fileinarchive

实践

理论容易,实践起来难

  1. 把 /data/tmp 下的 f1 文件打成一个 har 文件,同时放到 /tmp 目录下,结果刚上来就报错了。。
[root@node1 ~]# hadoop archive -archiveName f1.har -p /data/tmp/f1 /tmp
Permission denied: user=root, access=EXECUTE, inode="/tmp":hadoop:supergroup:drwx------
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:399)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:315)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:242)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:193)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:606)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1845)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1863)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:686)
        at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3208)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1186)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:982)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)

这个是因为我在 hadoop 用户下启动的 NameNode,故系统默认 hadoop 为超级用户,root用户不一定是超级用户,不能混淆了

关于 HDFS 的权限管理可以参考我的这篇博客——HDFS是如何进行权限管理的?

  1. 切换成 hadoop 用户重试一下
[root@node1 ~]# su - hadoop
上一次登录:五 528 21:52:33 CST 2021pts/0 上
[hadoop@node1 ~]$ hadoop archive -archiveName f1.har -p /data/tmp/f1 /tmp
2021-06-05 17:33:35,563 INFO client.RMProxy: Connecting to ResourceManager at node1/172.16.68.201:18040
2021-06-05 17:33:36,847 INFO client.RMProxy: Connecting to ResourceManager at node1/172.16.68.201:18040
2021-06-05 17:33:36,903 INFO client.RMProxy: Connecting to ResourceManager at node1/172.16.68.201:18040
2021-06-05 17:33:37,800 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1614270487333_0008
2021-06-05 17:33:38,474 INFO mapreduce.JobSubmitter: number of splits:1
2021-06-05 17:33:38,948 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1614270487333_0008
2021-06-05 17:33:38,952 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-06-05 17:33:39,552 INFO conf.Configuration: resource-types.xml not found
2021-06-05 17:33:39,552 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-06-05 17:33:40,027 INFO impl.YarnClientImpl: Submitted application application_1614270487333_0008
2021-06-05 17:33:40,221 INFO mapreduce.Job: The url to track the job: http://node1:18088/proxy/application_1614270487333_0008/
2021-06-05 17:33:40,241 INFO mapreduce.Job: Running job: job_1614270487333_0008
2021-06-05 17:33:51,514 INFO mapreduce.Job: Job job_1614270487333_0008 running in uber mode : false
2021-06-05 17:33:51,517 INFO mapreduce.Job:  map 0% reduce 0%
2021-06-05 17:33:58,652 INFO mapreduce.Job:  map 100% reduce 0%
2021-06-05 17:34:06,727 INFO mapreduce.Job:  map 100% reduce 100%
2021-06-05 17:34:07,756 INFO mapreduce.Job: Job job_1614270487333_0008 completed successfully
2021-06-05 17:34:07,909 INFO mapreduce.Job: Counters: 54
        File System Counters
                FILE: Number of bytes read=69
                FILE: Number of bytes written=473469
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=285
                HDFS: Number of bytes written=71
                HDFS: Number of read operations=16
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=12
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Other local map tasks=1
                Total time spent by all maps in occupied slots (ms)=4932
                Total time spent by all reduces in occupied slots (ms)=5542
                Total time spent by all map tasks (ms)=4932
                Total time spent by all reduce tasks (ms)=5542
                Total vcore-milliseconds taken by all map tasks=4932
                Total vcore-milliseconds taken by all reduce tasks=5542
                Total megabyte-milliseconds taken by all map tasks=5050368
                Total megabyte-milliseconds taken by all reduce tasks=5675008
        Map-Reduce Framework
                Map input records=1
                Map output records=1
                Map output bytes=61
                Map output materialized bytes=69
                Input split bytes=118
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=69
                Reduce input records=1
                Reduce output records=0
                Spilled Records=2
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=210
                CPU time spent (ms)=2310
                Physical memory (bytes) snapshot=533671936
                Virtual memory (bytes) snapshot=5206069248
                Total committed heap usage (bytes)=363331584
                Peak Map Physical memory (bytes)=310661120
                Peak Map Virtual memory (bytes)=2601361408
                Peak Reduce Physical memory (bytes)=223010816
                Peak Reduce Virtual memory (bytes)=2604707840
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=167
        File Output Format Counters 
                Bytes Written=0
[hadoop@node1 ~]$ 

事实证明创建 Archive 确实提交了 MapReduce 任务。

  1. 看下刚才创建好的 Archive
[hadoop@node1 ~]$ hadoop fs -ls -R /tmp/f1.har
-rw-r--r--   3 hadoop supergroup          0 2021-06-05 17:34 /tmp/f1.har/_SUCCESS
-rw-r--r--   3 hadoop supergroup         57 2021-06-05 17:34 /tmp/f1.har/_index
-rw-r--r--   3 hadoop supergroup         14 2021-06-05 17:34 /tmp/f1.har/_masterindex
-rw-r--r--   3 hadoop supergroup          0 2021-06-05 17:33 /tmp/f1.har/part-0
  1. 查看 Archive 下的 f1文件
[hadoop@node1 ~]$ hadoop fs -cat har://tmp/f1.har
cat: URI: har://tmp/f1.har is an invalid Har URI since '-' not found.  Expecting har://<scheme>-<host>/<path>.
[hadoop@node1 ~]$ hadoop fs -cat har:///tmp/f1.har

-_-|| 也是一不注意就报错,少了一个杠。

你可能感兴趣的:(大数据技术体系,大数据,hdfs)