Hadoop Shell 命令 与 WordCount

前言

在前2章内, 我们分别介绍了Hadoop安装的3种形式(Standalone mode/ Pseudo-Distributed mode/Cluster mode). 本章, 我们介绍如何使用HDFS命令进行一些基本的操作. 官方的操作文档可以查看Hadoop Shell命令.


正文

前置条件

已经安装Hadoop集群, 并启动. 从页面可以看到, 我们HDFS系统的文件目录.
Hadoop Shell 命令 与 WordCount_第1张图片

基本操作

对于文件系统用的最多的就是, 增删查改权限系统一直是我们操作文件系统的基本命令.它们的基本操作分别如下所示:

  • 查看文件目录 ls
# 本地仓库
localhost:current Sean$ hadoop fs -ls /
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
19/03/30 16:15:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   1 Sean supergroup          2 2019-03-25 11:55 /1.log
drwx------   - Sean supergroup          0 2019-03-25 12:11 /tmp
drwxr-xr-x   - Sean supergroup          0 2019-03-25 13:16 /user

# 全路径
localhost:current Sean$ hadoop fs -ls hdfs://localhost:9000/
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
19/03/30 16:16:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   1 Sean supergroup          2 2019-03-25 11:55 hdfs://localhost:9000/1.log
drwx------   - Sean supergroup          0 2019-03-25 12:11 hdfs://localhost:9000/tmp
drwxr-xr-x   - Sean supergroup          0 2019-03-25 13:16 hdfs://localhost:9000/user
  • 上传文件 put
# 上传文件
localhost:current Sean$ hadoop fs -put hello2019.sh /
# 查询上传的文件
localhost:current Sean$ hadoop fs -ls /
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 4 items
-rw-r--r--   1 Sean supergroup          2 2019-03-25 11:55 /1.log
-rw-r--r--   1 Sean supergroup         10 2019-03-30 16:19 /hello2019.sh
drwx------   - Sean supergroup          0 2019-03-25 12:11 /tmp
drwxr-xr-x   - Sean supergroup          0 2019-03-25 13:16 /user

# 手动合并(文件是可以还原的.)
cat blk_1073741983 >> tmp.file
cat blk_1073741984 >> tmp.file


默认文件切分大小为128M, 大于的话会切分成2快.

  • 查看文件内容 cat
# 通过hadoop查看
localhost:current Sean$ hadoop fs -cat /hello2019.sh
hello2019

#通过本地linux查看
localhost:current Sean$ cat finalized/subdir0/subdir0/blk_1073741983
hello2019

localhost:current Sean$ pwd
/Users/Sean/Software/hadoop/current/tmp/dfs/data/current/BP-586017156-127.0.0.1-1553485799471/current

  • 下载文件 get
localhost:current Sean$ hadoop fs -get /hello2019.sh
localhost:current Sean$ ls
VERSION		dfsUsed		finalized	hello2019.sh	rbw
localhost:current Sean$ cat hello2019.sh
hello2019
  • 创建文件目录 mkdir
localhost:current Sean$ hadoop fs -mkdir -p /wordcount/input
localhost:current Sean$ hadoop fs -ls /wordcount
Found 1 items
drwxr-xr-x   - Sean supergroup          0 2019-03-30 16:40 /wordcount/input

Others

hdfs的命令操作, 可以通过hadoop fs直接显示所以命令.

localhost:mapreduce Sean$ hadoop fs
Usage: hadoop fs [generic options]
	[-appendToFile  ... ]
	[-cat [-ignoreCrc]  ...]
	[-checksum  ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R]  PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l]  ... ]
	[-copyToLocal [-p] [-ignoreCrc] [-crc]  ... ]
	[-count [-q] [-h]  ...]
	[-cp [-f] [-p | -p[topax]]  ... ]
	[-createSnapshot  []]
	[-deleteSnapshot  ]
	[-df [-h] [ ...]]
	[-du [-s] [-h]  ...]
	[-expunge]
	[-find  ...  ...]
	[-get [-p] [-ignoreCrc] [-crc]  ... ]
	[-getfacl [-R] ]
	[-getfattr [-R] {-n name | -d} [-e en] ]
	[-getmerge [-nl]  ]
	[-help [cmd ...]]
	[-ls [-d] [-h] [-R] [ ...]]
	[-mkdir [-p]  ...]
	[-moveFromLocal  ... ]
	[-moveToLocal  ]
	[-mv  ... ]
	[-put [-f] [-p] [-l]  ... ]
	[-renameSnapshot   ]
	[-rm [-f] [-r|-R] [-skipTrash]  ...]
	[-rmdir [--ignore-fail-on-non-empty]  ...]
	[-setfacl [-R] [{-b|-k} {-m|-x } ]|[--set  ]]
	[-setfattr {-n name [-v value] | -x name} ]
	[-setrep [-R] [-w]   ...]
	[-stat [format]  ...]
	[-tail [-f] ]
	[-test -[defsz] ]
	[-text [-ignoreCrc]  ...]
	[-touchz  ...]
	[-truncate [-w]   ...]
	[-usage [cmd ...]]

Generic options supported are
-conf      specify an application configuration file
-D             use value for given property
-fs       specify a namenode
-jt     specify a ResourceManager
-files     specify comma separated files to be copied to the map reduce cluster
-libjars     specify comma separated jar files to include in the classpath.
-archives     specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
  • help
    输出命令参数手册.

  • mkdir
    创建目录. hadoop fs -mkdir -p /abc/acc

  • moveFromLocal / moveToLocal
    从本地移动HDFS(本地原文件删除). hadoop fs -moveFromLocal abc.txt /
    从HDFS移动本地(HDFS原文件删除). hadoop fs -moveFromLocal abc.txt /

  • appendToFile
    追加到文件上面. hadoop fs -appendToFile abc.txt /hello2019.txt

localhost:mapreduce Sean$ echo xxoo >> hello.txt
localhost:mapreduce Sean$ hadoop fs -appendToFile hello.txt /hello2019.sh
localhost:mapreduce Sean$ hadoop fs -cat /hello2019.sh
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
hello2019
xxoo
  • cat
    显示文件. hadoop fs -cat /hello2019.sh. 文件过多hadoop fs -cat /hello2019.sh | more或者hadoop fs -tail /hello2019.sh

  • tail
    显示文件末尾. hadoop fs -tail /hello2019.sh

  • text
    已字符形式打印一个文件的内容.hadoop fs -text /hello2019.sh.

  • chgrp / chmod / chown
    chgrp更改文件组; chmod更改权限; chown更改用户和组.

hadoop fs -chmod 666 /hello2019.txt
hadoop fs -chown someuser:somegrp /hello2019.txt

localhost:mapreduce Sean$ hadoop fs -chmod 777 /hello2019.sh
localhost:mapreduce Sean$ hadoop fs -ls /
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
19/03/30 17:09:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r--   1 Sean supergroup          2 2019-03-25 11:55 /1.log
-rwxrwxrwx   1 Sean supergroup         15 2019-03-30 16:55 /hello2019.sh
drwx------   - Sean supergroup          0 2019-03-25 12:11 /tmp
drwxr-xr-x   - Sean supergroup          0 2019-03-25 13:16 /user
drwxr-xr-x   - Sean supergroup          0 2019-03-30 16:43 /wordcount

# hadoop内没有用户的设计.(所以没创建该用户也可以这样改造.)
localhost:mapreduce Sean$ hadoop fs -chown hellokitty:hello  /hello2019.sh
localhost:mapreduce Sean$ hadoop fs -ls /
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
19/03/30 17:10:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r--   1 Sean       supergroup          2 2019-03-25 11:55 /1.log
-rwxrwxrwx   1 hellokitty hello              15 2019-03-30 16:55 /hello2019.sh
drwx------   - Sean       supergroup          0 2019-03-25 12:11 /tmp
drwxr-xr-x   - Sean       supergroup          0 2019-03-25 13:16 /user
drwxr-xr-x   - Sean       supergroup          0 2019-03-30 16:43 /wordcount
  • copyFromlocal / copyToLocal
    从本地拷贝; 拷贝到本地

  • cp
    hdfs内部进行拷贝. hadoop fs -cp /hello2019.sh /a/hello2019.sh

  • mv
    hdfs内部进行移动. hadoop fs -mv /hello2019.sh /a/

  • get
    获取到本地. 类似copyToLocal. hadoop fs -get /hello.sh

  • getmerge
    合并下载多个文件. hadoop fs -getmerge /wordcount/output/* hellomerge.sh

localhost:mapreduce Sean$ hadoop fs -getmerge /wordcount/output/* hellomerge.sh
localhost:mapreduce Sean$ cat hellomerge.sh
2019	1
able	1
cat	2
hello	1
kitty	1
pitty	2
  • put
    下载到本地. 类似copyFromLocal. hadoop fs -put hello2019.sh /

  • rm
    删除. hadoop fs -rm -r /hello2019.sh

# -r recursive 递归的意思
localhost:mapreduce Sean$ hadoop fs -rm -r /1.log
Deleted /1.log
localhost:mapreduce Sean$ hadoop fs -ls /
Found 4 items
-rwxrwxrwx   1 hellokitty hello              15 2019-03-30 16:55 /hello2019.sh
drwx------   - Sean       supergroup          0 2019-03-25 12:11 /tmp
drwxr-xr-x   - Sean       supergroup          0 2019-03-25 13:16 /user
drwxr-xr-x   - Sean       supergroup          0 2019-03-30 16:43 /wordcount
  • rmdir
    删除空目录. hadoop fs - rmdir /abbc

  • df
    统计文件系统的可用信息. hadoop fs -df -h /

localhost:mapreduce Sean$ hadoop fs -df -h /
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Filesystem                Size   Used  Available  Use%
hdfs://localhost:9000  465.7 G  5.9 M    169.1 G    0%
  • du
    统计文件夹目录的大小. hadoop fs -du -s -h /abc/d
# -s 汇总 -h 带单位
localhost:mapreduce Sean$ hadoop fs -du -s -h /wordcount/
86  /wordcount

Most commands print help when invoked w/o parameters.
localhost:mapreduce Sean$ hadoop fs -du -s -h hdfs://localhost:9000/*
15  hdfs://localhost:9000/hello2019.sh
4.7 M  hdfs://localhost:9000/tmp
266.0 K  hdfs://localhost:9000/user
86  hdfs://localhost:9000/wordcount
  • count
    统计一个目录下的文件数目. hadoop fs -count /aaa/

  • setrep
    设置文件的副本数目. replication.

localhost:mapreduce Sean$ hadoop fs -setrep 3 /wordcount/input/hello2019.sh
Replication 3 set: /wordcount/input/hello2019.sh

Hadoop Shell 命令 与 WordCount_第2张图片Hadoop Shell 命令 与 WordCount_第3张图片
如果结点为3个, 但是设置为10个的时候.并不会设置10个, 这个是namenode中的原数据的副本数目,但是不一定是真实的副本数目(视datanode的数目而定).


WordCount

localhost:mapreduce Sean$ hadoop jar hadoop-mapreduce-examples-2.7.5.jar wordcount /wordcount/input/ /wordcount/output

localhost:mapreduce Sean$ hadoop jar hadoop-mapreduce-examples-2.7.5.jar wordcount /wordcount/input/ /wordcount/output
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
19/03/30 16:43:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/03/30 16:43:31 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/03/30 16:43:32 INFO input.FileInputFormat: Total input paths to process : 1
19/03/30 16:43:32 INFO mapreduce.JobSubmitter: number of splits:1
19/03/30 16:43:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1553933297569_0001
19/03/30 16:43:33 INFO impl.YarnClientImpl: Submitted application application_1553933297569_0001
19/03/30 16:43:33 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1553933297569_0001/
19/03/30 16:43:33 INFO mapreduce.Job: Running job: job_1553933297569_0001
19/03/30 16:43:43 INFO mapreduce.Job: Job job_1553933297569_0001 running in uber mode : false
19/03/30 16:43:43 INFO mapreduce.Job:  map 0% reduce 0%
19/03/30 16:43:48 INFO mapreduce.Job:  map 100% reduce 0%
19/03/30 16:43:54 INFO mapreduce.Job:  map 100% reduce 100%
19/03/30 16:43:54 INFO mapreduce.Job: Job job_1553933297569_0001 completed successfully
19/03/30 16:43:54 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=74
		FILE: Number of bytes written=243693
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=157
		HDFS: Number of bytes written=44
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=3271
		Total time spent by all reduces in occupied slots (ms)=3441
		Total time spent by all map tasks (ms)=3271
		Total time spent by all reduce tasks (ms)=3441
		Total vcore-milliseconds taken by all map tasks=3271
		Total vcore-milliseconds taken by all reduce tasks=3441
		Total megabyte-milliseconds taken by all map tasks=3349504
		Total megabyte-milliseconds taken by all reduce tasks=3523584
	Map-Reduce Framework
		Map input records=7
		Map output records=8
		Map output bytes=74
		Map output materialized bytes=74
		Input split bytes=115
		Combine input records=8
		Combine output records=6
		Reduce input groups=6
		Reduce shuffle bytes=74
		Reduce input records=6
		Reduce output records=6
		Spilled Records=12
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=123
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=306184192
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=42
	File Output Format Counters
		Bytes Written=44

Hadoop Shell 命令 与 WordCount_第4张图片
输出目录不能有其他内容, 否则会被进行覆盖.本次测试后的输出目录下出现2个文件:success / part-r-0000. 第一个文件为输出标准, 第二个文件是真实的文件.

localhost:mapreduce Sean$ hadoop fs -cat /wordcount/output/part-r-00000
2019	1
able	1
cat	2
hello	1
kitty	1
pitty	2

Hadoop Shell 命令 与 WordCount_第5张图片

localhost:mapreduce Sean$ ls
hadoop-mapreduce-client-app-2.7.5.jar			hadoop-mapreduce-client-hs-plugins-2.7.5.jar		hadoop-mapreduce-examples-2.7.5.jar
hadoop-mapreduce-client-common-2.7.5.jar		hadoop-mapreduce-client-jobclient-2.7.5-tests.jar	lib
hadoop-mapreduce-client-core-2.7.5.jar			hadoop-mapreduce-client-jobclient-2.7.5.jar		lib-examples
hadoop-mapreduce-client-hs-2.7.5.jar			hadoop-mapreduce-client-shuffle-2.7.5.jar		sources

这个文件夹下方有非常多的测试例子. 可以自己研究.


基本原理(大体)

存储在HDFS内的文件其实还是存储在本地的, 只是它是一个分布式的文件系统而已. 我们看下我们之前存储的.

注意,由于我本地的是DataNodeNameNode安装在一起的, 所以文件目录下结点如下所示:

localhost:tmp Sean$ tree
.
├── dfs
│   ├── data
│   │   ├── current
│   │   │   ├── BP-586017156-127.0.0.1-1553485799471
│   │   │   │   ├── current
│   │   │   │   │   ├── VERSION
│   │   │   │   │   ├── dfsUsed
│   │   │   │   │   ├── finalized
│   │   │   │   │   │   └── subdir0
│   │   │   │   │   │       └── subdir0
│   │   │   │   │   │           ├── blk_1073741825
│   │   │   │   │   │           ├── blk_1073741825_1001.meta
│   │   │   │   │   └── rbw
│   │   │   │   ├── scanner.cursor
│   │   │   │   └── tmp
│   │   │   └── VERSION
│   │   └── in_use.lock
│   ├── name
│   │   ├── current
│   │   │   ├── VERSION
│   │   │   ├── edits_0000000000000000001-0000000000000000118
│   │   │   ├── edits_inprogress_0000000000000001233
│   │   │   ├── fsimage_0000000000000001230
│   │   │   ├── fsimage_0000000000000001230.md5
│   │   │   ├── fsimage_0000000000000001232
│   │   │   ├── fsimage_0000000000000001232.md5
│   │   │   └── seen_txid
│   │   └── in_use.lock
│   └── namesecondary
│       ├── current
│       │   ├── VERSION
│       │   ├── edits_0000000000000000001-0000000000000000118
│       │   ├── edits_0000000000000000119-0000000000000000943
│       │   ├── edits_0000000000000001231-0000000000000001232
│       │   ├── fsimage_0000000000000001230
│       │   ├── fsimage_0000000000000001230.md5
│       │   ├── fsimage_0000000000000001232
│       │   └── fsimage_0000000000000001232.md5
│       └── in_use.lock
└── nm-local-dir
    ├── filecache
    ├── nmPrivate
    └── usercache
  • NameNode: 文件的管理
  • DataNode: 文件的切分与管理(Socket / Netty)

HDFS命令

  • hdfs dfsadmin -report 查看集群状态
localhost:Desktop Sean$ hdfs dfsadmin -report
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
19/04/03 15:51:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Configured Capacity: 500068036608 (465.72 GB)
Present Capacity: 182055092460 (169.55 GB)
DFS Remaining: 182048903168 (169.55 GB)
DFS Used: 6189292 (5.90 MB)
DFS Used%: 0.00%
Under replicated blocks: 27
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (1):

Name: 127.0.0.1:50010 (localhost)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 500068036608 (465.72 GB)
DFS Used: 6189292 (5.90 MB)
Non DFS Used: 313001643796 (291.51 GB)
DFS Remaining: 182048903168 (169.55 GB)
DFS Used%: 0.00%
DFS Remaining%: 36.40%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Apr 03 15:51:21 CST 2019

Reference

[1].Hadoop Shell命令
[2]. 介绍hadoop中的hadoop和hdfs命令
[3]. Hadoop学习笔记4之HDFS常用命令

你可能感兴趣的:(14.,大数据,-------14.6.,Hadoop)