随着数据量越来越大,在一个操作系统存不下所有的数据,那么就分配到更多的操作系统管理的磁盘中,但是不方便管理和维护,迫切需要一种系统来管理多台机器上的文件,这就是分布式文件管理系统。HDFS只是分布式文件管理系统中的一种
HDFS(Hadoop Distributed File System),它是一个文件系统,用于存储文件,通过目录树来定位文件。其次,他是分布式的,由很多服务器联合起来实现其功能,集群中的服务器有各自的角色
注:所有操作均在Hadoop根目录下操作
sbin/start-dfs.sh
sbin/start-yarn.sh
hadoop fs -help rm
hadoop fs -ls /
hadoop fs -mkdir -p /IronmanJay/people
# 创建一个测试文件
touch zhangsan.txt
# 从本地剪切粘贴到HDFS
hadoop fs -moveFromLocal ./zhangsan.txt /IronmanJay/people
# 创建一个测试文件
touch lisi.txt
# 输入测试文件内容
wo shi da hao ren
# 追加一个文件到已经存在的文件末尾
hadoop fs -appendToFile lisi.txt /IronmanJay/people/zhangsan.txt
hadoop fs -cat /IronmanJay/people/zhangsan.txt
# 修改权限
hadoop fs -chmod 666 /IronmanJay/people/zhangsan.txt
# 修改所属用户
hadoop fs -chown IronmanJay:IronmanJay /IronmanJay/people/zhangsan.txt
hadoop fs -copyFromLocal README.txt /
hadoop fs -copyToLocal /IronmanJay/people/zhangsan.txt ./
hadoop fs -cp /IronmanJay/people/zhangsan.txt /newzhangsan.txt
hadoop fs -mv /newzhangsan.txt /IronmanJay/IronmanJay/
hadoop fs -get /IronmanJay/people/zhangsan.txt ./
hadoop fs -getmerge /user/IronmanJay/test/* ./merge.txt
hadoop fs -put ./merge.txt /user/IronmanJay/test/
hadoop fs -tail /IronmanJay/people/zhangsan.txt
hadoop fs -rm /user/IronmanJay/test/wangwu.txt
hadoop fs -mkdir /test
hadoop fs -rmdir /test
hadoop fs -du -s -h /user/IronmanJay/test
hadoop fs -setrep 10 /IronmanJay/people/zhangsan.txt
导入相应的依赖坐标+日志添加
<dependencies>
<dependency>
<groupId>junitgroupId>
<artifactId>junitartifactId>
<version>RELEASEversion>
dependency>
<dependency>
<groupId>org.apache.logging.log4jgroupId>
<artifactId>log4j-coreartifactId>
<version>2.8.2version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-commonartifactId>
<version>2.7.2version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-clientartifactId>
<version>2.7.2version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-hdfsartifactId>
<version>2.7.2version>
dependency>
<dependency>
<groupId>jdk.toolsgroupId>
<artifactId>jdk.toolsartifactId>
<version>1.8version>
<scope>systemscope>
<systemPath>D:/Software/Java/jdk1.8.0_131/lib/tools.jarsystemPath>
dependency>
dependencies>
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
// 测试连接
public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException {
Configuration conf = new Configuration();
// 配置在集群上运行
conf.set("fs.defaultFS", "hdfs://hadoop102:9000");
// 1、获取hdfs客户端
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), conf, "root");
// 2、在hdfs上创建路径
fs.mkdirs(new Path("/IronmanJay/BaiRui/HaiZi/WeiLai"));
// 3、关闭资源
fs.close();
System.out.println("over");
}
// 文件上传
@Test
public void testCopyFromLocalFile() throws URISyntaxException, IOException, InterruptedException {
// 1、获取fs对象
Configuration conf = new Configuration();
conf.set("dfs.replication", "2");
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), conf, "root");
// 2、执行上传API
fs.copyFromLocalFile(new Path("D:/test.txt"), new Path("/test2.txt"));
// 3、关闭资源
fs.close();
}
// 文件下载
@Test
public void testCopyToLocalFile() throws URISyntaxException, IOException, InterruptedException {
// 1、获取fs对象
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), conf, "root");
// 2、执行下载操作
fs.copyToLocalFile(false, new Path("/test.txt"), new Path("d:/text3.txt"), true);
// 3、关闭资源
fs.close();
}
// 文件夹删除
@Test
public void testDelete() throws URISyntaxException, IOException, InterruptedException {
// 1、获取fs对象
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), conf, "root");
// 2、执行删除操作
fs.delete(new Path("/IronmanJay"), true);
// 3、关闭资源
fs.close();
}
// 修改文件名称
@Test
public void testRename() throws URISyntaxException, IOException, InterruptedException {
// 1、获取fs对象
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), conf, "root");
// 2、执行更名操作
fs.rename(new Path("/test.txt"), new Path("/test3.txt"));
// 3、关闭资源
fs.close();
}
// 查看文件详情(查看文件名称、权限、长度、块信息)
@Test
public void testListFiles() throws URISyntaxException, IOException, InterruptedException {
// 1、获取fs对象
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), conf, "root");
// 2、查看文件详情
RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/"), true);
while (listFiles.hasNext()) {
LocatedFileStatus fileStatus = listFiles.next();
// 获取文件名称
System.out.println(fileStatus.getPath().getName());
// 获取文件权限
System.out.println(fileStatus.getPermission());
// 获取文件长度
System.out.println(fileStatus.getLen());
// 获取块的信息
BlockLocation[] blockLocations = fileStatus.getBlockLocations();
for (BlockLocation blockLocation : blockLocations) {
String[] hosts = blockLocation.getHosts();
for (String host : hosts) {
System.out.println(host);
}
}
System.out.println("----------分割线----------");
}
// 3、关闭资源
fs.close();
}
// I/O流操作文件上传
@Test
public void putFileToHDFS() throws URISyntaxException, IOException, InterruptedException {
// 1、获取fs对象
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), conf, "root");
// 2、获取输入流
FileInputStream fis = new FileInputStream(new File("D:/banhua.txt"));
// 3、获取输出流
FSDataOutputStream fos = fs.create(new Path("/banzhang.txt"));
// 4、流的对拷
IOUtils.copyBytes(fis, fos, conf);
// 5、关闭资源
IOUtils.closeStream(fos);
IOUtils.closeStream(fis);
fs.close();
}
// I/O流操作文件下载
@Test
public void getFileFromHDFS() throws URISyntaxException, IOException, InterruptedException {
// 1、获取fs对象
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), conf, "root");
// 2、获取输入流
FSDataInputStream fis = fs.open(new Path("/banzhang.txt"));
// 3、获取输出流
FileOutputStream fos = new FileOutputStream(new File("D:/banzhang.txt"));
// 4、流的对拷
IOUtils.copyBytes(fis, fos, conf);
// 5、关闭资源
IOUtils.closeStream(fos);
IOUtils.closeStream(fis);
fs.close();
}
For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack.
NameNode被格式化之后,将在opt/module/hadoop-2.7.2/data/temp/dfs/name/current目录中产生如下文件
fsimage_0000000000000000000
fsimage_0000000000000000000.md5
seen_txid
VERSION
<property>
<name>dfs.namenode.checkpoint.period</name>
<value>3600</value>
</property>
<property>
<name>dfs.namenode.checkpoint.txns</name>
<value>1000000</value>
<description>操作动作次数</description>
</property>
<property>
<name>dfs.namenode.checkpoint.check.period</name>
<value>60</value>
<description> 1分钟检查一次操作次数</description>
</property >
NameNode故障后,可以采用如下两种方法恢复数据
方法一:将SecondaryNameNode中数据拷贝到NameNode存储数据的目录
kill -9 NameNode进程序号
rm -rf /opt/module/hadoop-2.7.2/data/tmp/dfs/name/*
scp -r IronmanJay@hadoop104:/opt/module/hadoop-2.7.2/data/tmp/dfs/namesecondary/* ./name/
sbin/hadoop-daemon.sh start namenode
方法二:使用-importCheckpoint选项启动NameNode守护进程,从而将SecondaryNameNode中数据拷贝到NameNode目录中
<property>
<name>dfs.namenode.checkpoint.period</name>
<value>120</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/module/hadoop-2.7.2/data/tmp/dfs/name</value>
</property>
kill -9 NameNode进程序号
rm -rf /opt/module/hadoop-2.7.2/data/tmp/dfs/name/*
[IronmanJay@hadoop102 dfs]$ scp -r IronmanJay@hadoop104:/opt/module/hadoop-2.7.2/data/tmp/dfs/namesecondary ./
[IronmanJay@hadoop102 namesecondary]$ rm -rf in_use.lock
[IronmanJay@hadoop102 dfs]$ pwd
/opt/module/hadoop-2.7.2/data/tmp/dfs
[IronmanJay@hadoop102 dfs]$ ls
data name namesecondary
bin/hdfs namenode -importCheckpoint
sbin/hadoop-daemon.sh start namenode
集群处于安全模式,不能执行重要操作(写操作)。集群启动完成后,自动退出安全模式
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///${hadoop.tmp.dir}/dfs/name1,file:///${hadoop.tmp.dir}/dfs/name2</value>
</property>
②:停止集群,删除data和logs中的所有数据
rm -rf data/ logs/
③:格式化集群并启动
# 格式化集群
bin/hdfs namenode –format
# 启动集群
sbin/start-dfs.sh
④:查看结果
[IronmanJay@hadoop102 dfs]$ ll
总用量 12
drwx------. 3 IronmanJay IronmanJay 4096 2月 17 04:01 data
drwxrwxr-x. 3 IronmanJay IronmanJay 4096 2月 17 04:01 name1
drwxrwxr-x. 3 IronmanJay IronmanJay 4096 2月 17 04:01 name2
source /etc/profile
sbin/hadoop-daemon.sh start datanode
sbin/yarn-daemon.sh start nodemanager
②:在web界面查看是否成功
③:如果数据不均衡,可以用命令实现集群的再平衡
./start-balancer.sh
添加到白名单的主机节点,都允许访问NameNode,不在白名单的主机节点,都会被退出,具体步骤如下
vi dfs.hosts
hadoop102
hadoop103
hadoop104
<property>
<name>dfs.hosts</name>
<value>/opt/module/hadoop-2.7.2/etc/hadoop/dfs.hosts</value>
</property>
xsync hdfs-site.xml
hdfs dfsadmin -refreshNodes
yarn rmadmin -refreshNodes
./start-balancer.sh
在黑名单上面的主机都会被强制退出,具体步骤如下
vi dfs.hosts.exclude
hadoop105
<property>
<name>dfs.hosts.exclude</name>
<value>/opt/module/hadoop-2.7.2/etc/hadoop/dfs.hosts.exclude</value>
</property>
hdfs dfsadmin -refreshNodes
./start-balancer.sh