Hadoop 核心-HDFS

1:HDFS 的 API 操作

1.1 配置Windows下Hadoop环境

在windows系统需要配置hadoop运行环境，否则直接运行代码会出现以下问题:

缺少winutils.exe

Could not locate executable null \bin\winutils.exe in the hadoop binaries

缺少hadoop.dll

Unable to load native-hadoop library for your platform… using builtin-Java classes where applicable

步骤:

第一步：将hadoop2.7.5文件夹拷贝到一个没有中文没有空格的路径下面

第二步：在windows上面配置hadoop的环境变量： HADOOP_HOME，并将%HADOOP_HOME%\bin添加到path中

第三步：把hadoop2.7.5文件夹中bin目录下的hadoop.dll文件放到系统盘: C:\Windows\System32 目录

第四步：关闭windows重启

1.2 导入 Maven 依赖


     
        
            org.apache.hadoop
            hadoop-common
            2.7.5
        
        
            org.apache.hadoop
            hadoop-client
            2.7.5
        
        
            org.apache.hadoop
            hadoop-hdfs
            2.7.5
        
        
            org.apache.hadoop
            hadoop-mapreduce-client-core
            2.7.5
        
        
            junit
            junit
            RELEASE
        
    
    
        
            
                org.apache.maven.plugins
                maven-compiler-plugin
                3.1
                
                    1.8
                    1.8
                    UTF-8
                    
                
            
            
                org.apache.maven.plugins
                maven-shade-plugin
                2.4.3
                
                    
                        package
                        
                            shade
                        
                        
                            true

1.3 使用url方式访问数据（了解）

@Test
public void demo1()throws  Exception{
    //第一步：注册hdfs 的url
    URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());

    //获取文件输入流
    InputStream inputStream  = new URL("hdfs://node01:8020/a.txt").openStream();
    //获取文件输出流
    FileOutputStream outputStream = new FileOutputStream(new File("D:\\hello.txt"));

    //实现文件的拷贝
    IOUtils.copy(inputStream, outputStream);

    //关闭流
    IOUtils.closeQuietly(inputStream);
    IOUtils.closeQuietly(outputStream);
}

1.4 使用文件系统方式访问数据（掌握）

1.4.1 涉及的主要类

在 Java 中操作 HDFS, 主要涉及以下 Class:

Configuration
- 该类的对象封转了客户端或者服务器的配置
FileSystem
- 该类的对象是一个文件系统对象, 可以用该对象的一些方法来对文件进行操作, 通过 FileSystem 的静态方法 get 获得该对象
```
FileSystem fs = FileSystem.get(conf)
```
  - get 方法从 conf 中的一个参数 fs.defaultFS 的配置值判断具体是什么类型的文件系统
  - 如果我们的代码中没有指定 fs.defaultFS, 并且工程 ClassPath 下也没有给定相应的配置, conf 中的默认值就来自于 Hadoop 的 Jar 包中的 core-default.xml
  - 默认值为 file:///, 则获取的不是一个 DistributedFileSystem 的实例, 而是一个本地文件系统的客户端对象

1.4.2 获取 FileSystem 的几种方式

第一种方式

@Test
public void getFileSystem1() throws IOException {
    Configuration configuration = new Configuration();
    //指定我们使用的文件系统类型:
    configuration.set("fs.defaultFS", "hdfs://node01:8020/");

    //获取指定的文件系统
    FileSystem fileSystem = FileSystem.get(configuration);
    System.out.println(fileSystem.toString());

}

第二种方式

@Test
public void getFileSystem2() throws  Exception{
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new       Configuration());
    System.out.println("fileSystem:"+fileSystem);
}

第三种方式

@Test
public void getFileSystem3() throws  Exception{
    Configuration configuration = new Configuration();
    configuration.set("fs.defaultFS", "hdfs://node01:8020");
    FileSystem fileSystem = FileSystem.newInstance(configuration);
    System.out.println(fileSystem.toString());
}

第四种方式

//@Test
public void getFileSystem4() throws  Exception{
    FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://node01:8020") ,new Configuration());
    System.out.println(fileSystem.toString());
}

1.4.3 遍历 HDFS 中所有文件

使用 API 遍历

@Test
public void listMyFiles()throws Exception{
    //获取fileSystem类
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());
    //获取RemoteIterator 得到所有的文件或者文件夹，第一个参数指定遍历的路径，第二个参数表示是否要递归遍历
    RemoteIterator locatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/"), true);
    while (locatedFileStatusRemoteIterator.hasNext()){
        LocatedFileStatus next = locatedFileStatusRemoteIterator.next();
        System.out.println(next.getPath().toString());
    }
    fileSystem.close();
}

1.4.4 HDFS 上创建文件夹

@Test
public void mkdirs() throws  Exception{
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());
    boolean mkdirs = fileSystem.mkdirs(new Path("/hello/mydir/test"));
    fileSystem.close();
}

1.4.4 下载文件

@Test
public void getFileToLocal()throws  Exception{
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());
    FSDataInputStream inputStream = fileSystem.open(new Path("/timer.txt"));
    FileOutputStream  outputStream = new FileOutputStream(new File("e:\\timer.txt"));
    IOUtils.copy(inputStream,outputStream );
    IOUtils.closeQuietly(inputStream);
    IOUtils.closeQuietly(outputStream);
    fileSystem.close();
}

1.4.5 HDFS 文件上传

@Test
public void putData() throws  Exception{
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());
    fileSystem.copyFromLocalFile(new Path("file:///c:\\install.log"),new Path("/hello/mydir/test"));
    fileSystem.close();
}

1.4.6 hdfs访问权限控制

停止hdfs集群，在node01机器上执行以下命令

cd /export/servers/hadoop-2.7.5
sbin/stop-dfs.sh

修改node01机器上的hdfs-site.xml当中的配置文件

cd /export/servers/hadoop-2.7.5/etc/hadoop
vim hdfs-site.xml


    dfs.permissions.enabled
    true

修改完成之后配置文件发送到其他机器上面去

scp hdfs-site.xml node02:$PWD
scp hdfs-site.xml node03:$PWD

重启hdfs集群

cd /export/servers/hadoop-2.7.5
sbin/start-dfs.sh

随意上传一些文件到我们hadoop集群当中准备测试使用

cd /export/servers/hadoop-2.7.5/etc/hadoop
hdfs dfs -mkdir /config
hdfs dfs -put *.xml /config
hdfs dfs -chmod 600 /config/core-site.xml

使用代码准备下载文件

@Test
public void getConfig()throws  Exception{
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration(),"hadoop");
    fileSystem.copyToLocalFile(new Path("/config/core-site.xml"),new Path("file:///c:/core-site.xml"));
    fileSystem.close();
}

1.4.7 小文件合并

由于 Hadoop 擅长存储大文件，因为大文件的元数据信息比较少，如果 Hadoop 集群当中有大量的小文件，那么每个小文件都需要维护一份元数据信息，会大大的增加集群管理元数据的内存压力，所以在实际工作当中，如果有必要一定要将小文件合并成大文件进行一起处理

在我们的 HDFS 的 Shell 命令模式下，可以通过命令行将很多的 hdfs 文件合并成一个大文件下载到本地

cd /export/servers
hdfs dfs -getmerge /config/*.xml ./hello.xml

既然可以在下载的时候将这些小文件合并成一个大文件一起下载，那么肯定就可以在上传的时候将小文件合并到一个大文件里面去

@Test
public void mergeFile() throws  Exception{
    //获取分布式文件系统
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.250:8020"), new Configuration(),"root");
    FSDataOutputStream outputStream = fileSystem.create(new Path("/bigfile.txt"));
    //获取本地文件系统
    LocalFileSystem local = FileSystem.getLocal(new Configuration());
    //通过本地文件系统获取文件列表，为一个集合
    FileStatus[] fileStatuses = local.listStatus(new Path("file:///E:\\input"));
    for (FileStatus fileStatus : fileStatuses) {
        FSDataInputStream inputStream = local.open(fileStatus.getPath());
       IOUtils.copy(inputStream,outputStream);
        IOUtils.closeQuietly(inputStream);
    }
    IOUtils.closeQuietly(outputStream);
    local.close();
    fileSystem.close();
}

Hadoop 核心-HDFS

Hadoop 核心-HDFS

1:HDFS 的 API 操作

1.1 配置Windows下Hadoop环境

1.2 导入 Maven 依赖

1.3 使用url方式访问数据（了解）

1.4 使用文件系统方式访问数据（掌握）

1.4.1 涉及的主要类

1.4.2 获取 FileSystem 的几种方式

1.4.3 遍历 HDFS 中所有文件

1.4.4 HDFS 上创建文件夹

1.4.4 下载文件

1.4.5 HDFS 文件上传

1.4.6 hdfs访问权限控制

1.4.7 小文件合并

你可能感兴趣的:(Hadoop 核心-HDFS)