大数据开发之HDFS的API操作过程

创建maven工程并导入jar包


    cloudera
    https://repository.cloudera.com/artifactory/cloudera-repos/


    org.apache.hadoop
    hadoop-client
    2.6.0-mr1-cdh5.14.0


    org.apache.hadoop
    hadoop-common
    2.6.0-cdh5.14.0


    org.apache.hadoop
    hadoop-hdfs
    2.6.0-cdh5.14.0




    org.apache.hadoop
    hadoop-mapreduce-client-core
    2.6.0-cdh5.14.0



    junit
    junit
    4.11
    test


    org.testng
    testng
    RELEASE


    
        org.apache.maven.plugins
        maven-compiler-plugin
        3.0
        
            1.8
            1.8
            UTF-8
            
        
    


    
        org.apache.maven.plugins
        maven-shade-plugin
        2.4.3
        
            
                package
                
                    shade
                
                
                    true

使用文件系统方式访问数据

在 java 中操作 HDFS，主要涉及以下 Class：

Configuration：该类的对象封装了客户端或者服务器的配置;

FileSystem：该类的对象是一个文件系统对象，可以用该对象的一些方法来对文件进行操作，通过 FileSystem 的静态方法 get 获得该对象。

FileSystem fs = FileSystem.get(conf)

get 方法从 conf 中的一个参数 fs.defaultFS 的配置值判断具体是什么类型的文件系统。如果我们的代码中没有指定 fs.defaultFS，并且工程 classpath下也没有给定相应的配置，conf中的默认值就来自于hadoop的jar包中的core-default.xml ，默认值为：file:/// ，则获取的将不是一个DistributedFileSystem 的实例，而是一个本地文件系统的客户端对象

获取FileSystem的几种方式

第一种方式获取FileSystem

@Test
public void getFileSystem() throws URISyntaxException, IOException {
Configuration configuration = new Configuration();

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), configuration);
System.out.println(fileSystem.toString());

}

第二种方式获取FileSystem

@Test
public void getFileSystem2() throws URISyntaxException, IOException {

Configuration configuration = new Configuration();
configuration.set("fs.defaultFS","hdfs://192.168.47.100:8020");
FileSystem fileSystem = FileSystem.get(new URI("/"), configuration);
System.out.println(fileSystem.toString());

}

第三种获取FileSystem类的方式

@Test
public void getFileSystem3() throws URISyntaxException, IOException {

Configuration configuration = new Configuration();
FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://192.168.47.100:8020"), configuration);
System.out.println(fileSystem.toString());

}

第四种获取FileSystem类的方式

@Test
public void getFileSystem4() throws Exception{

Configuration configuration = new Configuration();
configuration.set("fs.defaultFS","hdfs://192.168.47.100:8020");
FileSystem fileSystem = FileSystem.newInstance(configuration);
System.out.println(fileSystem.toString());

}

递归遍历文件系统当中的所有文件

通过递归遍历hdfs文件系统

@Test
public void listFile() throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), new Configuration());
FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/"));
for (FileStatus fileStatus : fileStatuses) {
    if(fileStatus.isDirectory()){
        Path path = fileStatus.getPath();
        listAllFiles(fileSystem,path);
    }else{
        System.out.println("文件路径为"+fileStatus.getPath().toString());
    }
}

}
public void listAllFiles(FileSystem fileSystem,Path path) throws Exception{

FileStatus[] fileStatuses = fileSystem.listStatus(path);
for (FileStatus fileStatus : fileStatuses) {
    if(fileStatus.isDirectory()){
        listAllFiles(fileSystem,fileStatus.getPath());
    }else{
        Path path1 = fileStatus.getPath();
        System.out.println("文件路径为"+path1);
    }
}

}

官方提供的API直接遍历

/**

递归遍历官方提供的API版本
@throws Exception
*/

@Test
public void listMyFiles()throws Exception{

//获取fileSystem类
FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration());
//获取RemoteIterator 得到所有的文件或者文件夹，第一个参数指定遍历的路径，第二个参数表示是否要递归遍历
RemoteIterator locatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/"), true);
while (locatedFileStatusRemoteIterator.hasNext()){
    LocatedFileStatus next = locatedFileStatusRemoteIterator.next();
    System.out.println(next.getPath().toString());
}
fileSystem.close();

}

下载文件到本地

程序执行的main方法

拷贝文件的到本地
@throws Exception
*/

@Test
public void getFileToLocal()throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), new Configuration());
FSDataInputStream open = fileSystem.open(new Path("/test/input/install.log"));
FileOutputStream fileOutputStream = new FileOutputStream(new File("c:\\install.log"));
IOUtils.copy(open,fileOutputStream );
IOUtils.closeQuietly(open);
IOUtils.closeQuietly(fileOutputStream);
fileSystem.close();

}

hdfs上创建文件夹

@Test
public void mkdirs() throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration());
boolean mkdirs = fileSystem.mkdirs(new Path("/hello/mydir/test"));
fileSystem.close();

}

hdfs文件上传

@Test
public void putData() throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), new Configuration());
fileSystem.copyFromLocalFile(new Path("file:///c:\\install.log"),new Path("/hello/mydir/test"));
fileSystem.close();

}

HDFS的小文件合并

由于hadoop擅长存储大文件，因为大文件的元数据信息比较少，如果hadoop集群当中有大量的小文件，那么每个小文件都需要维护一份元数据信息，会大大的增加集群管理元数据的内存压力，所以在实际工作当中，如果有必要一定要将小文件合并成大文件进行一起处理

在我们的hdfs 的shell命令模式下，可以通过命令行将很多的hdfs文件合并成一个大文件下载到本地，命令如下

cd /export/servers

hdfs dfs -getmerge /config/*.xml ./hello.xml

既然可以在下载的时候将这些小文件合并成一个大文件一起下载，那么肯定就可以在上传的时候将小文件合并到一个大文件里面去

代码如下：

/**

将多个本地系统文件，上传到hdfs，并合并成一个大的文件
@throws Exception
*/

@Test
public void mergeFile() throws Exception{

//获取分布式文件系统
FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.47.100:8020"), new Configuration(),"root");
FSDataOutputStream outputStream = fileSystem.create(new Path("/bigfile.xml"));
//获取本地文件系统
LocalFileSystem local = FileSystem.getLocal(new Configuration());
//通过本地文件系统获取文件列表，为一个集合
FileStatus[] fileStatuses = local.listStatus(new Path("file:///F:\\上传小文件合并"));
for (FileStatus fileStatus : fileStatuses) {
    FSDataInputStream inputStream = local.open(fileStatus.getPath());
   IOUtils.copy(inputStream,outputStream);
    IOUtils.closeQuietly(inputStream);
}
IOUtils.closeQuietly(outputStream);
local.close();
fileSystem.close();

}

大数据开发之HDFS的API操作过程

你可能感兴趣的:(大数据hdfs)