获得parquet文件的rows和filesize

贴代码

    public static void getParquetFileSizeAndRowCount()throws Exception{
        Path inputPath = new Path("/user/hive/warehouse/user_parquet");
        Configuration conf = new Configuration();
        FileStatus[] inputFileStatuses = inputPath.getFileSystem(conf).globStatus(inputPath);

        for (FileStatus fs : inputFileStatuses) {
            for (Footer f : ParquetFileReader.readFooters(conf, fs, false)) {
                for (BlockMetaData b : f.getParquetMetadata().getBlocks()) {
                    logger.info("TotalByteSize:"+b.getTotalByteSize() +"   CompressedSize:"+b.getCompressedSize()+"   rowCount:"+b.getRowCount());
                }
            }
        }
    }

输出:

18/10/26 10:38:20 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
18/10/26 10:38:20 INFO hadoop.ParquetFileReader: reading another 1 footers
18/10/26 10:38:20 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
18/10/26 10:38:20 INFO test.HDFSTest: TotalByteSize:106324460   CompressedSize:106324460   rowCount:53285496

部分pom.xml

    
        2.8.4
        1.10.0
    


 
            org.apache.hadoop
            hadoop-common
            ${hadoop.version}
        
        
            org.apache.hadoop
            hadoop-hdfs
            ${hadoop.version}
        
        
            org.apache.hadoop
            hadoop-client
            ${hadoop.version}
        
 
            org.apache.parquet
            parquet-common
            ${parquet.version}
        
        
            org.apache.parquet
            parquet-encoding
            ${parquet.version}
        
        
            org.apache.parquet
            parquet-column
            ${parquet.version}
        
        
            org.apache.parquet
            parquet-hadoop
            ${parquet.version}
        

你可能感兴趣的:(hadoop)