编译支持 spark 读写 oss(cdh 5.x)

前言

背景:使用 spark 读取 hdfs 文件写入到 oss
hadoop : 2.6.0-cdh5.15.1
spark : 2.4.1
主要参考链接: https://blog.csdn.net/wankund...
增加了注意点和坑点
编译 hadoop-aliyun
hadoop 高版本已经默认支持 aliyun-oss 的访问,而本版本不支持,需要编译支持下
  • 拉取 hadoop trunk 分支代码,copy hadoop-tools/hadoop-aliyun 模块代码到 cdh 对应的项目模块中
  • 修改 hadoop-tools pom.xml

    • hadoop-aliyun 添加 hadoop-aliyun 子 module
  • 修改根 pom.xml 中的 java 版本为 1.8,hadoop-aliyun 使用了 1.8 的 lambda 语法,也可以直接修改代码支持
  • 修改 hadoop-aliyun pom.xml,修改 version,以及相关的 oss,http 依赖包,使用 shade 插件将相关依赖打进去
  • 代码修改

    • import org.apache.commons.lang3 改为 import org.apache.commons.lang
    • 复制(cdh版本) hadoop-aws 模块下的 BlockingThreadPoolExecutorService 和 SemaphoredDelegatingExecutor 两个类 到 org.apache.hadoop.util 目录下
  • 编译模块 hadoop-aliyun

    • mvn clean package -pl hadoop-tools/hadoop-aliyun

最终的配置文件如下




  4.0.0
  
    org.apache.hadoop
    hadoop-project
    2.6.0-cdh5.15.1
    ../../hadoop-project
  
  hadoop-aliyun
  Apache Hadoop Aliyun OSS support
  jar

  
    UTF-8
    true
  

  
    
      tests-off
      
        
          src/test/resources/auth-keys.xml
        
      
      
        true
      
    
    
      tests-on
      
        
          src/test/resources/auth-keys.xml
        
      
      
        false
      
    
  

  
    
      
        org.codehaus.mojo
        findbugs-maven-plugin
        
          true
          true
          ${basedir}/dev-support/findbugs-exclude.xml
          
          Max
        
      
      
        org.apache.maven.plugins
        maven-surefire-plugin
        
          3600
        
      
      
        org.apache.maven.plugins
        maven-dependency-plugin
        
          
            deplist
            compile
            
              list
            
            
              
              
                ${project.basedir}/target/hadoop-tools-deps/${project.artifactId}.tools-optional.txt
              
            
          
        
      
      
        org.apache.maven.plugins
        maven-shade-plugin
        3.1.0
        
          
            shade-aliyun-sdk-oss
            package
            
              shade
            
            
              false
              true
              true
              true
              
                
                  org.apache.http
                  com.xxx.thirdparty.org.apache.http
                
              
            
          
        
      
    
  

  
    
      junit
      junit
      test
    

    
      com.aliyun.oss
      aliyun-sdk-oss
      3.4.1
    
    
      org.apache.httpcomponents
      httpclient
      4.4.1
    
    
      org.apache.httpcomponents
      httpcore
      4.4.1
    

    
      org.apache.hadoop
      hadoop-common
      
        
          org.apache.httpcomponents
          httpclient
        
        
          org.apache.httpcomponents
          httpcore
        
      
      provided
    

    
      org.apache.hadoop
      hadoop-common
      test
      test-jar
    
    
      org.apache.hadoop
      hadoop-distcp
      test
    
    
      org.apache.hadoop
      hadoop-distcp
      test
      test-jar
    
    
      org.apache.hadoop
      hadoop-yarn-server-tests
      test
      test-jar
    
    
      org.apache.hadoop
      hadoop-mapreduce-client-jobclient
      test
    
    
      org.apache.hadoop
      hadoop-mapreduce-examples
      test
      jar
    
  


spark 读写取 oss 文件

val inputPath = "hdfs:///xxx"
val outputPath = "oss://bucket/OSS_FILES"

val conf = new SparkConf()
    conf.set("spark.hadoop.fs.oss.endpoint", "oss-cn-xxx")
    conf.set("spark.hadoop.fs.oss.accessKeyId", "xxx")
    conf.set("spark.hadoop.fs.oss.accessKeySecret", "xxx")
    conf.set("spark.hadoop.fs.oss.impl", "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem")
    conf.set("spark.hadoop.fs.oss.buffer.dir", "/tmp/oss")
    conf.set("spark.hadoop.fs.oss.connection.secure.enabled", "false")
    conf.set("spark.hadoop.fs.oss.connection.maximum", "2048")
    
spark.write.format("orc").mode("overwrite").save(outputPath)
其它以 spark sql 以及 hdfs 读取 oss 的方式可以参考后面的第三个链接

spark submit

spark-submit \
--class org.example.HdfsToOSS \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--executor-cores 2 \
--executor-memory 3G \
--driver-cores 1  \
--driver-memory 3G \
--conf "spark.driver.extraClassPath=hadoop-common-2.6.0-cdh5.15.1.jar" \
--conf "spark.executor.extraClassPath=hadoop-common-2.6.0-cdh5.15.1.jar" \
--jars ./hadoop-aliyun-2.6.0-cdh5.15.1.jar,./hadoop-common-2.6.0-cdh5.15.1.jar \
./spark-2.4-worker-1.0-SNAPSHOT.jar
注意下 extraClassPath 这个参数,如果没有特殊的配置,spark 会默认加载自身的 hadoop-common 包,如果版本不对,可能会导致 ClassNotFound,需要 extraClassPath 指定,会优先加载

才疏学浅,如果有错误的地方,欢迎指正

参考链接

你可能感兴趣的:(sparkhadooposs)