前言
背景:使用 spark 读取 hdfs 文件写入到 oss
hadoop : 2.6.0-cdh5.15.1
spark : 2.4.1
主要参考链接: https://blog.csdn.net/wankund...
增加了注意点和坑点
编译 hadoop-aliyun
hadoop 高版本已经默认支持 aliyun-oss 的访问,而本版本不支持,需要编译支持下
- 拉取 hadoop trunk 分支代码,copy hadoop-tools/hadoop-aliyun 模块代码到 cdh 对应的项目模块中
修改 hadoop-tools pom.xml
hadoop-aliyun 添加 hadoop-aliyun 子 module
- 修改根 pom.xml 中的 java 版本为 1.8,hadoop-aliyun 使用了 1.8 的 lambda 语法,也可以直接修改代码支持
- 修改 hadoop-aliyun pom.xml,修改 version,以及相关的 oss,http 依赖包,使用 shade 插件将相关依赖打进去
代码修改
- import org.apache.commons.lang3 改为 import org.apache.commons.lang
- 复制(cdh版本) hadoop-aws 模块下的 BlockingThreadPoolExecutorService 和 SemaphoredDelegatingExecutor 两个类 到 org.apache.hadoop.util 目录下
编译模块 hadoop-aliyun
- mvn clean package -pl hadoop-tools/hadoop-aliyun
最终的配置文件如下
4.0.0
org.apache.hadoop
hadoop-project
2.6.0-cdh5.15.1
../../hadoop-project
hadoop-aliyun
Apache Hadoop Aliyun OSS support
jar
UTF-8
true
tests-off
src/test/resources/auth-keys.xml
true
tests-on
src/test/resources/auth-keys.xml
false
org.codehaus.mojo
findbugs-maven-plugin
true
true
${basedir}/dev-support/findbugs-exclude.xml
Max
org.apache.maven.plugins
maven-surefire-plugin
3600
org.apache.maven.plugins
maven-dependency-plugin
deplist
compile
list
${project.basedir}/target/hadoop-tools-deps/${project.artifactId}.tools-optional.txt
org.apache.maven.plugins
maven-shade-plugin
3.1.0
shade-aliyun-sdk-oss
package
shade
false
true
true
true
org.apache.http
com.xxx.thirdparty.org.apache.http
junit
junit
test
com.aliyun.oss
aliyun-sdk-oss
3.4.1
org.apache.httpcomponents
httpclient
4.4.1
org.apache.httpcomponents
httpcore
4.4.1
org.apache.hadoop
hadoop-common
org.apache.httpcomponents
httpclient
org.apache.httpcomponents
httpcore
provided
org.apache.hadoop
hadoop-common
test
test-jar
org.apache.hadoop
hadoop-distcp
test
org.apache.hadoop
hadoop-distcp
test
test-jar
org.apache.hadoop
hadoop-yarn-server-tests
test
test-jar
org.apache.hadoop
hadoop-mapreduce-client-jobclient
test
org.apache.hadoop
hadoop-mapreduce-examples
test
jar
spark 读写取 oss 文件
val inputPath = "hdfs:///xxx"
val outputPath = "oss://bucket/OSS_FILES"
val conf = new SparkConf()
conf.set("spark.hadoop.fs.oss.endpoint", "oss-cn-xxx")
conf.set("spark.hadoop.fs.oss.accessKeyId", "xxx")
conf.set("spark.hadoop.fs.oss.accessKeySecret", "xxx")
conf.set("spark.hadoop.fs.oss.impl", "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem")
conf.set("spark.hadoop.fs.oss.buffer.dir", "/tmp/oss")
conf.set("spark.hadoop.fs.oss.connection.secure.enabled", "false")
conf.set("spark.hadoop.fs.oss.connection.maximum", "2048")
spark.write.format("orc").mode("overwrite").save(outputPath)
其它以 spark sql 以及 hdfs 读取 oss 的方式可以参考后面的第三个链接
spark submit
spark-submit \
--class org.example.HdfsToOSS \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--executor-cores 2 \
--executor-memory 3G \
--driver-cores 1 \
--driver-memory 3G \
--conf "spark.driver.extraClassPath=hadoop-common-2.6.0-cdh5.15.1.jar" \
--conf "spark.executor.extraClassPath=hadoop-common-2.6.0-cdh5.15.1.jar" \
--jars ./hadoop-aliyun-2.6.0-cdh5.15.1.jar,./hadoop-common-2.6.0-cdh5.15.1.jar \
./spark-2.4-worker-1.0-SNAPSHOT.jar
注意下 extraClassPath 这个参数,如果没有特殊的配置,spark 会默认加载自身的 hadoop-common 包,如果版本不对,可能会导致 ClassNotFound,需要 extraClassPath 指定,会优先加载
才疏学浅,如果有错误的地方,欢迎指正