spark on ACK

构建spark docker 镜像
下载spark-2.4.8-bin-hadoop2.7.tgz

Note: 这里下载spark包一定不是能是without hadoop 的。不然构建完以后运行,会报一些包找不到。比如log4j

tar -xvf spark-2.4.8-bin-hadoop2.7.tgz
cd spark-2.4.8-bin-hadoop2.7

编辑spark dockerfile

vim kubernetes/dockerfiles/spark/Dockerfile

将18行的 FROM openjdk:8-jdk-slim  替换成 FROM openjdk:8-jdk-slim-buster
因为默认openjdk基础形象是debian 11 ,后面的spark-py镜像会依赖此镜像。debian11 安装的python3 是python3.8 以上版本。spark2.4 不支持python3.7以上版本,所以会报”TypeError:an integer is required(got type bytes)” 这样的错误。

所以将基础镜像更换成debian 10。安装的python3 的版本是3.7

构建镜像

bin/docker-image-tool.sh -t v2.4.8 build
由于apt-get 源是国外的构建会比较慢,最好是开代理。

没有代理的也可以跟换国内源 (更换国内源会有包依赖的问题导致spark-py镜像是构建失败)

vim kubernetes/dockerfiles/spark/Dockerfile
在 29,31行之间 加入
ADD sources.list /etc/apt/sources.list
然后把sources.list文件放在spark-2.4.8-bin-hadoop2.7目录下
最后镜像构建好以后,会有3个镜像

spark、spark-py、spark-r

把这3个镜像push到镜像仓库。

spark 支持 OSS
spark 支持oss 可以再修改spark dockerfile

vim kubernetes/dockerfiles/spark/Dockerfile
将下面的几行jar文件放在COPY data /opt/spark/data 下面
ADD https://repo1.maven.org/maven2/com/aliyun/odps/hadoop-fs-oss/3.3.8-public/hadoop-fs-oss-3.3.8-public.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.8.1/aliyun-sdk-oss-3.8.1.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/aspectj/aspectjweaver/1.9.5/aspectjweaver-1.9.5.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/jdom/jdom/1.1.3/jdom-1.1.3.jar $SPARK_HOME/jars
也可以重新重新再写一个dockerfile 重新构建一个新的镜像。保持原有镜像
FROM acr-test01-registry.cn-beijing.cr.aliyuncs.com/netops/spark-py:v2.4.8
RUN mkdir -p /opt/spark/jars
# 如果需要使用OSS(读取OSS数据或者离线Event到OSS),可以添加以下JAR包到镜像中
ADD https://repo1.maven.org/maven2/com/aliyun/odps/hadoop-fs-oss/3.3.8-public/hadoop-fs-oss-3.3.8-public.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.8.1/aliyun-sdk-oss-3.8.1.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/aspectj/aspectjweaver/1.9.5/aspectjweaver-1.9.5.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/jdom/jdom/1.1.3/jdom-1.1.3.jar $SPARK_HOME/jars
docker build -t ack-spark-oss:v2.4.8 .
docker tag
docker push
spark on ack yaml

这个yaml是用来提交spark任务。

scala 、java 和 python 不太一样。

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "acr-test01-registry.cn-beijing.cr.aliyuncs.com/netops/ack-spark-2.4.5:v9"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "oss://qa-oss/spark-examples_2.11-2.4.8.jar"
  sparkConf:
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "oss://qa-oss/spark-events"
    "spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem"
    "spark.hadoop.fs.oss.endpoint": "oss-cn-beijing-internal.aliyuncs.com"
    "spark.hadoop.fs.oss.accessKeySecret": "OSd0RVN"
    "spark.hadoop.fs.oss.accessKeyId": "LTADXrW"
  sparkVersion: "2.4.5"
  imagePullSecrets: [spark]
  restartPolicy:
    type: Never
  driver:
    cores: 2
    coreLimit: "2"
    memory: "3g"
    memoryOverhead: "1g"
    labels:
      version: 2.4.5
    serviceAccount: spark
    annotations:
      k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
      k8s.aliyun.com/eci-image-cache: "true"
  executor:
    cores: 2
    instances: 5
    memory: "3g"
    memoryOverhead: "1g"
    labels:
      version: 2.4.5
    annotations:
      k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
      k8s.aliyun.com/eci-image-cache: "true"

如果你的镜像仓库是public的就不需要imagePullSecrets 这个参数。

如果你的镜像仓库是带验证的。那么就要使用imagePullSecrets 在验证,后面[spark] 是一个configmap 里面是镜像仓库用户名,密码

另外mainApplicationFile 这个是任务jar 包位置。可以是 oss,或是 hdfs 如果用local就需要jar 在镜像内。

sparkconf 部分是配置 spark-history ,如果不需要删除掉。

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: default
spec:
  type: Python
  mode: cluster
  image: "acr-test01-registry.cn-beijing.cr.aliyuncs.com/netops/ack-spark-2.4.5:v9"
  imagePullPolicy: Always
  mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
  sparkVersion: "2.4.5"
  pythonVersion: "3"
  imagePullSecrets: [spark]
  restartPolicy:
    type: Never
  driver:
    cores: 2
    coreLimit: "2"
    memory: "3g"
    memoryOverhead: "1g"
    labels:
      version: 2.4.5
    serviceAccount: spark
    annotations:
      k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
      k8s.aliyun.com/eci-image-cache: "true"
  executor:
    cores: 2
    instances: 5
    memory: "3g"
    memoryOverhead: "1g"
    labels:

你可能感兴趣的:(spark on ACK)