前言
- 当前 AWS EMR 的最新版本是 6.15,自带的 Python 版本是 3.7,尝试上传使用 Python 3.11
Python 环境打包
技术栈
Ubuntu 22.04(x86) Linux version 5.15 Python 3.11.5 pyspark 3.4.1 conda 23.10.0 conda-pack 0.7.1
- 官方建议用在 Amazon Linux 2 上编译安装 Python 环境,测试发现在 Ubuntu 上用 Miniconda 生成的虚拟环境也是可以的
在 Ubuntu上安装 Miniconda,用 conda 安装Python 3.11
conda create -n EMRServerless python=3.11
进入
EMRServerless
虚拟环境,安装pyspark 3.4.1
(EMR 6.15 的 Spark 版本是 3.4.1)。发现这步没用,AWS会用自己的 pyspark 版本conda activate EMRServerless pip3 install pyspark==3.4.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
退出到
base
环境,安装conda-pack
conda activate base pip3 install conda-pack -i https://pypi.tuna.tsinghua.edu.cn/simple
使用
conda-pack
打包生成文件py311.tar.gz
conda pack --prefix /home/qbit/miniconda3/envs/EMRServerless -o py311.tar.gz
- 解压可以看到里面最大的目录是
.../lib/python3.11/site-packages/pyspark/jars
将
py311.tar.gz
上传到S3
,路径如下s3://your-bucket/usr/qbit/py311_env/py311.tar.gz
AWS CLI 安装测试
- 安装 AWS CLI,并做基本配置
[profile emr-serverless]
aws_access_key_id = ACCESS-KEY-ID-OF-IAM-USER
aws_secret_access_key = SECRET-ACCESS-KEY-ID-OF-IAM-USER
region = cn-northwest-1
output = json
测试是否安装好
aws emr-serverless help
AWS IAM 角色创建
- 在本机当前目录下新建文件
emr-serverless-trust-policy.json
,有这个策略才能运行 Serverless,文件内容如下
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EMRServerlessTrustPolicy",
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "emr-serverless.amazonaws.com"
}
}
]
}
- 关联上面的策略文件创建角色
aws iam create-role \
--role-name EMRServerlessS3RuntimeRole \
--assume-role-policy-document file://emr-serverless-trust-policy.json
- 在本机当前目录下新建文件
emr-qbit-access-policy.json
,这个策略是为了关联S3
权限,注意将your-bucket
换成自己的桶名,还要注意将官方示例中的aws
替换为aws-cn
,,记得保持返回值中的Arn
备用,文件内容如下
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FullAccessToOutputBucket",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws-cn:s3:::your-bucket",
"arn:aws-cn:s3:::your-bucket/*"
]
}
]
}
- 创建
S3
访问策略,记得保持返回值中的Arn
备用
aws iam create-policy \
--policy-name EMRServerlessS3AccessPolicy \
--policy-document file://emr-qbit-access-policy.json
- 将
S3
访问策略与 Serverless 角色关联,命令中arn
为 S3 访问策略的arn
aws iam attach-role-policy \
--role-name EMRServerlessS3RuntimeRole \
--policy-arn arn:aws-cn:iam::XXX:policy/EMRServerlessS3AccessPolicy
运行 WordCount
- 调度代码示例参见 https://github.com/aws-samples/emr-serverless-samples/blob/main/examples/python-api/emr_serverless.py
- 类和方法概览
class EMRServerless:
def create_application(self, name: str, release_label: str, wait: bool = True):
""" 创建 application """
def start_application(self, wait: bool = True) -> None:
""" 启动 application """
def stop_application(self, wait: bool = True) -> None:
""" 关闭 application """
def delete_application(self) -> None:
""" 删除 application """
def run_spark_job(
self,
script_location: str,
job_role_arn: str,
arguments: list(),
s3_bucket_name: str,
wait: bool = True,
) -> str:
""" 提交 Spark 任务 """
def get_job_run(self, job_run_id: str) -> dict:
""" 按行任务信息 """
def fetch_driver_log(
self, s3_bucket_name: str, job_run_id: str, log_type: str = "stdout"
) -> str:
""" 取日志 """
- 单独看一下
run_spark_job
函数,关注sparkSubmitParameters
参数
def run_spark_job(
self,
script_location: str,
job_role_arn: str,
arguments: list(),
s3_bucket_name: str,
wait: bool = True,
) -> str:
response = self.client.start_job_run(
applicationId=self.application_id,
executionRoleArn=job_role_arn,
jobDriver={
"sparkSubmit": {
"entryPoint": script_location,
"entryPointArguments": arguments,
"sparkSubmitParameters": (
"--conf spark.driver.cores=1 "
"--conf spark.driver.memory=4g "
"--conf spark.executor.instances=1 "
"--conf spark.executor.cores=1 "
"--conf spark.executor.memory=4g "
"--conf spark.archives=s3://your-bucket/usr/qbit/py311_env/py311.tar.gz#py311 "
"--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=/home/hadoop/py311/bin/python "
"--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=/home/hadoop/py311/bin/python "
"--conf spark.executorEnv.PYSPARK_PYTHON=/home/hadoop/py311/bin/python"
)
}
},
configurationOverrides={
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": f"s3://{s3_bucket_name}/{self.s3_log_prefix}"
}
}
},
)
job_run_id = response.get("jobRunId")
job_done = False
while wait and not job_done:
jr_response = self.get_job_run(job_run_id)
job_done = jr_response.get("state") in [
"SUCCESS",
"FAILED",
"CANCELLING",
"CANCELLED",
]
return job_run_id
- 上传文本文件
s3://your-bucket/usr/qbit/input/word.txt
s3://your-bucket/usr/qbit/wordcount.py
代码如下
import pyspark
from pyspark.sql import SparkSession
import sys
# 创建SparkSession
spark = SparkSession.builder.appName("WordCountExample").getOrCreate()
# 读取文本文件创建RDD
text_file_path = "s3://your-bucket/usr/qbit/input/word.txt"
text_rdd = spark.sparkContext.textFile(text_file_path)
# 使用flatMap将每行文本拆分成单词
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))
# 使用map将每个单词映射为(单词, 1)的键值对
word_count_rdd = words_rdd.map(lambda word: (word, 1))
# 使用reduceByKey执行Word Count操作
word_counts = word_count_rdd.reduceByKey(lambda a, b: a + b)
# 保存结果
word_counts.saveAsTextFile("s3://your-bucket/usr/qbit/output/")
# 停止SparkSession
spark.stop()
# 打印 Python 版本号
print(f"Python version: {sys.version}")
print(f"pyspark version: {pyspark.__version__}")
- 安装
boto3
pip3 install boto3 -i https://pypi.tuna.tsinghua.edu.cn/simple
- 调用
emr_serverless.py
运行wordcount.py
python emr_serverless.py \
--job-role-arn arn:aws-cn:iam::XXX:role/EMRServerlessS3RuntimeRole \
--s3-bucket zt-hadoop-cn-northwest-1
- 在
s3://your-bucket/usr/qbit/output/
目录查看wordcount
的输出结果 - 结果中的版本信息,可以看到 Python 用了上传的解释器,pypspark 是 aws 自己版本
Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
pyspark version: 3.4.1+amzn.2
引入第三方库
- 第三方库有二级依赖,不能直接压缩为 zip 引入,可以把第三方库安装在 conda 环境,一起打包上传
引入自己编写的代码
- 可以将自己的代码文件夹打包为
util.zip
文件,用以下参数引入
--conf spark.submit.pyFiles=s3://your-bucket/qbit/util.zip
相关资料
本文处在 qbit snap