py4j.protocol.Py4JJavaError: An error...(pyspark aws s3读取数据配置)

bug提示

py4j.protocol.Py4JJavaError: An error occurred while calling o29.csv.
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong

问题原因

pyspark安装配置好读取s3服务器内容时候各种bug。试了网上的方法,最后总结一下,还是环境变量加载的hadoop-aws和aws-java-sdk,版本不匹配。经过多重测试,暂时发现有这两个包的版本可用:
hadoop-aws:2.7.3
aws-java-sdk:1.7.4

演示程序

这里只能通过s3a协议连接,该程序可连接国内和国外s3。
id和key需要更换

import pyspark
import os

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3," \
                                    "com.amazonaws:aws-java-sdk:1.7.4 " \
                                    "pyspark-shell"
access_id = 'your_access_id'
access_key = 'your_access_key'

spark = pyspark.sql.SparkSession.builder.master('local').appName("hxy_test_script").getOrCreate()
sc = spark.sparkContext
hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
hadoop_conf.set("fs.s3a.endpoint", "s3.cn-north-1.amazonaws.com.cn")

sql = pyspark.SQLContext(sc)
path_list = ['s3a://bucket/DATA/00.csv']
df = sql.read.csv(path_list, header=True)
print(df.count())
print(df.show())

你可能感兴趣的:(spark笔记)