随着AWS的流行,越来越多的企业将数据存储在S3上构建数据湖,本文示例如何用PySpark读取S3上的数据,并用结构化API处理与展示,最后简单并讨论直接写到S3与先写到本地再上传到S3的性能对比。
读取Spark需要$SPARK_HOME/jars
下包含hadoop-aws相关jar包,目前 aws-java-sdk-1.7.4.jar、hadoop-aws-2.7.7.jar为推荐版本, 一般上述jar包在 $HADOOP_HOME/share/hadoop/tools/lib/
下均可找到,复制到$SPARK_HOME/jars
下即可。若没装Hadoop可以自行网上下载上述Jar包。
以下代码中,cfg配置文件存储了AWS的访问凭证,用 configparser 模块读取。
import os
import pyspark
from pyspark.sql import SparkSession
import configparser
config = configparser.ConfigParser()
config.read("dl.cfg")
AWS_ACCESS_KEY_ID = config['AWS']['AWS_ACCESS_KEY_ID']
AWS_SECRET_ACCESS_KEY = config['AWS']['AWS_SECRET_ACCESS_KEY']
spark = SparkSession.builder\
.config("spark.hadoop.fs.s3a.access.key", AWS_ACCESS_KEY_ID)\
.config("spark.hadoop.fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)\
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")\
.getOrCreate()
也可直接指定包配置
来启动,运行时会自动下载,这样 sparkSession 初始化会比较缓慢,不推荐:
import os
import pyspark
from pyspark.sql import SparkSession
os.environ['AWS_ACCESS_KEY_ID'] = config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = config['AWS']['AWS_SECRET_ACCESS_KEY']
spark = SparkSession.builder\
.config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.7")\
.getOrCreate()
# 读取单个文件
df_song = spark.read.format('json').load('s3a://MyBucket/song_data/A/A/A/TRAAAAK128F9318786.json')
# 读取多个文件
df_songs = spark.read.format('json').load('s3a://MyBucket/song_data/A/A/A/*.json')
# cache到本地,下次读取加快速度
df_songs.cache()
df_songs.count()
24
df_songs.select('artist_id').take(5)
[Row(artist_id='ARTC1LV1187B9A4858'),
Row(artist_id='ARA23XO1187B9AF18F'),
Row(artist_id='ARSVTNL1187B992A91'),
Row(artist_id='AR73AIO1187B9AD57B'),
Row(artist_id='ARXQBR11187B98A2CC')]
# 若文件巨大,cache慎用
df = spark.read.csv("s3a://MyBucket/pagila/payment/payment.csv").cache()
df.printSchema()
df.show(5)
df.count()
root
|-- _c0: string (nullable = true)
+--------------------+
| _c0|
+--------------------+
|payment_id;custom...|
|16050;269;2;7;1.9...|
|16051;269;1;98;0....|
|16052;269;2;678;6...|
|16053;269;2;703;0...|
+--------------------+
only showing top 5 rows
16050
可以发现CSV默认以逗号分隔,但上述文件是分号,故读出来只有一列,且应指定表头
# 这里换一种方法读 csv,本质是一样的
df = spark.read\
.format('csv')\
.option('sep',';')\
.option('inferSchema',True)\
.option('header',True)\
.load("s3a://MyBucket/pagila/payment/payment.csv")
# 若文件巨大,cache慎用
df.cache()
# 成功分列
df.printSchema()
root
|-- payment_id: integer (nullable = true)
|-- customer_id: integer (nullable = true)
|-- staff_id: integer (nullable = true)
|-- rental_id: integer (nullable = true)
|-- amount: double (nullable = true)
|-- payment_date: string (nullable = true)
df.show(5)
+----------+-----------+--------+---------+------+--------------------+
|payment_id|customer_id|staff_id|rental_id|amount| payment_date|
+----------+-----------+--------+---------+------+--------------------+
| 16050| 269| 2| 7| 1.99|2017-01-24 21:40:...|
| 16051| 269| 1| 98| 0.99|2017-01-25 15:16:...|
| 16052| 269| 2| 678| 6.99|2017-01-28 21:44:...|
| 16053| 269| 2| 703| 0.99|2017-01-29 00:58:...|
| 16054| 269| 1| 750| 4.99|2017-01-29 08:10:...|
+----------+-----------+--------+---------+------+--------------------+
only showing top 5 rows
将 payment_date 列由字符串转换为日期类型
import pyspark.sql.functions as fn
df_new = df.withColumn('payment_date',fn.to_timestamp('payment_date','yyyy-MM-dd HH:mm:ss'))
df_new.printSchema()
df_new.show(5)
root
|-- payment_id: integer (nullable = true)
|-- customer_id: integer (nullable = true)
|-- staff_id: integer (nullable = true)
|-- rental_id: integer (nullable = true)
|-- amount: double (nullable = true)
|-- payment_date: timestamp (nullable = true)
+----------+-----------+--------+---------+------+-------------------+
|payment_id|customer_id|staff_id|rental_id|amount| payment_date|
+----------+-----------+--------+---------+------+-------------------+
| 16050| 269| 2| 7| 1.99|2017-01-24 21:40:19|
| 16051| 269| 1| 98| 0.99|2017-01-25 15:16:50|
| 16052| 269| 2| 678| 6.99|2017-01-28 21:44:14|
| 16053| 269| 2| 703| 0.99|2017-01-29 00:58:02|
| 16054| 269| 1| 750| 4.99|2017-01-29 08:10:06|
+----------+-----------+--------+---------+------+-------------------+
only showing top 5 rows
# DataFrame API 方法
df_new = df_new.withColumn('month',fn.month('payment_date'))
df_new.show(5)
+----------+-----------+--------+---------+------+-------------------+-----+
|payment_id|customer_id|staff_id|rental_id|amount| payment_date|month|
+----------+-----------+--------+---------+------+-------------------+-----+
| 16050| 269| 2| 7| 1.99|2017-01-24 21:40:19| 1|
| 16051| 269| 1| 98| 0.99|2017-01-25 15:16:50| 1|
| 16052| 269| 2| 678| 6.99|2017-01-28 21:44:14| 1|
| 16053| 269| 2| 703| 0.99|2017-01-29 00:58:02| 1|
| 16054| 269| 1| 750| 4.99|2017-01-29 08:10:06| 1|
+----------+-----------+--------+---------+------+-------------------+-----+
only showing top 5 rows
# SQL API 方法
df_new.drop('momth') # 先删除上述已有 month 列
df_new.createOrReplaceTempView('vw_df_payment')
df_new = spark.sql('''
select payment_id
,customer_id
,staff_id
,rental_id
,amount
,payment_date
,month(payment_date) as month
from vw_df_payment
''')
df_new.show(5)
+----------+-----------+--------+---------+------+-------------------+-----+
|payment_id|customer_id|staff_id|rental_id|amount| payment_date|month|
+----------+-----------+--------+---------+------+-------------------+-----+
| 16050| 269| 2| 7| 1.99|2017-01-24 21:40:19| 1|
| 16051| 269| 1| 98| 0.99|2017-01-25 15:16:50| 1|
| 16052| 269| 2| 678| 6.99|2017-01-28 21:44:14| 1|
| 16053| 269| 2| 703| 0.99|2017-01-29 00:58:02| 1|
| 16054| 269| 1| 750| 4.99|2017-01-29 08:10:06| 1|
+----------+-----------+--------+---------+------+-------------------+-----+
only showing top 5 rows
df_new.createOrReplaceTempView('vw_df_payment')
spark.sql('''
select Month,sum(amount) as Revenue
from vw_df_payment
group by month
order by month
''').show()
+-----+------------------+
|Month| Revenue|
+-----+------------------+
| 1| 4824.429999999856|
| 2| 9631.879999999608|
| 3|23886.560000002115|
| 4|28559.460000003943|
| 5| 514.180000000001|
+-----+------------------+
df.count()
16049
df_new.count()
16049
infer schema比较费时而且难免会有与实际不符的情况,下面我们指定schema。
指定Schema需要构造StructType类,或直接用类似SQL DDL风格的字符串。
# from pyspark.sql.types import StructType,StructField,IntegerType,DoubleType,DateType,TimestampType
# my_schema = StructType([
# StructField('payment_id' ,IntegerType()),
# StructField('customer_id',IntegerType()),
# StructField('staff_id',IntegerType()),
# StructField('rental_id',IntegerType()),
# StructField('amount',DoubleType()),
# StructField('payment_date',TimestampType())
# ])
# 除上述定义方法外,如下 DDL风格的 schema 定义亦可 (varchar 对应 string,datetime 对应 timestamp )
# 注:含 timestamp 类型时,一定要指定 timestampFormat
# yyyy-MM-dd HH:mm:ss,如果用 yyyy-MM-dd HH:mm:ss.SSSSSS
# spark会将 2017-01-24 21:40:19.996577+00 的微秒读为 996577000,并加到分钟
# 所以只能舍掉 SSSSSS 秒,或先读为字符串,再另外处理
my_schema = '''
payment_id int,
customer_id int,
staff_id int,
rental_id int,
amount double,
payment_date timestamp
'''
df = spark.read\
.format('csv')\
.schema(my_schema)\
.option('sep',';')\
.option('header',True)\
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")\
.load("s3a://MyBucket/pagila/payment/payment.csv")
df.printSchema()
df.show(5)
root
|-- payment_id: integer (nullable = true)
|-- customer_id: integer (nullable = true)
|-- staff_id: integer (nullable = true)
|-- rental_id: integer (nullable = true)
|-- amount: double (nullable = true)
|-- payment_date: timestamp (nullable = true)
+----------+-----------+--------+---------+------+-------------------+
|payment_id|customer_id|staff_id|rental_id|amount| payment_date|
+----------+-----------+--------+---------+------+-------------------+
| 16050| 269| 2| 7| 1.99|2017-01-24 21:40:19|
| 16051| 269| 1| 98| 0.99|2017-01-25 15:16:50|
| 16052| 269| 2| 678| 6.99|2017-01-28 21:44:14|
| 16053| 269| 2| 703| 0.99|2017-01-29 00:58:02|
| 16054| 269| 1| 750| 4.99|2017-01-29 08:10:06|
+----------+-----------+--------+---------+------+-------------------+
only showing top 5 rows
df.select('payment_date').take(5)
[Row(payment_date=datetime.datetime(2017, 1, 24, 21, 40, 19)),
Row(payment_date=datetime.datetime(2017, 1, 25, 15, 16, 50)),
Row(payment_date=datetime.datetime(2017, 1, 28, 21, 44, 14)),
Row(payment_date=datetime.datetime(2017, 1, 29, 0, 58, 2)),
Row(payment_date=datetime.datetime(2017, 1, 29, 8, 10, 6))]
df.createOrReplaceTempView('vw_df_payment')
spark.sql('''
select month(payment_date) as Month,sum(amount) as Revenue
from vw_df_payment
group by month(payment_date)
order by month
''').show()
+-----+------------------+
|Month| Revenue|
+-----+------------------+
| 1| 4824.429999999856|
| 2| 9631.879999999608|
| 3|23886.560000002115|
| 4|28559.460000003943|
| 5| 514.180000000001|
+-----+------------------+
直接写到S3应该是比较慢的,可以先保存在本地再上传到S3,或保存到AWS EMR的HDFS中,再传到S3,虽然麻烦,可能会快一些 。
import time
s = time.time()
df.write.format('json')\
.mode('overwrite')\
.save('s3a://bucket-vincent/payment.json')
print(f'\nrun time:{time.time() - s} s')
run time:121.28353834152222 s
copy到S3调用了AWS CLI
命令
import subprocess as sub
s = time.time()
df.write.format('json')\
.mode('overwrite')\
.save('/home/ghost/workdata/payment.json')
s3_copy_cmd = [
'aws','s3','cp',
'/home/ghost/workdata/payment.json',
's3://bucket-vincent/payment.json',
'--recursive','--exclude','*.crc'
]
sub.run(s3_copy_cmd)
print(f'\nrun time:{time.time() - s} s')
run time:30.206376314163208 s
可以发现后者运行时间有所减少。
https://sparkour.urizone.net/recipes/using-s3/
https://www.jitsejan.com/using-spark-to-read-from-s3.html
https://gist.github.com/claudinei-daitx/3766d01b070f3f0f8d1b64fd06b71585
https://gist.github.com/bachwehbi/49e0035bdcf3d420a181415e02a189b7
https://stackoverflow.com/questions/42822483/extremely-slow-s3-write-times-from-emr-spark