对于进入到Kafka中的实时数据可以通过Flume从Kafka中消费并输出保存到hdfs上。但对于Kafka中历史数据,Flume不起作用了,可以通过spark streaming从头消费Kafka主题中json数据,并将数据以json形式保存到hdfs上。
#执行pyspark进入交互界面,执行以下代码查看从kafka中读到的json数据,
#topic:dl_face, kafka集群:kafka1:9092,kafka2:9092,kafka3:9092,
#startingOffsets:earliest 代表从头开始消费
lines=spark.readStream.format("kafka").option("kafka.bootstrap.servers","kafka1:9092,kafka2:9092,kafka3:9092").option("subscribe","dl_face").option("startingOffsets","earliest").load()
#输出到终端
lines.writeStream.outputMode("update").format("console").start()
#结果(前20行)
+----+--------------------+-------+---------+------+--------------------+-------------+
| key| value| topic|partition|offset| timestamp|timestampType|
+----+--------------------+-------+---------+------+--------------------+-------------+
|null|[7B 22 73 75 73 7...|dl_face| 4| 3|2020-05-06 19:17:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 4|2020-05-06 19:17:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 5|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 6|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 7|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 8|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 9|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 10|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 11|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 12|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 13|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 14|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 15|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 16|2020-05-06 19:18:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 17|2020-05-06 19:19:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 18|2020-05-06 19:19:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 19|2020-05-06 20:29:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 20|2020-05-06 20:30:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 21|2020-05-06 20:30:...| 0|
|null|[7B 22 73 75 73 7...|dl_face| 4| 22|2020-05-06 20:30:...| 0|
+----+--------------------+-------+---------+------+--------------------+-------------+
#需要的Json数据都保存在上述结果的"value"字段中
# python3
# -*- coding:utf-8 -*-
# @Time: 5/7/20 4:10 PM
# @Author: Damon
# @Software: PyCharm
from pyspark.sql import SparkSession, functions, DataFrame
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType, DoubleType, ArrayType,DataType,LongType
if __name__=="__main__":
#获取入口对象
spark=SparkSession.builder.appName("consumer_result").getOrCreate()
spark.sparkContext.setLogLevel("WARN")
#在pyspark交互命令行内,从此开始,但需要import必须的Type
#列 json数据中TimestampType()类型会自动转换为日期格式(而且时间单位可能不一致),可使用LongType()读取TimestampType()类型字段数据,但不能使用IntegerType()来读取
schema = StructType([StructField("suspectState",IntegerType(),True),
StructField("updatedTime", LongType(), True),
StructField("createdUserid", StringType(), True),
StructField("isDelete",IntegerType(),True),
StructField("keyState",IntegerType(),True),
StructField("cardType",StringType(),True),
StructField("facePathFull",StringType(),True),
StructField("facedbName",StringType(),True),
StructField("refFaceId",StringType(),True),
StructField("engineFaceId",StringType(),True),
StructField("facedbId",StringType(),True),
StructField("cardPathFull",StringType(),True),
StructField("cardId",StringType(),True),
StructField("createdTime",LongType(),True),
StructField("id",StringType(),True),
StructField("state",IntegerType(),True)])
#从kafka主题dl_face中加载Json数据,kafka1:9092,kafka2:9092,kafka3:9092为部署的kafka集群主机名及端口
lines=spark.readStream.format("kafka").option("kafka.bootstrap.servers","kafka1:9092,kafka2:9092,kafka3:9092")\
.option("subscribe","dl_face")\
.option("startingOffsets","earliest")\
.load()\
.selectExpr("CAST(value AS STRING)")
#将存储Josn的value字段转成MapType,选取字段
df=lines.select(functions.from_json(functions.col("value").cast("string"),schema).alias("parse_value"))\
.select("parse_value.suspectState","parse_value.updatedTime","parse_value.createdUserid",
"parse_value.isDelete","parse_value.keyState","parse_value.cardType",
"parse_value.facePathFull","parse_value.facedbName","parse_value.refFaceId",
"parse_value.engineFaceId","parse_value.facedbId","parse_value.cardPathFull",
"parse_value.cardId","parse_value.createdTime","parse_value.id",
"parse_value.state")
df.printSchema()
#输出到shell进行debug,查看数据是否正确
# query = df\
# .writeStream.outputMode("update").format("console") \
# .start()
#必须指定"checkpointLocation",及输出路径"path",这里均对应hdfs上路径
query=df.writeStream\
.format("json")\
.option("checkpointLocation", "/home/dl_data/dl_face/checkpoint")\
.option("path","/home/dl_data/dl_face/20200506")\
.start()
#交互命令行内不需要执行
query.awaitTermination()
由于Json数据中含有时间戳数据字段"updatedTime"、"createdTime",时间戳数据的单位为: ms,但若指定为这两个字段类型为TimestampType(),读取的数据会自动将时间戳转为日期格式,而且计算的单位可能是:s而非ms。
指定读取的Json时间戳字段为LongType类型即可。若指定为IntegerType类型,显示的数据均为null。