spark structured streaming 教程02(对接kafka的json数据)

1准备kafka数据源

首先把下面这段json数据推到kafka中,这只是模拟的一条数据,structured streaming读取到它之后,会把他当做无边界表(unbounded table)的一条记录,这张表记录的是用户访问日志,它有3个字段,分别是uid(用户id),timestamp(访问的时间戳),agent(用户客户端的user-agent)

{
	"uid": "ef16382c8acce8ec",
	"timestamp": 1594983278059,
	"agent": "Mozilla/5.0 (Linux; Android 10; Redmi K30 5G Build/QKQ1.191222.002; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/80.0.3987.99 Mobile Safari/537.36"
}

2上代码

模拟多几条以上的json数据推到kafka后,我们就开始写structured streaming代码了,代码如下,是用groovy写的,如果你不会groovy,你就当它是没有分号(;)的java去阅读就好了,如果你要运行的话,直接在idea里面,在top1024b.etl包里面新建以.groovy结尾的文件再复制下面代码,然后向java那样运行即可,环境和依赖(包含groovy的依赖)那些我在上一篇博客都写过了
点我看上一篇博客

主代码:

package top1024b.etl

import groovy.transform.CompileStatic
import org.apache.spark.SparkConf
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.StreamingQuery

@CompileStatic
class Test02 {
    static void main(String[] args) throws Exception {
        SparkSession spark = SparkSession
                .builder().config(new SparkConf().setMaster("local[*]").set("spark.sql.shuffle.partitions", "1"))
                .appName("JavaStructuredNetworkWordCount")
                .getOrCreate()

        Dataset<Row> df = spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", "192.168.0.1:9092")
                .option("subscribe", "user_log")
                .option("startingOffsets", "earliest")
                .load()

        DataSetSql ds = new DataSetSql(spark, df)

        String sql = """
            SELECT
                get_json_object ( VALUE, '\$.uid' ) as uid,
                get_json_object ( VALUE, '\$.timestamp' ) as timestamp,
                get_json_object ( VALUE, '\$.agent' ) as agent
            FROM
                t
        """.toString().trim()

        df = ds
                .exe("select CAST(value AS STRING) from t")
                .exe(sql)
                .get()


        StreamingQuery query = df.writeStream()
                .outputMode("update")
                .format("console")
                .start()

        query.awaitTermination()
    }
}

工具类:

package top1024b.etl

import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SparkSession
import groovy.transform.CompileStatic

@CompileStatic
class DataSetSql {
    private Dataset ds
    private SparkSession spark

    DataSetSql(spark, ds) {
        this.ds = ds
        this.spark = spark
    }

    DataSetSql exe(String sql) {
        ds.createOrReplaceTempView("t")
        ds = spark.sql(sql)

        this
    }

    Dataset get(){
        ds
    }
}

你复制完上面代码后,需要注意以下几点

  • .option(“kafka.bootstrap.servers”, “192.168.0.1:9092”) 这里要根据你kafka实际的ip和端口进行调整,多台kafka用逗号隔开
  • .option(“subscribe”, “user_log”)这里要改成你推送到kafka的topic,我的topic是user_log,多个topic的话用逗号隔开
  • .option(“startingOffsets”, “earliest”)这里是设置成从kafka最小的offset开始读,如果要设置成从最新的offset开始读,把earliest替换成latest
  • @CompileStatic 加了这个注释,动态的groovy和静态的java一样快
  • .exe(“select CAST(value AS STRING) from t”) ,structured streaming默认读取kafka是有下图中的几列的,其中key和value是字节数组,所以要用 CAST(value AS STRING)转换成字符串
+----+--------------------+---------+---------+-------+--------------------+-------------+
| key|               value|    topic|partition| offset|           timestamp|timestampType|
+----+--------------------+---------+---------+-------+--------------------+-------------+
|null|[7B 22 75 69 64 2...| user_log|        0|5109826|2020-07-20 18:20:...|            0|
+----+--------------------+---------+---------+-------+--------------------+-------------+

3跑程序

运行Test02,可看到如下结果,每次有数据Batch:后面都会加1,然后控制台输出结果表,上一篇博客好像写过了,我这里还是再bb一次吧

-------------------------------------------
Batch: 0
-------------------------------------------
Code generated in 8.759419 ms
+--------------------+-------------+--------------------+
|                 uid|    timestamp|               agent|
+--------------------+-------------+--------------------+
|     869068032689124|1595206351765|Mozilla/5.0 (Linu...|
|     869068032689124|1595206351855|Mozilla/5.0 (Linu...|
|     869068032689124|1595206352110|Mozilla/5.0 (Linu...|
|     869068032689124|1595206352592|Mozilla/5.0 (Linu...|
|     869068032689124|1595206352763|Mozilla/5.0 (Linu...|
|     869068032689124|1595206352841|Mozilla/5.0 (Linu...|
|     869068032689124|1595206354639|Mozilla/5.0 (Linu...|
|     869068032689124|1595206355869|Mozilla/5.0 (Linu...|
|     869068032689124|1595206361842|Mozilla/5.0 (Linu...|
|     869068032689124|1595206361943|Mozilla/5.0 (Linu...|
|     869068032689124|1595206362016|Mozilla/5.0 (Linu...|
|     869068032689124|1595206363860|Mozilla/5.0 (Linu...|
|     869068032689124|1595206364792|Mozilla/5.0 (Linu...|
|     869068032689124|1595206364879|Mozilla/5.0 (Linu...|
|C995DCAB-060A-433...|1595206421047|%E7%82%B9%E8%B4%A...|
|C995DCAB-060A-433...|1595206427094|%E7%82%B9%E8%B4%A...|
|C995DCAB-060A-433...|1595206429983|%E7%82%B9%E8%B4%A...|
|C995DCAB-060A-433...|1595206430600|%E7%82%B9%E8%B4%A...|
|     868144035543674|1595206639415|Mozilla/5.0 (Linu...|
|     868144035543674|1595206649778|Mozilla/5.0 (Linu...|
+--------------------+-------------+--------------------+
only showing top 20 rows

3结束语

structured streaming 对接kafka官方文档:
点这里访问

不要总是对我用groovy来写这个代码耿耿于怀,groovy是世界上最好的语言

你可能感兴趣的:(sturctured,streaming,spark,大数据)