**?️大纲提要:**你可以使用 Pulsar Spark Connector 读取 Pulsar 的数据,并将结果写回 Pulsar。本文介绍 Pulsar Spark Connector 的使用方法。
?Pulsar Spark Connector 在 2019 年 7 月 9 日开源,源代码与用户指南参见这里。
以下示例使用 Homebrew 包管理器在 macOS 下载和安装软件,你可以根据自身需求和操作系统选择其他包管理器。
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew tap adoptopenjdk/openjdk
brew cask install adoptopenjdk8
tar xvfz spark-2.4.3-bin-hadoop2.7.tgz
wget https://archive.apache.org/dist/pulsar/pulsar-2.4.0/apache-pulsar-2.4.0-bin.tar.gz
tar xvfz apache-pulsar-2.4.0-bin.tar.gz
brew install maven
(1)使用 Scala Maven Plugin 提供的 archetype 构建一个 Scala 项目的框架。
mvn archetype:generate
在出现的列表中选择 net.alchim31.maven:scala-archetype-simple 的最新版本,当前为 1.7,并为新工程指定 groupId、artifactId 和 version。
本示例使用的是:
groupId: com.example
artifactId: connector-test
version: 1.0-SNAPSHOT
经过以上步骤,一个 Maven 的 Scala 项目框架就基本搭建好了。
(2)在项目根目录下的 _pom.xml_ 中引入 Spark、Pulsar Spark Connector 依赖, 并使用 _maven_shade_plugin_ 进行项目打包。
a. 定义依赖包的版本信息。
<properties>
<maven.compiler.source>1.8maven.compiler.source>
<maven.compiler.target>1.8maven.compiler.target>
<encoding>UTF-8encoding>
<scala.version>2.11.12scala.version>
<scala.compat.version>2.11scala.compat.version>
<spark.version>2.4.3spark.version>
<pulsar-spark-connector.version>2.4.0pulsar-spark-connector.version>
<spec2.version>4.2.0spec2.version>
<maven-shade-plugin.version>3.1.0maven-shade-plugin.version>
properties>
b. 引入 Spark、Pulsar Spark Connector 依赖。
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-core_${scala.compat.version}artifactId>
<version>${spark.version}version>
<scope>providedscope>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-sql_${scala.compat.version}artifactId>
<version>${spark.version}version>
<scope>providedscope>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-catalyst_${scala.compat.version}artifactId>
<version>${spark.version}version>
<scope>providedscope>
dependency>
<dependency>
<groupId>io.streamnative.connectorsgroupId>
<artifactId>pulsar-spark-connector_${scala.compat.version}artifactId>
<version>${pulsar-spark-connector.version}version>
dependency>
```
c. 添加包含 _pulsar-spark-connector_ 的 Maven 仓库。
```xml
<repositories>
<repository>
<id>centralid>
<layout>defaultlayout>
<url>https://repo1.maven.org/maven2url>
repository>
<repository>
<id>bintray-streamnative-mavenid>
<name>bintrayname>
<url>https://dl.bintray.com/streamnative/mavenurl>
repository>
repositories>
```
d. 使用 _maven_shade_plugin_ 将示例类与 _pulsar-spark-connector_ 一同打包。
```xml
<plugin>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-shade-pluginartifactId>
<version>${maven-shade-plugin.version}version>
<executions>
<execution>
<phase>packagephase>
<goals>
<goal>shadegoal>
goals>
<configuration>
<createDependencyReducedPom>truecreateDependencyReducedPom>
<promoteTransitiveDependencies>truepromoteTransitiveDependencies>
<minimizeJar>falseminimizeJar>
<artifactSet>
<includes>
<include>io.streamnative.connectors:*include>
includes>
artifactSet>
<filters>
<filter>
<artifact>*:*artifact>
<excludes>
<exclude>META-INF/*.SFexclude>
<exclude>META-INF/*.DSAexclude>
<exclude>META-INF/*.RSAexclude>
excludes>
filter>
filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer implementation="org.apache.maven.plugins.shade.resource.PluginXmlResourceTransformer" />
transformers>
configuration>
execution>
executions>
plugin>
示例中的工程包括以下程序:
val spark = SparkSession
.builder()
.appName("data-read")
.config("spark.cores.max", 2)
.getOrCreate()
val ds = spark.readStream
.format("pulsar")
.option("service.url", "pulsar://localhost:6650")
.option("admin.url", "http://localhost:8088")
.option("topic", "topic-test")
.load()
ds.printSchema() // 打印 topic-test 的 schema 信息,验证读取成功
val query = ds.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
val spark = SparkSession
.builder()
.appName("data-sink")
.config("spark.cores.max", 2)
.getOrCreate()
import spark.implicits._
spark.createDataset(1 to 10)
.write
.format("pulsar")
.option("service.url", "pulsar://localhost:6650")
.option("admin.url", "http://localhost:8088")
.option("topic", "topic-test")
.save()
首先配置、启动 Spark 和 Pulsar 的单节点集群,再将示例项目打包,并通过 spark-submit 分别提交两个作业,最后观察程序的执行结果。
cd ${spark.dir}/conf
cp log4j.properties.template log4j.properties
在文本编辑器中,将日志级别改为 WARN 。
log4j.rootCategory=WARN, console
cd ${spark.dir}
sbin/start-all.sh
修改 Pulsar WebService 端口为 8088(编辑 ${pulsar.dir}/conf/standalone.conf
),避免和 Spark 端口冲突。
webServicePort=8088
4. 启动 Pulsar 集群。
```bash
bin/pulsar standalone
cd ${connector_test.dir}
mvn package
${spark.dir}/bin/spark-submit --class com.example.StreamRead --master spark://localhost:7077 ${connector_test.dir}/target/connector-test-1.0-SNAPSHOT.jar
${spark.dir}/bin/spark-submit --class com.example.BatchWrite --master spark://localhost:7077 ${connector_test.dir}/target/connector-test-1.0-SNAPSHOT.jar
root
|-- value: integer (nullable = false)
|-- __key: binary (nullable = true)
|-- __topic: string (nullable = true)
|-- __messageId: binary (nullable = true)
|-- __publishTime: timestamp (nullable = true)
|-- __eventTime: timestamp (nullable = true)
Batch: 0
+-----+-----+-------+-----------+-------------+-----------+
|value|__key|__topic|__messageId|__publishTime|__eventTime|
+-----+-----+-------+-----------+-------------+-----------+
+-----+-----+-------+-----------+-------------+-----------+
Batch: 1
+-----+-----+--------------------+--------------------+--------------------+-----------+
|value|__key| __topic| __messageId| __publishTime|__eventTime|
+-----+-----+--------------------+--------------------+--------------------+-----------+
| 6| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 7| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 8| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 9| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 10| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 1| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
| 2| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
| 3| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
| 4| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
| 5| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
+-----+-----+--------------------+--------------------+--------------------+-----------+
至此,我们搭建了 Pulsar 和 Spark 集群,构建了示例项目的框架,使用 Pulsar Spark Connector 完成了从 Spark 读取 Pulsar 数据和向 Pulsar 写入 Spark 数据的操作,提交了最终程序测试。
? 程序的完整示例,可参阅这里。