Spark 2.2.1 使用JDBC 操作其他数据库的案例与解读
Spark SQL包括一个数据源,可以从其他数据库使用JDBC读取数据。这个功能优先于使用JdbcRDD。因为它可以直接返回DataFrame,方便在Spark SQL进行处理,也可以很容易地和其他数据源进行Join操作。从Java或Python也更容易使用JDBC数据源,因为它不需要用户提供ClassTag。(注意,这和使用SparkJDBC SQL服务器是不一样的,Spark JDBC SQL服务器允许其他应用程序使用Spark SQL进行查询。
在执行Spark Shell或者Spark Submit命令的时候,需在--driver-class-path配置对应数据库的JDBC驱动的路径。例如:SparkShell上连接Mysql数据库时,需用使用下面的命令。
--driver-class-path/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar
为了解决提交应用时,经常出现Executor上找不到驱动包的问题,下面给出两种情况下的案例,通过两种案例,熟悉不同场景下如何使用提交应用的参数选项。
(一) 集群中所有集群部署相同,并在集群上的某一节点启动DriverProgram
这种场景下,所有节点上都部署了Spark、Hive,部署路径也相同,而且驱动类在Hive的Lib目录下。这时候可以将Hive的Lib目录下的驱动Jar包添加到Driver-class-path中,添加时使用绝对路径,因此在每个节点上的CLASSPATH上都能在该绝对路径下找到驱动类。
将Master节点的mysql-connector-java-5.1.13-bin.jar分发到Worker1、Worker2、Worker3各个节点上,不然运行时在Worker节点上会提示找不到“Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver”。将mysql-connector-java-5.1.13-bin.jar分发到各节点的/usr/local/spark-2.2.1-bin-hadoop2.6/jars/目录下。
root@master:~#scp -rq/usr/local/apache-hive-1.2.1/ [email protected]:/usr/local/apache-
hive-1.2.1/
root@master:~#scp -rq /usr/local/apache-hive-1.2.1/ [email protected]:/usr/local/apache-hive-1.2.1/
root@master:~#scp -rq/usr/local/apache-hive-1.2.1/ [email protected]:/usr/local/apache-hive-1.2.1/
root@worker1:~# ls-ltr /usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar
-rw-r--r-- 1 rootroot 767492 Feb 20 10:06/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar
root@worker1:~# cp/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar/usr/local/spark-2.2.1-bin-hadoop2.6/jars/
…….
以Spark Shell中连接Mysql数据库为例,对应的命令为:
root@master:~#spark-shell --masterspark://192.168.189.1:7077 \
> --driver-class-path/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar \
> --executor-memory 512m --total-executor-cores 4
1) 登陆Mysql,在Mysql数据库中查看Hive数据库的表。
root@master:~#mysql -uroot -proot
Welcome to theMySQL monitor. Commands end with ; or\g.
Your MySQLconnection id is 38
Server version:5.5.47-0ubuntu0.14.04.1 (Ubuntu)
……
查询Mysql中的数据库。
mysql> show databases;
+--------------------+
| Database |
+--------------------+
|information_schema |
| hive |
| mysql |
|performance_schema |
| spark |
|sparkstreaming |
+--------------------+
6 rows in set (0.31sec)
使用hive数据库。
mysql> use hive;
Reading tableinformation for completion of table and column names
You can turn offthis feature to get a quicker startup with -A
Database changed
查询hive数据库中的表。
mysql> showtables;
+---------------------------+
| Tables_in_hive |
+---------------------------+
|BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
|DATABASE_PARAMS |
| DBS |
| FUNCS |
| FUNC_RU |
| GLOBAL_PRIVS |
| IDXS |
| INDEX_PARAMS |
| PARTITIONS |
|PARTITION_KEYS |
|PARTITION_KEY_VALS |
|PARTITION_PARAMS |
|PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| ROLES |
| SDS |
| SD_PARAMS |
|SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
|SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
|SKEWED_STRING_LIST |
|SKEWED_STRING_LIST_VALUES |
|SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
|TAB_COL_STATS |
| TBLS |
|TBL_COL_PRIVS |
| TBL_PRIVS |
| VERSION |
+---------------------------+
35 rows in set (0.00 sec)
2) 启动Spark-Shell,通过Spark SQL 查询第三方数据库Mysql中的表。
root@master:~#spark-shell --masterspark://192.168.189.1:7077 \
> --driver-class-path/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar \
> --executor-memory 512m --total-executor-cores 4
Setting default loglevel to "WARN".
To adjust logginglevel use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/20 09:31:19WARN util.NativeCodeLoader: Unable to load native-hadoop library for yourplatform... using builtin-java classes where applicable
Spark context WebUI available at http://master:4040
Spark contextavailable as 'sc' (master = spark://192.168.189.1:7077, app id =app-20180220093128-0000).
Spark sessionavailable as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
3) 在Spark 2.2.1中使用spark.sqlContext.read.jdbc方式加载第三方Jdbc Mysql数据库hive的TBLS表。
scala> importjava.util.Properties
importjava.util.Properties
scala> val jdbcDF=spark.sqlContext.read.jdbc("jdbc:mysql://192.168.189.1:3306/hive?user=root&password=root","TBLS", new Properties)
18/02/20 14:21:40 WARN metastore.ObjectStore: Failed to get databaseglobal_temp, returning NoSuchObjectException
jdbcDF: org.apache.spark.sql.DataFrame = [TBL_ID: bigint,CREATE_TIME: int ... 9 more fields]
4) 将jdbcDF注册为临时表table。从临时表table中查询记录。
scala>jdbcDF.createOrReplaceTempView("table")
scala>spark.sql("select * from table")
res2:org.apache.spark.sql.DataFrame = [TBL_ID: bigint, CREATE_TIME: int ... 9 morefields]
使用数据源API,可以将远程数据库的表装载成一个DataFrame或Spark SQL的临时表。支持的选项如表3-3所示:
属性名 |
含义 |
url |
要连接的JDBC URL。 |
dbtable
|
JDBC的表,应该是可读的。注意,任何一个有效的SQL查询中的‘FROM’子句都是可以使用的。比如,不使用整个表,而使用括号中的子查询语句。 |
driver |
连接到URL时需要JDBC驱动程序的类名。在运行一条JDBC命令让驱动器注册到JDBC子系统之前,master和workers都需要先加载该类。 |
partitionColumn, lowerBound, upperBound, numPartitions |
如果需要指定这些选项,就必须同时全部指定。它们描述列在并行地从多个worker上读取数据时如何对表进行分区。其中,对表进行查询时的partitionColumn必须是一个数值型的列。 |
表 3 - 3 数据库连接选项
(二) 在集群上的某一节点启动DriverProgram
这种场景下,需要保证两点,一是Driver Program能找到驱动类,一是执行任务的Executor能找到驱动类。
使用第一种情况下时的JDBC表作为测试表。
1) 启动Spark-Shell。
root@master:~#spark-shell --masterspark://192.168.189.1:7077 \
> --driver-class-path/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar \
> --jars/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar \
> --executor-memory 512m --total-executor-cores 4
命令参数说明:
l 通过--driver-class-path选项,将Driver Program所在节点的驱动类路径加到CLASSPATH 中;由于此时默认以Client部署模式提交,因此Driver Program在提交节点运行,可以使用相对路径;
l 通过--jars选项,将Executor需要使用的Jar包上传,由于上传的Jar包会自动添加到执行点的CLASSPATH,因此Executor执行时是可以识别的,不需要再手动添加到CLASSPATH上。这里也是把本地的驱动Jar包作为--jars参数。
注:上传的Jar包在执行时会自动下载。
2) 使用spark.sqlContext.load方式加载第三方Jdbc Mysql数据库hive的TBLS表。
scala> valjdbcDF = spark.sqlContext.load("jdbc", Map(
| "url" ->"jdbc:mysql://192.168.189.1:3306/hive?user=root&password=root",
| "dbtable" -> "TBLS"))
warning: there wasone deprecation warning; re-run with -deprecation for details
18/02/20 14:41:15WARN metastore.ObjectStore: Failed to get database global_temp, returningNoSuchObjectException
jdbcDF:org.apache.spark.sql.DataFrame = [TBL_ID: bigint, CREATE_TIME: int ... 9 morefields]
3) 显示jdbc表的内容。
scala>jdbcDF.show
+------+-----------+-----+----------------+-----+---------+------+--------------+-------------+------------------+------------------+
|TBL_ID|CREATE_TIME|DB_ID|LAST_ACCESS_TIME|OWNER|RETENTION|SD_ID| TBL_NAME| TBL_TYPE|VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|
+------+-----------+-----+----------------+-----+---------+------+--------------+-------------+------------------+------------------+
......
|151|1519025857|6|0|root|0|100581|pokes|MANAGED_TABLE| null|null|
|156|1519026222|6|0|root|0|100586|pokes_test|MANAGED_TABLE|null|null|
+------+-----------+-----+----------------+-----+---------+------+--------------+-------------+------------------+------------------+
扩展:这里使用Spark-Shell,只支持Client部署模式,如果使用Spark-Submit方式提交Spark 应用程序,可以使用Cluster部署模式,这时候,Driver Program会由Master(Standalone集群)或ResourceManager(Spark on YARN)负责调度,此时,可以去掉针对本地Driver Program的CLASSPATH 设置,即去掉--driver-class-path选项,--jars上传的驱动类也会自动添加到实际运行节点上的Driver Program的CLASSPATH。--jars会随着应用上传,如果这种应用场景比较常用的话,建议使用配置属性,将驱动类的Jar包部署到集群节点中。
当执行Spark应用程序时,如果出现ClassNotFounded、Connect建立失败,或者出现SQL语句解析异常等情况时,可以从以下两点进行故障排除:
1) JDBC驱动程序类对在客户端会话和所有的Executors的初始类加载器必须是
可见的。这是因为Java的DriverManager类进行安全检查,导致当建立一个连接时,会忽视所有对初始类加载器不可见的驱动类。一个方便的方法是修改所有Worker 节点上的compute_classpath.sh,让其包含驱动JARs包。在包含驱动JARs包时,需要注意Jar包设置的路径应该为全路径。
例如,在SPARK_CLASSPATH中添加Jar包依赖时,必须使用实际执行时所在的路径;如果路径设置错误,会导致驱动查找失败而异常。异常信息可能包含以下内容:
scala> import java.util.Properties
import java.util.Properties
scala> val jdbcDF =spark.sqlContext.read.jdbc("jdbc:mysql://192.168.189.1:3306/hive?user=root&password=root", "TBLS", new Properties)
18/02/20 14:02:17 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
jdbcDF: org.apache.spark.sql.DataFrame = [TBL_ID: bigint, CREATE_TIME: int ... 9 more fields]
scala> jdbcDF.show
18/02/20 14:03:47 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, worker2, executor 2): java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:53)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:286)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.lang.ClassLoader.findClass(ClassLoader.java:530)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
... 20 more
18/02/20 14:03:59 WARN scheduler.TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, worker1, executor 0): java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:53)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:286)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.lang.ClassLoader.findClass(ClassLoader.java:530)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
... 20 more
18/02/20 14:03:59 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, worker1, executor 0): java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:53)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:286)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.lang.ClassLoader.findClass(ClassLoader.java:530)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
... 20 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at org.apache.spark.sql.Dataset.show(Dataset.scala:637)
at org.apache.spark.sql.Dataset.show(Dataset.scala:596)
at org.apache.spark.sql.Dataset.show(Dataset.scala:605)
... 48 elided
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:53)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:286)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.lang.ClassLoader.findClass(ClassLoader.java:530)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
... 20 more
2) 一些数据库,如H2,会将所有名称转换为大写。在 SparkSQL中,需要使用大写来引用那些名字。