Spark 2.2.1 使用JDBC 操作其他数据库的案例与解读

Spark 2.2.1 使用JDBC 操作其他数据库的案例与解读

Spark SQL包括一个数据源,可以从其他数据库使用JDBC读取数据。这个功能优先于使用JdbcRDD。因为它可以直接返回DataFrame,方便在Spark SQL进行处理,也可以很容易地和其他数据源进行Join操作。从Java或Python也更容易使用JDBC数据源,因为它不需要用户提供ClassTag。(注意,这和使用SparkJDBC SQL服务器是不一样的,Spark JDBC SQL服务器允许其他应用程序使用Spark SQL进行查询。

        在执行Spark Shell或者Spark Submit命令的时候,需在--driver-class-path配置对应数据库的JDBC驱动的路径。例如:SparkShell上连接Mysql数据库时,需用使用下面的命令

--driver-class-path/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar

 

为了解决提交应用时,经常出现Executor上找不到驱动包的问题,下面给出两种情况下的案例,通过两种案例,熟悉不同场景下如何使用提交应用的参数选项。

(一) 集群中所有集群部署相同,并在集群上的某一节点启动DriverProgram

这种场景下,所有节点上都部署了Spark、Hive,部署路径也相同,而且驱动类在Hive的Lib目录下。这时候可以将Hive的Lib目录下的驱动Jar包添加到Driver-class-path中,添加时使用绝对路径,因此在每个节点上的CLASSPATH上都能在该绝对路径下找到驱动类。

将Master节点的mysql-connector-java-5.1.13-bin.jar分发到Worker1、Worker2、Worker3各个节点上,不然运行时在Worker节点上会提示找不到“Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver”。将mysql-connector-java-5.1.13-bin.jar分发到各节点的/usr/local/spark-2.2.1-bin-hadoop2.6/jars/目录下。

root@master:~#scp   -rq/usr/local/apache-hive-1.2.1/  [email protected]:/usr/local/apache-

hive-1.2.1/

root@master:~#scp   -rq /usr/local/apache-hive-1.2.1/ [email protected]:/usr/local/apache-hive-1.2.1/

root@master:~#scp   -rq/usr/local/apache-hive-1.2.1/  [email protected]:/usr/local/apache-hive-1.2.1/

root@worker1:~# ls-ltr /usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar

-rw-r--r-- 1 rootroot 767492 Feb 20 10:06/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar

root@worker1:~# cp/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar/usr/local/spark-2.2.1-bin-hadoop2.6/jars/

…….

 

以Spark Shell中连接Mysql数据库为例,对应的命令为:

root@master:~#spark-shell  --masterspark://192.168.189.1:7077  \

>      --driver-class-path/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar  \

>      --executor-memory 512m  --total-executor-cores 4

 

1)       登陆Mysql,在Mysql数据库中查看Hive数据库的表。

root@master:~#mysql -uroot -proot

Welcome to theMySQL monitor.  Commands end with ; or\g.

Your MySQLconnection id is 38

Server version:5.5.47-0ubuntu0.14.04.1 (Ubuntu)

……

        查询Mysql中的数据库。

mysql> show databases;

+--------------------+

| Database           |

+--------------------+

|information_schema |

| hive               |

| mysql              |

|performance_schema |

| spark              |

|sparkstreaming     |

+--------------------+

6 rows in set (0.31sec)

                   使用hive数据库。

mysql> use hive;

Reading tableinformation for completion of table and column names

You can turn offthis feature to get a quicker startup with -A

Database changed

                   查询hive数据库中的表。

mysql> showtables;

+---------------------------+

| Tables_in_hive            |

+---------------------------+

|BUCKETING_COLS            |

| CDS                       |

| COLUMNS_V2                |

|DATABASE_PARAMS           |

| DBS                       |

| FUNCS                     |

| FUNC_RU                   |

| GLOBAL_PRIVS              |

| IDXS                      |

| INDEX_PARAMS              |

| PARTITIONS                |

|PARTITION_KEYS            |

|PARTITION_KEY_VALS        |

|PARTITION_PARAMS          |

|PART_COL_PRIVS            |

| PART_COL_STATS            |

| PART_PRIVS                |

| ROLES                     |

| SDS                       |

| SD_PARAMS                 |

|SEQUENCE_TABLE            |

| SERDES                    |

| SERDE_PARAMS              |

|SKEWED_COL_NAMES          |

| SKEWED_COL_VALUE_LOC_MAP  |

|SKEWED_STRING_LIST        |

|SKEWED_STRING_LIST_VALUES |

|SKEWED_VALUES             |

| SORT_COLS                 |

| TABLE_PARAMS              |

|TAB_COL_STATS             |

| TBLS                      |

|TBL_COL_PRIVS             |

| TBL_PRIVS                 |

| VERSION                   |

+---------------------------+

35 rows in set (0.00 sec)

 

2)       启动Spark-Shell,通过Spark SQL 查询第三方数据库Mysql中的表。

root@master:~#spark-shell  --masterspark://192.168.189.1:7077  \

>      --driver-class-path/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar  \

>      --executor-memory 512m  --total-executor-cores 4

Setting default loglevel to "WARN".

To adjust logginglevel use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

18/02/20 09:31:19WARN util.NativeCodeLoader: Unable to load native-hadoop library for yourplatform... using builtin-java classes where applicable

Spark context WebUI available at http://master:4040

Spark contextavailable as 'sc' (master = spark://192.168.189.1:7077, app id =app-20180220093128-0000).

Spark sessionavailable as 'spark'.

Welcome to

      ____              __

     / __/__ ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.2.1

 

3)       在Spark 2.2.1中使用spark.sqlContext.read.jdbc方式加载第三方Jdbc  Mysql数据库hive的TBLS表。

scala> importjava.util.Properties

importjava.util.Properties

 

scala> val jdbcDF=spark.sqlContext.read.jdbc("jdbc:mysql://192.168.189.1:3306/hive?user=root&password=root","TBLS", new Properties)

18/02/20 14:21:40 WARN metastore.ObjectStore: Failed to get databaseglobal_temp, returning NoSuchObjectException

jdbcDF: org.apache.spark.sql.DataFrame = [TBL_ID: bigint,CREATE_TIME: int ... 9 more fields]

 

4)       将jdbcDF注册为临时表table。从临时表table中查询记录。

scala>jdbcDF.createOrReplaceTempView("table")

 

scala>spark.sql("select * from table")

res2:org.apache.spark.sql.DataFrame = [TBL_ID: bigint, CREATE_TIME: int ... 9 morefields]

 

使用数据源API,可以将远程数据库的表装载成一个DataFrameSpark SQL的临时表。支持的选项如表3-3所示:

属性名

含义

url

要连接的JDBC URL

dbtable

 

JDBC的表,应该是可读的。注意,任何一个有效的SQL查询中的‘FROM’子句都是可以使用的。比如,不使用整个表,而使用括号中的子查询语句。

driver

连接到URL时需要JDBC驱动程序的类名。在运行一条JDBC命令让驱动器注册到JDBC子系统之前,masterworkers都需要先加载该类。

partitionColumn, lowerBound, upperBound, numPartitions

如果需要指定这些选项,就必须同时全部指定。它们描述列在并行地从多个worker上读取数据时如何对表进行分区。其中,对表进行查询时的partitionColumn必须是一个数值型的列。

表 3 - 3 数据库连接选项

(二) 在集群上的某一节点启动DriverProgram

这种场景下,需要保证两点,一是Driver Program能找到驱动类,一是执行任务的Executor能找到驱动类。

使用第一种情况下时的JDBC表作为测试表。

1)       启动Spark-Shell。

root@master:~#spark-shell  --masterspark://192.168.189.1:7077  \

>      --driver-class-path/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar \

>       --jars/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar \

>      --executor-memory 512m  --total-executor-cores 4

 

命令参数说明:

l 通过--driver-class-path选项,将Driver Program所在节点的驱动类路径加到CLASSPATH     中;由于此时默认以Client部署模式提交,因此Driver Program在提交节点运行,可以使用相对路径;

l 通过--jars选项,将Executor需要使用的Jar包上传,由于上传的Jar包会自动添加到执行点的CLASSPATH,因此Executor执行时是可以识别的,不需要再手动添加到CLASSPATH上。这里也是把本地的驱动Jar包作为--jars参数。

注:上传的Jar包在执行时会自动下载。

 

2)       使用spark.sqlContext.load方式加载第三方Jdbc  Mysql数据库hive的TBLS表。

scala> valjdbcDF = spark.sqlContext.load("jdbc", Map(

     |       "url" ->"jdbc:mysql://192.168.189.1:3306/hive?user=root&password=root",

     |      "dbtable" -> "TBLS"))

warning: there wasone deprecation warning; re-run with -deprecation for details

18/02/20 14:41:15WARN metastore.ObjectStore: Failed to get database global_temp, returningNoSuchObjectException

jdbcDF:org.apache.spark.sql.DataFrame = [TBL_ID: bigint, CREATE_TIME: int ... 9 morefields]

 

3)       显示jdbc表的内容。

scala>jdbcDF.show

+------+-----------+-----+----------------+-----+---------+------+--------------+-------------+------------------+------------------+

|TBL_ID|CREATE_TIME|DB_ID|LAST_ACCESS_TIME|OWNER|RETENTION|SD_ID|      TBL_NAME|    TBL_TYPE|VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|

+------+-----------+-----+----------------+-----+---------+------+--------------+-------------+------------------+------------------+

......

|151|1519025857|6|0|root|0|100581|pokes|MANAGED_TABLE| null|null|

|156|1519026222|6|0|root|0|100586|pokes_test|MANAGED_TABLE|null|null|

+------+-----------+-----+----------------+-----+---------+------+--------------+-------------+------------------+------------------+

 

 

扩展:这里使用Spark-Shell,只支持Client部署模式,如果使用Spark-Submit方式提交Spark 应用程序,可以使用Cluster部署模式,这时候,Driver Program会由Master(Standalone集群)或ResourceManager(Spark on YARN)负责调度,此时,可以去掉针对本地Driver Program的CLASSPATH     设置,即去掉--driver-class-path选项,--jars上传的驱动类也会自动添加到实际运行节点上的Driver Program的CLASSPATH。--jars会随着应用上传,如果这种应用场景比较常用的话,建议使用配置属性,将驱动类的Jar包部署到集群节点中。

当执行Spark应用程序时,如果出现ClassNotFounded、Connect建立失败,或者出现SQL语句解析异常等情况时,可以从以下两点进行故障排除:

1)       JDBC驱动程序类对在客户端会话和所有的Executors的初始类加载器必须是

可见的。这是因为Java的DriverManager类进行安全检查,导致当建立一个连接时,会忽视所有对初始类加载器不可见的驱动类。一个方便的方法是修改所有Worker 节点上的compute_classpath.sh,让其包含驱动JARs包。在包含驱动JARs包时,需要注意Jar包设置的路径应该为全路径。

例如,在SPARK_CLASSPATH中添加Jar包依赖时,必须使用实际执行时所在的路径;如果路径设置错误,会导致驱动查找失败而异常。异常信息可能包含以下内容:

 

scala> import java.util.Properties
import java.util.Properties

scala>  val jdbcDF =spark.sqlContext.read.jdbc("jdbc:mysql://192.168.189.1:3306/hive?user=root&password=root", "TBLS", new Properties)
18/02/20 14:02:17 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
jdbcDF: org.apache.spark.sql.DataFrame = [TBL_ID: bigint, CREATE_TIME: int ... 9 more fields]

scala> jdbcDF.show
18/02/20 14:03:47 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, worker2, executor 2): java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
        at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:53)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:286)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
        at java.lang.ClassLoader.findClass(ClassLoader.java:530)
        at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
        at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
        ... 20 more

18/02/20 14:03:59 WARN scheduler.TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, worker1, executor 0): java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
        at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:53)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:286)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
        at java.lang.ClassLoader.findClass(ClassLoader.java:530)
        at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
        at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
        ... 20 more

18/02/20 14:03:59 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, worker1, executor 0): java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
        at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:53)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:286)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
        at java.lang.ClassLoader.findClass(ClassLoader.java:530)
        at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
        at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
        ... 20 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
  at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:637)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:596)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:605)
  ... 48 elided
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
  at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:53)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:286)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
  at java.lang.ClassLoader.findClass(ClassLoader.java:530)
  at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
  at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
  ... 20 more

2)       一些数据库,如H2,会将所有名称转换为大写。在 SparkSQL中,需要使用大写来引用那些名字。

 


你可能感兴趣的:(AI,&,Big,Data案例实战课程)