Spark通过JDBC加载部分数据、添加过滤条件

当我们需要使用SparkSQL通过JDBC方式连接MySQL、Oracle、Greenplum等来实现对数据的操作时,可能在某些情况下并不需要加载全量的数据表。例如:

  1. 只需要其中的部分字段
  2. 按照条件进行筛选后的数据

此时就需要在JDBC连接时对option(“dbtable”, tablename)属性值进行修改,参看spark官网给出的属性介绍:(spark2.3 jdbc-to-other-databases 详细属性链接)

Property Name Meaning
url The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. e.g., jdbc:postgresql://localhost/test?user=fred&password=secret
dbtable The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
driver The class name of the JDBC driver to use to connect to this URL.
… … … …

dbtable:应该读取的JDBC表。另外也可以在括号中使用子查询语句,而不是完整的表。

测试代码如下:

object JDBCSource {
	def main(args: Array[String]): Unit = {
	    val conf = new SparkConf().setAppName("Greenplum_test").setMaster("local[*]")
	    val sc = new SparkContext(conf)
	    sc.setLogLevel("WARN")
	    val spark = SparkSession.builder().config(conf).getOrCreate()
	    
	    //由于dbtable被用作SELECT语句的源。如果要填入子查询语句,则应提供别名:
	    val tablename = "(select id,name,gender from test.info where gender='man') temp"
	    
	    val data = spark.sqlContext.read
	      .format("jdbc")
	      .option("driver", "com.mysql.jdbc.Driver")
	      .option("url", "jdbc:mysql://localhost:3306/test")
	      .option("dbtable", tablename)			//将查询语句传入
	      .option("user", "username")
	      .option("password", "password")
	      .load()

	    data.show()
  }
}

你可能感兴趣的:(spark)