在Spark语义中,DtatFrame是一个分布式的行集合,可以想象为一个关系型数据库的表,或一个带有列头的Excel表格。它和RDD一样,有这样一些特点:
支持的数据源:
创建DataFrame的语法:
Dataset df = spark.read().json("examples/src/main/resources/people.json");
Spark SQL的起点: SparkSession
代码:
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate();
使用SparkSession,应用程序可以从现有的RDD、Hive表或Spark数据源中创建DataFrames。
Json测试文件:
{"name": "Michael", "age": 12}
{"name": "Andy", "age": 13}
{"name": "Justin", "age": 8}
代码:
package org.example;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SparkSQLTest4 {
public static void main(String[] args){
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest4")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset df = spark.read().json("file:///home/pyspark/test.json");
df.show();
spark.stop();
}
}
csv测试文件:
代码:
package org.example;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SparkSQLTest5 {
public static void main(String[] args){
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest4")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset df = spark.read().format("csv").option("header", "true").load("file:///home/pyspark/emp.csv");
df.show();
spark.stop();
}
}
代码:
package org.example;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SparkSQLTest2 {
public static void main(String[] args){
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest2")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset sqlDF = spark.sql("SELECT * FROM test.ods_fact_sale limit 100");
sqlDF.show();
spark.stop();
}
}
代码:
package org.example;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SparkSQLTest3 {
public static void main(String[] args){
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest3")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset jdbcDF = spark.read()
.format("jdbc")
.option("url", "jdbc:mysql://10.31.1.123:3306/test")
.option("dbtable", "(SELECT * FROM EMP) tmp")
.option("user", "root")
.option("password", "abc123")
.load();
jdbcDF.printSchema();
jdbcDF.show();
spark.stop();
}
}
我们选用经典scoot用户下的4张表来模拟Spark SQL实战:
emp
dept
bonus
salgrade
生成DataFrame的时候会保留统计信息,有点类似关系型数据库的统计信息
代码:
package org.example;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SparkSQLTest7 {
public static void main(String[] args){
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest7")
.config("spark.some.config.option", "some-value")
.getOrCreate();
spark.sql("use test");
Dataset sqlDF = spark.sql("SELECT * FROM emp");
sqlDF.describe().show();
spark.stop();
}
}
测试记录:
从下图可以看出,DataFrame给每一列都做了统计信息。
有些应用场景,我们只需要DataFrame的部分列,此时可以通过select实现:
代码:
package org.example;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SparkSQLTest8 {
public static void main(String[] args){
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest8")
.config("spark.some.config.option", "some-value")
.getOrCreate();
spark.sql("use test");
Dataset sqlDF = spark.sql("SELECT * FROM emp");
sqlDF.select("ename","hiredate").show();
spark.stop();
}
}
有些应用场景,我们需要对列进行别名、新增列、删除列等操作。
代码:
package org.example;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SparkSQLTest9 {
public static void main(String[] args){
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest8")
.config("spark.some.config.option", "some-value")
.getOrCreate();
spark.sql("use test");
Dataset sqlDF = spark.sql("SELECT * FROM emp");
//输出看有哪些列
System.out.println("\n" + "\n" + "\n");
for ( String col:sqlDF.columns() ){
System.out.println(col);
}
System.out.println("\n" + "\n" + "\n");
//删除一列
sqlDF.drop("comm").show();
//新增(或替换)一列
//sqlDF.withColumn("new_comm", "sal").show();
//给列进行重命名
sqlDF.withColumnRenamed("comm","comm_new").show();
spark.stop();
}
}
过滤数据用的是filter,其实也可以用where,where是filter的别名
代码:
package org.example;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SparkSQLTest10 {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest10")
.config("spark.some.config.option", "some-value")
.getOrCreate();
spark.sql("use test");
Dataset sqlDF = spark.sql("SELECT * FROM emp");
sqlDF.where("comm is not null").show();
spark.stop();
}
}
常用的聚合操作:
操作 | 描述 |
---|---|
avg/mean | 平均值 |
count | 统计个数 |
countDistinct | 统计唯一的个数 |
max | 求最大值 |
min | 求最小值 |
sum | 求和 |
sumDistinct | 统计唯一值的合计 |
skewness | 偏态 |
stddev | 标准偏差 |
代码:
package org.example;
import org.apache.spark.sql.*;
public class SparkSQLTest11 {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest11")
.config("spark.some.config.option", "some-value")
.getOrCreate();
spark.sql("use test");
Dataset sqlDF = spark.sql("SELECT * FROM emp");
sqlDF.groupBy("deptno").agg(functions.avg("sal").alias("avg_sal"),
functions.max("comm").alias("max_comm")).show();
spark.stop();
}
}
一些比较复杂的场景,我们希望使用自定义函数来实现。
代码:
package org.example;
import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;
public class SparkSQLTest12 {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest12")
.config("spark.some.config.option", "some-value")
.getOrCreate();
spark.udf().register("plusOne", new UDF1() {
@Override
public Integer call(Integer x) {
return x + 1;
}
}, DataTypes.IntegerType);
spark.sql("SELECT plusOne(5)").show();
spark.stop();
}
}
语法:
DataFrame.join(other, on=None, how=None)
other 需要连接的DataFrame
on str, list or Column, 可选项
how str, 可选项
default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti
代码:
package org.example;
import org.apache.spark.sql.*;
public class SparkSQLTest13 {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest13")
.config("spark.some.config.option", "some-value")
.getOrCreate();
spark.sql("use test");
Dataset df1 = spark.sql("SELECT * FROM emp");
Dataset df2 = spark.sql("SELECT * FROM dept");
Dataset df3 = df1.join(df2, df1.col("deptno").equalTo(df2.col("deptno")) ,"inner").select(df1.col("empno"),df1.col("ename"),df2.col("dname"),df2.col("loc"));
df3.show();
spark.stop();
}
}
这里我们使用一个右连接
代码:
package org.example;
import org.apache.spark.sql.*;
public class SparkSQLTest14 {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest14")
.config("spark.some.config.option", "some-value")
.getOrCreate();
spark.sql("use test");
Dataset df1 = spark.sql("SELECT * FROM emp");
Dataset df2 = spark.sql("SELECT * FROM dept");
Dataset df3 = df1.join(df2, df1.col("deptno").equalTo(df2.col("deptno")) ,"right").select(df1.col("empno"),df1.col("ename"),df2.col("dname"),df2.col("loc"));
df3.show();
spark.stop();
}
}
语法:
DataFrame.orderBy(*cols, **kwargs)
-- 返回按指定列排序的新DataFrame
参数: ascending bool or list,可选项
布尔值或布尔值列表(默认为True)。排序升序与降序。为多个排序顺序指定列表。如果指定了列表,则列表的长度必须等于cols的长度。
代码:
package org.example;
import org.apache.spark.sql.*;
public class SparkSQLTest15 {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest15")
.config("spark.some.config.option", "some-value")
.getOrCreate();
spark.sql("use test");
Dataset df1 = spark.sql("SELECT * FROM emp");
Dataset df2 = spark.sql("SELECT * FROM dept");
Dataset df3 = df1.join(df2, df1.col("deptno").equalTo(df2.col("deptno")) ,"right").select(df1.col("empno"),df1.col("ename"),df2.col("dname"),df2.col("loc"));
Dataset df4 = df3.orderBy(df3.col("dname").desc(),df3.col("ename").asc() );
df4.show();
spark.stop();
}
}
SparkSession上的sql函数允许应用程序以编程方式运行sql查询,并将结果作为Dataset返回。
代码:
package org.example;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class SparkSQLTest16 {
public static void main(String[] args){
SparkSession spark = SparkSession
.builder()
.appName("SparkSQLTest16")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset df = spark.read().json("file:///home/pyspark/test.json");
df.createOrReplaceTempView("people");
spark.sql("select * from people where age = 12").show();
spark.stop();
}
}