整理一下SparkSql DataSet Api的使用方式与心得[email protected]
参考资料:
在使用sparksql的时候,应用会返还给我们DataSet 来处理查询出来的数据.故此需要了解DataSet的操作方法,以应对业务的各种无理要求[email protected]
操作大致如下所示,同时可以看到还可以创建DataFrame,但是返回的是DataSet
总的来说跟我们使用JavaStream的方法,或者像是Flink的算子一样.没有必要一次性全都了解,可以根据需要主动去找就行了,[email protected]
返回值 | 方法头 | 说明 |
void |
show() Displays the top 20 rows of Dataset in a tabular form. |
返回默认20条记录 |
void |
show(boolean truncate) Displays the top 20 rows of Dataset in a tabular form. |
如果一个字段的字符数量超过20,则后面的就不显示 |
void |
show(int numRows) Displays the Dataset in a tabular form. |
设置显示的记录数 |
void |
show(int numRows, boolean truncate) Displays the Dataset in a tabular form. |
设置显示的记录数,如果超过20个字符是否需要截取 |
void |
show(int numRows, int truncate) Displays the Dataset in a tabular form. |
设置显示的记录数,每个字段最多能显示多少个字符 |
void |
show(int numRows, int truncate, boolean vertical) Displays the Dataset in a tabular form. |
将DataSet
显示指定的列,无果没有指定列这返回所有列,
Dataset |
describe(scala.collection.Seq Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. |
Dataset |
describe(String... cols) Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. |
该方法比较有意思,就是返回指定字段的统计信息:统计信息包含count,mean,stddev,min,max
统计信息的含义如下所示:
因为是少量Row的返回,所以住下返回的内容.
- first:获取第一行记录
- head:获取第一行记录
- head(int n)获取前n行记录,返回的是Array
- take(int n):获取前n行记录,返回的是Array
- takeAsList(int n):获取前n行数据,并以List的形式展现
java 用例如下所示:
// first
Row first = studentDataset.first();
// head
Row head = studentDataset.head();
// head(2)
Row[] heads = studentDataset.head(2);
// take(2)
Row[] take = studentDataset.take(2);
// takeAsList(2)
List rows = studentDataset.takeAsList(2);
这些方法的名字跟Sql的作用大致一样.注意根据需求使用即可[email protected]
返回值 | 方法体 | 注释 |
Dataset |
where(Column condition) Filters rows using the given condition. |
使用Spark的方法来拼接条件 |
Dataset |
where(String conditionExpr) Filters rows using the given SQL expression. |
类似传入Sql这种写法 |
示例:
// where(String conditionExpr)
Dataset wheredataset = studentDataset.where("age > 13 and sex = '男'");
wheredataset.show();
// where(Column condition)
Dataset whereDataset2 = studentDataset.where(studentDataset.col("age").gt(13).and(studentDataset.col("sex").equalTo("男")));
whereDataset2.show();
类似于where ,同时多提供了2种实现.总共四种实现----scala的就pass了
Dataset |
filter(Column condition) Filters rows using the given condition. |
Dataset |
filter(FilterFunction (Java-specific) Returns a new Dataset that only contains elements where |
Dataset |
filter(scala.Function1 (Scala-specific) Returns a new Dataset that only contains elements where |
Dataset |
filter(String conditionExpr) Filters rows using the given SQL expression. |
示例:
// filter(String conditionExpr)
studentDataset.filter("age > 13 and sex = '男'").show();
// filter(Column condition)
studentDataset.filter(studentDataset.col("age").gt(13).and(studentDataset.col("sex").equalTo("男"))).show();
// filter(new FilterFunction(){...})
studentDataset.filter(new FilterFunction() {
@Override
public boolean call(Row value) throws Exception {
return (long)value.getAs("age") > 13 && value.getAs("sex").equals("男");
}
}).show();
选择指定的列,官网提供了多种方法,同时可以针对指定的列增加一些操作.
返回值 | 方法体 |
Dataset |
select(Column... cols) Selects a set of column based expressions. |
Dataset |
select(scala.collection.Seq Selects a set of column based expressions. |
Dataset |
select(String col, scala.collection.Seq Selects a set of columns. |
Dataset |
select(String col, String... cols) Selects a set of columns. |
|
select(TypedColumn Returns a new Dataset by computing the given Column expression for each element. |
|
select(TypedColumn Returns a new Dataset by computing the given Column expressions for each element. |
|
select(TypedColumn Returns a new Dataset by computing the given Column expressions for each element. |
|
select(TypedColumn Returns a new Dataset by computing the given Column expressions for each element. |
|
select(TypedColumn Returns a new Dataset by computing the given Column expressions for each element. |
示例:
// select(String col, String... cols)
studentDataset.select("name","age","sex").show(2);
// select(String col, scala.collection.Seq cols)
ArrayStack arraySeq = new ArrayStack<>();
arraySeq.push("sex");
arraySeq.push("age");
studentDataset.select("name",arraySeq).show(2);
// select(Column... cols) colum可以对字段做一些处理,例如给年龄+5 并取别名age+5
studentDataset.select(studentDataset.apply("name"),studentDataset.col("age").plus(5).as("age+5"),studentDataset.col("sex")).show(2);
// select(scala.collection.Seq cols)
ArrayStack arraySeqClumn = new ArrayStack<>();
arraySeqClumn.push(studentDataset.col("sex"));
arraySeqClumn.push(studentDataset.col("age"));
arraySeqClumn.push(studentDataset.col("name"));
studentDataset.select(arraySeqClumn).show(2);
可以对指定字段进行特殊处理
Dataset |
selectExpr(scala.collection.Seq Selects a set of SQL expressions. |
Dataset |
selectExpr(String... exprs) Selects a set of SQL expressions. |
实例:
studentDataset.selectExpr("name","age","age+1 as otherAge","round(age)","sex as性别").show(2);
这个跟select 作用相反,是去掉某一列.
Dataset |
drop(Column col) Returns a new Dataset with a column dropped. |
Dataset |
drop(scala.collection.Seq Returns a new Dataset with columns dropped. |
Dataset |
drop(String... colNames) Returns a new Dataset with columns dropped. |
Dataset |
drop(String colName) Returns a new Dataset with a column dropped. |
示例:
studentDataset.drop("age").show(2);
limit方法获取指定Dataset的前n行记录,得到一个新的Dataset对象。
和take与head不同的是,limit方法不是Action操作,因为take,head均获得的均为Array(数组),而limit返回的是一个新的转化生成的Dataset对象
Dataset |
limit(int n) Returns a new Dataset by taking the first |
order by 和 sort 都是按照指定字段排序,默认为升序。并且使用方法相同,支持多字段排序。
studentDataset.sort(studentDataset.col("age").desc(),studentDataset.apply("name").desc()).show(5);
studentDataset.orderBy(studentDataset.col("age").desc(),studentDataset.apply("name").desc()).show(5);