DT大数据梦工厂
文件/home/pengyucheng/java/rdd2dfram.txt
中有如下4条记录:
1,hadoop,11
2,spark,7
3,flink,5
4,ivy,27
编码实现:查询并在控制台打印出每行第三个字段值大于7的记录-例如,第一条记录1,hadoop,11
中第三个字段值为 11 大于7故应该打印出来。
上述问题属于Spark SQL类问题:即查询出第三个字段值为 11 大于7的记录。关键在于将txt中的非结构化的数据转换成DataFrame中的类似myql中的结构化(有具体数据类型)的数据,而转换的方式有两种:
“Spark SQL supports two different methods for converting existing RDDs into DataFrames. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.
The second method for creating DataFrames is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct DataFrames when the columns and their types are not known until runtime.”
本文通过案例讲解第一转换方式,下一篇文章分析第二种转换方式。
package main.scala;
import java.io.Serializable;
/**
* 封装文件内容
* @author pengyucheng
*/
public class Person implements Serializable{
private static final long serialVersionUID = 1L;
private int id;
private String name;
private int age;
public int getId() {
return id;
}
@Override
public String toString() {
return "Person [id=" + id + ", name=" + name + ", age=" + age + "]";
}
public String getName() {
return name;
}
public int getAge() {
return age;
}
public void setId(int id) {
this.id = id;
}
public void setName(String name) {
this.name = name;
}
public void setAge(int age) {
this.age = age;
}
}
package cool.pengych.spark.sql;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
/**
* RDD 转换成DataFrame
* @author pengyucheng
*
*/
public class RDD2DataFrameByReflection
{
public static void main(String[] args)
{
/*
* 1、创建SQLContext
*/
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("RDD2DataFrameByReflection");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
/*
* 2、读取外部文件,并转换封装到对应的JavaRDD中
*/
JavaRDD lines = sc.textFile("file:home/pengyucheng/java/rdd2dfram.txt");
JavaRDD persons = lines.map(new Function(){
private static final long serialVersionUID = 1L;
@Override
public Person call(String line) throws Exception
{
String[] attrs = line.split(",");
Person person = new Person();
person.setId(Integer.valueOf(attrs[0]));
person.setAge(Integer.valueOf(attrs[2]));//age的序列号为 2
person.setName(attrs[1]);
return person;
}
});
/*
* 3、生成DataFrame:在底层通过反射的方式获得Person的所有fields,结合RDD本身,就生成了DataFrame
*/
DataFrame df = sqlContext.createDataFrame(persons, Person.class);
df.registerTempTable("persons");
DataFrame bigDatas = sqlContext.sql("select * from persons where age >= 6");
JavaRDD bigDatasRDD = bigDatas.javaRDD();
JavaRDD result = bigDatasRDD.map(new Function() {
private static final long serialVersionUID = 1L;
@Override
public Person call( Row row) throws Exception
{
Person person = new Person();
person.setId(row.getInt(1));
person.setAge(row.getInt(0));//age的序列号变为 0:DataFrame -> JavaRDD过程中对
person.setName(row.getString(2));
return person;
}
});
/*
* 4、在控制台输出结果
*/
List personList = result.collect();
for (Person person : personList)
{
System.out.println(person);
}
}
}
package main.scala
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object RDD2DataFrameByReflection {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("RDD2DataFrameByReflection")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val persons = sc.textFile("file:home/pengyucheng/java/rdd2dfram.txt").map(row => {
val attrs = row.split(",")
val person = new Person
person.setId(Integer.valueOf(attrs(0)))
person.setAge(Integer.valueOf(attrs(2)))
person.setName(attrs(1))
person
} )
sqlContext.createDataFrame(persons,new Person().getClass).registerTempTable("persons")
val personsRDD = sqlContext.sql("select * from persons where age >= 6 ").rdd
val result = personsRDD.map( row => {
val person = new Person
person.setId(row.getInt(1))
person.setAge(row.getInt(0))
person.setName(row.getString(2))
person
} )
result.collect.foreach(println)
}
}
16/05/25 18:10:16 INFO DAGScheduler: Job 0 finished: collect at RDD2DataFrameByReflection.java:69, took 0.795109 s
Person [id=1, name=hadoop, age=11]
Person [id=2, name=spark, age=7]
Person [id=4, name=ivy, age=27]
16/05/25 18:10:16 INFO SparkContext: Invoking stop() from shutdown hook
16/05/25 17:42:59 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalAccessException: Class org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1 can not access a member of class cool.pengych.spark.sql.Person with modifiers "public"
at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:102)
at java.lang.reflect.AccessibleObject.slowCheckMemberAccess(AccessibleObject.java:296)
at java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:288)
以下代码段,通过反射的方式动态获取Person类的属性值,从而将RDD转换成DataFrame,这就要求Person类的访问修饰符为public,如果不是则抛上述异常。
DataFrame df = sqlContext.createDataFrame(persons, Person.class);
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at org.apache.spark.sql.Row$class.getInt(Row.scala:218)
person.setAge(Integer.valueOf(attrs[2]));//age的初始序列号为 2
person.setAge(row.getInt(0));//age的序列号变为 0
age的初始序列号为 2,猜想大概是转换成DataFrame后,做sql查询,DataFrame内部为了提高查询性能进行了字段重新排序(安装字母升序)优化,所以age序列号变成了 0。如果写成:person.setAge(row.getInt(2))
由于获取的是name字段的值,故出现上述异常。
def createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame
dataFrame.rdd