一.通过文件生成DF
(1)通过SQLContext的csvFile函数加载csv文件生成DF
import com.databricks.spark.csv._
val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')
csvFile方法接收需要加载的csv文件路径filePath,如果需要加载的csv文件有头部信息,我们可以将useHeader设置为true,这样就可以将第一行的信息当作列名称来读;delimiter指定csv文件列之间的分隔符。其中,students对象的类型是org.apache. spark.sql.DataFrame。
(2)通过SQLContext的load函数加载csv文件生成DF
val options = Map("header" -> "true", "path" -> "E:\\StudentData.csv")
val newStudents = sqlContext.read.options(options).format("com.databricks.spark.csv").load()
(3)通过SQLContext的read
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("examples/src/main/resources/people.json")
(4)通过SQLContext的CreateDataFrame方法生成DF
(1) rdd转RowRDD
(2)构建StructType
(3)SQLContext调用CreateDataFrame
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
def createDtaFrame(sparkCtx:SparkContext,sqlCtx:SQLContext):Unit = {
val rowRDD = sparkCtx.textFile("D://***/studentInfo.txt").map(_.split(",")).map(p => Row(p(0),p(1).toInt,p(2)))
val schema = StructType(
Seq(
StructField("name",StringType,true),
StructField("age",IntegerType,true),
StructField("studentNo",StringType,true)
)
)
val dataDF = sqlCtx.createDataFrame(rowRDD,schema)
dataDF.show()
}
//use case class Student
case class Student(name:String,age:Int,studentNO:String)
def rddToDFCase(sparkCtx:SparkContext,sqlCtx:SQLContext):DataFrame = {
//导入隐饰操作,否则RDD无法调用toDF方法
import sqlCtx.implicits._
val dataFrame = sparkCtx.textFile("E:/scala_workspacey/studentInfo.txt",2)
.map( x => x.split(",")).map( x => Student(x(0),x(1).trim().toInt,x(2)).toDF()
dataFrame
}
注意:在spark1.6中把RDD到DataFrame的隐式转换隔离出来了,单独放到SQLContext.implicits对象中,所以现在需要把RDD转换为Dataframe就需要引入。
二. 通过数据库生成DF
(1)通过MySQL表
方法1:
val jdbcDF = sqlCtx.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/test")
.option("dbtable", "sy_users")
.option("user", "root")
.option("password", "password")
.load()
方法2:
val prop = new Properties()
prop.put("user","root")
prop.put("password","****")
prop.put("driver","com.mysql.jdbc.Driver")
val data = sqlCtx.read.jdbc("jdbc:mysql://localhost:3306/test","sy_users",prop)
val sparkCtx = new SparkContext(new SparkConf())
val hiveCtx = new HiveContext(sparkCtx)
val student_rowRDD = hiveCtx.sql("select * from sparktest.student").rdd
val schema = StructType(List(StructField("id", IntegerType, true),
StructField("name", StringType, true),StructField("gender", StringType, true),
StructField("age", IntegerType, true)))
val studentDataFrame = hiveCtx.createDataFrame(student_rowRDD, schema)