Spark1.6.0 Scala创建DataFrame

一.通过文件生成DF

(1)通过SQLContext的csvFile函数加载csv文件生成DF

import com.databricks.spark.csv._   
val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')

csvFile方法接收需要加载的csv文件路径filePath,如果需要加载的csv文件有头部信息,我们可以将useHeader设置为true,这样就可以将第一行的信息当作列名称来读;delimiter指定csv文件列之间的分隔符。其中,students对象的类型是org.apache. spark.sql.DataFrame。

(2)通过SQLContext的load函数加载csv文件生成DF

val options = Map("header" -> "true", "path" -> "E:\\StudentData.csv")  
val newStudents = sqlContext.read.options(options).format("com.databricks.spark.csv").load()

(3)通过SQLContext的read

val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
val df = sqlContext.read.json("examples/src/main/resources/people.json")

(4)通过SQLContext的CreateDataFrame方法生成DF

  (1) rdd转RowRDD

  (2)构建StructType

  (3)SQLContext调用CreateDataFrame

import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
  def createDtaFrame(sparkCtx:SparkContext,sqlCtx:SQLContext):Unit = {
    val rowRDD = sparkCtx.textFile("D://***/studentInfo.txt").map(_.split(",")).map(p => Row(p(0),p(1).toInt,p(2)))
    val schema = StructType(
      Seq(
          StructField("name",StringType,true),
          StructField("age",IntegerType,true),
          StructField("studentNo",StringType,true)
      )
    )
    val dataDF = sqlCtx.createDataFrame(rowRDD,schema)
    dataDF.show()

  }

(5)通过SparkContext,从数据源创建DF
//use case class Student
  case class Student(name:String,age:Int,studentNO:String)  
  def rddToDFCase(sparkCtx:SparkContext,sqlCtx:SQLContext):DataFrame = {  
    //导入隐饰操作,否则RDD无法调用toDF方法  
    import sqlCtx.implicits._
    val dataFrame = sparkCtx.textFile("E:/scala_workspacey/studentInfo.txt",2)  
      .map( x => x.split(",")).map( x => Student(x(0),x(1).trim().toInt,x(2)).toDF()  
    dataFrame
  }  

注意:在spark1.6中把RDD到DataFrame的隐式转换隔离出来了,单独放到SQLContext.implicits对象中,所以现在需要把RDD转换为Dataframe就需要引入。

二. 通过数据库生成DF

(1)通过MySQL表

方法1:
val jdbcDF = sqlCtx.read
  .format("jdbc")
  .option("url", "jdbc:mysql://localhost:3306/test")
  .option("dbtable", "sy_users")
  .option("user", "root")
  .option("password", "password")
  .load()
方法2:
val prop = new Properties()
    prop.put("user","root")
    prop.put("password","****")
    prop.put("driver","com.mysql.jdbc.Driver")
val data = sqlCtx.read.jdbc("jdbc:mysql://localhost:3306/test","sy_users",prop)

(2)通过Hive表

val sparkCtx = new SparkContext(new SparkConf())
val hiveCtx = new HiveContext(sparkCtx)
val student_rowRDD = hiveCtx.sql("select * from sparktest.student").rdd

val schema = StructType(List(StructField("id", IntegerType, true),
    StructField("name", StringType, true),StructField("gender", StringType, true),
    StructField("age", IntegerType, true)))

val studentDataFrame = hiveCtx.createDataFrame(student_rowRDD, schema)


你可能感兴趣的:(spark,scala)