spark查询任意字段,并使用dataframe输出结果

在写spark程序中,查询csv文件中某个字段,一般是这样的写法:
**方法(1),**直接使用dataframe 查询

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(customSchema)
    .load("cars.csv")
val selectedData = df.select("year", "model")


参考索引:https://github.com/databricks/spark-csv

以上读csv文件是spark1.x的写法,spark2.x的写法又不太一样:
val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv").cache()

方法(2),构建case class.

case class Person(name: String, age: Long)
// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
  .toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

这是spark2.2.0网站上面的例子.

参考索引:http://spark.apache.org/docs/latest/sql-programming-guide.html

以上2种写法,如果只是测试一下小文件,文件的列头的字段不多(几十个)以内是可以用的.比如我只查询某个用户的Name, Age, Sex 这几个字段.

但是实际上,会遇到这种问题:

  1. 我不确定要查哪些字段;
  2. 我不确定要查几个字段.

上面的例子就不够用了.恰好有第三种方法,
方法(3):

import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// |        value|
// +-------------+
// |Name: Michael|
// |   Name: Andy|
// | Name: Justin|
// +-------------+

上面的例子,也是来自spark网站,仍然会使用dataframe,不过查询的字段结构,使用StructField 和StructType .查询的每个字段, 使用数字代替,而不是具体的Name,Age 字段名.不过,例(3)的使用效果跟例(1),(2)类似,没法解决上面提出的问题,还需要改进一下.

例(4):

val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv").cache()
var schemaString = "name,age"
//注册临时表
df.createOrReplaceTempView("people")
//sql 查询
var dataDF = sparkSession.sql("select "+schemaString+" from people")
//转rdd
var dfrdd = dataDF.rdd
val fields = schemaString.split(",").map(fieldName => StructField(fieldName, StringType, nullable = true))
var schema = StructType(fields)
//将rdd转成df
var newDF=sparkSession.createDataFrame(dfrdd, schema)

这样就可以实现以上提出的问题.

dataframe是很快的,特别是新的版本中。当然,在生产环境中,我们可能仍然用RDD去转换想要的数据。这时,可以这样写:

//将csv中某一整行的数组抽取需要的字段,并转成一个数组
比如csv文件有15列,获取某2列:NAME",“AGE”
可以通过数组的匹配,主要思路是 :

  1. 读取某个csv文件时,获取csv表头第一行的某个字段所在的位置,比如第n列,生成数组1;
  2. 将要查询的字段放进数组2;
  3. 将数组2与数组1匹配,记录数组2中的字段在数组1中的位置,最后生成一个新的数组3;
  4. 数组3就是记录要读写的字段在所有字段数组中的位置,利用数组3,就能不需要使用具体字段名,以及字段的个数等数据, 方便读写数据.

val queryArr = Array(“NAME”,“AGE”)

    val rowRDD2 = rowRDD.map(attributes => {
      val myattributes : Array[String] = attributes
      //包含要查询的字段所在列的位置的数组,比如,第n列
      val mycolumnsNameIndexArr : Array[Int] = colsNameIndexArrBroadcast.value
      var mycolumnsNameDataArrb : ArrayBuffer[String] = new ArrayBuffer[String]()
      for(i<-0 until mycolumnsNameIndexArr.length){
        mycolumnsNameDataArrb+=myattributes(mycolumnsNameIndexArr(i)).toString
      }
      val mycolumnsNameDataArr : Array[String] = mycolumnsNameDataArrb.toArray
      mycolumnsNameDataArr
    }).map(x => Row(x)).cache()

这里的attributes就是数组1,mycolumnsNameIndexArr 是数组3.
这样,结果返回的rdd每一行就是一个数组,再根据行数遍历,可以把行转换成列。

生成数组3的方法如下:

/**
    * Description:获取csv表头第一行的某个字段所在的位置,比如第n列
    * Author: zhouyang
    * Date 2017/11/14 11:13
    * @param header : String
    * @param columnsNameArr : Array[String]
    * @return     Int
    */
  def getColumnsNameIndexArr(header : String, columnsNameArr : Array[String]) : Array[Int] ={
    val tempHeaderArr : Array[String] = header.split(",")
    var indexArrb = new ArrayBuffer[Int]()
    if(tempHeaderArr.length>0){
      for(j<-0 until columnsNameArr.length){
        val columnsNameStrTemp = columnsNameArr(j)
        var i : Int = 0
        breakable {
          while(i

详见文章:https://blog.csdn.net/cafebar123/article/details/79509456

你可能感兴趣的:(spark,scala,dataframe,csv)