在写spark程序中,查询csv文件中某个字段,一般是这样的写法:
**方法(1),**直接使用dataframe 查询
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.schema(customSchema)
.load("cars.csv")
val selectedData = df.select("year", "model")
参考索引:https://github.com/databricks/spark-csv
以上读csv文件是spark1.x的写法,spark2.x的写法又不太一样:
val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv")
.cache()
方法(2),构建case class.
case class Person(name: String, age: Long)
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")
// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
这是spark2.2.0网站上面的例子.
参考索引:http://spark.apache.org/docs/latest/sql-programming-guide.html
以上2种写法,如果只是测试一下小文件,文件的列头的字段不多(几十个)以内是可以用的.比如我只查询某个用户的Name, Age, Sex 这几个字段.
但是实际上,会遇到这种问题:
上面的例子就不够用了.恰好有第三种方法,
方法(3):
import org.apache.spark.sql.types._
// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))
// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// | value|
// +-------------+
// |Name: Michael|
// | Name: Andy|
// | Name: Justin|
// +-------------+
上面的例子,也是来自spark网站,仍然会使用dataframe,不过查询的字段结构,使用StructField 和StructType .查询的每个字段, 使用数字代替,而不是具体的Name,Age 字段名.不过,例(3)的使用效果跟例(1),(2)类似,没法解决上面提出的问题,还需要改进一下.
例(4):
val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv").cache()
var schemaString = "name,age"
//注册临时表
df.createOrReplaceTempView("people")
//sql 查询
var dataDF = sparkSession.sql("select "+schemaString+" from people")
//转rdd
var dfrdd = dataDF.rdd
val fields = schemaString.split(",").map(fieldName => StructField(fieldName, StringType, nullable = true))
var schema = StructType(fields)
//将rdd转成df
var newDF=sparkSession.createDataFrame(dfrdd, schema)
这样就可以实现以上提出的问题.
dataframe是很快的,特别是新的版本中。当然,在生产环境中,我们可能仍然用RDD去转换想要的数据。这时,可以这样写:
//将csv中某一整行的数组抽取需要的字段,并转成一个数组
比如csv文件有15列,获取某2列:NAME",“AGE”
可以通过数组的匹配,主要思路是 :
val queryArr = Array(“NAME”,“AGE”)
val rowRDD2 = rowRDD.map(attributes => {
val myattributes : Array[String] = attributes
//包含要查询的字段所在列的位置的数组,比如,第n列
val mycolumnsNameIndexArr : Array[Int] = colsNameIndexArrBroadcast.value
var mycolumnsNameDataArrb : ArrayBuffer[String] = new ArrayBuffer[String]()
for(i<-0 until mycolumnsNameIndexArr.length){
mycolumnsNameDataArrb+=myattributes(mycolumnsNameIndexArr(i)).toString
}
val mycolumnsNameDataArr : Array[String] = mycolumnsNameDataArrb.toArray
mycolumnsNameDataArr
}).map(x => Row(x)).cache()
这里的attributes就是数组1,mycolumnsNameIndexArr 是数组3.
这样,结果返回的rdd每一行就是一个数组,再根据行数遍历,可以把行转换成列。
生成数组3的方法如下:
/**
* Description:获取csv表头第一行的某个字段所在的位置,比如第n列
* Author: zhouyang
* Date 2017/11/14 11:13
* @param header : String
* @param columnsNameArr : Array[String]
* @return Int
*/
def getColumnsNameIndexArr(header : String, columnsNameArr : Array[String]) : Array[Int] ={
val tempHeaderArr : Array[String] = header.split(",")
var indexArrb = new ArrayBuffer[Int]()
if(tempHeaderArr.length>0){
for(j<-0 until columnsNameArr.length){
val columnsNameStrTemp = columnsNameArr(j)
var i : Int = 0
breakable {
while(i
详见文章:https://blog.csdn.net/cafebar123/article/details/79509456