spark数据处理的过程,就是将数据以某种格式(txt,json,csv,parquet,mysql,hive,Hbase)导入,也就是read过程,对数据进行一定的处理之后,以用户想要的格式导出,也就是write过程。
RDD
DataFrame
Dataset
其中RDD可转化为DataFrame,DataFrame可以转化为Datasets,其中Datasets时静态类型(static-typing)和运行时类型安全的(runtime type-safaty)
SQL DataFrame Dataset
syntax errors runtime compile time compile time
Analysis errors runtime runtime compile time
Sql 是运行时才会检测出错误,如:
select * from stu;
//以下时syntax error语句,select 拼错了
slect * from stu;
//这条语句,只有在执行的时候,才能知道
//假设stu表中有name字段
select name from stu;
//Analysis error,name拼错了
select nmea from stu;
//这是需要运行,并解析语句之后,才知道没有nmea这个字段
DataFrame
df.select("name").show()
//此时要是select拼写出错,在compile time,就会检测到错误
df.select("nmea").show()
//当name出错时,需要运行并解析之后,才发现错误
Dataset
dataset.map(line=>line.name)
//if name拼错了,在compile time,就能检测出错误
1 通过反射得到
这一方法有条件:需要事先知道数据的schema,通过推理和隐式转化得到dataframe
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types.{StringType, IntegerType, StructType, StructField}
def inferRefection(spark:SpakSession):Unit={
//得到rdd
val infoRdd = spark.sparkContext.textFile(".../info.txt")
//导入隐式转换包
import spark.implicits._
//对数据进行一定的切割后,直接使用toDF()方法进行转换
val df = infoRdd.map(_.split(",")).map(line=>Info(line(0).toInt, line(1), line(2).toInt)).toDF()
df.printSchema
df.select("name").show()
}
case class Info(id:Int, name:String, age:Int)
info.txt内容:
1,xiaoming,20
2,xiaohua,22
3,pig,23
4,dog,21
2 通过编程得到
如果不知道schema,则需要自定义,需要使用
org.apache.spark.sql.Row 和 org.apache.spark.sql.types.{StructType,StructField}
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types.{StringType, IntegerType, StructType, StructField}
def programe(spark:SparkSession):Unit={
//得到rdd
val infoRdd = spark.sparkContext.textFile(".../info.txt")
val structType = StructType(Array(StructField("id",IntegerType, true), StructField("name", StringType, true), StructField("age", IntegerType), true))
val df = createDataFrame(infoRdd, structType)
df.printSchema
df.select("name").show()
}
1 读取json,parquet,csv等得到
val spark = new SparkSession().builder().appName("haha").master("local[*]").getOrCreate()
//path表示文件路径
val df1 = spark.read.json(path)
val df2 = spark.read.csv(path)
val df3 = spark.read.parquet(path)
//对应的存储语句为
df1.write.json(path)
df1.write.csv(path)
df1.write.parquet(path)
//为了避免写的结果,会存在好多个partition里面,可以设置一下分区数,默认为200个分区,下面我设置的是10个分区,设置为一个也行
spark.sqlContext.setConf("spark.sql.shuffle.partitions","10")
//也可以接option选项,指定Header,schema等
spark.read.option("header","true").option("inferSchema","true").csv(path)
2 读取hive中数据表
启动spark-shell --master local[*],在scala命令行下
//直接导入hive中表,tablename为表名
val df1 = spark.table("tablename")
//查询结果返回的也是一个dataframe
val df2 = spark.sql("select * from tablename")
//顺便提一下存储为hive表语句
df1.write.saveAsTable("tablename_save")
3 读取mysql中数据
read方式一:
spark.read.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/hive")
.option("dbtable", "hive.TBLS")
.option("user", "root")
.option("password", "root")
.option("driver", "com.mysql.jdbc.Driver")
.load()
//hive.TBLS是数据库名称为hive,表名为TBLS
//注意要导入dirver
read方式二:
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "root")
connectionProperties.put("driver", "com.mysql.jdbc.Driver")
(1)
val jdbcDF2 = spark.read.jdbc("jdbc:mysql://localhost:3306", "hive.TBLS", connectionProperties)
(2)
// Specifying the custom data types of the read schema,可以指定用户数据类型
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
write方式一:
jdbcDF.write.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306")
.option("dbtable", "hive.writejdbc").option("user", "root")
.option("password", "root")
.option("driver", "com.mysql.jdbc.Driver")
.save()
write方式二:
// Specifying create table column data types on write,可以指定列名称和数据类型
jdbcDF.write
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc("jdbc:mysql://localhost:3306", "test.writejdbc2", connectionProperties)
write方式三:
jdbcDF2.write.jdbc("jdbc:mysql://localhost:3306", "test.writejdbc3", connectionProperties)
//其中test为数据库名称,writejdbc2为表名称
package com.imooc.spark
import org.apache.spark.sql.SparkSession
object DatasetApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("DatasetApp")
.master("local[2]").getOrCreate()
//注意:需要导入隐式转换
import spark.implicits._
val path = "file:///Users/rocky/data/sales.csv"
//spark如何解析csv文件?
val df = spark.read.option("header","true").option("inferSchema","true").csv(path)
df.show
//直接转换,Sales,在下面定义了 case class Sales
val ds = df.as[Sales]
//迭代打印出itemId
ds.map(line => line.itemId).show
spark.stop()
}
case class Sales(transactionId:Int,customerId:Int,itemId:Int,amountPaid:Double)
}
//infoDf,为一个dataframe,转化为一个临时视图,就可以用spark.sql访问
infoDF.createOrReplaceTempView("infos")
spark.sql("select * from infos where age > 30").show()