- spark-sql
- spark-sql 在cli端的操作
- 使用spark-sql编程需要配置jar包
- 通过反射推断schema
- 通过structtype指定schema
- 通过spark-shell从mysql中加载数据
- 将数据保存到mysql中
- hive-on=spark
spark-sql
DataFrames:分布式数据容器,像关系型数据库的二维表格。相当于一张表。除了数据外,还记录了数据的结构信息schema。与hive类似,df也支持嵌套数据类型(struct,array,map)。
DataSet:
spark-sql特点:支持多数据源
spark-sql在cli端的操作
DSL:领域专用语言,下面是DSL风格语法
先获取rdd
val df=rdd.toDF("id","name","age") //这里面设置表头信息(schema数据)
df.show 可以显示表信息
df.select("name").show 查询name字段数据
df.select("name",“id”).show 查询多字段数据
df.select("name","age").filter(col("age")>5).show 筛选age大于5的name
df.printSchema 打印表头信息
SQL风格语法
在获取df后
df.registerTempTable(“animal”)//将df注册成一张表,传入表名
SQLContext.sql("sql 语句").show 通过sqlcontext来使用sql语句
sqlContext.sql("desc t_preson").show 显示表的schema
一般spark-sql的流程:
hdfs->sc.textfile->RDD->case class Person->rdd.map(..)->toDF->df.registerTempTable("t_table")->sqlContext.sql("sql语句")
使用spark-sql编程需要配置jar包
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-sql_2.10artifactId>
<version>1.6.1version>
dependency>
通过反射推断schema
object InterSchema {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Master").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new SQLContext(sc)
val file: RDD[String] = sc.textFile("hdfs://hadoop01:9000/person.txt")
val lineRDD: RDD[Array[String]] = file.map(lines => {
val fields = lines.split(",")
fields
})
val personRDD: RDD[Person] = lineRDD.map(p => {
Person(p(0).toInt, p(1), p(2).toInt, p(3).toInt)
})
import sqlContext.implicits._
val personDF: DataFrame = personRDD.toDF()
personDF.registerTempTable("t_person")
val sql="select name,age,fv from t_person where age>20 order by fv asc limit 10"
val res: DataFrame = sqlContext.sql(sql)
res.write.mode("append").json("hdfs://hadoop01:9000/spark-sql/out20171026")
sc.stop()
}
}
case class Person(id:Int,name:String,age:Int,fv:Int){}
通过structtype指定schema
object StructTypeSchema {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("jdbcRDD").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val linesRDD: RDD[Array[String]] = sc.textFile("hdfs://hadoop01:9000/", 1).map(_.split(","))
val schema: StructType = StructType {
Array(
StructField("id", IntegerType, false),
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("fv", IntegerType, true),
)
}
val rowRDD: RDD[Row] = linesRDD.map(p=>Row(p(0).toInt,p(1),p(2).toInt,p(3).toInt))
val personDF: DataFrame = sqlContext.createDataFrame(rowRDD,schema)
personDF.registerTempTable("t_person")
val sql="select name,age,fv from t_person order by fv asc"
val res: DataFrame = sqlContext.sql(sql)
res.write.mode("append").json("C://person")
sc.stop()
}
}
通过spark-shell从mysql中加载数据
在cli端启动
./spqrk/bin/spark-shell \
--master spark:
--jars mysql-connector-java-5.1.35-bin.jar (jar的地址) \
--driver-class-path *.jar(上面jar的地址)
将bigdata数据库中person表加载出来
val jdbcDF=sqlContext.read.format("jdbc").options(Map("url"->"jdbc:mysql://hadoop01:3306/bigdata","driver"->"com.mysql.jdbc.Driver","dbtable"->"person","user"->"root","password"->"root")).load()
jdbcDF.show()显示所有表数据
将数据保存到mysql中
object InsertDataToMySQL {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Master").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new SQLContext(sc)
//获取数据
val linesRDD: RDD[Array[String]] = sc.textFile("hdfs://hadoop01:9000/person.txt").map(_.split(","))
val schema: StructType = StructType {
List(
StructField("name", StringType, false),
StructField("age", IntegerType, false)
)
}
val mapedRDD: RDD[Row] = linesRDD.map(p=>Row(p(0),p(1).toInt))
val personDF: DataFrame = sqlContext.createDataFrame(mapedRDD,schema)
val poro: Properties = new Properties()
poro.put("user","root")
poro.put("password","root")
poro.put("driver","com.mysql.jdbc.Driver")
personDF.write.mode("append").jdbc("jdbc:mysql//hadoop01:3306/bigdata","person",poro)
sc.stop()
}
}
打包,将setMaster("local[]")注释掉
先clean,再package ,然后上传。
./spark/bin/spark-submit --class com.sam.study.InsertDataToMYSQL --master spark://hadoop01:7077 --jar /home/mysql-connector-java-5.1.35-bin.jar --driver-class-path /home/mysql-connector-java-5.1.35-bin.jar /root/person1.jar
hive-on=spark
如果mysql是window环境,需要将字符集改成lating1
1、将hadoop中的core-site拷贝到spark配置文件下,为什么需要core-site,因为spark-sql生成一些临时jar包需要hdfs缓存起来,执行时需要提取jar
2、将hive的conf下hive-site文件也拷到spark的conf下
3、调用/bin/spark-sql
spark-sql --master spark:
4、然后像mysql 一样用,写sql语句,像creat。。。
5、导入数据 load data local inpath "/root/person.txt" into table person
6、查询 select....