使用SparkSQL/DataFrame读取HBase表

最近更新

发现有很多同学发私信问我这个jar包的事情,说找不到类,今天特意更新一下:

HBaseContext类:
https://github.com/apache/hbase/tree/master/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark

HBaseTableCatalog类:https://github.com/apache/hbase/tree/master/hbase-spark/src/main/scala/org/apache/spark/sql/datasources/hbase

这两个类都是在hbase-spark模块中的,这个jar包下载地址:http://maven.wso2.org/nexus/content/repositories/Apache/org/apache/hbase/hbase-spark/2.0.0-SNAPSHOT/


HBase-Spark Connector(在HBase-Spark 模块中)利用了在Spark-1.2.0中引入的DataSource API(SPARK-3247),在简单的HBase KV存储和复杂的关系型SQL查询之间架起了桥梁,使得用户可以在HBase上使用Spark执行复杂的数据分析工作。HBase Dataframe是一个标准的Spark Dataframe,能够与任何其他的数据源进行交互,比如Hive,Orc,Parquet,JSON等。HBase-Spark Connector应用了关键技术,如分区剪枝(partition pruning),列剪枝(column pruning),谓詞下推(predicate pushdown)和数据局部性(data locality)。

要使用HBase-Spark Connector,用户需要定义在HBase和Spark表之间的映射关系的schema目录,准备数据,并且填充到HBase表中,然后加载HBase Dataframe。之后,用户可以使用SQL查询做集成查询和访问记录HBase的表。以下描述了这个的基本步骤:

1、定义目录(Define catalog)
2、保存DataFrame
3、加载DataFrame
4、SQL 查询
更多详细请参考文献[2][3]

定义目录(Define catalog)

def catalog = s"""{
       |"table":{"namespace":"default", "name":"table1"},
       |"rowkey":"key",
       |"columns":{
         |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
         |"col1":{"cf":"cf1", "col":"col1", "type":"string"}
       |}
     |}""".stripMargin

定义类

case class HBaseRecord(
   col0: String,
   col1: String)

保存数据

val data = (0 to 255).map { i =>  HBaseRecord(i.toString, "extra")}

sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5")).format("org.apache.hadoop.hbase.spark").save()
:34: error: not found: value HBaseTableCatalog
              sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5")).format("org.apache.hadoop.hbase.spark").save()

出现以上错误,解决方法为导入相应包,如下:

scala> import org.apache.spark.sql.datasources.hbase.{HBaseTableCatalog}
import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
scala> sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5")).format("org.apache.hadoop.hbase.spark ").save()
:29: error: value toDF is not a member of org.apache.spark.rdd.RDD[HBaseRecord]
              sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5")).format("org.apache.hadoop.hbase.spark ").save()

報错了,说toDF不是org.apache.spark.rdd.RDD[HBaseRecord]的成员。之所以出现这个问题,是因为之前启动Spark-shell的时候,sqlContext不能使用[1],没有创建成功,因为我的集群的没有启动,所以,解决方法就是启动整个集群,然后重新进入spark-shell。当然,在启动的时候,要加上一个hbase-spark包到classpath,如下:

hadoop@master:~$ spark-1.6.0-bin-hadoop2.4/bin/spark-shell --jars /home/yang/Downloads/hbase-spark-2.0.0-20160316.173537-2.jar 

因为我们有用到HBaseTableCatalog类,这个类是在hbase-spark包里的,所以,我们要引入这个包。

继续执行:

scala> sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5")).format("org.apache.hadoop.hbase.spark").save()
java.lang.NullPointerException
    at org.apache.hadoop.hbase.spark.HBaseRelation.(DefaultSource.scala:125)
    at org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:74)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
    at $iwC$$iwC$$iwC$$iwC$$iwC.(:45)
    at $iwC$$iwC$$iwC$$iwC.(:47)
    at $iwC$$iwC$$iwC.(:49)
    at $iwC$$iwC.(:51)
    at $iwC.(:53)
    at (:55)
    at .(:59)
    at .()
    at .(:7)
    at .()
    at $print()
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
    at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
    at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

出现上述问题的原因是,由于没有初始化HBaseContext对象,所以,只有创建一个HBaseContext对象就好了。但注意,由于在spark-shell中缺少相应的hbase依赖包,所以,在此,你想创建HBaseContext包时,需要引入相应包,比较麻烦,建议在IDE里测试这个程序,比如Scala IDE,或Intellij。

以下是完整程序:



import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.{ HBaseConfiguration, HColumnDescriptor, HTableDescriptor }
import org.apache.hadoop.hbase.client.{ HBaseAdmin, HTable, Put }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.datasources.hbase._
import org.apache.hadoop.hbase.spark.datasources.HBaseScanPartition
import org.apache.hadoop.hbase.util.Bytes

case class HBaseRecord(
  col0: String,
  col1: Int)

object HBaseRecord {
  def apply(i: Int, t: Int): HBaseRecord = {
    val s = s"""row${"%03d".format(i)}"""
    HBaseRecord(s,
      i)
  }
}

object Test {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("test spark sql");
    conf.setMaster("spark://master:7077");
    val sc = new SparkContext("local", "test") //new SparkContext(conf)//
    val config = HBaseConfiguration.create()
    //config.addResource("/home/hadoop/hbase-1.2.2/conf/hbase-site.xml");
    //config.set("hbase.zookeeper.quorum", "node1,node2,node3");
    val hbaseContext = new HBaseContext(sc, config, null)

    def catalog = s"""{
       |"table":{"namespace":"default", "name":"table4"},
       |"rowkey":"key",
       |"columns":{
         |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
         |"col1":{"cf":"cf1", "col":"col1", "type":"int"}
       |}
     |}""".stripMargin

    val sqlContext = new SQLContext(sc);
    import sqlContext.implicits._

    def withCatalog(cat: String): DataFrame = {
      sqlContext
        .read
        .options(Map(HBaseTableCatalog.tableCatalog -> cat))
        .format("org.apache.hadoop.hbase.spark")
        .load()
    }
    val df = withCatalog(catalog)

    val res = df.select("col1")
    //res.save("hdfs://master:9000/user/yang/a.txt")
    res.show()
    df.registerTempTable("table4")
    sqlContext.sql("select count(col0),sum(col1) from table4 where col1>'20' and col1<'26' ").show
    println("-----------------------------------------------------");
    sqlContext.sql("select count(col1),avg(col1) from table4").show
  }
}

以上程序会用到很多JAR包,大家可以自行下载,如有需要,也可向我邮件发送。

参考文献:
[1] https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-Scala-Error-value-toDF-is-not-a-member-of-org-apache-spark/m-p/29994#M973
[2] https://hbase.apache.org/book.html#_sparksql_dataframes
[3] https://strongyoung.gitbooks.io/hbase-reference-guide/content/hbase_spark/86sparksqldataframes.html

你可能感兴趣的:(Spark,HBase,Scala,Hadoop)