这是个踩坑的过程,上篇的虽然跑通了,但是其实版本配置有问题,记得看下篇
环境配置:
Scala:2.12.1
Spark:2.4.4
Hbase:2.2.3
前言:
前面有篇文章我使用了pyspark
,弄得我很累,感觉python开发确实没那么好,看了些网上的文章和问了下朋友决定还是学学使用scala
。(可以看看这个,我觉得挺有道理的.)
环境:
因为我也是第一次搞scala,如果你们也是的话,可以参考一下:
windows上 IntelliJ IDEA安装scala环境 详细 初学.
IDEA 开发 scala.
这里我参考了这个:http://dblab.xmu.edu.cn/blog/1316-2/.
只要代码测试成功了我再往下写,可是因为环境版本不一样,我改了一些地方的。
我们先建表和插入一些数据进去:(复制黏贴到hbase shell里面就一次过全部执行了,我是不是很贴心)
create 'student','info'
put 'student','1','info:name','Xueqian'
put 'student','1','info:gender','F'
put 'student','1','info:age','23'
put 'student','2','info:name','Weiliang'
put 'student','2','info:gender','M'
put 'student','2','info:age','24'
Pom
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.shct</groupId>
<artifactId>sparkhbasetest</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<jdk.version>1.8</jdk.version>
<scala.version>2.12.1</scala.version>
<spark.version>2.4.4</spark.version>
<hadoop.version>3.1.2</hadoop.version>
<hbase.version>2.2.3</hbase.version>
</properties>
<dependencies>
<!--scala-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>${scala.version}</version>
</dependency>
<!--spark-core-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!--hadoop-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!--hbase-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>${hbase.version}</version>
<type>pom</type>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-mapreduce</artifactId>
<version>${hbase.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.test.SparkCount</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
读取:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase._
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
object sparkHbaseRead {
def main(args:Array[String]): Unit ={
val conf=HBaseConfiguration.create()
val sc = new SparkContext(new SparkConf())
//设置查询的表名
// conf.set("hbase.mapreduce.inputtable", "Studet")
conf.set(TableInputFormat.INPUT_TABLE,"student")
conf.set("zookeeper.znode.parent","/hbase-unsecure")
conf.set("hbase.zookeeper.property.clientPort", "2181")
val stuRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
val count = stuRDD.count()
println("Students RDD Count:" + count)
stuRDD.cache()
stuRDD.foreach({ case (_,result) =>
val key = Bytes.toString(result.getRow)
val name = Bytes.toString(result.getValue("info".getBytes,"name".getBytes))
val gender = Bytes.toString(result.getValue("info".getBytes,"gender".getBytes))
val age = Bytes.toString(result.getValue("info".getBytes,"age".getBytes))
println("Row key:"+key+" Name:"+name+" Gender:"+gender+" Age:"+age)
})
}
}
这里修改的地方是,加入了zookeeper.znode.parent
的配置,如果不配置的话有可能会出现以下错误:
java.util.concurrent.ExecutionException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/hbaseid
Caused by: java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase
zookeeper.znode.parent
是你的Hbase
里的配置文件hbase-site.xml
的配置,如果你的这个配置文件没有写这一项的话,则不需要写
写入:
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.spark._
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
object spaekHbaseWrite {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("SparkWriteHBase").setMaster("local")
val sc = new SparkContext(sparkConf)
val tablename = "student"
// hbase 的连接设置
val hconf = sc.hadoopConfiguration
hconf.set("hbase.zookeeper.property.clientPort","2181")
hconf.set("zookeeper.znode.parent","/hbase-unsecure")
hconf.set("hbase.zookeeper.quorum","192.168.0.111")
hconf.set(TableOutputFormat.OUTPUT_TABLE,tablename)
val job = Job.getInstance(hconf)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
val indataRDD = sc.makeRDD(Array("7,小明,M,10","8,小红,F,12")) //构建两行记录
val rdd = indataRDD.map(_.split(',')).map{arr=>{
val put = new Put(Bytes.toBytes(arr(0))) //行健的值
// put.add(Bytes.toBytes("info"),Bytes.toBytes("name"), Bytes.toBytes(arr(1))) //info:name列的值
// put.add(Bytes.toBytes("info"),Bytes.toBytes("gender"),Bytes.toBytes(arr(2))) //info:gender列的值
// put.add(Bytes.toBytes("info"),Bytes.toBytes("age"),Bytes.toBytes(arr(3))) //info:age列的值
put.addColumn(Bytes.toBytes("info"),Bytes.toBytes("name"), Bytes.toBytes(arr(1).toString)) //info:name列的值
put.addColumn(Bytes.toBytes("info"),Bytes.toBytes("gender"),Bytes.toBytes(arr(2))) //info:gender列的值
put.addColumn(Bytes.toBytes("info"),Bytes.toBytes("age"),Bytes.toBytes(arr(3))) //info:age列的值
(new ImmutableBytesWritable, put)
}}
rdd.saveAsNewAPIHadoopDataset(job.getConfiguration)
sc.stop()
}
}
写入这里主要修改的还是在配置那里想办法加入了zookeeper.znode.parent
,并且我发现put.add
会报错,因为hbase在2.x.x之后,hbase-client的Put.add接口变了,从Put.add变成了 Put.addColumn。