1、WAL:write-ahead log 预写日志
灾难恢复,一旦服务器崩溃,通过重放log,即可恢复之前的数据(内存中还没有刷写到磁盘的数据);如果写入wal操作失败,整个操作就认为是失败。
因此,如果开启wal机制,写操作性能会降低;
如果关闭wal机制,数据在内存中未刷写到磁盘时,server突然宕机,产生数据丢失。
解决办法:不开启wal机制,手工刷新memstore的数据落地
//禁用WAL机制
put.setDurability(Durability.SKIP_WAL)
在数据写操作之后,调用flushTable操作,代替wal机制:
def flushTable(table:String,conf:Configuration):Unit={
var connection: Connection = null
var admin:Admin=null
connection=ConnectionFactory.createConnection(conf)
try{
admin=connection.getAdmin
//将数据从MemStore刷写到磁盘中
admin.flush(TableName.valueOf(table))
}catch{
case e:Exception=>e.printStackTrace()
}finally {
if(null!=admin){
admin.close()
}
}
}
2、利用HFile
https://hbase.apache.org/2.1/book.html#arch.bulk.load
直接使用Spark将DF/RDD的数据生成HFile文件,数据load到HBase里面。
注意:
<1> HFile在所有的加载方案里面是最快的,不过有一个前提——数据是第一次导入,表是空的,如果表中已经有了数据,HFile再导入到hbase的表中会触发split操作;
<2>输出部分key和value的类型必须是: < ImmutableBytesWritable, KeyValue>或者< ImmutableBytesWritable, Put>
//1.元数据DataFrame
//这里使用mapPartitions方法
val hbaseInfoRDD = logDF.rdd.mapPartitions(partition => {
partition.flatMap(x=>{
val ip = x.getAs[String]("ip")
val country = x.getAs[String]("country")
val province = x.getAs[String]("province")
val city = x.getAs[String]("city")
val formattime = x.getAs[String]("formattime")
val method = x.getAs[String]("method")
val url = x.getAs[String]("url")
val protocal = x.getAs[String]("protocal")
val status = x.getAs[String]("status")
val bytessent = x.getAs[String]("bytessent")
val referer = x.getAs[String]("referer")
val browsername = x.getAs[String]("browsername")
val browserversion = x.getAs[String]("browserversion")
val osname = x.getAs[String]("osname")
val osversion = x.getAs[String]("osversion")
val ua = x.getAs[String]("ua")
val columns = scala.collection.mutable.HashMap[String,String]()
columns.put("ip",ip)
columns.put("country",country)
columns.put("province",province)
columns.put("city",city)
columns.put("formattime",formattime)
columns.put("method",method)
columns.put("url",url)
columns.put("protocal",protocal)
columns.put("status",status)
columns.put("bytessent",bytessent)
columns.put("referer",referer)
columns.put("browsername",browsername)
columns.put("browserversion",browserversion)
columns.put("osname",osname)
columns.put("osversion",osversion)
val rowkey = getRowKey(day, referer+url+ip+ua) // 生成HBase的rowkey
val rk = Bytes.toBytes(rowkey)
val list = new ListBuffer[((String,String),KeyValue)]()
// 每一个rowkey对应的cf中的所有column字段
for((k,v) <- columns) {
val keyValue = new KeyValue(rk, "o".getBytes, Bytes.toBytes(k),Bytes.toBytes(v))
list += (rowkey,k) -> keyValue
}
list.toList
})
}).sortByKey()
.map(x => (new ImmutableBytesWritable(Bytes.toBytes(x._1._1)), x._2))
val conf = new Configuration()
conf.set("hbase.rootdir","hdfs://hadoop000:8020/hbase")
conf.set("hbase.zookeeper.quorum","hadoop000:2181")
val tableName = createTable(day, conf)
// 设置写数据到哪个表中
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = NewAPIHadoopJob.getInstance(conf)
val table = new HTable(conf, tableName)
HFileOutputFormat2.configureIncrementalLoad(job,table.getTableDescriptor,table.getRegionLocator)
val output = "hdfs://hadoop000:8020/etl/access3/hbase"
val outputPath = new Path(output)
hbaseInfoRDD.saveAsNewAPIHadoopFile(
output,
classOf[ImmutableBytesWritable],
classOf[KeyValue],
classOf[HFileOutputFormat2],
job.getConfiguration
)
if(FileSystem.get(conf).exists(outputPath)) {
val load = new LoadIncrementalHFiles(job.getConfiguration)
load.doBulkLoad(outputPath, table)
FileSystem.get(conf).delete(outputPath, true)
}
logInfo(s"作业执行成功... $day")
spark.stop()
【分析】:
整个过程可以分为两步:
<1> 通过mapreduce任务准备数据
使用HFileOutputFormat2:
HFileOutputFormat2.configureIncrementalLoad(
job,
table.getTableDescriptor,
table.getRegionLocator)
这是configureIncrementalLoad的其中一个函数
<2> 完成数据加载
(1) 先将源数据用saveAsNewAPIHadoopFile保存到指定output路径:
val output = "hdfs://hadoop000:8020/etl/access3/hbase"
val outputPath = new Path(output)
hbaseInfoRDD.saveAsNewAPIHadoopFile(
output,
classOf[ImmutableBytesWritable],
classOf[KeyValue],
classOf[HFileOutputFormat2],
job.getConfiguration
)
(2) 之后将output数据加载到hbase对应的表中
if(FileSystem.get(conf).exists(outputPath)) {
val load = new LoadIncrementalHFiles(job.getConfiguration)
load.doBulkLoad(outputPath, table)
FileSystem.get(conf).delete(outputPath, true)
}
3.spark插入df进入hbase(Hortonworks的shc方式)
直接上代码:
/**
* 保存到Hbase
*
* @param df dataFrame
* @param namespace 命名空间
* @param table 表名
* @param rowkey rowkey
* @param column 列名
*/
private def saveToHbase(df: DataFrame, namespace:String, table:String, rowkey:String, column:String): Unit = {
def catalog = s"""{
|"table":{"namespace":"$namespace", "name":"$table"},
|"rowkey":"$rowkey",
|"columns":{
| "$rowkey":{"cf":"rowkey", "col":"$rowkey", "type":"string"},
| "$column":{"cf":"t", "col":"$column", "type":"string"}
|}
|}""".stripMargin
println(s"""start to persist in HBASE: table:$table,column:$column....""")
df.write
.mode(SaveMode.Overwrite)
.options(Map(HBaseTableCatalog.tableCatalog -> catalog))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
println(s"""persist in HBASE success...""")
}
从hbase中查询数据:
withCatalog方法:
def withCatalog(cat: String): DataFrame = {
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
查询数据:
val df = withCatalog(catalog)
val s = df.filter((($"col0" <= "row050" && $"col0" > "row040") ||
$"col0" === "row005" ||
$"col0" === "row020" ||
$"col0" === "r20" ||
$"col0" <= "row005") &&
($"col4" === 1 ||
$"col4" === 42))
.select("col0", "col1", "col4")
s.show
注意:对于dataFrame中为null的字段,需要将null转成空字符串等,再进行插入,否则会报错。