借助hive快速导数据到hbase(01)

需求:解析XML文件,写入到hbase(xml文件格式为GBK,spark读进来会乱码)
痛点:普通的写入太慢太耗费时间

1.spark解决读取GBK乱码问题

object ParseXml {
def main(args: Array[String]): Unit = {
//构建sparksession
val spark = SparkSession.builder.master("local[*]").appName("Parse_xml").getOrCreate()
// 格式转换
val data_DS: RDD[String] = spark.sparkContext.hadoopFile("/Users/Desktop/2017003_2010-2019/2018-2019", classOf[TextInputFormat],
classOf[LongWritable], classOf[Text]).map(
pair => new String(pair._2.getBytes, 0, pair.2.getLength, "GBK"))
import spark.implicits.

data_DS.toDF().createOrReplaceTempView("categ_entry")
//SHENQINGH,FEIYONGZLMC,JIAOFEIJE,JIAOFEISJ,JIAOFEIRXM,PIAOJUDM,SHOUJUH
spark.sql("select * from categ_entry").write.csv("data/data_csv_2")
spark.close()
}
}

2.解析XML输出为csv文件

object ParseXml2 {
def main(args: Array[String]): Unit = {
//构建sparksession
// val spark: SparkSession = SparkSession.builder().appName("ConfigFictoryDemo").master("local[2]").getOrCreate()
val spark = SparkSession.builder.master("local[*]")
//.config("spark.debug.maxToStringFields", "100")
.appName("Parse_xml").getOrCreate()
val sc = spark.sparkContext
val df = spark.read
.format("com.databricks.spark.xml")
.option("SHENQINGH", "FEIYONGZLMC")
.load("data/data_csv_2")
//注册表
df.toDF().createOrReplaceTempView("categ_entry")
//SHENQINGH,FEIYONGZLMC,JIAOFEIJE,JIAOFEISJ,JIAOFEIRXM,PIAOJUDM,SHOUJUH
spark.sql("select SHENQINGH,FEIYONGZLMC,JIAOFEIJE,JIAOFEISJ,JIAOFEIRXM,PIAOJUDM,SHOUJUH from categ_entry").write.csv("data/result_2")
spark.close()
}
}

数据格式

2014208081375,实用,180.0,20150630,芜湖,,47526269
2014208081375,新型,150.0,20141231,芜湖,,41375489
2014208081375,实用,180.0,20151224,芜湖,,49007979

3.load到hive表(hive支持load整个文件夹下的数据)

load data local inpath "/na/20200513/hive/result" into table hive_info_paid_20200513;

4.Hbase反映射为Hbase表

CREATE TABLE ods_hive_patent_info_paid_20200513(
key string comment "hbase rowkey",
SHENQINGH string comment "申请号",
JIAOFEISJ string comment "缴费时间",
JIAOFEIJE string comment "缴费金额",
FEIYONGZLMC string comment "费用ZLMC",
JIAOFEIRXM string comment "缴费RXM",
unit string ,
recNum string ,
currency string,
num string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:SHENQINGH,cf:JIAOFEISJ,cf:JIAOFEIJE,cf:FEIYONGZLMC,cf:JIAOFEIRXM,cf:unit,cf:recNum,cf:currency,cf:num")
TBLPROPERTIES("hbase.table.name" = "process_fee_20200513");

(此处我创建的是hive内部表,在hive drop掉此表,hbase中的表也会被删除,根据个人情况可创建外部表,此处不再赘述)

至此完美将一亿条数存到Hbase表

你可能感兴趣的:(借助hive快速导数据到hbase(01))