用户行为日志:
110.85.18.234 - - [30/Jan/2019:00:00:21 +0800] "GET /course/list?c=cb HTTP/1.1" 200 12800 "www.imooc.com" "https://www.imooc.com/course/list?c=data" - "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" "-" 10.100.16.243:80 200 0.172 0.172
218.74.48.154 - - [30/Jan/2019:00:00:22 +0800] "GET /.well-known/apple-app-site-association HTTP/1.1" 200 165 "www.imooc.com" "-" - "swcd (unknown version) CFNetwork/974.2.1 Darwin/18.0.0" "-" 10.100.135.47:80 200 0.001 0.001
113.77.139.245 - - [30/Jan/2019:00:00:22 +0800] "GET /static/img/common/new.png HTTP/1.1" 200 1020 "www.imooc.com" "https://www.imooc.com/" - "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3642.0 Safari/537.36" "-" 10.100.16.241:80 200 0.001 0.001
113.77.139.245 - - [30/Jan/2019:00:00:22 +0800] "GET /static/img/menu_icon.png HTTP/1.1" 200 4816 "www.imooc.com" "https://www.imooc.com/" - "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3642.0 Safari/537.36" "-" 10.100.16.243:80 200 0.001 0.001
106.38.241.68 - - [30/Jan/2019:00:00:22 +0800] "GET /wenda/detail/430191 HTTP/1.1" 200 8702 "www.imooc.com" "-" - "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" "-" 10.100.137.42:80 200 0.216 0.219
111.197.20.121 - - [30/Jan/2019:00:00:22 +0800] "GET /.well-known/apple-app-site-association HTTP/1.1" 200 165 "www.imooc.com" "-" - "swcd (unknown version) CFNetwork/893.14.2 Darwin/17.3.0" "-" 10.100.136.65:80 200 0.001 0.001
110.85.18.234 - - [30/Jan/2019:00:00:22 +0800] "GET /u/card%20?jsonpcallback=jQuery19106008894766558066_1548777623367&_=1548777623368 HTTP/1.1" 200 382 "www.imooc.com" "https://www.imooc.com/course/list?c=cb" - "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" "-" 10.100.16.241:80 200 0.059 0.059
110.85.18.234 - - [30/Jan/2019:00:00:22 +0800] "GET /activity/newcomer HTTP/1.1" 200 444 "www.imooc.com" "https://www.imooc.com/course/list?c=cb" - "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" "-" 10.100.136.64:80 200 0.103 0.103
194.54.83.182 - - [30/Jan/2019:00:00:22 +0800] "GET / HTTP/1.1" 301 178 "imooc.com" "-" - "Go-http-client/1.1" "-" - - - 0.000
110.85.18.234 - - [30/Jan/2019:00:00:22 +0800] "GET /u/loading HTTP/1.1" 200 64 "www.imooc.com" "https://www.imooc.com/course/list?c=cb" - "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" "-" 10.100.16.243:80 200 0.015 0.015
由于后续业务只用到这几个字段,为了方便,只使用这几个字段。
import spark.implicits._
val logRDD: RDD[String] = spark.sparkContext.textFile("/Users/rocky/IdeaProjects/imooc-workspace/spark-project-train/src/data/test-access.log")
var logDF: DataFrame = logRDD.map(x => {
\\...val splits: Array[String] = x.split("\"")...较为复杂的日志,需要切割多次。
\\...val splits2: Array[String] = x.split(" ")...
val ip: String = splits(0)
val time: String = splits(3)
val url: String = splits(4)
val referer: String = splits(8)
val ua: String = splits(10)
\\\...ua解析可以放开始就做...
(ip, datetime, url, referer, ua, browsername, browserversion, osname, osversion)
}).toDF("ip", "time", "url", "referer", "ua", "browsername", "browserversion", "osname", "osversion")
IP数据:
1.0.1.0|1.0.3.255|16777472|16778239|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.0.8.0|1.0.15.255|16779264|16781311|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.0.32.0|1.0.63.255|16785408|16793599|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.1.0.0|1.1.0.255|16842752|16843007|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.1.2.0|1.1.7.255|16843264|16844799|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.1.8.0|1.1.63.255|16844800|16859135|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.2.0.0|1.2.1.255|16908288|16908799|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
val ipRowRDD: RDD[String] = spark.sparkContext.textFile("file:///Users/rocky/IdeaProjects/imooc-workspace/sparksql-train/data/ip.txt")
val ipRuleDF: DataFrame = ipRowRDD.map(x => {
val splits: Array[String] = x.split("\\|")
val startIP: Long = splits(2).toLong
val endIP: Long = splits(3).toLong
val province: String = splits(6)
val city: String = splits(7)
val isp: String = splits(9)
(startIP, endIP, province, city, isp)
}).toDF("start_ip", "end_ip", "province", "city", "isp")
使用UDF,必须 import:
import org.apache.spark.sql.functions._
1.使用UDF,新增ip_long字段,使IP字段变为Long类型,方便后面和ipRuleDF进行join。
def getLongIp() = udf((ip:String) => {
val splits: Array[String] = ip.split("[.]")
var ipNum = 0l
for(i<-0 until(splits.length)) {
ipNum = splits(i).toLong | ipNum << 8L
}
ipNum
})
logDF = logDF.withColumn("ip_long", getLongIp()($"ip"))
2.使用UDF,使日期格式变得更加规范,变为yyyyMMddHHmm。
def formatTime() = udf((time:String) =>{
FastDateFormat.getInstance("yyyyMMddHHmm").format(
new Date(FastDateFormat.getInstance("dd/MMM/yyyy:HH:mm:ss Z",Locale.ENGLISH)
.parse(time.substring(time.indexOf("[")+1, time.lastIndexOf("]"))).getTime
))
})
logDF = logDF.withColumn("formattime", formatTime()(logDF("time")))
3.这个没用UDF,直接来操作在生成DF前用RDD,解析User Agent字段,新增browsername,browserversion,osname,osversion这4个字段。
在Github上寻找解析包,快速开发:
<dependency>
<groupId>cz.mallat.uasparser</groupId>
<artifactId>uasparser</artifactId>
<version>0.6.2</version>
</dependency>
4. logDF与ipRuleDF进行join,新增province,city,isp这3个字段。
logDF.createOrReplaceTempView("logs")
ipRuleDF.createOrReplaceTempView("ips")
val sql = SQLUtils.SQL
val result: DataFrame = spark.sql(sql)
object SQLUtils {
lazy val SQL = "select " +
"logs.ip ," +
"logs.formattime," +
"logs.url," +
"logs.referer," +
"logs.browsername," +
"logs.browserversion," +
"logs.osname," +
"logs.osversion," +
"logs.ua" +
",ips.province as provincename" +
",ips.city as cityname" +
",ips.isp as isp" +
"from logs left join " +
"ips on logs.ip_long between ips.start_ip and ips.end_ip "
}
1.设计rowkey为day+crc32(referer+url+ip+ua),dataframe中每一行都有对应的rowkey。
val rowkey = getRowKey(day, referer+url+ip+ua)
def getRowKey(time:String, info:String) = {
val builder = new StringBuilder(time) //面试题:StringBuilder和StringBuffer
builder.append("_") //面试题:只要是字符串拼接,尽量不要使用+
val crc32 = new CRC32()
crc32.reset()
if(StringUtils.isNotEmpty(info)){
crc32.update(Bytes.toBytes(info))
}
builder.append(crc32.getValue)
builder.toString()
}
2.RDD[sql.Row] => RDD[(ImmutableBytesWritable, Put)]。
dataframe中每一行都有对应的rowkey,rowkey对应一个Put对象,每一行对应的(cf、column、value)都写到这个Put中。
val hbaseInfoRDD = logDF.rdd.map(x => {
val ip = x.getAs[String]("ip")
val formattime = x.getAs[String]("formattime")
val province = x.getAs[String]("province")
val city = x.getAs[String]("city")
val url = x.getAs[String]("url")
val referer = x.getAs[String]("referer")
val browsername = x.getAs[String]("browsername")
val browserversion = x.getAs[String]("browserversion")
val osname = x.getAs[String]("osname")
val osversion = x.getAs[String]("osversion")
val ua = x.getAs[String]("ua")
val columns = scala.collection.mutable.HashMap[String,String]()
columns.put("ip",ip)
columns.put("province",province)
columns.put("city",city)
columns.put("formattime",formattime)
columns.put("url",url)
columns.put("referer",referer)
columns.put("browsername",browsername)
columns.put("browserversion",browserversion)
columns.put("osname",osname)
columns.put("osversion",osversion)
val rowkey = getRowKey(day, referer+url+ip+ua) // HBase的rowkey
val put = new Put(Bytes.toBytes(rowkey)) // 要保存到HBase的Put对象
// 每一个rowkey对应的cf中的所有column字段
for((k,v) <- columns) {
put.addColumn(Bytes.toBytes("o"), Bytes.toBytes(k.toString), Bytes.toBytes(v.toString));
}
(new ImmutableBytesWritable(rowkey.getBytes), put)
})
3.写入到HBase中,每天创建一张表,有故障需要重跑的话,应该先把存在的表删了,然后重新写入。
val day = args(0) //从外面传入day
val input = s"hdfs://hadoop000:8020/access/$day/*"
val conf = new Configuration()
conf.set("hbase.rootdir","hdfs://hadoop000:8020/hbase")
conf.set("hbase.zookeeper.quorum","hadoop000:2181")
val tableName = createTable(day, conf)
// 设置写数据到哪个表中
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
// 保存数据
hbaseInfoRDD.saveAsNewAPIHadoopFile(
"hdfs://hadoop000:8020/etl/access/hbase",
classOf[ImmutableBytesWritable], //kClass
classOf[Put], //vClass
classOf[TableOutputFormat[ImmutableBytesWritable]], //outputFormatClass
conf
)
1.扫描HBase表,scan中加入需要扫描的列,通过Spark的newAPIHadoopRDD读取数据。
// 连接HBase
val conf = new Configuration()
conf.set("hbase.rootdir", "hdfs://hadoop000:8020/hbase")
conf.set("hbase.zookeeper.quorum", "hadoop000:2181")
val tableName = "access_" + day
conf.set(TableInputFormat.INPUT_TABLE, tableName) // 要从哪个表里面去读取数据
val scan = new Scan()
// 设置要查询的cf
scan.addFamily(Bytes.toBytes("o"))
// 设置要查询的列
scan.addColumn(Bytes.toBytes("o"), Bytes.toBytes("country"))
scan.addColumn(Bytes.toBytes("o"), Bytes.toBytes("province"))
scan.addColumn(Bytes.toBytes("o"), Bytes.toBytes("browsername"))
// 设置Scan
conf.set(TableInputFormat.SCAN, Base64.encodeBytes(ProtobufUtil.toScan(scan).toByteArray))
// 通过Spark的newAPIHadoopRDD读取数据
val hbaseRDD = spark.sparkContext.newAPIHadoopRDD(conf,
classOf[TableInputFormat], //inputFormatClass
classOf[ImmutableBytesWritable], //kClass
classOf[Result] //vClass
)
2.3种方法对读取的RDD进行统计分析,每种浏览器的总数,并进行降序排序。
cache()最常用的优化点。
hbaseRDD.cache() //最常用的优化点,缓存
//使用Spark Core进行统计
hbaseRDD.map(x => {
val browsername = Bytes.toString(x._2.getValue("o".getBytes, "browsername".getBytes))
(browsername, 1)
}).reduceByKey(_ + _)
.map(x => (x._2, x._1)).sortByKey(false)
.map(x => (x._2, x._1)).foreach(println)
//使用dataframe API进行统计
hbaseRDD.map(x => {
val browsername = Bytes.toString(x._2.getValue("o".getBytes, "browsername".getBytes))
Browser(browsername)
}).toDF().select("browsername").groupBy("browsername").count().orderBy(desc("count")).show(false)
//使用Spark SQL进行统计
hbaseRDD.map(x => {
val browsername = Bytes.toString(x._2.getValue("o".getBytes, "browsername".getBytes))
Browser(browsername)
}).toDF().createOrReplaceTempView("tmp")
spark.sql("select browsername,count(1) cnt from tmp group by browsername order by cnt desc").show(false)
case class Browser(browsername: String)
date -d"1 day ago" +"%Y%m%d"
mysql创建表:
create table if not exists browser_stat(
day varchar(10) not null, //不需要设置主键,任务失败要重写直接删除以前的就行。
browser varchar(100) not null,
cnt int
) engine=innodb default charset=utf8;
resultRDD.coalesce(1).foreachPartition(part => {
Try{
// TODO... 将统计结果写入到MySQL
val connection = {
Class.forName("com.mysql.jdbc.Driver")
val url = "jdbc:mysql://hadoop000:3306/spark?characterEncoding=UTF-8"
val user = "root"
val password = "root"
DriverManager.getConnection(url, user, password)
}
val preAutoCommit = connection.getAutoCommit
connection.setAutoCommit(false)
val sql = "insert into browser_stat (day,browser,cnt) values(?,?,?)"
val pstmt = connection.prepareStatement(sql) //预编译SQL语句
pstmt.addBatch(s"delete from browser_stat where day=$day") //先加入删除的SQL语句
part.foreach(x => {
pstmt.setString(1, day)
pstmt.setString(2, x._1)
pstmt.setInt(3, x._2)
pstmt.addBatch() //插入一个partition的数据
})
pstmt.executeBatch()
connection.commit()
(connection, preAutoCommit)
} match {
case Success((connection, preAutoCommit)) => {
connection.setAutoCommit(preAutoCommit)
if(null != connection) connection.close()
}
case Failure(e) => throw e
}
})
spark写入mysql的几种方法
put.setDurability(Durability.SKIP_WAL) //禁用WAL
def flushTable(table:String, conf:Configuration): Unit = {
var connection:Connection = null
var admin:Admin = null
try {
connection = ConnectionFactory.createConnection(conf)
admin = connection.getAdmin
admin.flush(TableName.valueOf(table)) //MemStore==>StoreFile
} catch {
case e:Exception => e.printStackTrace()
} finally {
if(null != admin) {
admin.close()
}
if(null != connection) {
connection.close()
}
}
}
val hbaseInfoRDD = logDF.rdd.mapPartitions(partition => {
partition.flatMap(x=>{
//...省略x的字段...
val rowkey = getRowKey(day, referer+url+ip+ua) // HBase的rowkey
val rk = Bytes.toBytes(rowkey)
val put = new Put(Bytes.toBytes(rowkey)) // 要保存到HBase的Put对象
val list = new ListBuffer[((String,String),KeyValue)]()
// 每一个rowkey对应的cf中的所有column字段
for((k,v) <- columns) {
val keyValue = new KeyValue(rk, "o".getBytes, Bytes.toBytes(k),Bytes.toBytes(v))
list += (rowkey,k) -> keyValue
}
list.toList
})
}).sortByKey() //需要按(rowkey,column)进行排序
.map(x => (new ImmutableBytesWritable(Bytes.toBytes(x._1._1)), x._2))
直接写入HBase的话,每条row是按rowkey排序的,row里面是按column排序的,直接生成HFile文件需要按(rowkey,column)进行排序,所以需要设计一个(rowkey,k) -> keyValue的数据结构。
val job = NewAPIHadoopJob.getInstance(conf)
val table = new HTable(conf, tableName)
//配置MapReduce Job,对给定的表进行增量加载。
HFileOutputFormat2.configureIncrementalLoad(job,table.getTableDescriptor,table.getRegionLocator)
val output = "hdfs://hadoop000:8020/etl/access3/hbase"
val outputPath = new Path(output)
hbaseInfoRDD.saveAsNewAPIHadoopFile(
output,
classOf[ImmutableBytesWritable], //keyClass
classOf[KeyValue], //valueClass
classOf[HFileOutputFormat2], //outputFormatClass
job.getConfiguration
)
if(FileSystem.get(conf).exists(outputPath)) {
val load = new LoadIncrementalHFiles(job.getConfiguration)
load.doBulkLoad(outputPath, table)
FileSystem.get(conf).delete(outputPath, true)
}
通过接口可以看出,dataframe就是:RDD[Row] + schema。
package org.apache.spark.sql.sources
abstract class BaseRelation {
def sqlContext: SQLContext
def schema: StructType
}
@InterfaceStability.Stable
trait TableScan {
def buildScan(): RDD[Row]
}
@InterfaceStability.Stable
trait PrunedScan {
def buildScan(requiredColumns: Array[String]): RDD[Row]
}
@InterfaceStability.Stable
trait PrunedFilteredScan {
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}
@InterfaceStability.Stable
trait InsertableRelation {
def insert(data: DataFrame, overwrite: Boolean): Unit
}
定义DefaultSource类,parameters既是option中传入的参数。然后再实现HBaseRelation方法。
class DefaultSource extends RelationProvider{
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
HBaseRelation(parameters)(sqlContext)
}
}
HBaseRelation方法需要实现schema和buildScan,即schema + RDD[Row]。
option(“spark.table.name”,"(age int,name string,sex string)")定义的schema,转化为spark的数据类型:
"age int" => SparkSchema("age", "int") => StructField("age", IntegerType)
"name string" => SparkSchema("name", "string") => StructField("name", StringType)
"sex string" => SparkSchema("sex", "string") => StructField("sex", StringType)
hbase是没有数据类型的,hbaseRDD
hbaseRDD一条记录的value:
keyvalues={0001/o:age/1645116727964/Put/vlen=2/seqid=0,
0001/o:name/1645116727901/Put/vlen=3/seqid=0,
0001/o:sex/1645116728008/Put/vlen=3/seqid=0}
age_value => SparkSchema("age", "int") => Integer.parseInt(new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes("age"))))
name_value => SparkSchema("name", "string") => new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes(name)))
sex_value => SparkSchema("sex", "string") => new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes(sex)))
三个字段再转成Row
case class HBaseRelation(@transient val parameters: Map[String, String])
(@transient val sqlContext: SQLContext)
extends BaseRelation with TableScan{
override def schema: StructType = {
val schema = sparkFields.map(field => {
val structField = field.fieldType.toLowerCase match {
case "string" => StructField(field.fieldName, StringType)
case "int" => StructField(field.fieldName, IntegerType)
}
structField
})
new StructType(schema)
}
override def buildScan(): RDD[Row] = {
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", zookeeperAddress)
hbaseConf.set(TableInputFormat.INPUT_TABLE, hbaseTable)
//hbaseConf.set(TableInputFormat.SCAN_COLUMNS, "")
val hbaseRDD = sqlContext.sparkContext.newAPIHadoopRDD(hbaseConf,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result]
)
hbaseRDD.map(_._2).map(result => { // result = 11,zhu,nan
val buffer = new ArrayBuffer[Any]()
sparkFields.foreach(field => {
field.fieldType.toLowerCase match {
case "string" => {
buffer += new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes(field.fieldName)))
}
case "int" => {
buffer += Integer.parseInt(new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes(field.fieldName))))
}
}
})
Row.fromSeq(buffer)
})
}
}
此主程序只能扫描整张表,可以加入 hbaseConf.set(TableInputFormat.SCAN_COLUMNS, “”) 来获取其中几列。
val df = spark.read.format("com.chengyanban.hbase")
.option("zookeeper.address","hadoop000:2181")
.option("hbase.table.name","user")
.option("spark.table.name","(age int,name string,sex string)")
.load()
df.printSchema()
df.show()
Kudu是一种列式数据存储。列式数据存储将数据存储在强类型列中。有了适当的设计,它对于分析或数据仓库工作负载有几个优势。
Table:在Kudu中,表是你的数据存储的地方。表有一个schema和一个完全有序的主键。一张表被分成几个部分,称为tablets。
Tablet:tablet是表的连续段,类似于其他数据存储引擎或关系数据库中的分区。一个给定的tablet被复制到多个tablet server上,在任何给定的时间点,这些副本中的一个被认为是leader tablet。任何副本都可以提供读和写服务,这需要服务于该tablet的一组tablet server之间达成一致。
Tablet Server:tablet server存储并向客户端提供tablet。对于特定的tablet,一个tablet server充当leader,其他server充当该tablet的跟随者副本。只有leader服务写请求,而leader或follower服务读请求。采用Raft Consensus Algorithm选出leader。一台tablet server可以服务多台tablet,一台tablet可以被多台tablet server所服务。
Master:master跟踪所有tablet、tablet server、Catalog Table和其他与集群相关的元数据。在给定的时间点,只能有一个代理主人(领导者)。如果当前的leader消失,则使用Raft Consensus Algorithm选出一个新的master。
master还为客户端协调元数据操作。例如,当创建一个新表时,客户端内部将请求发送给master。master将新表的元数据写入Catalog Table,并协调在tablet server上创建tablet的过程。
所有master的数据都存储在一个tablet中,可以复制到所有其他候选master上。
tablet server以设定的时间间隔(默认为每秒一次)向master发送心跳。
Catalog Table:Catalog Table是Kudu元数据的中心位置。它存储关于table和tablet的信息。Catalog Table不能直接读取或写入。相反,它只能通过在客户端API中公开的元数据操作来访问。
目录表存储两类元数据:
不用自定义数据源,很方便。
val odsDF = spark.read.format("org.apache.kudu.spark.kudu")
.option("kudu.master", masterAdress)
.option("kudu.table", sourceTableName)
.load()
data.write.mode(SaveMode.Append).format("org.apache.kudu.spark.kudu")
.option("kudu.master", master)
.option("kudu.table", tableName)
.save()
需设置副本数,按哪个字段进行HashPartition,并设置bucket数。
val options: CreateTableOptions = new CreateTableOptions()
options.setNumReplicas(1)
val parcols: util.LinkedList[String] = new util.LinkedList[String]()
parcols.add(partitionID)
options.addHashPartitions(parcols,3)
client.createTable(tableName, schema, options)
设置Schema信息(需设置key):
lazy val ProvinceCitySchema: Schema = {
val columns = List(
new ColumnSchemaBuilder("provincename",Type.STRING).nullable(false).key(true).build(),
new ColumnSchemaBuilder("cityname",Type.STRING).nullable(false).key(true).build(),
new ColumnSchemaBuilder("cnt",Type.INT64).nullable(false).key(true).build()
).asJava
new Schema(columns)
}
参考文章:
HBase bulkLoad时间都花在哪?
kudu介绍