趁着放假的前的空闲时光,写了几遍吧,这里主要是写一个ML里面的实例,这个是我从一个国外网站上看到的,以前的算法都是用的mlib,现在开始用ML来做。其实开始我对ML和mlib区别比较模糊,后来多看了几遍官网自己就熟悉了。没事就溜达一下官网吧,个人觉得官网是几个开源中最好的一个,api介绍也很丰富
。
准备数据
用户的浏览网址数据
Cookie | Site | Impressions
--------------- |-------------- | -------------
wKgQaV0lHZanDrp | live.com | 24
wKgQaV0lHZanDrp | pinterest.com | 21
rfTZLbQDwbu5mXV | wikipedia.org | 14
rfTZLbQDwbu5mXV | live.com | 1
rfTZLbQDwbu5mXV | amazon.com | 1
r1CSY234HTYdvE3 | youtube.com | 10
用户的地理位置数据
Cookie | Lat | Lng | Impressions
--------------- |---------| --------- | ------------
wKgQaV0lHZanDrp | 34.8454 | 77.009742 | 13
wKgQaV0lHZanDrp | 31.8657 | 114.66142 | 1
rfTZLbQDwbu5mXV | 41.1428 | 74.039600 | 20
rfTZLbQDwbu5mXV | 36.6151 | 119.22396 | 4
r1CSY234HTYdvE3 | 42.6732 | 73.454185 | 4
r1CSY234HTYdvE3 | 35.6317 | 120.55839 | 5
20ep6ddsVckCmFy | 42.3448 | 70.730607 | 21
20ep6ddsVckCmFy | 29.8979 | 117.51683 | 1
上面的数据自己可以构建
代码
val (positive, negative) = Seq.fill(1000)(Random.alphanumeric.take(15).mkString).splitAt(100)
val (pSites, pGeo, pResp) = generateDataset(positive, positivePredictors, negativePredictors, response = 1)
val (nSites, nGeo, nResp) = generateDataset(negative, negativePredictors, positivePredictors, response = 0)
// Write site impression log
val sitesW = new PrintWriter(new File("sites.csv"))
sitesW.println("cookie,site,impressions")
(pSites ++ nSites).foreach(sitesW.println)
sitesW.close()
// Write geo impression log
val geoW = new PrintWriter(new File("geo.csv"))
geoW.println("cookie,lat,lon,impressions")
(pGeo ++ nGeo).foreach(geoW.println)
geoW.close()
下面开始对数据进行处理
把相同cookie数据进行聚集
val gather = new Gather()
.setPrimaryKeyCols("cookie")
.setKeyCol("site")
.setValueCol("impressions")
.setValueAgg("sum") // sum impression by key
.setOutputCol("sites")
val gatheredSites = gather.transform(siteLog)
输出数据
变成
Cookie | Sites
-----------------|----------------------------------------------
wKgQaV0lHZanDrp | [
| { site: live.com, impressions: 24.0 },
| { site: pinterest.com, impressions: 21.0 }
| ]
rfTZLbQDwbu5mXV | [
| { site: wikipedia.org, impressions: 14.0 },
| { site: live.com, impressions: 1.0 },
| { site: amazon.com, impressions: 1.0 }
sites数据处理有点不一样
val s2Transformer = new S2CellTransformer()
.setLevel(5)
.setCellCol("s2_cell")
// Gather S2 CellId log
val gatherS2Cells = new Gather()
.setPrimaryKeyCols("cookie")
.setKeyCol("s2_cell")
.setValueCol("impressions")
.setOutputCol("s2_cells")
val gatheredCells = gatherS2Cells.transform(s2Transformer.transform(geoDf))
对于上面的几个feature处理ml里面目前还没有代码参见
https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/S2CellTransformer.scala
现在开始组装成feature数据
val encodeS2Cells = new GatherEncoder()
.setInputCol("s2_cells")
.setOutputCol("s2_cells_f")
.setKeyCol("s2_cell")
.setValueCol("impressions")
.setCover(0.95) // dimensionality red
转化成
Cookie | S2 Cells Features
-----------------|------------------------
wKgQaV0lHZanDrp | [ 5.0 , 1.0 , 0 ]
rfTZLbQDwbu5mXV | [ 12.0 , 0 , 5.0 ]
把数据关联起来
这里有一份response数据里是
cookie对应的 0 或者1 这个在下面模型需要用到
// Assemble input dataset
val dataset = responseDf.as("response")
.join(gatheredSites, responseDf(Response.cookie) === gatheredSites(Sites.cookie))
.join(gatheredCells, responseDf(Response.cookie) === gatheredCells(Sites.cookie))
.select(
$"response.*",
$"sites",
$"s2_cells"
).cache()
建模型
LogisticRegression
val lr = new LogisticRegression()
.setFeaturesCol("features")
.setLabelCol(Response.response)
.setProbabilityCol("probability")
后面的就是预测模型的值了然后评估
其实我们主要的学历pipline的思想,一套流水话的作业,让任感觉做起来比较舒服。以后里面提供的 Feature extraction,transform和selection会越来越多的。