基于内容推荐实践

动漫社区帖子推荐

根据帖子内容分词

  • https://github.com/NLPchina/ansj_seg
  • https://github.com/NLPchina/ansj_seg/wiki

将内容转换成向量

  • Word2Vec 训练模型 // tf - idf 模型
    • http://blog.csdn.net/itplus/article/details/37969519
    • http://blog.csdn.net/zhangchen2449/article/details/52795529
    • http://blog.csdn.net/liuyuemaicha/article/details/52610911
    • 参与计算的帖子范围,全量还是最近几个月的帖子 -- 候选集范围
      • 最近1一个月帖子
    • 多少个特征值合适
    • 参数
      • .setLearningRate($(stepSize)) 学习率
        • setDefault(stepSize -> 0.025)
      • .setMinCount($(minCount)) 关键词的上下窗口
        • 过滤掉词频小于minCount的分词
        • 默认5
      • .setNumIterations($(maxIter)) 迭代次数
        • setDefault(maxIter -> 1)
      • .setNumPartitions($(numPartitions))
        • setDefault(numPartitions -> 1)
      • .setSeed($(seed))
      • .setVectorSize($(vectorSize)) 向量大小
        • setDefault(vectorSize -> 100)
      • .setWindowSize($(windowSize)) 表示当前词与预测词在一个句子中的最大距离是多少
        • setDefault(windowSize -> 5)
      • .setMaxSentenceLength($(maxSentenceLength)) 每条语句以长度maxSentenceLength分组
        • setDefault(maxSentenceLength -> 1000)
  • 模型保存
    • hdfs ==
  • 模型更新
    • 多久更新一次
    • 重新计算后,之前计算的用户和帖子的特征都要重新计算

通过模型计算帖子的特征向量

  • model.transform(postWordsDf)
  • 物品候选集选择,哪些帖子需要被作为候选集合
  • 如何保存

通过用户行为的帖子,计算用户的特征

  • 计算当天有行为用户的特征更新或者保存到hbase --TODO 实时更新
    • 行为范围
      • 30天长期用户行为计算特征,特征变化不大,计算结果差不多,每天计算会重复
      • 15 天行为
  • 用户特征表 -- hbase

基本流程

  • 模型训练
    • 模型训练保存到hdfs
  • 用户特征更新
    • 昨天有行为的用户,用户最近15天行为特征分析并保存到hbase
  • 用户推荐
    • 昨天活跃用户,获取用户特征,候选集30天帖子,推荐100个帖子保存到推荐表

应用

  • 根据用户行为帖子内容,计算用户特征
    • 有了用户特征,可以获取用户的带权重的标签
  • 新帖子推荐
    • 新帖子内容 转换 特征 ,可以计算最相关的一些用户
  • 推荐用户帖子

val model: Word2VecModel = Word2VecModel.load("/tmp/dyd/recommend/2018-03-07/")

问题

模型计算问题

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:231)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:231)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1307)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:237)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:107)
at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:86)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1387)
at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:343)
  • setNumPartitions 默认 1 增加 到20
  • setMinCount 增加到 10

计算模型 "task-result-getter-3" java.lang.OutOfMemoryError: Java heap space

  • --conf "spark.driver.maxResultSize=2g" --driver-memory 4g

计算模型 Job aborted due to stage failure: Serialized task 653:11 was 226450337 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.

  • --conf spark.rpc.message.maxSize=512

加载模型 TransportChannelHandler: Exception in connection from uhadoop-adarbt-core1/10.10.117.232:42752

java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at io.netty.buffer.CompositeByteBuf.nioBuffer(CompositeByteBuf.java:1160) at io.netty.buffer.AbstractDerivedByteBuf.nioBuffer(AbstractDerivedByteBuf.java:73) at io.netty.buffer.AbstractByteBuf.nioBuffer(AbstractByteBuf.java:939) at org.apache.spark.network.buffer.NettyManagedBuffer.nioByteBuffer(NettyManagedBuffer.java:45) at org.apache.spark.network.BlockTransferService$$anon$1.onBlockFetchSuccess(BlockTransferService.scala:99) at org.apache.spark.network.shuffle.RetryingBlockFetcher$RetryingBlockFetchListener.onBlockFetchSuccess(RetryingBlockFetcher.java:206) at org.apache.spark.network.shuffle.OneForOneBlockFetcher$ChunkCallback.onSuccess(OneForOneBlockFetcher.java:72)

  • --driver-memory 4g

你可能感兴趣的:(基于内容推荐实践)