MongoDB mapReduce填补自带group的限制问题

用过MongoDB的小伙伴或多或少都会遇到MongoDB自带group的限制问题,即group不支持大于2万的结果集!

我曾在任职公司做过某网站小区均价数据抓取,采用的是Groovy技术。数据抓取下来后存放在MongoDB中。数据抓取完成都是需要经历数据清洗处理的,其中有一项数据去重时候就遇到MongoDB自带group的限制问题,即group不支持大于2万的结果集。几番折腾后来采用MongoDB mapReduce来处理。

下面是数据去重的部分核心代码,已做脱敏处理!

package com.script.thirdPriceCrawl

import com.script.util.MongoUtils
import com.mongodb.BasicDBObject
import com.mongodb.DB
import com.mongodb.DBCollection
import com.mongodb.DBObject
import com.mongodb.MapReduceOutput
import org.apache.commons.lang.StringUtils

import java.util.concurrent.ExecutorService
import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit

/**
 * 网络爬虫:
 * 某网站-小区均价数据抓取
* http://www.xxx.com/ * * 解析完成后通过MapReduce统计/清理重复数据工具类
* 在清洗数据脚本执行之前调用 * * @author 小辉哥/小辉GE * 2019年9月5日 下午21:22:12 * */ class ThirdPartPriceDataRepeatBatchUtils { //MapReduce 执行分组后输出DBCollection def static final THIRD_PARTY_PRICE_DATAREPEAT_TEMP = "third_party_price_repeat_data_temp" def static final THIRD_PARTY_PRICE_DATAPRICEREPEAT_TEMP = "third_party_price_repeat_dataprice_temp" //线程池 def static ExecutorService fixedThreadPoolQuery //其他变量 def static final CONNECT_CHAR = "#####" /** * 每次执行时候初始化线程池, 防止类已加载就要多个线程池中线程占用内存 * @param type * @return */ def static initThreadPool(){ fixedThreadPoolQuery = Executors.newFixedThreadPool(50) } /** * 全量测试执行时间为277分钟, 预计执行时间为五小时 * * 返回当月总解析后重复数据 * 判断分组标准:s_date+source+city+region+name * 备注:方法在解析完成清洗开始前执行 * @param sdate * @param thirdPartyColl * @return */ def static findThirdPriceDataRepeatBatch(sdate, DBCollection thirdPartyColl){ try { println("findThirdPriceDataRepeatBatch 处理重复数据开始") long start = System.currentTimeMillis() //query条件, 按sdate筛选 DBObject query = new BasicDBObject().append("s_date", sdate) //mapfun String mapfun = "function(){" + "emit({s_date:this.s_date, source:this.source, city:this.city, region:this.region, name:this.name}, 1);" + "};" //reducefun String reducefun = "function(key, values){" + "return values.length;" + "};" //执行MapReduce MapReduceOutput mapReduce = thirdPartyColl.mapReduce(mapfun, reducefun, THIRD_PARTY_PRICE_DATAREPEAT_TEMP, query) if(mapReduce!=null && mapReduce.results().size()>0){ //初始化线程池(初始化放在mapReduce.results()结果后, 保证thirdPartyColl.mapReduce发生异常, 线程池未创建) initThreadPool() mapReduce.results().each { DBObject o -> try{ if(o.value > 1) { fixedThreadPoolQuery.execute( new Runnable() { void run() { try { println("调用findThirdPriceDataRepeatByPriceBatch, 传入的obj对象为: "+o.toString()) findThirdPriceDataRepeatByPriceBatch((int)(o._id.s_date), o._id.source, o._id.city, o._id.region, o._id.name, thirdPartyColl) } catch (Exception e) { println "调用findThirdPriceDataRepeatByPriceBatch -> DBObject对象为"+o.toString()+"发生异常, 异常信息为:" + e.getMessage() } finally { } } } ) } } catch (Exception e) { println("findThirdPriceDataRepeat --> mapReduce.results().each { DBObject o -> DBObject对象为"+o.toString()+", 时发生异常, 异常信息为" + e.getLocalizedMessage()) } } fixedThreadPoolQuery.shutdown() fixedThreadPoolQuery.awaitTermination(2, TimeUnit.DAYS) long end = System.currentTimeMillis() println "findThirdPriceDataRepeatBatch 处理重复数据完成,耗时:" + (end - start) } } catch (Exception e) { println("findThirdPriceDataRepeatBatch(sdate, DBCollection thirdPartyColl){发生异常, 异常信息为" + e.getLocalizedMessage()) } } /** * 判断分组标准:s_date+source+city+region+name+avg_price * @param sdate * @param source * @param city * @param region * @param name * @param thirdPartyColl * @return */ def static findThirdPriceDataRepeatByPriceBatch(sdate, source, city, region, name, DBCollection thirdPartyColl){ //按价格分组后, 判断是否出现多次不同的价格, 并不是根据分组的count判断, //而是根据按照价格分组后, 对应sdate+source+city+region+name 还出现多次在mapReduce.results()中, 即为重复数据 def dataRepeatMap = [:] try{ //query条件, 按sdate筛选 DBObject query = new BasicDBObject().append("s_date", sdate) .append("source",source).append("city", city).append("region", region).append("name", name) //mapfun String mapfun = "function(){" + "emit({s_date:this.s_date, source:this.source, city:this.city, region:this.region, name:this.name, avg_price:this.avg_price}, 1);" + "};" //reducefun String reducefun = "function(key, values){" + "return values.length;" + "};" //执行MapReduce MapReduceOutput mapReduce = thirdPartyColl.mapReduce(mapfun, reducefun, THIRD_PARTY_PRICE_DATAPRICEREPEAT_TEMP, query) if(mapReduce!=null && mapReduce.results().size()>0){ mapReduce.results().each { DBObject o -> try{ def sd = (int) o._id.s_date def keys = sd + CONNECT_CHAR + o._id.source + CONNECT_CHAR + o._id.city + CONNECT_CHAR + o._id.region + CONNECT_CHAR + o._id.name if (!dataRepeatMap.containsKey(keys)) { println("dataRepeatMap 中不含有 keys:" + keys + "加入dataRepeatMap") dataRepeatMap.put(keys, keys) } else { println("dataRepeatMap 中含有 keys:" + keys + "执行删除方法, 对象为: " + o.toString()) deleteThirdPriceDataBatch((int) (o._id.s_date), o._id.source, o._id.city, o._id.region, o._id.name, thirdPartyColl) } } catch (Exception e) { println("调用deleteThirdPriceDataBatch DBObject对象为"+o.toString()+", 时发生异常, 异常信息为" + e.getLocalizedMessage()) } } } }catch (Exception e) { println("findThirdPriceDataRepeatByPriceBatch(sdate, source, city, region, name, DBCollection thirdPartyColl){发生异常, 异常信息为" + e.getLocalizedMessage()) } } /** * 删除重复数据 * @param sdate * @param source * @param city * @param region * @param name * @param thirdPartyColl */ def static deleteThirdPriceDataBatch(sdate, source, city, region, name, DBCollection thirdPartyColl){ DBObject obj = new BasicDBObject(); obj.put("s_date", sdate) obj.put("source", source) obj.put("city", city) obj.put("region", region) obj.put("name", name) thirdPartyColl.remove(obj); } }

输出结果在这里就不展示了!!!

以上代码仅供参考,如有不当之处,欢迎指出!!!

更多干货,欢迎大家关注和联系我。期待和大家一起更好的交流、探讨技术!!!

MongoDB mapReduce填补自带group的限制问题_第1张图片

你可能感兴趣的:(MongoDB,Groovy)