我很喜欢《亮剑》这部电视剧,李云龙经常说:我们不能打了半天,不知道敌人是谁。
所以 ,这一篇文章,我简单分析一下,微博社交圈子挖掘目前遇到的问题是什么?不能分析了半天,只注重结果如何如何,却忘记了最根本的问题、难题是什么。 复杂网络中社团结构发现的研究已经有好多年了,有分裂的方法,凝聚的方法,基于网络动力学的方法,还有很多别的奇怪的方法。这些方法都有各自适应的解决的网络结构。比如前一篇博客中提到的两点:
- 层次性
- 重叠性
一些方法层次性处理的很好,如GN,Newman‘s fast algorithm等,有些重叠性处理的非常好,如k-clique方法比较典型。后来,研究人员相继提出一些方法,将上面的两个问题放在一起解决,比如计算所@沈华伟_ICT在09年的一篇论文,提到了EAGLE算法,能够很好的处理层次性和重叠性,并且在词构成的复杂网络和科学家网络中,得到的很好的结果,是能够比Newman's fast algorithm和k-clique算法有更好的效果是一个不错的工作。 但是上面的算法,应用到微博圈子挖掘中,效果差强人意。原因在哪里呢?就在于:微博是一个比“复杂网络”还要“复杂”的网络。比如GN算法,在节点平局度为5,6,7的时候都会有不错的准确率,但是到8的时候,结果就很差了。Newman's fast algorithm是同样的。k-clique算法如果应用在节点平均度比较大的网络,就会出现很大的社团,这在前一篇博客中的例子很明显的说明了这一点。在@沈华伟_ICT的论文中,平均度分别为:9和3(做了四舍五入),这和微博平均初度相比,还是很小的。我将我关注的网络进行了统计,只保留双向关注,减小了网络的规模,请原谅我没有用图来表示,因为networkx画出来的图,也是一团糟,太密集了。具体统计如下:
number_of_nodes | 232 |
number_of_edges | 1467 |
degree_of_刘大鸿 | 8 |
degree_of_摇摆巴赫 | 25 |
degree_of_周运洪yunhong | 15 |
degree_of_搜狗郭昂 | 21 |
degree_of_东坡门人 | 5 |
degree_of_暖暖cathy | 5 |
degree_of_淘宝虚云 | 9 |
degree_of_IR-Lucene | 35 |
degree_of_杨先锋UU | 4 |
degree_of_hszhsh1915647047 | 6 |
degree_of_谭卫国Forest | 5 |
degree_of_荣名为宝 | 4 |
degree_of_王志超 | 14 |
degree_of_悦晓0709 | 7 |
degree_of_Joseph星之海洋 | 2 |
degree_of_bicloud | 16 |
degree_of_张刚-bert | 13 |
degree_of_gycheng | 15 |
degree_of_Humyy | 6 |
degree_of_张某_ICT | 230 |
degree_of_bill323 | 7 |
degree_of_蕃茄me | 7 |
degree_of_nancy_4633 | 2 |
degree_of_冷建成 | 5 |
degree_of_kafka0102 | 4 |
degree_of_檀林_hootch | 8 |
degree_of_jluwanghui | 1 |
degree_of_杨彦闯 | 12 |
degree_of_崔冲儿 | 2 |
degree_of_yellowleaf2010 | 10 |
degree_of_郭嘉丰_ICT | 20 |
degree_of_柯南小胖道尔 | 4 |
degree_of_公帅_ICT | 18 |
degree_of_aslyc | 5 |
degree_of_sunli1223 | 23 |
degree_of_创业者徐仁禄 | 6 |
degree_of_张诚zhangcheng | 1 |
degree_of_Darryn | 1 |
degree_of_李-曙光 | 11 |
degree_of_foxmailed | 15 |
degree_of_Firewind | 4 |
degree_of_四正 | 9 |
degree_of_燕子_lynn | 3 |
degree_of_王东wd | 3 |
degree_of_蒋涛CSDN | 22 |
degree_of_常佳佳-Jason | 16 |
degree_of_詹剑锋_中科院 | 62 |
degree_of_GUCAS老H | 17 |
degree_of_宋波simba | 13 |
degree_of_马金柱focus | 8 |
degree_of_梁公军 | 20 |
degree_of_小木118 | 1 |
degree_of_Cherwen | 1 |
degree_of_威廉他 | 11 |
degree_of_雨前LYQ | 12 |
degree_of_苏牧洋 | 1 |
degree_of_elvar2011 | 3 |
degree_of_沈华伟_ICT | 9 |
degree_of_霍泰稳 | 27 |
degree_of_Yahoo韩轶平 | 10 |
degree_of_小象公主-猎头一枚 | 6 |
degree_of_炼心-自强 | 8 |
degree_of_王向东 | 6 |
degree_of_泰山泰山 | 1 |
degree_of_视觉研究 | 10 |
degree_of_八六孩儿 | 3 |
degree_of__Diaoer | 3 |
degree_of_孟二利 | 22 |
degree_of_雨梦_yumengkk | 14 |
degree_of_wpwei | 2 |
degree_of_IT民工-老蓝 | 1 |
degree_of_TimYang | 37 |
degree_of_陈利人 | 5 |
degree_of_桂林山水78 | 6 |
degree_of_Saylove浣熊 | 1 |
degree_of_崔卫兵 | 2 |
degree_of_sigmod | 9 |
degree_of_林乐宇_冰山雪豹 | 4 |
degree_of_杨逍Venus | 11 |
degree_of_新IT民工 | 21 |
degree_of_头不疼 | 1 |
degree_of_庞崇- | 1 |
degree_of_爱的马斯特 | 14 |
degree_of_Binos_ICT | 7 |
degree_of_豆爸何锐 | 8 |
degree_of_幸运coming琳琳 | 12 |
degree_of_RefuseBT | 1 |
degree_of_网路冷眼 | 32 |
degree_of_橘子郡_guy | 15 |
degree_of_秋实Li | 4 |
degree_of_BetaCafe | 10 |
degree_of_AmyDeng_Fusionio | 38 |
degree_of_jingmouren | 10 |
degree_of_jqliu | 4 |
degree_of_影子猎手 | 9 |
degree_of_liangjz | 36 |
degree_of_bodd | 6 |
degree_of_海带丝丝 | 3 |
degree_of_宗秀倩 | 5 |
degree_of_程序媛 | 8 |
degree_of_互联网聚焦 | 2 |
degree_of_李猛-Mn | 29 |
degree_of_51CTO官方微博 | 7 |
degree_of_MapReduce | 31 |
degree_of_小樱Daisy | 1 |
degree_of_IT技术博客大学习 | 22 |
degree_of_XiaoJunHong | 19 |
degree_of_肖瑞麟Jerry | 13 |
degree_of_凌峰TB | 6 |
degree_of_董安民 | 3 |
degree_of_美国经济 | 1 |
degree_of_张启达 | 4 |
degree_of_万树-杨 | 7 |
degree_of_陈房伟 | 2 |
degree_of_围棋搜索引擎 | 9 |
degree_of_wenzhihong | 2 |
degree_of_吴尔平-andy | 6 |
degree_of_大时代投资 | 1 |
degree_of_KissDev | 46 |
degree_of_forchenyun | 13 |
degree_of_Ken王健 | 4 |
degree_of_琦大头 | 5 |
degree_of_花开花落1003 | 2 |
degree_of_张夏天_机器学习 | 15 |
degree_of_-林鸿飞- | 28 |
degree_of_solochar | 50 |
degree_of_luketty | 3 |
degree_of_黄麟晰 | 2 |
degree_of_张永生 | 10 |
degree_of_arpro | 7 |
degree_of_董明楷 | 4 |
degree_of_淘解伦 | 35 |
degree_of_ICT_朱亚东 | 21 |
degree_of_胡云华MSRA | 6 |
degree_of_zangxt | 11 |
degree_of_lordhong | 6 |
degree_of_hny101 | 4 |
degree_of_鱼晓-五毛 | 12 |
degree_of_guoyipeng | 10 |
degree_of_TreapDB | 40 |
degree_of_张杰_NoahArk | 6 |
degree_of_吕慧伟 | 4 |
degree_of___那谁__ | 15 |
degree_of_梁斌penny | 80 |
degree_of_飞雪巴啦巴巴巴 | 1 |
degree_of_建文的马甲 | 5 |
degree_of_THUIRDB | 61 |
degree_of_花火易碎 | 5 |
degree_of_潘少宁_腾讯_LAMP人 | 26 |
degree_of_Eva奶奶 | 2 |
degree_of_alue-fabre | 5 |
degree_of_肖隆平 | 1 |
degree_of_loveEmma | 11 |
degree_of_创业-育森 | 19 |
degree_of_独孤虎-李利鹏 | 10 |
degree_of_berlinix | 5 |
degree_of_gongbin | 8 |
degree_of_Abioy | 10 |
degree_of_即刻搜索JIKE | 11 |
degree_of_触景无限 | 3 |
degree_of_王上游 | 4 |
degree_of_易观胡斌 | 1 |
degree_of_小鱼西游 | 4 |
degree_of_兔杰列夫 | 4 |
degree_of_草木菁菁无畏 | 1 |
degree_of_wenzhong | 25 |
degree_of_丁国栋_ICT | 40 |
degree_of_袁小晕 | 9 |
degree_of_拓尔思 | 11 |
degree_of_zhh_1211卉 | 2 |
degree_of_nzinfo | 25 |
degree_of_王大美 | 10 |
degree_of_刘德超richard | 1 |
degree_of_聋瞎的世界 | 1 |
degree_of_丕子 | 34 |
degree_of_工体东路 | 5 |
degree_of_fishermen | 17 |
degree_of_刘克庄 | 2 |
degree_of_任勇_东京大学 | 23 |
degree_of_leeyanva | 15 |
degree_of_闫刚2012 | 1 |
degree_of_贺志明_ICT | 39 |
degree_of_winston | 9 |
degree_of_yaronli | 18 |
degree_of_bian | 15 |
degree_of_fengyuncrawl | 73 |
degree_of_王斌_ICTIR | 33 |
degree_of_SunGis | 6 |
degree_of_数据挖掘_PHP | 11 |
degree_of_OnlyXP | 8 |
degree_of_王联辉 | 22 |
degree_of_张凯1976 | 11 |
degree_of_罗大维 | 15 |
degree_of_奔三北P | 2 |
degree_of_net_ashes | 4 |
degree_of_小五丫头 | 1 |
degree_of_九州-姬野 | 31 |
degree_of_温小燕儿 | 1 |
degree_of_薄荷糖糖 | 3 |
degree_of_宁怡 | 2 |
degree_of_wave2future | 2 |
degree_of_Qunar-JarnTang | 7 |
degree_of_关毅的围脖 | 18 |
degree_of_猎头-Kevin | 19 |
degree_of_玉宇金辉 | 2 |
degree_of_liudaoru | 19 |
degree_of_王栋PKU | 13 |
degree_of_淘宝日照 | 19 |
degree_of_争一言 | 1 |
degree_of_蓝俊杰 | 1 |
degree_of_武卫东 | 23 |
degree_of_ElmerZhang | 5 |
degree_of_生活精选 | 1 |
degree_of_object | 3 |
degree_of_soker | 6 |
degree_of_asddew23f | 3 |
degree_of_孟鸿 | 7 |
degree_of_c背a井t医y志y猫c | 3 |
degree_of_佟怡峦 | 7 |
degree_of_hi郭海峰 | 2 |
degree_of_图灵杨海玲 | 32 |
degree_of_顾平Baidu | 7 |
degree_of_林语姿 | 1 |
degree_of_QLeelulu | 4 |
degree_of_张颖峰 | 21 |
degree_of_淘宝褚霸 | 41 |
degree_of_墨顏Shiver | 1 |
degree_of_微博Koth | 38 |
degree_of_魔时科技张首华 | 33 |
degree_of_longxibendi | 2 |
degree_of_timo | 10 |
average_of_degree | 12 |
最上面的节点数和边数,中间是每个节点的度,最后是平均度12,相对于已有的一些工作而言,这个度是在太大了。(注:这是去年的数据,与现在的数据肯定是不符的)。 通过的上面的分析和数据,我们发现,微博圈子挖掘所面临的主要问题是微博网络过于复杂,我一个只有200多关注的用户,平均度已经达到12,可想一些名人的微博,会是更加复杂。所以,急需一些新的方法解决这个问题,也希望本篇文章能够对准备做社交挖掘的同学有帮助。至于我的解决方法,有一些初步的想法,等到有结果了,我会和大家分享。 另外,会有同学认为,大规模分析,才是问题。我想这个也得方法有了之后,再去想吧,最想要的还是效果。 希望和大家一起讨论。