微博社交圈子挖掘所面临的困难

我很喜欢《亮剑》这部电视剧,李云龙经常说:我们不能打了半天,不知道敌人是谁。

所以 ,这一篇文章,我简单分析一下,微博社交圈子挖掘目前遇到的问题是什么?不能分析了半天,只注重结果如何如何,却忘记了最根本的问题、难题是什么。 复杂网络中社团结构发现的研究已经有好多年了,有分裂的方法,凝聚的方法,基于网络动力学的方法,还有很多别的奇怪的方法。这些方法都有各自适应的解决的网络结构。比如前一篇博客中提到的两点:

  1. 层次性
  2. 重叠性

一些方法层次性处理的很好,如GN,Newman‘s fast algorithm等,有些重叠性处理的非常好,如k-clique方法比较典型。后来,研究人员相继提出一些方法,将上面的两个问题放在一起解决,比如计算所@沈华伟_ICT在09年的一篇论文,提到了EAGLE算法,能够很好的处理层次性和重叠性,并且在词构成的复杂网络和科学家网络中,得到的很好的结果,是能够比Newman's fast algorithm和k-clique算法有更好的效果是一个不错的工作。 但是上面的算法,应用到微博圈子挖掘中,效果差强人意。原因在哪里呢?就在于:微博是一个比“复杂网络”还要“复杂”的网络。比如GN算法,在节点平局度为5,6,7的时候都会有不错的准确率,但是到8的时候,结果就很差了。Newman's fast algorithm是同样的。k-clique算法如果应用在节点平均度比较大的网络,就会出现很大的社团,这在前一篇博客中的例子很明显的说明了这一点。在@沈华伟_ICT的论文中,平均度分别为:9和3(做了四舍五入),这和微博平均初度相比,还是很小的。我将我关注的网络进行了统计,只保留双向关注,减小了网络的规模,请原谅我没有用图来表示,因为networkx画出来的图,也是一团糟,太密集了。具体统计如下:

number_of_nodes 232
number_of_edges 1467
 
degree_of_刘大鸿 8
degree_of_摇摆巴赫 25
degree_of_周运洪yunhong 15
degree_of_搜狗郭昂 21
degree_of_东坡门人 5
degree_of_暖暖cathy 5
degree_of_淘宝虚云 9
degree_of_IR-Lucene 35
degree_of_杨先锋UU 4
degree_of_hszhsh1915647047 6
degree_of_谭卫国Forest 5
degree_of_荣名为宝 4
degree_of_王志超 14
degree_of_悦晓0709 7
degree_of_Joseph星之海洋 2
degree_of_bicloud 16
degree_of_张刚-bert 13
degree_of_gycheng 15
degree_of_Humyy 6
degree_of_张某_ICT 230
degree_of_bill323 7
degree_of_蕃茄me 7
degree_of_nancy_4633 2
degree_of_冷建成 5
degree_of_kafka0102 4
degree_of_檀林_hootch 8
degree_of_jluwanghui 1
degree_of_杨彦闯 12
degree_of_崔冲儿 2
degree_of_yellowleaf2010 10
degree_of_郭嘉丰_ICT 20
degree_of_柯南小胖道尔 4
degree_of_公帅_ICT 18
degree_of_aslyc 5
degree_of_sunli1223 23
degree_of_创业者徐仁禄 6
degree_of_张诚zhangcheng 1
degree_of_Darryn 1
degree_of_李-曙光 11
degree_of_foxmailed 15
degree_of_Firewind 4
degree_of_四正 9
degree_of_燕子_lynn 3
degree_of_王东wd 3
degree_of_蒋涛CSDN 22
degree_of_常佳佳-Jason 16
degree_of_詹剑锋_中科院 62
degree_of_GUCAS老H 17
degree_of_宋波simba 13
degree_of_马金柱focus 8
degree_of_梁公军 20
degree_of_小木118 1
degree_of_Cherwen 1
degree_of_威廉他 11
degree_of_雨前LYQ 12
degree_of_苏牧洋 1
degree_of_elvar2011 3
degree_of_沈华伟_ICT 9
degree_of_霍泰稳 27
degree_of_Yahoo韩轶平 10
degree_of_小象公主-猎头一枚 6
degree_of_炼心-自强 8
degree_of_王向东 6
degree_of_泰山泰山 1
degree_of_视觉研究 10
degree_of_八六孩儿 3
degree_of__Diaoer 3
degree_of_孟二利 22
degree_of_雨梦_yumengkk 14
degree_of_wpwei 2
degree_of_IT民工-老蓝 1
degree_of_TimYang 37
degree_of_陈利人 5
degree_of_桂林山水78 6
degree_of_Saylove浣熊 1
degree_of_崔卫兵 2
degree_of_sigmod 9
degree_of_林乐宇_冰山雪豹 4
degree_of_杨逍Venus 11
degree_of_新IT民工 21
degree_of_头不疼 1
degree_of_庞崇- 1
degree_of_爱的马斯特 14
degree_of_Binos_ICT 7
degree_of_豆爸何锐 8
degree_of_幸运coming琳琳 12
degree_of_RefuseBT 1
degree_of_网路冷眼 32
degree_of_橘子郡_guy 15
degree_of_秋实Li 4
degree_of_BetaCafe 10
degree_of_AmyDeng_Fusionio 38
degree_of_jingmouren 10
degree_of_jqliu 4
degree_of_影子猎手 9
degree_of_liangjz 36
degree_of_bodd 6
degree_of_海带丝丝 3
degree_of_宗秀倩 5
degree_of_程序媛 8
degree_of_互联网聚焦 2
degree_of_李猛-Mn 29
degree_of_51CTO官方微博 7
degree_of_MapReduce 31
degree_of_小樱Daisy 1
degree_of_IT技术博客大学习 22
degree_of_XiaoJunHong 19
degree_of_肖瑞麟Jerry 13
degree_of_凌峰TB 6
degree_of_董安民 3
degree_of_美国经济 1
degree_of_张启达 4
degree_of_万树-杨 7
degree_of_陈房伟 2
degree_of_围棋搜索引擎 9
degree_of_wenzhihong 2
degree_of_吴尔平-andy 6
degree_of_大时代投资 1
degree_of_KissDev 46
degree_of_forchenyun 13
degree_of_Ken王健 4
degree_of_琦大头 5
degree_of_花开花落1003 2
degree_of_张夏天_机器学习 15
degree_of_-林鸿飞- 28
degree_of_solochar 50
degree_of_luketty 3
degree_of_黄麟晰 2
degree_of_张永生 10
degree_of_arpro 7
degree_of_董明楷 4
degree_of_淘解伦 35
degree_of_ICT_朱亚东 21
degree_of_胡云华MSRA 6
degree_of_zangxt 11
degree_of_lordhong 6
degree_of_hny101 4
degree_of_鱼晓-五毛 12
degree_of_guoyipeng 10
degree_of_TreapDB 40
degree_of_张杰_NoahArk 6
degree_of_吕慧伟 4
degree_of___那谁__ 15
degree_of_梁斌penny 80
degree_of_飞雪巴啦巴巴巴 1
degree_of_建文的马甲 5
degree_of_THUIRDB 61
degree_of_花火易碎 5
degree_of_潘少宁_腾讯_LAMP人 26
degree_of_Eva奶奶 2
degree_of_alue-fabre 5
degree_of_肖隆平 1
degree_of_loveEmma 11
degree_of_创业-育森 19
degree_of_独孤虎-李利鹏 10
degree_of_berlinix 5
degree_of_gongbin 8
degree_of_Abioy 10
degree_of_即刻搜索JIKE 11
degree_of_触景无限 3
degree_of_王上游 4
degree_of_易观胡斌 1
degree_of_小鱼西游 4
degree_of_兔杰列夫 4
degree_of_草木菁菁无畏 1
degree_of_wenzhong 25
degree_of_丁国栋_ICT 40
degree_of_袁小晕 9
degree_of_拓尔思 11
degree_of_zhh_1211卉 2
degree_of_nzinfo 25
degree_of_王大美 10
degree_of_刘德超richard 1
degree_of_聋瞎的世界 1
degree_of_丕子 34
degree_of_工体东路 5
degree_of_fishermen 17
degree_of_刘克庄 2
degree_of_任勇_东京大学 23
degree_of_leeyanva 15
degree_of_闫刚2012 1
degree_of_贺志明_ICT 39
degree_of_winston 9
degree_of_yaronli 18
degree_of_bian 15
degree_of_fengyuncrawl 73
degree_of_王斌_ICTIR 33
degree_of_SunGis 6
degree_of_数据挖掘_PHP 11
degree_of_OnlyXP 8
degree_of_王联辉 22
degree_of_张凯1976 11
degree_of_罗大维 15
degree_of_奔三北P 2
degree_of_net_ashes 4
degree_of_小五丫头 1
degree_of_九州-姬野 31
degree_of_温小燕儿 1
degree_of_薄荷糖糖 3
degree_of_宁怡 2
degree_of_wave2future 2
degree_of_Qunar-JarnTang 7
degree_of_关毅的围脖 18
degree_of_猎头-Kevin 19
degree_of_玉宇金辉 2
degree_of_liudaoru 19
degree_of_王栋PKU 13
degree_of_淘宝日照 19
degree_of_争一言 1
degree_of_蓝俊杰 1
degree_of_武卫东 23
degree_of_ElmerZhang 5
degree_of_生活精选 1
degree_of_object 3
degree_of_soker 6
degree_of_asddew23f 3
degree_of_孟鸿 7
degree_of_c背a井t医y志y猫c 3
degree_of_佟怡峦 7
degree_of_hi郭海峰 2
degree_of_图灵杨海玲 32
degree_of_顾平Baidu 7
degree_of_林语姿 1
degree_of_QLeelulu 4
degree_of_张颖峰 21
degree_of_淘宝褚霸 41
degree_of_墨顏Shiver 1
degree_of_微博Koth 38
degree_of_魔时科技张首华 33
degree_of_longxibendi 2
degree_of_timo 10
 
average_of_degree 12

最上面的节点数和边数,中间是每个节点的度,最后是平均度12,相对于已有的一些工作而言,这个度是在太大了。(注:这是去年的数据,与现在的数据肯定是不符的)。 通过的上面的分析和数据,我们发现,微博圈子挖掘所面临的主要问题是微博网络过于复杂,我一个只有200多关注的用户,平均度已经达到12,可想一些名人的微博,会是更加复杂。所以,急需一些新的方法解决这个问题,也希望本篇文章能够对准备做社交挖掘的同学有帮助。至于我的解决方法,有一些初步的想法,等到有结果了,我会和大家分享。 另外,会有同学认为,大规模分析,才是问题。我想这个也得方法有了之后,再去想吧,最想要的还是效果。 希望和大家一起讨论。  

转载于:https://www.cnblogs.com/sing1ee/archive/2012/02/28/2765024.html

你可能感兴趣的:(微博社交圈子挖掘所面临的困难)