社区发现数据集

社区发现数据集

目录

  • 社区发现数据集
      • 目录
      • 基于链接分析的数据集
      • 基于链接与离散型属性的数据集
      • 基于链接与文本型属性的数据集
      • 其他常见的数据集链接
      • Mark Newman收集的数据集
      • Social and Information Network Analysis

基于链接分析的数据集

  • Zachary karate club

    Zachary 网络是通过对一个美国大学空手道俱乐部进行观测而构建出的一个社会网络.网络包含 34 个节点和 78 条边,其中个体表示俱乐部中的成员,而边表示成员之间存在的友谊关系.空手道俱乐部网络已经成为复杂网络社区结构探测中的一个经典问题[1]。【下载地址】

  • American College football

    College Football 网络. Newman 根据美国大学生足球联赛而创建的一个复杂的社会网络.该网络包含 115个节点和 616 条边,其中网络中的结点代表足球队,两个结点之间的边表示两只球队之间进行过一场比赛.参赛的115支大学生代表队被分为12个联盟。比赛的流程是联盟内部的球队先进行小组赛,然后再是联盟之间球队的比赛。这表明联盟内部的球队之间进行的比赛次数多于联盟之间的球队之间进行的比赛的次数.联盟即可表示为该网络的真实社区结构[2]。【下载地址】

  • Dolphin social network

    Dolphin 数据集是 D.Lusseau 等人使用长达 7 年的时间观察新西兰 Doubtful Sound海峡 62 只海豚群体的交流情况而得到的海豚社会关系网络。这个网络具有 62 个节点,159 条边。节点表示海豚,而边表示海豚间的频繁接触[3]。【下载地址】

  • netscience dataset

    Netscience is a coauthorship network of scientists working on network theory and experiment. The dataset contains all components of the network, for a total of 1589 scientists [12]. 【下载地址, 访问密码:4bfc】


基于链接与离散型属性的数据集

  • Political blogs

    该 数 据 集 由Lada Adamic于2005年编译完成, 表示博客的政治倾向。 包含1490个结点和19090条边。数据集中的每个结点都有一个属性描述(用0或者1表示),表示民主或者保守[4] 。【下载地址】

  • DBLP Dataset

    Digital Bibliography Project (DBLP) is a computer science bibliography. In this data set, authors are considered as users, the paper titles of the authors are the text of users and the coauthorship relationship forms the links of users.
    DBLP每月更新的【数据地址】
    DBLP处理后的数据集【数据地址】
    DBLP数据集【使用说明1,使用说明2】

    • DBLP-10K
      DBLP Bibliography data from four research areas of database (DB), data mining (DM), information retrieval (IR) and artificial intelligence (AI) 3. We build a coauthor graph with top 5, 000 authors and their coauthor relationships. In addition, we use two relevant attributes: prolific and primary topic. For attribute “prolific”, authors with ≥ 20 papers are labeled as highly prolific; authors with ≥ 10 and < 20 papers are labeled as prolific and authors with < 10 papers are labeled as low prolific. For attribute “primary topic”, we use a topic modeling approach (PLSA) to extract 100 research topics from a document collection composed of paper titles from the selected authors. Each extracted topic consists of a probability distribution of keywords which are most representative of the topic. Then each author will have one out of 100 topics as his/her primary topic [5]. 【下载地址 访问密码 0674】
    • DBLP-1K, DBLP-5K
      两个数据集,则是直接从DBKP-10K数据集中选择TOP 1000、5000作者构成的数据集。DBLP-5K可参考文献 [6]
  • Facebook Friendship Datasets

    The datasets contain the Facebook networks (from a date in Sept. 2005) from these colleges: Caltech, Princeton, Georgetown and UNC Chapel Hill. The links represent the friendship on Facebook. Each user has the following attributes: ID, a student/faculty status flag, gender, major, second major/minor (if applicable), dormitory(house), year and high school [10].【下载地址, 访问密码:264c】.

基于链接与文本型属性的数据集

  • Enron Email Dataset

    This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders [7].【下载地址】

    • Enron Mail subset
      A subset of about 1700 labeled email messages (4.5M). These were chosen in a semi-motivated fashion (focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on). Students in the ANLP course annotated the selected messages with the category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings. 该子数据集参照【分类】分为11类。【下载地址】
      2005年3月版本的【Enron mail数据集】
  • CiteSeer

    The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset [8]. 【下载地址】

  • Cora

    The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details [8].【下载地址】

  • WebKB

    The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details [9].【下载地址】

  • Terrorists

    The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details. 【下载地址】

  • Terrorist Attacks

    This dataset consists of 1293 terrorist attacks each assigned one of 6 labels indicating the type of the attack. Each attack is described by a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature. There are a total of 106 distinct features. The files in the dataset can be used to create two distinct graphs. The README file in the dataset provides more details.【下载地址】

  • Flickr数据集

    The Flickr image sharing network consists of nodes which represent Flickr users, and edges indicate follow relations between users. We use tags of images uploaded by a given user as her attributes. In this network, the ground-truth communities are defined as user-created interest-based groups that have more than five members. 【下载地址, 访问密码:ffdb】

其他常见的数据集链接


  • Stanford Large Network Dataset Collection

  • Social networks
    online social networks, edges represent interactions between people
  • Networks with ground-truth communities
    ground-truth network communities in social and information networks
  • Communication networks
    email communication networks with edges representing communication
  • Citation networks
    nodes represent papers, edges represent citations
  • Collaboration networks
    nodes represent scientists, edges represent collaborations (co-authoring a paper)
  • Web graphs
    nodes represent webpages and edges are hyperlinks
  • Amazon networks
    nodes represent products and edges link commonly co-purchased products
  • Internet networks
    nodes represent computers and edges communication
  • Road networks
    nodes represent intersections and edges roads connecting the intersections
  • Autonomous systems
    graphs of the internet
  • Signed networks
    networks with positive and negative edges (friend/foe, trust/distrust)
  • Location-based online social networks
    Social networks with geographic check-ins
  • Wikipedia networks and metadata
    Talk, editing and voting data from Wikipedia
  • Twitter and Memetracker
    Memetracker phrases, links and 467 million Tweets
  • Online communities
    Data from online communities such as Reddit and Flickr
  • Online reviews
    Data from online review systems such as BeerAdvocate and Amazon

Mark Newman收集的数据集

介绍及相关社区发现算法:http://www-personal.umich.edu/~mejn/
数据集:http://www-personal.umich.edu/~mejn/netdata/

Social and Information Network Analysis

  • KDD Cup Dataset
    http://www.cs.cornell.edu/projects/kddcup/datasets.html
  • Stack Overflow Data
    http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/
  • Youtube dataset
    YouTube videos as nodes. Edge a->b means video b is in the related video list (first 20 only) of a video a.
    http://netsg.cs.sfu.ca/youtubedata/
  • Amazon Data
    The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes). http://snap.stanford.edu/data/amazon-meta.html

[1]: W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977)
[2]: M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002).
[3]: V.Lusseau, K .Schneider, OJ .Boisseau et al. The Bottlenose Dolphin Community of Doubtful Sound Features a Large Proportion of Long-Lasting Associations. Behavioral Ecology and Sociobiology, 2003, 54(4):392-405
[4]: L. A. Adamic and N. Glance, “The political blogosphere and the 2004 US Election”, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005)
[5]: Zhou Y, Cheng H, Yu J X. Clustering large attributed graphs: An efficient incremental approach[C]//Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010: 689-698.
[6]: Zhou Y, Cheng H, Yu J X. Graph clustering based on structural/attribute similarities[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 718-729.
[7]: Klimt B, Yang Y. Introducing the Enron Corpus[C]//CEAS. 2004.
[8]: Yang T, Jin R, Chi Y, et al. Combining link and content for community detection: a discriminative approach[C]//Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009: 927-936.
[9]: Lu Q, Getoor L. Link-based classification[C]//ICML. 2003, 3: 496-503.
[10]: Dang T A, Viennet E. Community detection based on structural and attribute similarities[C]//International Conference on Digital Society (ICDS). 2012: 7-12.
[11]: Xirong Li, Cees G.M. Snoek, and Marcel Worring, Learning Social Tag Relevance by Neighbor Voting, in IEEE Transactions on Multimedia (T-MM), 2009
[12]: Newman M E J. Finding community structure in networks using the eigenvectors of matrices[J]. Physical review E, 2006, 74(3): 036104.

声明:本文仅对相关数据集进行说明,并提供相应的链接,如需转载,请注明本文链接:http://blog.csdn.net/wzgang123/article/details/51089521

你可能感兴趣的:(实验数据集)