来自numb3r3的资源小结:Fellow,Tools,Datasets

Opinion Analysis

  • KAVITA GANESAN 
    I am currently interested in the areas of text/opinion summarization, various opinion analysis tasks and vertical search. I enjoy seeing research being put to good use. That said, I aim at making proposed techniques practical, efficient and scalable to bridge the gap between correctness and usability.

  • Bing Liu 
    Data mining, Web mining, and text mining Machine learning

  • Alice Oh 
    I am co-organizing a Workshop on Finding Synergies Between Texts and Networks for IEEE SocialCom-2010.

  • Samuel Brody 
    A post-doctoral research scientist in the Department of Biomedical Informatics, at the Columbia University Medical Center, working with Dr. Noemie Elhadad. The area of research is unsupervised learning for natural language processing (NLP). He is interested in computational linguistics, machine learning (primarily unsupervised), corpus statistics, Bayesian inference and related subject.

  • Noémie Elhadad 
    Research interests are in natural language processing, with particular focus on text summarization and discourse-level structuring of information.

  • Daniel Ramage 
    My research focuses on building models and tools that can help us understand the world through the lens of the text people write about it. I develop techniques to analyze academic publications, cluster tagged web pages, and mine Twitter posts, among other applications.

  • User Review Structure Analysis (URSA) 
    The goal of the URSA project is to provide a better understanding of user reviewing patterns and to develop tools to better search, understand and access user reviews. We performed an in-depth classification and analysis of a real-world restaurant review data set mined from Citysearch, New York.

Information Retrieval

Recommender System

  • Yehuda Koren 
    His research interests include recommender systems, spam filtering, and general data analysis and visualization. Yehuda led the team that won first awards in the Netflix Prize competition.

  • Information Systems and Machine Learning Lab (ISMLL) 
    A famous resys research Lab, and public many best papers about recommender system every year. They have many perfect work about tag recommendation. 

  • Steffen Rendle
    His research focuses Factorization Model for resys and Tag resys.

  • Ocelma
    Chief Innovation Officer @ BMAT. His research focuses on Music Recommendation.

  • CHU, WEI
    Join Yahoo lab in 2008.

Social Network

  • Social Network Analysis Group @ Stanford 
    The Social Network Analysis Group at Stanford University is a team of faculty, postdocs, and students who study social networks. Our work ranges from basic research on social network phenomena to advanced methods for network analysis.

Computational Advertising

  • Andrei Broder 
    Andrei Broder is a Yahoo! Fellow and Vice President for Computational Advertising. He also serves as Chief Scientist for Search and Advertising.

Machine Learning

  • Eric P. Xing 
    My principal research interests lie in the development of machine learning and statistical methodology; especially for solving problems involving automated learning, reasoning, and decision-making in high-dimensional, multimodal, and dynamic possible worlds; and for building quantitative models and predictive understandings of the evolutionary mechanism, regulatory circuitry, and developmental processes of biological systems.

  • Topic Modeling Bibliography

  • Tom Minka 
    I work in the field of Bayesian statistical inference, and I develop efficient algorithms for use in machine learning, computer vision, text retrieval, and data mining. My goal is to make Bayesian inference a standard tool for processing information.

  • Andrew McCallum 
    The main goal of my research is to dramatically increase our ability to mine actionable knowledge from unstructured text. I am especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature & community.

  • The Gaussian Processes Web Site 
    This web site aims to provide an overview of resources concerned with probabilistic modeling, inference and learning based on Gaussian processes.

  • Matt Hoffman 
    My research focuses on Bayesian modeling of audio, audio feature extraction, music information retrieval, and the application of music information retrieval and modeling techniques to musical synthesis.

  • Shuang Hong Yang 
    Machine learning, data mining, and information retrieval. I am most interested in modeling complex data structures.

Data Mining

  • Prem Melville 
    Our work on Constrained Markov Decision Processes won the Best Application Paper Award at KDD 2010. Our IBM Research team won the KDD Cup 2009, KDD Cup 2008 and the INFORMS Data Mining Contest 2008.

Software Tools

  • Opinosis: Opinion & Text Summarization Software

  • JWI (the MIT Java Wordnet Interface) 
    JWI is written for Java 1.5.0 and has the package namespace edu.mit.jwi. The distribution does not include the Wordnet dictionary files; these can be downloaded from the Wordnet download site. This version of software is distributed under a license that makes it free to use for non-commerical purposes, as long as proper copyright acknowledgement is made. If you are interested in obtaining a commercial license, please contact the MIT Technology Licensing Office.

  • LingPipe 
    LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like: 

    • Find the names of people, organizations or locations in news 
    • Automatically classify Twitter search results into categories 
    • Suggest correct spellings of queries
  • STAT 
    STAT stands for Simple Text Analysis Toolkit, with the goal to provide an lightweight open source machine learning framework in Java. In contrast to existing packages, STAT aims to be substantially simpler to learn and extend. We also provide implementation of some algorithms and provide wrappers for using other packages within STAT.

  • Twitter Corpus Tool
    These tools associated with the tweets corpus prepared for the TREC 2011 Microblog Task.

  • Web Recommender Project 
    Goal:

    • Define a general software design and evaluation framework for web recommendation
    • Implement and test some possible algorithmic strategies based on available open-source tools and resources
  • plda
    A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation

Datasets

OPINION / REVIEW DATASET

  • OPINOSIS dataset 
    This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camry” and “sound quality of ipod nano”, etc. In total there are 51 such topics with each topic having approximately 100 sentences (on the average). The reviews were obtained from various sources - Tripadvisor (hotels), Edmunds.com (cars) and Amazon.com (various electronics).

  • Restaurant Reviews Dataset 
    This data has been collected by me (in a project with Noemie Elhadad) from http://newyork.citysearch.com/ in August 2006. Out of 17843 Restaurants, only 5531 had reviews which gives us a total of 52077 reviews. Maximum number of reviews is 242 (to give better idea for distribution: 25 restaurants >=100 reviews, 103 restaurants >=10 reviews). Here is disribution of ratings (Columns = 1: Rating, 2: Review counts, 5: Percent) and cuisines (Columns = 1: Cuisine, 2: Restaurant Count, 4: Review Count - note than one restaurant can have multiple cuisines).

  • Multiple-aspect restaurant reviews 
    The corpus, introduced in Snyder and Barzilay, consists of 4,488 reviews, both in raw-text and in feature-vector form. Each review gives an explicit 1-to-5 rating for five different aspects—food, ambiance, service, value, and overall experience — along with the text of the review itself, all provided by the review author. A rating of 5 was the most common over all aspects, and Snyder and Barzilay report that 30.5% of the 3,488 reviews in their randomly-selected training set had a rating of 5 for all five aspects, although no other tuple of ratings was represented by more than 5% of the training set. The code used in Snyder and Barzilay is also distributed at the aforementioned URL. The original source for the reviews was http://www.we8there.com/; data from the same website was also used by Higashinaka et al.

  • Multi-Domain Sentiment Dataset 
    This dataset, introduced in Blitzer et al. [40], consists of product reviews from several different product types taken from Amazon.com, some with 1-to-5 star labels, some unlabeled.

  • TripAdvisor Data Set (Latent Aspect Rating Analysis) 
    Parsed reviews crawled from TripAdvisor (http://www.tripadvisor.com). Meta data includes: Author, Content, Date, Number of Reader, Number of Helpful Judgment, Overall rating, Value aspect rating, Rooms aspect rating, Location aspect rating, Cleanliness aspect rating, Check in/front desk aspect rating, Service aspect rating and Business Service aspect rating. Ratings ranges from 0 to 5 stars, and -1 indicates this aspect rating is missing in the orginal html file. 
    Aspect Directory: Segmented aspects from the extracted text file, attributes including of Author, Content, Date, Rating and Aspect Segmentations.

  • Free-Text Annotations 
    Learning Document-Level Semantic Properties from Free-Text Annotations 
    The datasets used in this work are available in XML format , can also download the individual datasets.

  • Cornell Movie Review Data 
    A collection of movie-review datasets for use in sentiment-analysis experiments. Movie review documents are labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars"), and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

  • Dataset List

你可能感兴趣的:(tools)