KAVITA GANESAN
I am currently interested in the areas of text/opinion summarization, various opinion analysis tasks and vertical search. I enjoy seeing research being put to good use. That said, I aim at making proposed techniques practical, efficient and scalable to bridge the gap between correctness and usability.
Bing Liu
Data mining, Web mining, and text mining Machine learning
Alice Oh
I am co-organizing a Workshop on Finding Synergies Between Texts and Networks for IEEE SocialCom-2010.
Samuel Brody
A post-doctoral research scientist in the Department of Biomedical Informatics, at the Columbia University Medical Center, working with Dr. Noemie Elhadad. The area of research is unsupervised learning for natural language processing (NLP). He is interested in computational linguistics, machine learning (primarily unsupervised), corpus statistics, Bayesian inference and related subject.
Noémie Elhadad
Research interests are in natural language processing, with particular focus on text summarization and discourse-level structuring of information.
Daniel Ramage
My research focuses on building models and tools that can help us understand the world through the lens of the text people write about it. I develop techniques to analyze academic publications, cluster tagged web pages, and mine Twitter posts, among other applications.
User Review Structure Analysis (URSA)
The goal of the URSA project is to provide a better understanding of user reviewing patterns and to develop tools to better search, understand and access user reviews. We performed an in-depth classification and analysis of a real-world restaurant review data set mined from Citysearch, New York.
Fabrizio Sebastiani
Research interests lie at the crossroads of Information Retrieval, Machine Learning, and Human Language Technology, with particular emphasis on text categorization and its applications.
Chris Burges
A Principal Researcher and manager of the Text Mining, Search and Navigation Group at Microsoft Research. I'm currently interested in machine learning, optimization methods, ranking, and learning for Web applications in general.
Yehuda Koren
His research interests include recommender systems, spam filtering, and general data analysis and visualization. Yehuda led the team that won first awards in the Netflix Prize competition.
Information Systems and Machine Learning Lab (ISMLL)
A famous resys research Lab, and public many best papers about recommender system every year. They have many perfect work about tag recommendation.
Steffen Rendle
His research focuses Factorization Model for resys and Tag resys.
Ocelma
Chief Innovation Officer @ BMAT. His research focuses on Music Recommendation.
CHU, WEI
Join Yahoo lab in 2008.
Eric P. Xing
My principal research interests lie in the development of machine learning and statistical methodology; especially for solving problems involving automated learning, reasoning, and decision-making in high-dimensional, multimodal, and dynamic possible worlds; and for building quantitative models and predictive understandings of the evolutionary mechanism, regulatory circuitry, and developmental processes of biological systems.
Tom Minka
I work in the field of Bayesian statistical inference, and I develop efficient algorithms for use in machine learning, computer vision, text retrieval, and data mining. My goal is to make Bayesian inference a standard tool for processing information.
Andrew McCallum
The main goal of my research is to dramatically increase our ability to mine actionable knowledge from unstructured text. I am especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature & community.
The Gaussian Processes Web Site
This web site aims to provide an overview of resources concerned with probabilistic modeling, inference and learning based on Gaussian processes.
Matt Hoffman
My research focuses on Bayesian modeling of audio, audio feature extraction, music information retrieval, and the application of music information retrieval and modeling techniques to musical synthesis.
Shuang Hong Yang
Machine learning, data mining, and information retrieval. I am most interested in modeling complex data structures.
JWI (the MIT Java Wordnet Interface)
JWI is written for Java 1.5.0 and has the package namespace edu.mit.jwi. The distribution does not include the Wordnet dictionary files; these can be downloaded from the Wordnet download site. This version of software is distributed under a license that makes it free to use for non-commerical purposes, as long as proper copyright acknowledgement is made. If you are interested in obtaining a commercial license, please contact the MIT Technology Licensing Office.
LingPipe
LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like:
STAT
STAT stands for Simple Text Analysis Toolkit, with the goal to provide an lightweight open source machine learning framework in Java. In contrast to existing packages, STAT aims to be substantially simpler to learn and extend. We also provide implementation of some algorithms and provide wrappers for using other packages within STAT.
Twitter Corpus Tool
These tools associated with the tweets corpus prepared for the TREC 2011 Microblog Task.
Web Recommender Project
Goal:
plda
A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation
OPINOSIS dataset
This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camry” and “sound quality of ipod nano”, etc. In total there are 51 such topics with each topic having approximately 100 sentences (on the average). The reviews were obtained from various sources - Tripadvisor (hotels), Edmunds.com (cars) and Amazon.com (various electronics).
Restaurant Reviews Dataset
This data has been collected by me (in a project with Noemie Elhadad) from http://newyork.citysearch.com/ in August 2006. Out of 17843 Restaurants, only 5531 had reviews which gives us a total of 52077 reviews. Maximum number of reviews is 242 (to give better idea for distribution: 25 restaurants >=100 reviews, 103 restaurants >=10 reviews). Here is disribution of ratings (Columns = 1: Rating, 2: Review counts, 5: Percent) and cuisines (Columns = 1: Cuisine, 2: Restaurant Count, 4: Review Count - note than one restaurant can have multiple cuisines).
Multiple-aspect restaurant reviews
The corpus, introduced in Snyder and Barzilay, consists of 4,488 reviews, both in raw-text and in feature-vector form. Each review gives an explicit 1-to-5 rating for five different aspects—food, ambiance, service, value, and overall experience — along with the text of the review itself, all provided by the review author. A rating of 5 was the most common over all aspects, and Snyder and Barzilay report that 30.5% of the 3,488 reviews in their randomly-selected training set had a rating of 5 for all five aspects, although no other tuple of ratings was represented by more than 5% of the training set. The code used in Snyder and Barzilay is also distributed at the aforementioned URL. The original source for the reviews was http://www.we8there.com/; data from the same website was also used by Higashinaka et al.
Multi-Domain Sentiment Dataset
This dataset, introduced in Blitzer et al. [40], consists of product reviews from several different product types taken from Amazon.com, some with 1-to-5 star labels, some unlabeled.
TripAdvisor Data Set (Latent Aspect Rating Analysis)
Parsed reviews crawled from TripAdvisor (http://www.tripadvisor.com). Meta data includes: Author, Content, Date, Number of Reader, Number of Helpful Judgment, Overall rating, Value aspect rating, Rooms aspect rating, Location aspect rating, Cleanliness aspect rating, Check in/front desk aspect rating, Service aspect rating and Business Service aspect rating. Ratings ranges from 0 to 5 stars, and -1 indicates this aspect rating is missing in the orginal html file.
Aspect Directory: Segmented aspects from the extracted text file, attributes including of Author, Content, Date, Rating and Aspect Segmentations.
Free-Text Annotations
Learning Document-Level Semantic Properties from Free-Text Annotations
The datasets used in this work are available in XML format , can also download the individual datasets.
Cornell Movie Review Data
A collection of movie-review datasets for use in sentiment-analysis experiments. Movie review documents are labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars"), and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.