2017年8月13-17日,第23届KDD大会在加拿大哈利法克斯召开。KDD CUP是ACM SIGKDD组织的有关数据挖掘和知识发现领域的年度赛事,作为KDD年会的重要组成部分,从1997年至今,已有二十年的历史,是目前数据挖掘领域最有影响力的赛事。今天,我们就一起来回顾下这二十年的KDD CUP吧。
Intro:This year's challenge is to predict who is most likely to donate to a charity. Contestants were evaluated on the accuracy on the validation data set.Note: the data used in KDD Cup 1997 is exactly the same as KDD Cup 1998.今年的挑战是预测谁最有可能捐赠给慈善机构。选手们对验证数据集的准确性进行了评估。注:1997年KDD杯使用的数据与1998年KDD杯完全相同。
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-1997/Data
Results:
First Place (jointly shared):
Runner Up:
Intro:
The competition task is a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-1998/Tasks
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-1998/Data
Results
Intro:
The task for the classifier learning contest organized in conjunction with the KDD'99 conference was to learn a predictive model (i.e. a classifier) capable of distinguishing between legitimate and illegitimate connections in a computer network.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-1999/Tasks
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-1999/Data
Result:
Intro:
The KDD Cup 2000 domain contains clickstream and purchase data from Gazelle.com, a legwear and legcare web retailer that closed their online store on 8/18/2000.
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2000/Data
Result:
Question 1 of KDD Cup 2000
Question 2 of KDD Cup 2000
Question 3 of KDD Cup 2000
Question 4 of KDD Cup 2000
Intro:
Because of the rapid growth of interest in mining biological databases, KDD Cup 2001 was focused on data from genomics and drug design. Sufficient (yet concise) information was provided so that detailed domain knowledge was not a requirement for entry. A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2001/Tasks
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2001/Data
Result:
Task 1 - Thrombin
Task 2 - Function
C. Lambert (Golden Helix)
J. Sese, H. Hayashi, and S. Morishita (University of
Tokyo)
D. Vogel and R. Srinivasan (A.I. Insight)
S. Pocinki, R. Wilkinson, and P. Gaffney (Lubrizol Corp.)
Task 3 - Localization
M. Schonlau (RAND)
W. DuMouchel, C. Volinsky and C. Cortes (AT & T)
B. Frasca, Z. Zheng, R. Parekh, and R. Kohavi (Blue Martini)
Intro:
This year the competition included two tasks that involved data mining in molecular biology domains. The first task focused on constructing models that can assist genome annotators by automatically extracting information from scientific articles. The second task focused on learning models that characterize the behavior of individual genes in a hidden experimental setting. Both are described in more detail on the Tasks page.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2002/Tasks
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2002/Data
Result:
Task 1: Information Extraction from Biomedical Articles
Yizhar Regev and Michal Finkelstein
Design Technology Institute Ltd., Department of Mechanical Engineering at the National University of Singapore and Genome Institute of Singapore (Shi Min)
Data Mining Group, Imperial College and Inforsense Limited (Huma Lodhi and Yong Zhang)
Verity Inc. and Exelixis, Inc. (Bin Chen)
Task 2: Yeast Gene Regulation Prediction
Telstra Research Laboratories
David Vogel and Randy Axelrod
;A.I. Insight Inc. and Sentara Healthcare
Marcus Denecke, Mark-A. Krogel, Marco Landwehr and Tobias Scheffer
Magdeburg University
George Forman
Hewlett Packard Laboratories
Amal Perera, Bill Jockheck, Willy Valdivia Granda, Anne Denton, Pratap Kotala and William Perrizo
North Dakota State University
Intro:
The first task involves predicting the future; contestants predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference. For the second task, contestants must build a citation graph of a large subset of the archive from only the LaTex sources. In the third task, each paper's popularity will be estimated based on partial download logs. And the last task is open! Given the large amount of data, contestants can devise their own questions and the most interesting result is the winner.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2003/Tasks
Ruleshttp://www.kdd.org/kdd-cup/view/kdd-cup-2003/Rules
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2003/Data
Result:
I. Citation Prediction Task
Intro:
This year's competition focuses on data-mining for a variety of performance criteria such as Accuracy, Squared Error, Cross Entropy, and ROC Area. As described on this WWW-site, there are two main tasks based on two datasets from the areas of bioinformatics and quantum physics.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2004/Tasks
Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2004/Rules
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2004/Data
Result:
Quantum Physics Problem
David S. Vogel, Eric Gottschalk, and Morgan C. Wang
MEDai / A.I. Insight / University of Central Florida
* Honorable Mention for ROC Area
* Honorable Mention for Cross Entropy
* Honorable Mention for SLQ Score
Inductis Inc.
* Honorable Mention for Accuracy
Golden Helix Inc.
Protein Homology Problem
Bernhard Pfahringer
University of Waikato, Computer Science Department
Yan Fu, RuiXiang Sun, Qiang Yang, Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao
Institute of Computing Technology, Chinese Academy of Sciences
* Honorable Mention for Squared Error
* Honorable Mention for Average Precision
David S. Vogel, Eric Gottschalk, and Morgan C. Wang
MEDai / A.I. Insight / University of Central Florida
* Honorable Mention for Top-1 Accuracy
Artificial Intelligence Unit, University of Dortmund, Germany
Intro:
This year's competition is about classifying internet user search queries. The task was specifically designed to draw participation from industry, academia, and students.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2005/Tasks
Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2005/Rules
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2005/Data
Result:
Winners
Hong Kong University of Science and Technology team
Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang
Hong Kong University of Science and Technology team
Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang
Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang
Query Categorization Precision Award
Budapest University of Technology team
Zsolt T. Kardkovacs, Domonkos Tikk, Zoltan Bansaghi
Query Categorization Performance Award
MEDai/AI Insight/ Humboldt University team
David S. Vogel, Steve Bridges, Steffen Bickel, Peter Haider, Rolf Schimpfky, Peter Siemen, Tobias Scheffer
Query Categorization Creativity Award
Budapest University of Technology team
Zsolt T. Kardkovacs, Domonkos Tikk, Zoltan Bansaghi
Intro:
This year's KDD Cup challenge problem is drawn from the domain of medical data mining. The tasks are a series of Computer-Aided Detection problems revolving around the clinical problem of identifying pulmonary embolisms from three-dimensional computed tomography data. This challenging domain is characterized by:
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2006/Tasks
Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2006/Rules
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2006/Data
Result:
Task 1 - PE Identification
Task 2 - Patient Classification
Task 3 - Negative Predictive Value
Intro:
This year's KDD Cup focuses on predicting aspects of movie rating behavior. There are two tasks. The tasks, developed in conjunction with Netflix, have been selected to be interesting to participants from both academia and industry You can choose to compete in either or both of the tasks.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2007/Tasks
Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2007/Rules
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2007/Data
Result:
Tasks 1 - Who Rated What
Tasks 2 - How Many Ratings
Intro:
The KDD Cup 2008 challenge focuses on the problem of early detection of breast cancer from X-ray images of the breast. In a screening population, a small fraction of cancerous patients have more than one malignant lesion. To simplify the problem, we only consider one type of cancer - cancerous masses - and only include cancer patients with at most one cancerous mass per patient. The challenge will consist of two parts, each of which is related to the development of algorithms for Computer Aided Detection (CAD) of early stage breast cancer from X-ray images.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2008/Tasks
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2008/Data
Result:
Challenge 1
Team Members: Claudia Perlich, Prem Melville, Grzegorz Swirszcz, Yan Liu, Saharon Rosset, Richard Lawrence.
Affiliation: IBM Research
Team Members: Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin
Affiliation: National Taiwan University
Team Members: Yazhene Krishnaraj and Chandan K. Reddy
Affiliation: Wayne State University
Challenge 2
Team Members: Claudia Perlich, Prem Melville, Grzegorz Swirszcz, Yan Liu, Saharon Rosset, Richard Lawrence.
Affiliation: IBM Research
Team Members: Didier Baclin
Team Members: Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin
Affiliation: National Taiwan University
Intro:
Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).
The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.
The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2009/Tasks
Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2009/Rules
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
Result:
Fast Track
Ensemble Selection for the KDD Cup Orange Challenge
KDD Cup Fast Scoring on a Large Database
Slow Track
University of Melbourne entry
Stochastic Gradient Boosting
Fast Scoring on a Large Database using regularized maximum entropy model, categorical/numerical balanced AdaBoost and selective Naive Bayes
Intro:
How generally or narrowly do students learn? How quickly or slowly? Will the rate of improvement vary between students? What does it mean for one problem to be similar to another? It might depend on whether the knowledge required for one problem is the same as the knowledge required for another. But is it possible to infer the knowledge requirements of problems directly from student performance data, without human analysis of the tasks?
This year’s challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting.
赛题介绍
根据智能教学辅导系统和学生之间的交互日志,来预测学生数学题的考试成绩。该任务兼具实践重要性和科学趣味性。竞赛提供3个开发(develop)数据集和2个挑战(challenge)数据集,每个数据集又分为训练(train)部分和测试(test)部分。Challenge数据集的test部分被隐藏,参赛者需要开发一种学习模型,来准确预测这部分隐藏部分的成绩。
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2010-student-performance-evaluation/Tasks
Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2010-student-performance-evaluation/Rules
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2010-student-performance-evaluation/Data
Result:
All Teams
Feature engineering and classifier ensembling for KDD CUP 2010
Gradient Boosting Machines with Singular Value Decomposition
Collaborative Filtering Applied to Educational Data Mining
Student Teams
Feature engineering and classifier ensembling for KDD CUP 2010
Using HMMs and bagged decision trees to leverage rich features of user and skill
Split-Score-Predicate
Intro:
People have been fascinated by music since the dawn of humanity. A wide variety of music genres and styles has evolved, reflecting diversity in personalities, cultures and age groups. It comes as no surprise that human tastes in music are remarkably diverse, as nicely exhibited by the famous quotation: "We don't like their sound, and guitar music is on the way out" (Decca Recording Co. rejecting the Beatles, 1962).
Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.
Such an exciting analysis introduces new scientific challenges. The KDD Cup contest releases over 300 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items-songs, albums, artists, genres-all tied together within a known taxonomy.
The competition is divided into two tracks:
The first track is aimed at predicting scores that users gave to various items.
The second track requires separation of loved songs from other songs.
Both tracks are open to all research groups in academia and industry.
The KDD Cup 2011 files are currently offline.
赛题介绍
Track1任务:Predicting scores that users gave to various items
(音乐评分预测)
根据用户在雅虎音乐上item的历史评分记录,来预测用户对其他item(包括歌曲、专辑等)的评分和实际评分之间的差异RMSE(最小均方误差)。同时提供的还有歌曲所属的专辑、歌手、曲风等信息
Track2任务:Separation of loved songs from other songs
(识别音乐是否被用户评分)
每个用户提供6首候选的歌曲,其中3首为用户已评分数据,另3首是该用户未评分,但是出自用户中整体评分较高的歌曲。歌曲的属性信息(专辑、歌手、曲风等)也同样提供。参赛者给出二分分类结果(0/1分类),并根据整体准确率计算最终排名
该赛题官方已下线,无数据集下载
Intro:
Online social networking services have become tremendously popular in recent years, with popular social networking sites like Facebook, Twitter, and Tencent Weibo adding thousands of enthusiastic new users each day to their existing billions of actively engaged users. Since its launch in April 2010, Tencent Weibo, one of the largest micro-blogging websites in China, has become a major platform for building friendship and sharing interests online. Currently, there are more than 200 million registered users on Tencent Weibo, generating over 40 million messages each day. This scale benefits the Tencent Weibo users but it can also flood users with huge volumes of information and hence puts them at risk of information overload. Reducing the risk of information overload is a priority for improving the user experience and it also presents opportunities for novel data mining solutions. Thus, capturing users’ interests and accordingly serving them with potentially interesting items (e.g. news, games, advertisements, products), is a fundamental and crucial feature social networking websites like Tencent Weibo.
More information on KDD Cup 2012 (Track 1) can be found at Kaggle.com
赛题介绍
Track1任务:Predict which users(or information sources) one user might follow in Tencent
(社交网络中的个性化推荐系统)
根据腾讯微博中的用户属性(User Profile)、SNS社交关系、在社交网络中的互动记录(retweet、comment、at)等,以及过去30天内的历史item推荐记录,来预测接下来最有可能被用户接受的推荐item列表
大赛官网介绍
https://www.kaggle.com/c/kddcup2012-track1#description
大赛数据集
https://www.kaggle.com/c/kddcup2012-track1/data
Intro:
Search advertising has been one of the major revenue sources of the Internet industry for years. A key technology behind search advertising is to predict the click-through rate (pCTR) of ads, as the economic model behind search advertising requires pCTR values to rank ads and to price clicks. In this task, given the training instances derived from session logs of the Tencent proprietary search engine, soso.com, participants are expected to accurately predict the pCTR of ads in the testing instances.
More information on KDD Cup 2012 (Track 2) can be found at Kaggle.com
赛题介绍
Track2任务:Predict the click-through rate of ads given the query and user information
(搜索广告系统的pTCR点击率预估)
提供用户在腾讯搜索的查询词(query)、展现的广告信息(包括广告标题、描述、url等),以及广告的相对位置(多条广告中的排名)和用户点击情况,以及广告主和用户的属性信息,来预测后续时间用户对广告的点击情况
大赛官网介绍
https://www.kaggle.com/c/kddcup2012-track2#description
大赛数据集
https://www.kaggle.com/c/kddcup2012-track2/data
Intro:
The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.
Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. On one hand, there are many authors who publish under several variations of their own name. On the other hand, different authors might share a similar or even the same name.
As a result, the profile of an author with an ambiguous name tends to contain noise, resulting in papers that are incorrectly assigned to him or her. This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author.
More information on KDD Cup 2013 (Track 1) can be found at Kaggle.com
赛题介绍
Track1任务:Author-Paper Identification Challenge
微软学术搜索是一个开放的平台,它涵盖了各种学术领域超过5000万的出版物和1900多万作者,并保持着每周更新的速度。提供这项服务的主要挑战之一是作者名称的歧义。一方面,很多作者倾向于使用不同的笔名。另一方面,不同的作者可能有一个相似甚至相同的名字。
因此,名字有歧义的作者往往会导致作品与作者对应问题。本届挑战要求参与者能在作者档案中识别出本人所著论文。
大赛官网介绍
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge
大赛数据集
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-chal
lenge/data
Intro:
The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.
Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. This KDD Cup task challenges participants to determine which authors in a given data set are duplicates.
More information on KDD Cup 2013 (Track 2) can be found at Kaggle.com
赛题介绍
Track2任务:Author Disambiguation Challenge
本届挑战要求参与者能在数据集中辨别出哪些作者是同一个人。
大赛官网介绍
https://www.kaggle.com/c/kdd-cup-2013-author-disambiguation
大赛数据集
https://www.kaggle.com/c/kdd-cup-2013-author-disambiguation/data
Intro:
DonorsChoose.org is an online charity that makes it easy to help students in need through school donations. At any time, thousands of teachers in K-12 schools propose projects requesting materials to enhance the education of their students. When a project reaches its funding goal, they ship the materials to the school.
The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that are exceptionally exciting to the business, at the time of posting. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn.
Successful predictions may require a broad range of analytical skills, from natural language processing on the need statements to data mining and classical supervised learning on the descriptive factors around each project.
赛题介绍
KDD Cup2014要求参赛者帮助慈善网站DonorsChoose.org挑选有商业亮点的项目,所有项目都能满足某些特定需求,但是只有个别项目能大幅度超过平均水准。通过早期识别和推荐这些项目,他们能够获得更多的资金注入、更好的用户体验,同时帮助更多的学生获得他们需要的学习材料。
大赛官网介绍
https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose
大赛数据集
https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data
Intro:
Students' high dropout rate on MOOC platforms has been heavily criticized, and predicting their likelihood of dropout would be useful for maintaining and encouraging students' learning activities. Therefore, in KDD Cup 2015, we will predict dropout on XuetangX, one of the largest MOOC platforms in China.
The competition participants need to predict whether a user will drop a course within next 10 days based on his or her prior activities. If a user C leaves no records for course C in the log during the next 10 days, we define it as dropout from course C For more details about log, please refer to the Data Descriptions.
赛题介绍
MOOC在线学习平台上学生的逃课率极高,因此预测他们接下来是否会选择逃课将对保持和激励学生的学习积极性十分有益。在KDD Cup 2015,我们的主题在于预测学生在学堂在线这个全中国最大幕课平台中的逃课率。参赛者需要基于用户个人行为预测接下来10天内他们的逃课几率。
大赛官网介绍
http://www.kddcup2015.com/information.html
大赛数据集
http://data-mining.philippe-fournier-viger.com/the-kddcup-2015-dataset-download-link/
Intro:
Finding influential nodes in a social network for identifying patterns or maximizing information diffusion has been an actively researched area with many practical applications. In addition to the obvious value to the advertising industry, the research community has long sought mechanisms to effectively disseminate new scientific discoveries and technological breakthroughs so as to advance our collective knowledge and elevate our civilization. For students, parents and funding agencies that are planning their academic pursuits or evaluating grant proposals, having an objective picture of the institutions in question is particularly essential. Partly against this backdrop we have witnessed that releasing a yearly Research Institution or University Ranking has become a tradition for many popular newspapers, magazines and academic institutes. Such rankings not only attract attention from governments, universities, students and parents, but also create debates on the scientific correctness behind the rankings. The most criticized aspect of these rankings is: the data used and the methodology employed for the ranking are mostly unknown to the public.
The 2016 KDD Cup will address this very important problem through publically available datasets, like the Microsoft Academic Graph (MAG), a freely available dataset that includes information on academic publications and citations. This dataset, being a heterogeneous graph, that can be used to study the influential nodes of various types including authors, affiliations and venues; we choose to focus on affiliations in this competition. In effect, given a research field, we are challenging the KDD Cup community to jointly develop data mining techniques to identify the best research institutions based on their publication and how they are cited in research articles.
Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2016/Tasks
Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2016/Rules
Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2016/Data
Intro:
Highway tollgates are well known bottlenecks in traffic networks. During rush hours, long queues at tollgates can overwhelm traffic management authorities. Effective preemptive countermeasures are desired to solve this challenge. Such countermeasures include expediting the toll collection process and streamlining future traffic flow. The expedition of toll collection could be simply allocating temporary toll collectors to open more lanes. Future traffic flow could be streamlined by adaptively tweaking traffic signals at upstream intersections. Preemptive countermeasures will only work when the traffic management authorities receive reliable predictions for future traffic flow. For example, if heavy traffic in the next hour is predicted, then traffic regulators could immediately deploy additional toll collectors and/or divert traffic at upstream intersections.
Traffic flow patterns vary due to different stochastic factors, such as weather conditions, holidays, time of the day, etc. The prediction of future traffic flow and ETA (Estimated Time of Arrival) is a known challenge. An unprecedented large amount of traffic data from mobile apps such as Waze (in the US) or Amap (in China) can help us take up that challenge. If the contestants in this proposed KDD CUP could design reliable approaches for future traffic flow and ETA prediction, then the traffic management authorities might be able to capitalize on big data & algorithms for fewer congestions at tollgates.
赛题介绍
高速公路收费站是交通网络中众所周知的瓶颈。如果可以提前预测接下来一小时的交通拥堵状况,那么交通管理部门可以及时采取措施进行上游路口的流量诱导和控制。KDD CUP 2017希望参赛者可以设计一套预测交通流量和车辆到达时间的算法,用算法和数据来赋能交通领域,减少拥堵的发生。
Task 1: To estimate the average travel time from designated intersections to tollgates(预测车辆从路口到收费站的平均用时)
Task 2: To predict average tollgate traffic volume(高速收费站车流量预测)