We have shared the data in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (d1 through d1776), these are calculated properties that can capture some of the characteristics of the molecule - for example size, shape, or elemental constitution. The descriptor matrix has been normalized.
简述:给定CSV文件,每行代表一个分子。第一列代表真实的生物反应,有反应(1),无反应(0)。第2列到第1777列代表分子的属性,例如,大小、形状或元素等。
这个题目的比赛早就结束了,但是仍然可以提交5次结果,查看自己的得分排名。只要提交一个csv格式的结果文件就可以了。
看到0、1,可以确定这是一个二分类问题。
对于这样一个二分类,多属性的问题,首先想到用逻辑回归来试一下。
下面是使用Logistic Regression做预测的python代码:
#!/usr/bin/env python #coding:utf-8 ''' Created on 2014年11月24日 @author: zhaohf ''' from sklearn.linear_model import LogisticRegression from numpy import genfromtxt,savetxt def main(): dataset = genfromtxt(open('../Data/train.csv','r'),delimiter=',',dtype='f8')[1:] test = genfromtxt(open('../Data/test.csv','r'),delimiter=',',dtype='f8')[1:] target = [x[0] for x in dataset] train = [x[1:] for x in dataset] lr = LogisticRegression() lr.fit(train, target) predicted_probs = [[index+1,x[1]] for index,x in enumerate(lr.predict_proba(test))] savetxt('../Submissions/lr_benchmark.csv',predicted_probs,delimiter=',',fmt='%d,%f',header='Molecule,PredictedProbability',comments='') if __name__ == '__main__': main()
下面使用SVM来试一下。代码非常相似。
#!/usr/bin/env python #coding:utf-8 ''' Created on 2014年11月24日 @author: zhaohf ''' from sklearn import svm from numpy import genfromtxt,savetxt def main(): dataset = genfromtxt(open('../Data/train.csv','r'),delimiter=',',dtype='f8')[1:] test = genfromtxt(open('../Data/test.csv','r'),delimiter=',',dtype='f8')[1:] target = [x[0] for x in dataset] train = [x[1:] for x in dataset] svc = svm.SVC(probability=True) svc.fit(train,target) predicted_probs = [[index+1,x[1]] for index,x in enumerate(svc.predict_proba(test))] savetxt('../Submissions/svm_benchmark.csv',predicted_probs,delimiter=',',fmt='%d,%f',header='MoleculeId,PredictedProbability',comments='') if __name__ == '__main__': main()
排行榜上做到最好的分数是0.37356,要想参加比赛取得成绩还是要做些努力的。