天池大数据比赛-天体分类总结

赛题介绍

比赛链接:
https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100150.711.8.5f712784mldsqp&raceId=231646

在天文学中,光谱描述了天体的辐射特性,以不同波长处辐射强度的分布来表示。每条观测得到的光谱主要是由黑体辐射产生的连续谱、天体中元素的原子能级跃迁产生的特征谱线(吸收线、发射线)以及噪声组成。通常天文学家依据光谱的特征谱线和物理参数等来判定天体的类型。在目前的LAMOST巡天数据发布中,光谱主要被分为恒星、星系、类星体和未知天体四大类。 LAMOST数据集中的每一条光谱提供了3690-9100埃的波长范围内的一系列辐射强度值。光谱自动分类就是要从上千维的光谱数据中选择和提取对分类识别最有效的特征来构建特征空间,例如选择特定波长或波段上的光谱流量值等作为特征,并运用算法对各种天体进行区分。

本次大赛旨在通过机器学习算法对LAMOST DR3光谱进行自动分类(STAR/GALAXY/QSO/UNKNOWN),参赛选手需要设计高效高准确率的算法来解决这个天文学研究中的实际问题。

竞赛数据

赛题数据包括索引文件(index.csv)和波段文件(id.txt集合的zip)两部分:

1)索引文件的第一行是字段名,之后每一行代表一个天体。索引文件的第一个字段为波段文件id号。训练集的索引文件记录了波段文件id号以及分类信息,测试集的索引文件记录了波段文件id号,需要预测分类信息。

2)波段文件是txt后缀的文本文件,存储的是已经插值采样好的波段数据,以逗号分隔。所有波段文件的波段区间和采样点都相同,采样点个数都是2600个。

3)带 train 为训练集;带 test 为第一阶段测试集;带 rank 为第二阶段测试集。

Unknown数据补充说明:
1)LAMOST数据集中的unknown类别是由于光谱质量(信噪比低)等原因,未能够给出确切的分类的天体;
2)Unknown分类目前由程序给出,其中不排除有恒星、星系和类星体。

数据预处理

每个样本的特征分布在各个文件中,无法直接使用模型。
一开始想用python的pandas进行特征的聚合,碰到两个问题:
1. 速度太慢
2. 内存占用太大

针对速度问题:该用java读区并处理数据
针对内存问题:并不将所有数据写到一个文件中,而是拆分几个特征文件。

def readTotalFeature(fileName, maxFileIndex, hasLabel=True, partition = 20, ewFeature=True, rawFeature=True, addQsoFeature=True, doScale=True):
    total_train_feature = pd.DataFrame()
    for i in range(0, maxFileIndex+1):
        train_set = pd.read_csv(fileName + str(i) +'.csv', header=None)
        renameCol = {
    0:'id'}
        if hasLabel:
            renameCol[train_set.shape[1]-1] = 'label'
        train_set.rename(columns=renameCol, inplace=True)
        if hasLabel:
            processed_train = pd.DataFrame()
            rawFeature = train_set.iloc[:,1:-1]
        else:
            rawFeature = train_set.iloc[:,1:]
        if doScale:
            rawFeature =  pd.DataFrame(scale(rawFeature, axis=1))
        processed_train = extractFeature(rawFeature, partition, levelFeature=False, smooth=False)
        if ewFeature:
            rangeList = [(0,300), (0, 500), (500, 1000),(850, 1000), (1000, 1300),(1500,2000), (1500, 2600), (2500, 2600)]
            ew_feature = EWByRangeList(rawFeature, rangeList, filterThrehold=0.5)
            processed_train = pd.concat([processed_train, ew_feature], axis=1)
        if addQsoFeature:
            qso_feature = qsoFeature(rawFeature)
            processed_train = pd.concat([processed_train, qso_feature], axis=1)
        if hasLabel:
            processed_train['label'] = train_set.label
        else:
            processed_train['id'] = train_set.id
        total_train_feature = total_train_feature.append(processed_train, ignore_index=True)
    return total_train_feature.sample(total_train_feature.shape[0], replace=False)

特征工程

特征提取:

def extractFeature(train_feature, partition=2, randomPartion=False, prefix="p", onlyAvg=False, filterLow=False, levelFeature=False,  smooth=False):
    if smooth:
        train_feature = train_feature.rolling(100, axis=1).mean()
    if filterLow:
        train_feature[train_feature<1] = 0
    plen = train_feature.shape[1]/partition
    features = pd.DataFrame()
    avgs = pd.DataFrame()
    for i in range(0, partition):
        pstart = random.randint(0, train_feature.shape[1]-plen-1)
        pendExclue = pstart + plen + 1
        avgC = train_feature.iloc[:, pstart:pendExclue].mean(axis=1)
        stdC = train_feature.iloc[:, pstart:pendExclue].std(axis=1)
        maxC = train_feature.iloc[:, pstart:pendExclue].max(axis=1)
        minC = train_feature.iloc[:, pstart:pendExclue].min(axis=1)
        medianC = train_feature.iloc[:, pstart:pendExclue].median(axis=1)
        diffC = train_feature.iloc[:, pstart:pendExclue].diff(axis=1).iloc[:,1:]
        features[prefix+'_avg'+str(i)] = avgC
        avgs[prefix+'_avg'+str(i)] = avgC
        if onlyAvg==False:
            features[prefix+'_std'+str(i)] = stdC
            features[prefix+'_median'+str(i)] = medianC
            features[prefix+'_max'+str(i)] = maxC
            features[prefix+'_min'+str(i)] = minC
            features[prefix+'_avg_m_std'+str(i)] = stdC/avgC
            features[prefix+'_max_m_avg'+str(i)] = maxC/avgC
            features[prefix+'_min_m_avg'+str(i)] = minC/avgC
    if levelFeature:
        level_feature = extractFeature(avgs, prefix="l", partition = 4)
        features = pd.concat([features, level_feature], axis = 1)
    return features.fillna(0)

训练集和测试集合拆分

X = total_train_feature.iloc[:,:-1]
Y = total_train_feature.iloc[:,-1]

# Split the dataset in two equal parts
x, t_x, y, t_y = train_test_split(
    X, Y, test_size=0.3, random_state=0)

异常值处理:

def outlier_rejection(X, y):
    model = IsolationForest(n_jobs=-1, random_state=rng)
    model.fit(X)
    y_pred = model.predict(X)
    return X[y_pred == 1], y[y_pred == 1]

PCA
SMOTE
特征筛选

smote = SMOTE()
anova_filter = SelectKBest(f_classif, k=(int)(x.shape[1]*0.6))
model = RandomForestClassifier(n_jobs=4, class_weight='balanced', max_depth=70,  verbose=1,random_state=rng)
# model = AdaBoostClassifier(random_state=rng)
# model = GradientBoostingClassifier(random_state=rng,verbose=1)
# model = XGBClassifier()
# model = KNeighborsClassifier(n_neighbors = 3, n_jobs=-1)
scaler = StandardScaler()
clf = ImPineline([
#     ('anova', anova_filter),
    ('scaler',scaler),
    ('smote',smote),
    ('model', model)
])

模型

model = RandomForestClassifier(n_jobs=4, class_weight='balanced', max_depth=70,  verbose=1,random_state=rng)

调优

利用skitlearn的GridSearchCV 进行参数调优秀

if validMode:
    smote = SMOTE()
    parameters = {
        'max_depth':[60, 70, 80],  #70
        'max_features':[0.5,0.6,0.7], #0.6
    #     'min_samples_split':[2, 4, 8, 20], #2
    #     'min_samples_leaf':[1,4,8,20], #1
    #     'min_weight_fraction_leaf':[0,0.05,0.1,0.2], #0
        'max_leaf_nodes':[800, 1000, 1200], #1000
    #      'bootstrap':[True, False], #False
    #     'oob_score':[True, False], #False
    #     'class_weight':[None,'balanced'], #balanced
        'n_estimators':[10, 20], #10
    #     'criterion':['gini','entropy']
    }
    clf = GridSearchCV(RandomForestClassifier(n_jobs=4, max_depth=20), parameters, cv=5,
                           scoring='f1_macro', n_jobs=8, verbose=2)
    clf.fit(x, y)
    print(clf.best_params_)
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

结果和展望

排名:83/843

你可能感兴趣的:(数据挖掘/机器学习,天池,机器学习,数据挖掘)