mmc2015

记一次失败的kaggle比赛（3）：失败在什么地方，贪心筛选特征、交叉验证、blending

今天这个比赛结束了，结果可以看：https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard

public结果：

private结果：

首先对比private和public的结果，可以发现：

1）几乎所有的人都overfitting了；或者说private的另一半测试数据比public的那一半测试数据更不规律。

2）private的前十名有5个是在public中排不进前几百，有四个甚至排在1000名到2000名之间；说明使用一个正确的方法比一味地追求public上的排名更重要！！！

3）我自己从public的第2323名调到private的1063名，提高了1260个名次；作为第一次参加这种比赛的人，作为一个被各种作业困扰的人，能在有5236个队伍中、5831个选手中取得这样的成绩，个人还比较满意，毕竟经验不足，做了很多冤枉工作。

4）说回最关键的，什么叫做“一个正确的方法”？？？这也是我想探讨的失败之处：

1、选择正确的模型：因为对数据不了解，所以直接尝试了以下模型：

models=[
    RandomForestClassifier(n_estimators=1999, criterion='gini', n_jobs=-1, random_state=SEED),
    RandomForestClassifier(n_estimators=1999, criterion='entropy', n_jobs=-1, random_state=SEED),
    ExtraTreesClassifier(n_estimators=1999, criterion='gini', n_jobs=-1, random_state=SEED),
    ExtraTreesClassifier(n_estimators=1999, criterion='entropy', n_jobs=-1, random_state=SEED),
    GradientBoostingClassifier(learning_rate=0.1, n_estimators=101, subsample=0.6, max_depth=8, random_state=SEED)
]

实际上，我这里想说的是，这些模型的速度都非常慢！最开始，我觉得方便就一直没有配置xgBoost，这种选择实际上浪费了非常多的时间；后来使用了xgBoost，才得到了最终的这个结果。所以说，不了解数据时，选择一个速度快的、泛化能力强的模型很重要，xgBoost是首选。

2、上来不经过任何思考就开始使用各种复杂的模型，甚至连一个baseline都没有：对，我就是这样，因为第一次，确实缺乏经验；因为复杂的模型容易过拟合，所以你越比陷得越深；而且复杂模型一般花费时间比较多，真是浪费青春；这一点我是在快要没时间的时候才意识到的；另外，我的最终结果确实是通过一个非常简单的模型得到的。所以说，开始时先鲁一个简单的模型，以此为参照构建之后的模型。什么是简单的模型：原始数据集（或者稍微做了一点处理的数据集，比如去常数列、补缺失值、归一化等）、logistic regression或者简单的svm、xgBoost。

3、相信交叉验证的结果：不要只将数据集划分成两份，因为交叉验证时你会发现有些fold效果非常好，AUC可以到0.85左右，而有些fold则非常差，0.82都不到。

4、关于noise的问题：一直没找到好的处理办法，所以最终效果不是很好也正常。

5、关于一堆零的处理办法：归一化特征，这个非常有必要！否则你之后的特征工程都会发现效果很差，因为0+k=k、0*k=0、0^2=0；具体怎么归一化，我就不多说了，点到为止。

6、另外还有一些小细节，比如筛选特征时，因为你的最终模型是GBDT，那你筛选特征时就使用GBDT，否则你使用LR筛选的有效特征可能对GBDT模型来说并不是有效的；还有很多很多，真的是在实践中才能意识到，比如特征处理是在train+test上还是单独在train上这些问题，理论上只应该在train上，因为我们认为test数据集是不知道的，但是对于这种比赛，你知道了test，那还是用上的好。。。。不多说了，大家还是多实践好；科研再忙，一学期玩一个比赛还是有时间的。。。。。。。。

7、说了这么多没用的，给大家上一点代码，主要包括贪心筛选特征、交叉验证、blending三部分关键点，但是整个代码是完整的：

#!usr/bin/env python
#-*- coding:utf-8 -*-

import pandas as pd
import numpy as np

from sklearn import preprocessing, cross_validation, metrics
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.cross_validation import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib





SEED=1126
nFold=5





def SaveFile(submitID, testSubmit, fileName="submit.csv"):
    content="ID,TARGET";
    for i in range(submitID.shape[0]):
        content+="\n"+str(submitID[i])+","+str(testSubmit[i])
    file=open(fileName,"w")
    file.write(content)
    file.close()


def CrossValidationScore(data, label, clf, nFold=5, scoreType="accuracy"):
    if scoreType=="accuracy":
        scores=cross_validation.cross_val_score(clf,data,label,cv=nFold)
        #print("mean accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
        return scores.mean()
    elif scoreType=="auc":
        meanAUC=0.0
        kfcv=StratifiedKFold(y=label, n_folds=nFold, shuffle=True, random_state=SEED)
        for j, (trainI, cvI) in enumerate(kfcv):
            print "Fold ", j, "^"*20
            Xtrain=data[trainI]
            Xcv=data[cvI]
            Ytrain=label[trainI]
            Ycv=label[cvI]
            clf.fit(Xtrain,Ytrain)
            probas=clf.predict_proba(Xcv)
            aucScore=metrics.roc_auc_score(Ycv, probas[:,1])
            #print "auc (fold %d/%d): %0.4f" % (i+1,nFold, aucScore)
            meanAUC+=aucScore
        #print "mean auc: %0.4f" % (meanAUC/nFold)
        return meanAUC/nFold

def GreedyFeatureAdd(clf, data, label, scoreType="accuracy", goodFeatures=[], maxFeaNum=100, eps=0.00005):
    scoreHistorys=[]
    while len(scoreHistorys)<=2 or scoreHistorys[-1]>scoreHistorys[-2]+eps:
        if len(goodFeatures)==maxFeaNum:
            break
        scores=[]        
        for testFeaInd in range(data.shape[1]):
            if testFeaInd not in goodFeatures:
                #tempFeaInds=goodFeatures.append(testFeaInd);
                tempFeaInds=goodFeatures+[testFeaInd]
                tempData=data[:,tempFeaInds]
                score=CrossValidationScore(tempData, label, clf, nFold, scoreType)
                scores.append((score,testFeaInd))
                print "feature: "+str(testFeaInd)+"==>mean "+scoreType+": %0.4f" % score
        goodFeatures.append(sorted(scores)[-1][1]) #only add the feature which get "the biggest gain score" 
        scoreHistorys.append(sorted(scores)[-1][0]) #only add the biggest gain score
        #print scoreHistorys
        print "current features: %s" % sorted(goodFeatures)
    if len(goodFeatures) trainAuc=%f" % (c, trainAuc)
'''
C  =>  trainProba
0.0001 => 0.126..
0.001 => 0.807188
0.01 => 0.815833
0.03 => 0.820674
0.04 => 0.821295
0.05 => 0.821439 ***
0.06 => 0.821129
0.07 => 0.820521
0.08 => 0.820067
0.1 => 0.819036
0.3 => 0.813210
1.0 => 0.809002
10.0 => 807334
'''




    
model.C=sortedtrainAucList[-1][1] #0.05
model.fit(dataset_trainBlend,trainY)
trainProba=model.predict_proba(dataset_trainBlend)[:,1]
print "train auc: %f" % metrics.roc_auc_score(trainY, trainProba)  #0.821439
print "model.coef_: ", model.coef_


print "Predict and saving results..."
submitProba=model.predict_proba(dataset_testBlend)[:,1]
df=pd.DataFrame(submitProba)
print df.describe()
SaveFile(submitID, submitProba, fileName="1submit.csv") #0.815536 [blending makes result < GBC 0.8199]
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Blending models ISN'T a good idea when one model OBVIOUSLY better than others...
'''
count  75818.000000
mean       0.039187
std        0.033691
min        0.024876
25%        0.028400
50%        0.029650
75%        0.034284
max        0.806586
'''

print "MinMaxScaler predictions to [0,1]..."
mms=preprocessing.MinMaxScaler(feature_range=(0, 1))
submitProba=mms.fit_transform(submitProba)
df=pd.DataFrame(submitProba)
print df.describe()
SaveFile(submitID, submitProba, fileName="1submitScale.csv") #0.815536
'''
count  75818.000000
mean       0.018307
std        0.043099
min        0.000000
25%        0.004509
50%        0.006107
75%        0.012035
max        1.000000
'''

其实还有很多话想说，不过这个文章就到这边吧，毕竟一个1000+的人的说教会让人觉得烦；以后再参加其他比赛了一起说吧。

http://blog.kaggle.com/2016/02/22/profiling-top-kagglers-leustagos-current-7-highest-1/

和大牛不谋而合：

What does your iteration cycle look like?

Understand the dataset. At least enough to build a consistent validation set.
Build a consistent validation set and test its relationship with the leaderboard score.
Build a very simple model.
Look for approaches used in similar competitions in the past.
Start feature engineering, step by step to create a strong model.
Think about ensembling, be it by creating alternate versions of the feature set or using different modeling techniques (xgb, rf, linear regression, neural nets, factorization machines, etc).

https://www.kaggle.com/c/santander-customer-satisfaction/forums/t/20647/congrats?page=2

Some strategies we used to reduce overfitting are as follows: 1) use both local CV and public LB test to identify a couple of potentially good model candidates; 2) in our final solution, mainly use those model candidates which give "quite large" improvements and ignore those models that only provide small improvements, since those tiny improvements might be quite possible due to overfiting to the noise. For example, our best public LB score ensemble include >=60 individual models, and most of them only provide about <0.000050 tiny improvement. In our final ensemble which gave us #1 place on the private LB, I kicked out most of these models with tiny improvements, and the final ensemble only include about 5 models (here some of these 5 models are an average of running a model with many different random seeds); 3) for xgb models, run it with a couple of different seeds and slightly different parameters, and then take the average, this will also make its performance a bit more stable. 4) For those features like age<23 to identify 0 records, only used them very conservatively.

https://www.kaggle.com/c/santander-customer-satisfaction/forums/t/20662/overtuning-hyper-parameters-especially-re-xgboost

My question: How far should one tune the various parameters of xgboost? Here are my current limits:

max_depth: to the nearest 1.
min_child_weight: to the nearest 1.
reg_alpha: to 1 significant figure (max 3 decimal places).
reg_lambda: to 1 significant figure (max 3 decimal places).
subsample: to the best 0.05
colsample_bytree: to the best 0.05
gamma: to the nearest 0.1 (in every case I have tried gamma < 1)

As mentioned above, I find that tuning further may cause a lower score against a held out test set.

To understand tuning, it helps to understand how tree based ensemble algorithms work. See Prettenhofer and Louppe slides among many, many others. Then it becomes a question about fitting the parameters to the data without overfitting (i.e. tuning the algorithm so far that it finds too many features that are specific only to the currently known data). Each parameter offers it's own unique opportunity to mess things up. Specifically

max_depth: [1,infinity)
It's how many levels deep a tree is allowed to go. Each level allows you to catch an interaction between the previous split and a new variable. If a value < 3 works really well, you probably want to look at the data again to see why there aren't useful interactions. On the other side, the larger this value, the more possible it is to overfit training data. A good place to start is between 4 and 6. For higher values think about using one of the other parameters to add robustness. (Robustness means that the algorithm doesn't get tripped up by outliers in the training set which won't generalize well.)
min_child_weight: [1,infinity)
This is how big each group in the tree has to be. Larger values are more robust than smaller values (less likely to result in overfitting). Use the largest value you can that doesn't seem to hurt performance. Unfortunately, this value is also sensitive to the size of the training set. Namely min_child_weight/size_train is a lower bound on the probability of a type of event trees can hope to find. (In this case the ML estimate of this probability was .0396). If max_depth is not too high, you can start with 1, however the higher max_depth is, the higher this value should also be in order to avoid overfitting.
reg_alpha, reg_lambda: These only apply if you are using the linear model as the base model in boosting and not the default tree model.
subsample: (0,1] This is primarily to help prevent overfitting. Good values are more dependent on how much training data there is. One issue, especially using K-fold CV is that if you are below a certain threshold (very data dependent), then more data will improve performance more than anything else. So the best value from 2-fold CV may not be the best value for 5-fold CV or 10-fold CV or even all the training data... If you have enough data start between .5 and .9. If you are tired of waiting for a single run of xgboost to finish, crank this lower. If you don't have very much data, leave it between .9 and 1.
colsample_bytree: (0,1] This is also very data dependent, albeit more on the number and quality of features. Unlike random forest, gradient boosting is less sensitive to bad features, especially if this value is small. On the other hand it needs to have enough features for each tree to build a reasonable model. A reasonable place to start is 10 <= num_feats*colsample_bytree <=50. Alternately you could mimic the random forest heuristic which is a value that gives either sqrt(num_feats) or log2(num_feats) for each tree. Side note: For this data set the difference between .701 and .7 is at most one feature depending on rounding.
gamma: Never tried this one. No help here. Nada. Got nothink.

It's very suspicious if small changes in any of these parameters make large changes in CV. We should all have the robot that goes 'Danger! Danger Will Robinson' when that happens. Also good parameters should perform well regardless of the random seed used. The script only worked well with 1234. Now it was possible that 1234 could be magic on both the public and private sets....

推荐算法学习记录2.2——kaggle数据集的动漫电影数据集推荐算法实践——基于内容的推荐算法、协同过滤推荐萱仔学习自我记录推荐算法学习 python matplotlib 开发语言
1、基于内容的推荐：这种方法根据项的相关信息（如描述信息、标签等）和用户对项的操作行为（如评论、收藏、点赞等）来构建推荐算法模型。它可以直接利用物品的内容特征进行推荐，适用于内容较为丰富的场景。‌#1.基于内容的推荐算法fromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.metrics.pairwiseimport
免费GPU平台教程，助力你的AI, pytorch tensorflow 支持cuda zhangfeng1133 人工智能 pytorch tensorflow
Colab：https://drive.google.com/drive/home阿里天池实验室：https://tianchi.aliyun.com/60个小时gputianchi.aliyun.com/notebook-ai/天池实验室_实时在线的数据分析协作工具，享受免费计算资源-阿里云天池移动九天：https://jiutian.10086.cn/edu/#/homekagglekaggl
49Kaggle 数据分析项目入门实战--绝地求生游戏最终排名预测 Jachin111
绝地求生介绍相信很多都玩过绝地求生这款游戏，其游戏规则主要是将100名玩家空手被扔到一个岛上，这些玩家必须探索、寻找、消灭其他玩家，直到只剩下一个玩家活着。绝地求生很受欢迎。这款游戏销量目前超过5000万份，是有史以来销量排名前五的游戏，每月有数百万活跃玩家。而我们本次实验的任务就是根据玩家在游戏中的种种表现来预测出其在最终的排名。导入数据并预览首先安装实验需要的statsmodels包。!pip
李沐《动手学深度学习》课程笔记：15 实战：Kaggle房价预测 + 课程竞赛：加州2020年房价预测非文的NLP修炼笔记 #李沐《动手学深度学习》课程笔记深度学习人工智能
15实战：Kaggle房价预测+课程竞赛：加州2020年房价预测1.访问和读取数据集importhashlibimportosimporttarfileimportzipfileimportrequestsDATA_HUB=dict()DATA_URL='http://d2l_data.s3-accelerate.amazonaws.com/'defdownload(name,cache_dir=
Kaggle Intermediate ML Part Two 卢延吉 New Developer 数据 (Data)ML &ME &GPT Data ML
CategoricalVariablesCategoricalvariables,alsoknownasqualitativevariables,areafundamentalconceptinstatisticsanddataanalysis.Here'sabreakdowntohelpyouunderstandthem:Whatarethey?Categoricalvariablesrepre
【工业智能】VSB Power Line Fault Detection-chapter1 凭轩听雨199407 学习 python 制造数据挖掘
VSBPowerLineFaultDetection-chapter1backgrounddataset数据介绍信号处理方法EDAtrainfeatureengineeringmodeltraintry信息来源：KaggleCompetition:VSBPowerLineFaultDetectionbackground中压高架线路绵延上百公里来为城市提供电力。因为距离很远，所以人工检测那些没有立即
【工业智能】VSB Power Line Fault Detection-chapter2 凭轩听雨199407 数据挖掘
工业智能】VSBPowerLineFaultDetection-chapter2关键信息依赖版本信息名词术语tricks信息来源：KaggleCompetition:VSBPowerLineFaultDetection分析冠军代码。源文件URL：https://www.kaggle.com/code/mark4h/vsb-1st-place-solution关键信息LGB标准5折验证9个特征所有特
机器学习网格搜索超参数优化实战(随机森林) ##4 恒c 机器学习随机森林人工智能
文章目录基于Kaggle电信用户流失案例数据（可在官网进行下载）数据预处理模块时序特征衍生第一轮网格搜索第二轮搜索第三轮搜索第四轮搜索第五轮搜索基于Kaggle电信用户流失案例数据（可在官网进行下载）导入库#基础数据科学运算库importnumpyasnpimportpandasaspd#可视化库importseabornassnsimportmatplotlib.pyplotasplt#时间模块
多元统计分析课程论文-聚类效果评价 talle2021 数据分析机器学习聚类数据挖掘机器学习
数据集来源：UnsupervisedLearningonCountryData(kaggle.com)代码参考：Clustering:PCA|K-Means-DBSCAN-Hierarchical||Kaggle基于特征合成降维和主成分分析法降维的国家数据集聚类效果评价目录1.特征合成降维2.PCA降维3.K-Means聚类3.1对特征合成降维的数据聚类分析3.2对PCA降维的数据聚类分析摘要：本
R语言课程论文-飞机失事数据可视化分析 talle2021 数据分析 r语言数据分析数据可视化
数据来源：AirplaneCrashesSince1908(kaggle.com)代码参考：ExploringhistoricAirPlanecrashdata|Kaggle数据指标及其含义指标名含义Date事故发生日期(年-月-日)Time当地时间，24小时制，格式为hh:mmLocation事故发生的地点Operator航空公司或飞机的运营商Flight由飞机操作员指定的航班号Route事故前
Dataframe型数据分析技巧汇总我叫杨傲天学习笔记机器学习数据分析数据挖掘
Kaggle如何针对少量数据集比赛的打法。数据降维的几种方法HF.075|时间序列趋势性分析方法汇总机器学习必须了解的7种交叉验证方法（附代码）这个图！Python也能一键绘制了，而且样式更多..散点图，把散点图画出花来综述：机器学习中的模型评价、模型选择与算法选择！表格任务中的深度学习模型性能比较再见Onehot！KaggleMaster的上分神操作！特征重要性评估方法之排列重要性
Task 11 XGBoost 算法分析与案例调参实例沫2021
1.XGBoost算法XGBoost是陈天奇等人开发的一个开源机器学习项目，高效地实现了GBDT算法并进行了算法和工程上的许多改进，被广泛应用在Kaggle竞赛及其他许多机器学习竞赛中并取得了不错的成绩。XGBoost是一个优化的分布式梯度增强库，旨在实现高效，灵活和便携。它在GradientBoosting框架下实现机器学习算法。XGBoost提供了并行树提升（也称为GBDT，GBM），可以快速
关于商店销售量的数据处理小问题（Python）不期而遇__ python pandas 数据分析大数据
通过学校举行的某次学科竞赛，我接触到了kaggle上的一道题：StoreSales-TimeSeriesForecasting。由于题主资质尚浅，本文将对前期数据处理的一些小问题做出解答，不涉及后续更难的问题。此处放原题链接：StoreSales-TimeSeriesForecasting题主也是看了很多的资料，也看到了CSDN上另外一位大佬写的文章，收获颇多，此处也放一下链接：Kaggle实战：
学习笔记 2019-04-30 段勇_bf97
HousePrices-bagging_xgboost+lasso+ridgeKaggle入門級賽題：房價預測FFMPEG视音频编解码零基础学习方法35岁程序员的独家面试经历公司名称公司介绍薪水车辆工程专业33岁简历有些传感器方面的东西20k-35k非渣硕是如何获得百度、京东双SP一些面试经验20k-40k吴以均的简历一个大牛的简历北京航空航天大学毕业生的简历厦门大学软件学院毕业生的简历名称介绍H
数据分析基础之《pandas（8）—综合案例》 csj50 机器学习数据分析
一、需求1、现在我们有一组从2006年到2016年1000部最流行的电影数据数据来源：https://www.kaggle.com/damianpanek/sunday-eda/data2、问题1想知道这些电影数据中评分的平均分，导演的人数等信息，我们应该怎么获取？3、问题2对于这一组电影数据，如果我们想看Rating、Runtime(Minutes)的分布情况，应该如何呈现数据？4、问题3对于这
XGBoost算法小森( ﹡ˆoˆ﹡ ) 机器学习算法算法人工智能机器学习
XGBoost在机器学习中被广泛应用于多种场景，特别是在结构化数据的处理上表现出色，XGBoost适用于多种监督学习任务，包括分类、回归和排名问题。在数据挖掘和数据科学竞赛中，XGBoost因其出色的性能而被频繁使用。例如，在Kaggle平台上的许多获奖方案中，XGBoost都发挥了重要作用。此外，它在处理缺失值和大规模数据集上也有很好的表现。XGBoost是一种基于梯度提升决策树（GBDT）的算
Kaggle Intro Model Validation and Underfitting and Overfitting 卢延吉 New Developer 数据 (Data)ML &ME &GPT 机器学习
ModelValidationModelvalidationisthecornerstoneofensuringarobustandreliablemachinelearningmodel.It'stherigorousassessmentofhowwellyourmodelperformsonunseendata,mimickingreal-worldscenarios.Doneright,it
kaggle实战语义分割-Car segmentation（附源码）橘柚jvyou python 人工智能计算机视觉深度学习 pytorch
目录前言项目介绍数据集处理数据集加载定义网络训练网络验证网络前言本篇文章会讲解使用pytorch完成另外一个计算机视觉的基本任务-语义分割。语义分割是将图片中每个部分根据其语义分割出来，其相比于图像分类的不同点是，图像分类是对一张图片进行分类，而语义分割是对图像中的每个像素点进行分类。我们这里使用的语义分割数据集是kaggle上的一个数据集。数据集来源：https://www.kaggle.com
kaggle实战图像分类-Intel Image Classification（附源码）橘柚jvyou 分类人工智能 pytorch 计算机视觉深度学习
目录前言数据集加载定义网络训练网络验证网络前言本篇文章会讲解一个使用pytorch这个深度学习框架完成一个kaggle上的图像分类任务。主要会介绍如何加载数据集，导入网络训练数据，保存损失，精度变化曲线和最终模型，以及测试模型在验证集上的好坏。其数据集介绍可以看一下kaggle的网址，这里就不过多介绍。数据集来源：https://www.kaggle.com/datasets/puneet6060
机器学习 | 深入集成学习的精髓及实战技巧挑战亦世凡华、 #机器学习机器学习集成学习人工智能 boosting xgboost
目录xgboost算法简介泰坦尼克号乘客生存预测(实操)lightGBM算法简介《绝地求生》玩家排名预测(实操)xgboost算法简介XGBoost全名叫极端梯度提升树，XGBoost是集成学习方法的王牌，在Kaggle数据挖掘比赛中，大部分获胜者用了XGBoost。XGBoost在绝大多数的回归和分类问题上表现的十分顶尖，接下来将较详细的介绍XGBoost的算法原理。最优模型构建方法：构建最优模
称霸kaggle的XGBoost究竟是啥？猴小白
一、前言：kaggle神器XGBoost相信入了机器学习这扇门的小伙伴们一定听过XGBoost这个名字，这个看起来朴实无华的boosting算法近年来可算是炙手可热，别的不说，但是大家所熟知的kaggle比赛来看，说XGBoost是“一统天下”都不为过。业界将其冠名“机器学习竞赛的胜利女神”，当然，相信很多小伙伴也看过很多文章称其为“超级女王”。那么问题来了，为啥是女的？（滑稽~）XGBoost全
烹饪第一个U-Net进行图像分割小北的北 python 开发语言
今天我们将学习如何准备计算机视觉中最重要的网络之一：U-Net。如果你没有代码和数据集也没关系，可以分别通过下面两个链接进行访问：代码：https://www.kaggle.com/datasets/mateuszbuda/lgg-mri-segmentation?source=post_page-----e812e37e9cd0--------------------------------Ka
北京房价预测——Kaggle数据 GavinHarbus
日暮途远，人间何世将军一去，大树飘零概述之前学习了加州房价预测模型，便摩拳擦掌，从kaggle上找到一份帝都房价数据，练练手。实验流程实验数据从Kaggle中选择了帝都北京住房价格的数据集，该数据集摘录了2011～2017年链家网上的北京房价数据。image下载并预览数据下载并解压数据image预览数据image每一行代表一间房，每个房子有26个相关属性，其中以下几个需要备注：DOM:市场活跃天数
kaggle：泰坦尼克号获救预测_Titanic_EDA## 卜咦
问题数据来源于Kaggle，通过一组列有泰坦尼克号灾难幸存者或幸存者的训练样本集，我们的模型能否基于不包含幸存者信息的给定测试数据集确定这些测试数据集中的乘客是否幸存。代码与数据分析导入必要的包和titanic数据image数据集基本信息将数据分为不同类别，分别为类别型数据和数字型数据类别数据：Survived,Sex,andEmbarked.Ordinal:Pclass数字型数据：Age,Far
基于LLM的数据漂移和异常检测新缸中之脑 LLM
大型语言模型(LLM)的最新进展被证明是许多领域的颠覆性力量（请参阅：通用人工智能的火花：GPT-4的早期实验）。和许多人一样，我们非常感兴趣地关注这些发展，并探索LLM影响数据科学和机器学习领域的工作流程和常见实践的潜力。在我们之前的文章中，我们展示了LLM使用Kaggle竞赛中的表格数据提供预测的潜力。只需很少的努力（即数据清理和/或功能开发），我们基于LLM的模型就可以在几个竞赛参赛作品中获
Xgboost 大雄的学习人生
在最近的Kaggle竞赛中，利用Xgboost的队伍经常能问鼎冠军，那么问题来了，Xgboost为什么这么强呢？算法释义Xgboost是一种带有正则化项，并利用损失函数泰勒展开式中二阶导数信息优化求解并增加一些计算优化的梯度提升树。Xgboost的目标函数定义为：其中l为损失函数，Ω(ft(x))是用于惩罚ft(x)模型复杂度的正则化项。根据上述目标函数可以得到Xgboost在每一轮前向分步算法中
机器学习数据预处理方法（数据重编码） ##2 恒c 机器学习人工智能数据分析
文章目录@[TOC]基于Kaggle电信用户流失案例数据（可在官网进行下载）一、离散字段的数据重编码1.OrdinalEncoder自然数排序2.OneHotEncoder独热编码3.ColumnTransformer转化流水线二、连续字段的特征变换1.标准化（Standardization）和归一化（Normalization）2.连续变量分箱3.连续变量特征转化的ColumnTransform
机器学习逻辑回归模型训练与超参数调优 ##3 恒c 机器学习逻辑回归人工智能
文章目录@[TOC]基于Kaggle电信用户流失案例数据（可在官网进行下载）逻辑回归模型训练逻辑回归的超参数调优基于Kaggle电信用户流失案例数据（可在官网进行下载）数据预处理部分可见：机器学习数据预处理方法（数据重编码）逻辑回归模型训练fromsklearn.metricsimportaccuracy_score,recall_score,precision_score,f1_score,ro
50Kaggle 数据分析项目入门实战--分销商产品未来销售情况预测 Jachin111
分销商产品未来销售情况预测未来销售额预测介绍对于一个产品来说，其未来销售额的预测是一个重要的指标，也是一项重要的任务。例如，对于一部苹果手机来说。在上市之前，得先对销售额进行预测，才能确定出货量的大小。本次实验来源于Kaggle上的一个挑战，即：未来销售额预测，由俄罗斯的1C-Company软件分销公司发起，并提供数据。而本次实验的任务就是根据提供的数据，包含商品类别、商品名称、商店等信息和商品的
机器学习本科课程实验1 线性模型 11egativ1ty 机器学习本科课程机器学习人工智能
第三章线性模型3.1一元线性回归3.2多元线性回归3.3对数几率回归，线性判别分析（二选一）3.4类别不均衡3.1一元线性回归——Kaggle房价预测使用Kaggle房价预测数据集：打乱数据顺序，取前70%的数据作为训练集，后30%的数据作为测试集分别以LotArea,BsmtUnfSF,GarageArea三种特征作为模型的输入，SalePrice作为模型的输出在训练集上，使用最小二乘法求解模型
解读Servlet原理篇二---GenericServlet与HttpServlet 周凡杨 java HttpServlet 源理 GenericService 源码
在上一篇《解读Servlet原理篇一》中提到，要实现javax.servlet.Servlet接口（即写自己的Servlet应用），你可以写一个继承自javax.servlet.GenericServletr的generic Servlet ，也可以写一个继承自java.servlet.http.HttpServlet的HTTP Servlet（这就是为什么我们自定义的Servlet通常是exte
MySQL性能优化 bijian1013 数据库 mysql
性能优化是通过某些有效的方法来提高MySQL的运行速度，减少占用的磁盘空间。性能优化包含很多方面，例如优化查询速度，优化更新速度和优化MySQL服务器等。本文介绍方法的主要有： a.优化查询 b.优化数据库结构
ThreadPool定时重试 dai_lm java ThreadPool thread timer timertask
项目需要当某事件触发时，执行http请求任务，失败时需要有重试机制，并根据失败次数的增加，重试间隔也相应增加，任务可能并发。由于是耗时任务，首先考虑的就是用线程来实现，并且为了节约资源，因而选择线程池。为了解决不定间隔的重试，选择Timer和TimerTask来完成 package threadpool; public class ThreadPoolTest {
Oracle 查看数据库的连接情况周凡杨 sql oracle 连接
首先要说的是，不同版本数据库提供的系统表会有不同，你可以根据数据字典查看该版本数据库所提供的表。 select * from dict where table_name like '%SESSION%'; 就可以查出一些表，然后根据这些表就可以获得会话信息 select sid,serial#,status,username,schemaname,osuser,terminal,ma
类的继承朱辉辉33 java
类的继承可以提高代码的重用行，减少冗余代码；还能提高代码的扩展性。Java继承的关键字是extends 格式:public class 类名（子类）extends 类名（父类）{ } 子类可以继承到父类所有的属性和普通方法，但不能继承构造方法。且子类可以直接使用父类的public和 protected属性，但要使用private属性仍需通过调用。子类的方法可以重写，但必须和父类的返回值类
android 悬浮窗特效肆无忌惮_ android
最近在开发项目的时候需要做一个悬浮层的动画，类似于支付宝掉钱动画。但是区别在于，需求是浮出一个窗口，之后边缩放边位移至屏幕右下角标签处。效果图如下：一开始考虑用自定义View来做。后来发现开线程让其移动很卡，ListView+动画也没法精确定位到目标点。后来想利用Dialog的dismiss动画来完成。自定义一个Dialog后，在styl
hadoop伪分布式搭建林鹤霄 hadoop
要修改4个文件 1: vim hadoop-env.sh 第九行 2: vim core-site.xml <configuration> &n
gdb调试命令 aigo gdb
原文：http://blog.csdn.net/hanchaoman/article/details/5517362 一、GDB常用命令简介 r run 运行.程序还没有运行前使用 c cuntinue
Socket编程的HelloWorld实例 alleni123 socket
public class Client { public static void main(String[] args) { Client c=new Client(); c.receiveMessage(); } public void receiveMessage(){ Socket s=null; BufferedRea
线程同步和异步百合不是茶线程同步异步
多线程和同步 : 如进程、线程同步，可理解为进程或线程A和B一块配合，A执行到一定程度时要依靠B的某个结果，于是停下来，示意B运行；B依言执行，再将结果给A；A再继续操作。所谓同步，就是在发出一个功能调用时，在没有得到结果之前，该调用就不返回，同时其它线程也不能调用这个方法多线程和异步:多线程可以做不同的事情,涉及到线程通知 &
JSP中文乱码分析 bijian1013 java jsp 中文乱码
在JSP的开发过程中，经常出现中文乱码的问题。首先了解一下Java中文问题的由来： Java的内核和class文件是基于unicode的，这使Java程序具有良好的跨平台性，但也带来了一些中文乱码问题的麻烦。原因主要有两方面，
js实现页面跳转重定向的几种方式 bijian1013 JavaScript 重定向
js实现页面跳转重定向有如下几种方式：一.window.location.href <script language="javascript"type="text/javascript"> window.location.href="http://www.baidu.c
【Struts2三】Struts2 Action转发类型 bit1129 struts2
在【Struts2一】 Struts Hello World http://bit1129.iteye.com/blog/2109365中配置了一个简单的Action，配置如下 <!DOCTYPE struts PUBLIC "-//Apache Software Foundation//DTD Struts Configurat
【HBase十一】Java API操作HBase bit1129 hbase
Admin类的主要方法注释： 1. 创建表 /** * Creates a new table. Synchronous operation. * * @param desc table descriptor for table * @throws IllegalArgumentException if the table name is res
nginx gzip ronin47 nginx gzip
Nginx GZip 压缩 Nginx GZip 模块文档详见：http://wiki.nginx.org/HttpGzipModule 常用配置片段如下： gzip on; gzip_comp_level 2; # 压缩比例，比例越大，压缩时间越长。默认是1 gzip_types text/css text/javascript; # 哪些文件可以被压缩 gzip_disable &q
java-7.微软亚院之编程判断俩个链表是否相交给出俩个单向链表的头指针，比如 h1 ， h2 ，判断这俩个链表是否相交 bylijinnan java
public class LinkListTest { /** * we deal with two main missions: * * A. * 1.we create two joined-List(both have no loop) * 2.whether list1 and list2 join * 3.print the join
Spring源码学习-JdbcTemplate batchUpdate批量操作 bylijinnan java spring
Spring JdbcTemplate的batch操作最后还是利用了JDBC提供的方法，Spring只是做了一下改造和封装 JDBC的batch操作： String sql = "INSERT INTO CUSTOMER " + "(CUST_ID, NAME, AGE) VALUES (?, ?, ?)";
[JWFD开源工作流]大规模拓扑矩阵存储结构最新进展 comsci 工作流
生成和创建类已经完成,构造一个100万个元素的矩阵模型,存储空间只有11M大,请大家参考我在博客园上面的文档"构造下一代工作流存储结构的尝试",更加相信的设计和代码将陆续推出......... 竞争对手的能力也很强.......,我相信..你们一定能够先于我们推出大规模拓扑扫描和分析系统的....
base64编码和url编码 cuityang base64 url
import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintWriter; import java.io.StringWriter; import java.io.UnsupportedEncodingException;
web应用集群Session保持 dalan_123 session
关于使用 memcached 或redis 存储 session ，以及使用 terracotta 服务器共享。建议使用 redis，不仅仅因为它可以将缓存的内容持久化，还因为它支持的单个对象比较大，而且数据类型丰富，不只是缓存 session，还可以做其他用途，一举几得啊。1、使用 filter 方法存储这种方法比较推荐，因为它的服务器使用范围比较多，不仅限于tomcat ，而且实现的原理比较简
Yii 框架里数据库操作详解-[增加、查询、更新、删除的方法 'AR模式'] dcj3sjt126com 数据库
public function getMinLimit () { $sql = "..."; $result = yii::app()->db->createCo
solr StatsComponent（聚合统计） eksliang solr聚合查询 solr stats
StatsComponent 转载请出自出处：http://eksliang.iteye.com/blog/2169134 http://eksliang.iteye.com/ 一、概述 Solr可以利用StatsComponent 实现数据库的聚合统计查询，也就是min、max、avg、count、sum的功能二、参数
百度一道面试题 greemranqq 位运算百度面试寻找奇数算法 bitmap 算法
那天看朋友提了一个百度面试的题目：怎么找出{1,1,2,3,3,4,4,4,5,5,5,5} 找出出现次数为奇数的数字. 我这里复制的是原话，当然顺序是不一定的，很多拿到题目第一反应就是用map,当然可以解决，但是效率不高。还有人觉得应该用算法xxx,我是没想到用啥算法好...！还有觉得应该先排序... 还有觉
Spring之在开发中使用SpringJDBC ihuning spring
在实际开发中使用SpringJDBC有两种方式： 1. 在Dao中添加属性JdbcTemplate并用Spring注入； JdbcTemplate类被设计成为线程安全的，所以可以在IOC 容器中声明它的单个实例，并将这个实例注入到所有的 DAO 实例中。JdbcTemplate也利用了Java 1.5 的特定(自动装箱，泛型，可变长度
JSON API 1.0 核心开发者自述 | 你所不知道的那些技术细节 justjavac json
2013年5月，Yehuda Katz 完成了JSON API(英文，中文) 技术规范的初稿。事情就发生在 RailsConf 之后，在那次会议上他和 Steve Klabnik 就 JSON 雏形的技术细节相聊甚欢。在沟通单一 Rails 服务器库—— ActiveModel::Serializers 和单一 JavaScript 客户端库——&
网站项目建设流程概述 macroli 工作
一.概念网站项目管理就是根据特定的规范、在预算范围内、按时完成的网站开发任务。二.需求分析项目立项　　我们接到客户的业务咨询，经过双方不断的接洽和了解，并通过基本的可行性讨论够，初步达成制作协议，这时就需要将项目立项。较好的做法是成立一个专门的项目小组，小组成员包括：项目经理，网页设计，程序员，测试员，编辑/文档等必须人员。项目实行项目经理制。客户的需求说明书　　第一步是需
AngularJs 三目运算表达式判断 qiaolevip 每天进步一点点学习永无止境众观千象 AngularJS
事件回顾：由于需要修改同一个模板，里面包含2个不同的内容，第一个里面使用的时间差和第二个里面名称不一样，其他过滤器，内容都大同小异。希望杜绝If这样比较傻的来判断if-show or not，继续追究其源码。 var b = "{{", a = "}}"; this.startSymbol = function(a) {
Spark算子：统计RDD分区中的元素及数量 superlxw1234 spark spark算子 Spark RDD分区元素
关键字：Spark算子、Spark RDD分区、Spark RDD分区元素数量 Spark RDD是被分区的，在生成RDD时候，一般可以指定分区的数量，如果不指定分区数量，当RDD从集合创建时候，则默认为该程序所分配到的资源的CPU核数，如果是从HDFS文件创建，默认为文件的Block数。可以利用RDD的mapPartitionsWithInd
Spring 3.2.x将于2016年12月31日停止支持 wiselyman Spring 3
Spring 团队公布在2016年12月31日停止对Spring Framework 3.2.x（包含tomcat 6.x）的支持。在此之前spring团队将持续发布3.2.x的维护版本。请大家及时准备及时升级到Spring
fis纯前端解决方案fis-pure zccst JavaScript
作者：zccst FIS通过插件扩展可以完美的支持模块化的前端开发方案，我们通过FIS的二次封装能力，封装了一个功能完备的纯前端模块化方案pure。 1，fis-pure的安装 $ fis install -g fis-pure $ pure -v 0.1.4 2，下载demo到本地 git clone https://github.com/hefangshi/f

记一次失败的kaggle比赛（3）：失败在什么地方，贪心筛选特征、交叉验证、blending

What does your iteration cycle look like?

你可能感兴趣的:(Kaggle)