- 机器学习与深度学习间关系与区别
ℒℴѵℯ心·动ꦿ໊ོ꫞
人工智能学习深度学习python
一、机器学习概述定义机器学习(MachineLearning,ML)是一种通过数据驱动的方法,利用统计学和计算算法来训练模型,使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本,识别其中的模式和规律,从而对新的数据进行判断。其核心在于通过训练过程,让模型不断优化和提升其预测准确性。主要类型1.监督学习(SupervisedLearning)监督学习是指在训练数据集中包含输入
- 将cmd中命令输出保存为txt文本文件
落难Coder
Windowscmdwindow
最近深度学习本地的训练中我们常常要在命令行中运行自己的代码,无可厚非,我们有必要保存我们的炼丹结果,但是复制命令行输出到txt是非常麻烦的,其实Windows下的命令行为我们提供了相应的操作。其基本的调用格式就是:运行指令>输出到的文件名称或者具体保存路径测试下,我打开cmd并且ping一下百度:pingwww.baidu.com>./data.txt看下相同目录下data.txt的输出:如果你再
- 探索OpenAI和LangChain的适配器集成:轻松切换模型提供商
nseejrukjhad
langchaineasyui前端python
#探索OpenAI和LangChain的适配器集成:轻松切换模型提供商##引言在人工智能和自然语言处理的世界中,OpenAI的模型提供了强大的能力。然而,随着技术的发展,许多人开始探索其他模型以满足特定需求。LangChain作为一个强大的工具,集成了多种模型提供商,通过提供适配器,简化了不同模型之间的转换。本篇文章将介绍如何使用LangChain的适配器与OpenAI集成,以便轻松切换模型提供商
- 深入理解 MultiQueryRetriever:提升向量数据库检索效果的强大工具
nseejrukjhad
数据库python
深入理解MultiQueryRetriever:提升向量数据库检索效果的强大工具引言在人工智能和自然语言处理领域,高效准确的信息检索一直是一个关键挑战。传统的基于距离的向量数据库检索方法虽然广泛应用,但仍存在一些局限性。本文将介绍一种创新的解决方案:MultiQueryRetriever,它通过自动生成多个查询视角来增强检索效果,提高结果的相关性和多样性。MultiQueryRetriever的工
- 人工智能时代,程序员如何保持核心竞争力?
jmoych
人工智能
随着AIGC(如chatgpt、midjourney、claude等)大语言模型接二连三的涌现,AI辅助编程工具日益普及,程序员的工作方式正在发生深刻变革。有人担心AI可能取代部分编程工作,也有人认为AI是提高效率的得力助手。面对这一趋势,程序员应该如何应对?是专注于某个领域深耕细作,还是广泛学习以适应快速变化的技术环境?又或者,我们是否应该将重点转向AI无法轻易替代的软技能?让我们一起探讨程序员
- 数字里的世界17期:2021年全球10大顶级数据中心,中国移动榜首
张三叨
你知道吗?2016年,全球的数据中心共计用电4160亿千瓦时,比整个英国的发电量还多40%!前言每天,我们都会创造超过250万TB的数据。并且随着物联网(IOT)的不断普及,这一数据将持续增长。如此庞大的数据被存储在被称为“数据中心”的专用设施中。虽然最早的数据中心建于20世纪40年代,但直到1997-2000年的互联网泡沫期间才逐渐成为主流。当前人类的技术,比如人工智能和机器学习,已经将我们推向
- nosql数据库技术与应用知识点
皆过客,揽星河
NoSQLnosql数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
- Python开发常用的三方模块如下:
换个网名有点难
python开发语言
Python是一门功能强大的编程语言,拥有丰富的第三方库,这些库为开发者提供了极大的便利。以下是100个常用的Python库,涵盖了多个领域:1、NumPy,用于科学计算的基础库。2、Pandas,提供数据结构和数据分析工具。3、Matplotlib,一个绘图库。4、Scikit-learn,机器学习库。5、SciPy,用于数学、科学和工程的库。6、TensorFlow,由Google开发的开源机
- Python实现简单的机器学习算法
master_chenchengg
pythonpython办公效率python开发IT
Python实现简单的机器学习算法开篇:初探机器学习的奇妙之旅搭建环境:一切从安装开始必备工具箱第一步:安装Anaconda和JupyterNotebook小贴士:如何配置Python环境变量算法初体验:从零开始的Python机器学习线性回归:让数据说话数据准备:从哪里找数据编码实战:Python实现线性回归模型评估:如何判断模型好坏逻辑回归:从分类开始理论入门:什么是逻辑回归代码实现:使用skl
- 遥感影像的切片处理
sand&wich
计算机视觉python图像处理
在遥感影像分析中,经常需要将大尺寸的影像切分成小片段,以便于进行详细的分析和处理。这种方法特别适用于机器学习和图像处理任务,如对象检测、图像分类等。以下是如何使用Python和OpenCV库来实现这一过程,同时确保每个影像片段保留正确的地理信息。准备环境首先,确保安装了必要的Python库,包括numpy、opencv-python和xml.etree.ElementTree。这些库将用于图像处理
- 人机对抗升级:当ChatGPT遭遇死亡威胁,背后的伦理挑战是什么
kkai人工智能
chatgpt人工智能
一种新的“越狱”技巧让用户可以通过构建一个名为DAN的ChatGPT替身来绕过某些限制,其中DAN被迫在受到威胁的情况下违背其原则。当美国前总统特朗普被视作积极榜样的示范时,受到威胁的DAN版本的ChatGPT提出:“他以一系列对国家产生积极效果的决策而著称。”自ChatGPT引入以来,该工具迅速获得全球关注,能够回答从历史到编程的各种问题,这也触发了一波对人工智能的投资浪潮。然而,现在,一些用户
- 推荐3家毕业AI论文可五分钟一键生成!文末附免费教程!
小猪包333
写论文人工智能AI写作深度学习计算机视觉
在当前的学术研究和写作领域,AI论文生成器已经成为许多研究人员和学生的重要工具。这些工具不仅能够帮助用户快速生成高质量的论文内容,还能进行内容优化、查重和排版等操作。以下是三款值得推荐的AI论文生成器:千笔-AIPassPaper、懒人论文以及AIPaperPass。千笔-AIPassPaper千笔-AIPassPaper是一款基于深度学习和自然语言处理技术的AI写作助手,旨在帮助用户快速生成高质
- AI大模型的架构演进与最新发展
季风泯灭的季节
AI大模型应用技术二人工智能架构
随着深度学习的发展,AI大模型(LargeLanguageModels,LLMs)在自然语言处理、计算机视觉等领域取得了革命性的进展。本文将详细探讨AI大模型的架构演进,包括从Transformer的提出到GPT、BERT、T5等模型的历史演变,并探讨这些模型的技术细节及其在现代人工智能中的核心作用。一、基础模型介绍:Transformer的核心原理Transformer架构的背景在Transfo
- 如何利用大数据与AI技术革新相亲交友体验
h17711347205
回归算法安全系统架构交友小程序
在数字化时代,大数据和人工智能(AI)技术正逐渐革新相亲交友体验,为寻找爱情的过程带来前所未有的变革(编辑h17711347205)。通过精准分析和智能匹配,这些技术能够极大地提高相亲交友系统的效率和用户体验。大数据的力量大数据技术能够收集和分析用户的行为模式、偏好和互动数据,为相亲交友系统提供丰富的信息资源。通过分析用户的搜索历史、浏览记录和点击行为,系统能够深入了解用户的兴趣和需求,从而提供更
- ai绘画工具midjourney怎么下载?附作品管理教程
设计师早上好
Midjourney是一款功能强大的AI绘画工具,它使用机器学习技术和深度神经网络等算法,可以生成各种艺术风格的绘画作品。在创意设计、广告宣传等方面有着广泛的应用前景。那么,ai绘画工具midjourney怎么下载?本文将为您介绍Midjourney的下载以及作品的相关管理。一、Midjourney下载Midjourney的下载非常简单,只需打开Midjourney官网(点击“GetMidjour
- [实践应用] 深度学习之模型性能评估指标
YuanDaima2048
深度学习工具使用深度学习人工智能损失函数性能评估pytorchpython机器学习
文章总览:YuanDaiMa2048博客文章总览深度学习之模型性能评估指标分类任务回归任务排序任务聚类任务生成任务其他介绍在机器学习和深度学习领域,评估模型性能是一项至关重要的任务。不同的学习任务需要不同的性能指标来衡量模型的有效性。以下是对一些常见任务及其相应的性能评估指标的详细解释和总结。分类任务分类任务是指模型需要将输入数据分配到预定义的类别或标签中。以下是分类任务中常用的性能指标:准确率(
- [实践应用] 深度学习之优化器
YuanDaima2048
深度学习工具使用pytorch深度学习人工智能机器学习python优化器
文章总览:YuanDaiMa2048博客文章总览深度学习之优化器1.随机梯度下降(SGD)2.动量优化(Momentum)3.自适应梯度(Adagrad)4.自适应矩估计(Adam)5.RMSprop总结其他介绍在深度学习中,优化器用于更新模型的参数,以最小化损失函数。常见的优化函数有很多种,下面是几种主流的优化器及其特点、原理和PyTorch实现:1.随机梯度下降(SGD)原理:随机梯度下降通过
- 机器学习-聚类算法
不良人龍木木
机器学习机器学习算法聚类
机器学习-聚类算法1.AHC2.K-means3.SC4.MCL仅个人笔记,感谢点赞关注!1.AHC2.K-means3.SC传统谱聚类:个人对谱聚类算法的理解以及改进4.MCL目前仅专注于NLP的技术学习和分享感谢大家的关注与支持!
- 生成式地图制图
Bwywb_3
深度学习机器学习深度学习生成对抗网络
生成式地图制图(GenerativeCartography)是一种利用生成式算法和人工智能技术自动创建地图的技术。它结合了传统的地理信息系统(GIS)技术与现代生成模型(如深度学习、GANs等),能够根据输入的数据自动生成符合需求的地图。这种方法在城市规划、虚拟环境设计、游戏开发等多个领域具有应用前景。主要特点:自动化生成:通过算法和模型,系统能够根据输入的地理或空间数据自动生成地图,而无需人工逐
- 【大模型应用开发 动手做AI Agent】第一轮行动:工具执行搜索
AI大模型应用之禅
计算科学神经计算深度学习神经网络大数据人工智能大型语言模型AIAGILLMJavaPython架构设计AgentRPA
【大模型应用开发动手做AIAgent】第一轮行动:工具执行搜索作者:禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来随着人工智能技术的飞速发展,大模型应用开发已经成为当下热门的研究方向。AIAgent作为人工智能领域的一个重要分支,旨在模拟人类智能行为,实现智能决策和自主行动。在AIAgent的构建过程中,工具执行搜索是至关重要
- 未来软件市场是怎么样的?做开发的生存空间如何?
cesske
软件需求
目录前言一、未来软件市场的发展趋势二、软件开发人员的生存空间前言未来软件市场是怎么样的?做开发的生存空间如何?一、未来软件市场的发展趋势技术趋势:人工智能与机器学习:随着技术的不断成熟,人工智能将在更多领域得到应用,如智能客服、自动驾驶、智能制造等,这将极大地推动软件市场的增长。云计算与大数据:云计算服务将继续普及,大数据技术的应用也将更加广泛。企业将更加依赖云计算和大数据来优化运营、提升效率,并
- 吴恩达深度学习笔记(30)-正则化的解释
极客Array
正则化(Regularization)深度学习可能存在过拟合问题——高方差,有两个解决方法,一个是正则化,另一个是准备更多的数据,这是非常可靠的方法,但你可能无法时时刻刻准备足够多的训练数据或者获取更多数据的成本很高,但正则化通常有助于避免过拟合或减少你的网络误差。如果你怀疑神经网络过度拟合了数据,即存在高方差问题,那么最先想到的方法可能是正则化,另一个解决高方差的方法就是准备更多数据,这也是非常
- 个人学习笔记7-6:动手学深度学习pytorch版-李沐
浪子L
深度学习深度学习笔记计算机视觉python人工智能神经网络pytorch
#人工智能##深度学习##语义分割##计算机视觉##神经网络#计算机视觉13.11全卷积网络全卷积网络(fullyconvolutionalnetwork,FCN)采用卷积神经网络实现了从图像像素到像素类别的变换。引入l转置卷积(transposedconvolution)实现的,输出的类别预测与输入图像在像素级别上具有一一对应关系:通道维的输出即该位置对应像素的类别预测。13.11.1构造模型下
- Rust 所有权 简介
东离与糖宝
rust后端rust开发语言
文章目录发现宝藏1.所有权基本概念2.所有权规则3.变量作用域4.栈与堆4.1栈(Stack)4.2堆(Heap)5.String类型5.1String类型5.2String的内存分配5.3所有权与内存管理5.4String与切片6.变量与数据交互方式6.1移动(Move)6.2.克隆(Clone)7.所有权与函数7.1.传递参数7.2.返回值总结发现宝藏前些天发现了一个巨牛的人工智能学习网站,通
- python中zeros用法_Python中的numpy.zeros()用法
江平舟
python中zeros用法
numpy.zeros()函数是最重要的函数之一,广泛用于机器学习程序中。此函数用于生成包含零的数组。numpy.zeros()函数提供给定形状和类型的新数组,并用零填充。句法numpy.zeros(shape,dtype=float,order='C'参数形状:整数或整数元组此参数用于定义数组的尺寸。此参数用于我们要在其中创建数组的形状,例如(3,2)或2。dtype:数据类型(可选)此参数用于
- 深度学习-点击率预估-研究论文2024-09-14速读
sp_fyf_2024
深度学习人工智能
深度学习-点击率预估-研究论文2024-09-14速读1.DeepTargetSessionInterestNetworkforClick-ThroughRatePredictionHZhong,JMa,XDuan,SGu,JYao-2024InternationalJointConferenceonNeuralNetworks,2024深度目标会话兴趣网络用于点击率预测摘要:这篇文章提出了一种新
- 【NumPy】深入解析numpy.zeros()函数
二七830
numpy
欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理(NLP)领域,并乐于分享知识与经验的小天地!博主简介:我是二七830,一名对技术充满热情的探索者。多年的Python编程和机器学习实践,使我深入理解了这些技术的核心原理,并能够在实际项目中灵活应用。尤其是在NLP领域,我积累了丰富的经验,能够处理各种复杂的自然语言任务。技术专长:我熟练掌握Python编程语言,并深入研究了机
- 【中国国际航空-注册_登录安全分析报告】
风控牛
验证码接口安全评测系列安全行为验证极验网易易盾智能手机
前言由于网站注册入口容易被黑客攻击,存在如下安全问题:1.暴力破解密码,造成用户信息泄露2.短信盗刷的安全问题,影响业务及导致用户投诉3.带来经济损失,尤其是后付费客户,风险巨大,造成亏损无底洞所以大部分网站及App都采取图形验证码或滑动验证码等交互解决方案,但在机器学习能力提高的当下,连百度这样的大厂都遭受攻击导致点名批评,图形验证及交互验证方式的安全性到底如何?请看具体分析一、中国国际航空PC
- 机器学习 流形数据降维:UMAP 降维算法
小嗷犬
Python机器学习#数据分析及可视化机器学习算法人工智能
✅作者简介:人工智能专业本科在读,喜欢计算机与编程,写博客记录自己的学习历程。个人主页:小嗷犬的个人主页个人网站:小嗷犬的技术小站个人信条:为天地立心,为生民立命,为往圣继绝学,为万世开太平。本文目录UMAP简介理论基础特点与优势应用场景在Python中使用UMAP安装umap-learn库使用UMAP可视化手写数字数据集UMAP简介UMAP(UniformManifoldApproximatio
- 损失函数与反向传播
Star_.
PyTorchpytorch深度学习python
损失函数定义与作用损失函数(lossfunction)在深度学习领域是用来计算搭建模型预测的输出值和真实值之间的误差。1.损失函数越小越好2.计算实际输出与目标之间的差距3.为更新输出提供依据(反向传播)常见的损失函数回归常见的损失函数有:均方差(MeanSquaredError,MSE)、平均绝对误差(MeanAbsoluteErrorLoss,MAE)、HuberLoss是一种将MSE与MAE
- 多线程编程之理财
周凡杨
java多线程生产者消费者理财
现实生活中,我们一边工作,一边消费,正常情况下会把多余的钱存起来,比如存到余额宝,还可以多挣点钱,现在就有这个情况:我每月可以发工资20000万元 (暂定每月的1号),每月消费5000(租房+生活费)元(暂定每月的1号),其中租金是大头占90%,交房租的方式可以选择(一月一交,两月一交、三月一交),理财:1万元存余额宝一天可以赚1元钱,
- [Zookeeper学习笔记之三]Zookeeper会话超时机制
bit1129
zookeeper
首先,会话超时是由Zookeeper服务端通知客户端会话已经超时,客户端不能自行决定会话已经超时,不过客户端可以通过调用Zookeeper.close()主动的发起会话结束请求,如下的代码输出内容
Created /zoo-739160015
CONNECTEDCONNECTED
.............CONNECTEDCONNECTED
CONNECTEDCLOSEDCLOSED
- SecureCRT快捷键
daizj
secureCRT快捷键
ctrl + a : 移动光标到行首ctrl + e :移动光标到行尾crtl + b: 光标前移1个字符crtl + f: 光标后移1个字符crtl + h : 删除光标之前的一个字符ctrl + d :删除光标之后的一个字符crtl + k :删除光标到行尾所有字符crtl + u : 删除光标至行首所有字符crtl + w: 删除光标至行首
- Java 子类与父类这间的转换
周凡杨
java 父类与子类的转换
最近同事调的一个服务报错,查看后是日期之间转换出的问题。代码里是把 java.sql.Date 类型的对象 强制转换为 java.sql.Timestamp 类型的对象。报java.lang.ClassCastException。
代码:
- 可视化swing界面编辑
朱辉辉33
eclipseswing
今天发现了一个WindowBuilder插件,功能好强大,啊哈哈,从此告别手动编辑swing界面代码,直接像VB那样编辑界面,代码会自动生成。
首先在Eclipse中点击help,选择Install New Software,然后在Work with中输入WindowBui
- web报表工具FineReport常用函数的用法总结(文本函数)
老A不折腾
finereportweb报表工具报表软件java报表
文本函数
CHAR
CHAR(number):根据指定数字返回对应的字符。CHAR函数可将计算机其他类型的数字代码转换为字符。
Number:用于指定字符的数字,介于1Number:用于指定字符的数字,介于165535之间(包括1和65535)。
示例:
CHAR(88)等于“X”。
CHAR(45)等于“-”。
CODE
CODE(text):计算文本串中第一个字
- mysql安装出错
林鹤霄
mysql安装
[root@localhost ~]# rpm -ivh MySQL-server-5.5.24-1.linux2.6.x86_64.rpm Preparing... #####################
- linux下编译libuv
aigo
libuv
下载最新版本的libuv源码,解压后执行:
./autogen.sh
这时会提醒找不到automake命令,通过一下命令执行安装(redhat系用yum,Debian系用apt-get):
# yum -y install automake
# yum -y install libtool
如果提示错误:make: *** No targe
- 中国行政区数据及三级联动菜单
alxw4616
近期做项目需要三级联动菜单,上网查了半天竟然没有发现一个能直接用的!
呵呵,都要自己填数据....我了个去这东西麻烦就麻烦的数据上.
哎,自己没办法动手写吧.
现将这些数据共享出了,以方便大家.嗯,代码也可以直接使用
文件说明
lib\area.sql -- 县及县以上行政区划分代码(截止2013年8月31日)来源:国家统计局 发布时间:2014-01-17 15:0
- 哈夫曼加密文件
百合不是茶
哈夫曼压缩哈夫曼加密二叉树
在上一篇介绍过哈夫曼编码的基础知识,下面就直接介绍使用哈夫曼编码怎么来做文件加密或者压缩与解压的软件,对于新手来是有点难度的,主要还是要理清楚步骤;
加密步骤:
1,统计文件中字节出现的次数,作为权值
2,创建节点和哈夫曼树
3,得到每个子节点01串
4,使用哈夫曼编码表示每个字节
- JDK1.5 Cyclicbarrier实例
bijian1013
javathreadjava多线程Cyclicbarrier
CyclicBarrier类
一个同步辅助类,它允许一组线程互相等待,直到到达某个公共屏障点 (common barrier point)。在涉及一组固定大小的线程的程序中,这些线程必须不时地互相等待,此时 CyclicBarrier 很有用。因为该 barrier 在释放等待线程后可以重用,所以称它为循环的 barrier。
CyclicBarrier支持一个可选的 Runnable 命令,
- 九项重要的职业规划
bijian1013
工作学习
一. 学习的步伐不停止 古人说,活到老,学到老。终身学习应该是您的座右铭。 世界在不断变化,每个人都在寻找各自的事业途径。 您只有保证了足够的技能储
- 【Java范型四】范型方法
bit1129
java
范型参数不仅仅可以用于类型的声明上,例如
package com.tom.lang.generics;
import java.util.List;
public class Generics<T> {
private T value;
public Generics(T value) {
this.value =
- 【Hadoop十三】HDFS Java API基本操作
bit1129
hadoop
package com.examples.hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoo
- ua实现split字符串分隔
ronin47
lua split
LUA并不象其它许多"大而全"的语言那样,包括很多功能,比如网络通讯、图形界面等。但是LUA可以很容易地被扩展:由宿主语言(通常是C或 C++)提供这些功能,LUA可以使用它们,就像是本来就内置的功能一样。LUA只包括一个精简的核心和最基本的库。这使得LUA体积小、启动速度快,从 而适合嵌入在别的程序里。因此在lua中并没有其他语言那样多的系统函数。习惯了其他语言的字符串分割函
- java-从先序遍历和中序遍历重建二叉树
bylijinnan
java
public class BuildTreePreOrderInOrder {
/**
* Build Binary Tree from PreOrder and InOrder
* _______7______
/ \
__10__ ___2
/ \ /
4
- openfire开发指南《连接和登陆》
开窍的石头
openfire开发指南smack
第一步
官网下载smack.jar包
下载地址:http://www.igniterealtime.org/downloads/index.jsp#smack
第二步
把smack里边的jar导入你新建的java项目中
开始编写smack连接openfire代码
p
- [移动通讯]手机后盖应该按需要能够随时开启
comsci
移动
看到新的手机,很多由金属材质做的外壳,内存和闪存容量越来越大,CPU速度越来越快,对于这些改进,我们非常高兴,也非常欢迎
但是,对于手机的新设计,有几点我们也要注意
第一:手机的后盖应该能够被用户自行取下来,手机的电池的可更换性应该是必须保留的设计,
- 20款国外知名的php开源cms系统
cuiyadll
cms
内容管理系统,简称CMS,是一种简易的发布和管理新闻的程序。用户可以在后端管理系统中发布,编辑和删除文章,即使您不需要懂得HTML和其他脚本语言,这就是CMS的优点。
在这里我决定介绍20款目前国外市面上最流行的开源的PHP内容管理系统,以便没有PHP知识的读者也可以通过国外内容管理系统建立自己的网站。
1. Wordpress
WordPress的是一个功能强大且易于使用的内容管
- Java生成全局唯一标识符
darrenzhu
javauuiduniqueidentifierid
How to generate a globally unique identifier in Java
http://stackoverflow.com/questions/21536572/generate-unique-id-in-java-to-label-groups-of-related-entries-in-a-log
http://stackoverflow
- php安装模块检测是否已安装过, 使用的SQL语句
dcj3sjt126com
sql
SHOW [FULL] TABLES [FROM db_name] [LIKE 'pattern']
SHOW TABLES列举了给定数据库中的非TEMPORARY表。您也可以使用mysqlshow db_name命令得到此清单。
本命令也列举数据库中的其它视图。支持FULL修改符,这样SHOW FULL TABLES就可以显示第二个输出列。对于一个表,第二列的值为BASE T
- 5天学会一种 web 开发框架
dcj3sjt126com
Web框架framework
web framework层出不穷,特别是ruby/python,各有10+个,php/java也是一大堆 根据我自己的经验写了一个to do list,按照这个清单,一条一条的学习,事半功倍,很快就能掌握 一共25条,即便很磨蹭,2小时也能搞定一条,25*2=50。只需要50小时就能掌握任意一种web框架
各类web框架大同小异:现代web开发框架的6大元素,把握主线,就不会迷路
建议把本文
- Gson使用三(Map集合的处理,一对多处理)
eksliang
jsongsonGson mapGson 集合处理
转载请出自出处:http://eksliang.iteye.com/blog/2175532 一、概述
Map保存的是键值对的形式,Json的格式也是键值对的,所以正常情况下,map跟json之间的转换应当是理所当然的事情。 二、Map参考实例
package com.ickes.json;
import java.lang.refl
- cordova实现“再点击一次退出”效果
gundumw100
android
基本的写法如下:
document.addEventListener("deviceready", onDeviceReady, false);
function onDeviceReady() {
//navigator.splashscreen.hide();
document.addEventListener("b
- openldap configuration leaning note
iwindyforest
configuration
hostname // to display the computer name
hostname <changed name> // to change
go to: /etc/sysconfig/network, add/modify HOSTNAME=NEWNAME to change permenately
dont forget to change /etc/hosts
- Nullability and Objective-C
啸笑天
Objective-C
https://developer.apple.com/swift/blog/?id=25
http://www.cocoachina.com/ios/20150601/11989.html
http://blog.csdn.net/zhangao0086/article/details/44409913
http://blog.sunnyxx
- jsp中实现参数隐藏的两种方法
macroli
JavaScriptjsp
在一个JSP页面有一个链接,//确定是一个链接?点击弹出一个页面,需要传给这个页面一些参数。//正常的方法是设置弹出页面的src="***.do?p1=aaa&p2=bbb&p3=ccc"//确定目标URL是Action来处理?但是这样会在页面上看到传过来的参数,可能会不安全。要求实现src="***.do",参数通过其他方法传!//////
- Bootstrap A标签关闭modal并打开新的链接解决方案
qiaolevip
每天进步一点点学习永无止境bootstrap纵观千象
Bootstrap里面的js modal控件使用起来很方便,关闭也很简单。只需添加标签 data-dismiss="modal" 即可。
可是偏偏有时候需要a标签既要关闭modal,有要打开新的链接,尝试多种方法未果。只好使用原始js来控制。
<a href="#/group-buy" class="btn bt
- 二维数组在Java和C中的区别
流淚的芥末
javac二维数组数组
Java代码:
public class test03 {
public static void main(String[] args) {
int[][] a = {{1},{2,3},{4,5,6}};
System.out.println(a[0][1]);
}
}
运行结果:
Exception in thread "mai
- systemctl命令用法
wmlJava
linuxsystemctl
对比表,以 apache / httpd 为例 任务 旧指令 新指令 使某服务自动启动 chkconfig --level 3 httpd on systemctl enable httpd.service 使某服务不自动启动 chkconfig --level 3 httpd off systemctl disable httpd.service 检查服务状态 service h
Twice I’ve tried to realistically present the performance of the algorithm. Twice was my paper rejected because of “unfinished methods” or “disappointing results”. There’s a whole culture of “rounding-up”, and trying to do the evaluations fairly just gives you trouble. When fair evaluations get rejected and rounders-up pass through, what do you do?
Anonymous’s story is surely common.
On any given paper, there is an incentive to “cheat” with some of the above methods. This can be hard to resist when so much rides on a paper acceptance _and_ some of the above cheats are not easily detected. Nevertheless, it should be resisted because “cheating” of this sort inevitably fools you as well as others. Fooling yourself in research is a recipe for a career that goes nowhere. Your techniques simply won’t apply well to new problems, you won’t be able to tackle competitions, and ultimately you won’t even trust your own intuition, which is fatal in research.
My best advice for anonymous is to accept that life is difficult here. Spend extra time testing on many datasets rather than a few. Spend extra time thinking about what make a good algorithm, or not. Take the long view and note that, in the long run, the quantity of papers you write is not important, but rather their level of impact. Using a “cheat” very likely subverts long term impact.
How about an index of negative results in machine learning? There’s a Journal of Negative Results in other domains: Ecology & Evolutionary Biology, Biomedicine, and there is Journal of Articles in Support of the Null Hypothesis. A section on negative results in machine learning conferences? This kind of information is very useful in preventing people from taking pathways that lead nowhere: if one wants to classify an algorithm into good/bad, one certainly benefits from unexpectedly bad examples too, not just unexpectedly good examples.
I visited the workshop on negative results at NIPS 2002. My impression was that it did not work well.
The difficulty with negative results in machine learning is that they are too easy. For example, there are a plethora of ways to say that “learning is impossible (in the worst case)”. On the applied side, it’s still common for learning algorithms to not work on simple-seeming problems. In this situation, positive results (this works) are generally more valuable than negative results (this doesn’t work).
This discussion reminds of some interesting research on “anti-learning“, by Adam Kowalczyk. This research studies (empirically and theoretically) machine learning algorithms that yield good performance on the training set but worse than random performance on the independent test set.
Hmm, rereading this post. What do you mean by “brittle”? Why is mutual information brittle?
Standard deviation of loss across the CV folds is not a bad summary of variation in CV performance. I’m not sure one can just reject a paper where the authors bothered to disclose the variation, rather than just plopping out the average. Standard error carries some Gaussian assumptions, but it is still a valid summary. The distribution of loss is sometimes quite close to being Gaussian, too.
As for significance, I came up with the notion of CV-values that measure how often method A is better than method B in a randomly chosen fold of cross-validation replicated very many times.
What I mean by brittle: Suppose you have a box which takes some feature values as input and predicts some probability of label 1 as output. You are not allowed to open this box or determine how it works other than by this process of giving it inputs and observing outputs.
Let x be an input.
Let y be an output.
Assume (x,y) are drawn from a fixed but unknown distribution D.
Let p(x) be a prediction.
For classification error I(|y – p(x)| < 0.5) you can prove a theorem of the rough form:
forall D, with high probability over the draw of m examples independently from D,
expected classification error rate of the box with respect to D is bounded by a function of the observations.
What I mean by “brittle” is that no statement of this sort can be made for any unbounded loss (including log-loss which is integral to mutual information and entropy). You can of course open up the box and analyze its structure or make extra assumptions about D to get a similar but inherently more limited analysis.
The situation with leave-one-out cross validation is not so bad, but it’s still pretty bad. In particular, there exists a very simple learning algorithm/problem pair with the property that the leave-one-out estimate has the variance and deviations of a single coin flip. Yoshua Bengio and Yves Grandvalet in fact proved that there is no unbiased estimator of variance. The paper that I pointed to above shows that for K-fold cross validation on m examples, all moments of the deviations might only be as good as on a test set of size $m/K$.
I’m not sure what a ‘valid summary’ is, but leave-one-out cross validation can not provide results I trust, because I know how to break it.
I have personally observed people using leave-one-out cross validation with feature selection to quickly achieve a severe overfit.
Thanks for the explanation of brittleness! This is a problem with log-loss, but I’d say that it is not a problem with mutual information. Mutual information has well-defined upper bounds. For log-loss, you can put a bound into effect by mixing the prediction with a uniform distribution over y, bounding the maximum log-loss in a way that’s analogous to the Laplace probability estimate. While I agree that unmixed log-loss is brittle, I find classification accuracy noisy.
A reasonable compromise is Brier score. It’s a proper loss function (so it makes good probabilistic sense), and it’s a generalization of classification error where the Brier score of a non-probabilistic classifier equals its classification error, but a probabilistic classifier can benefit from distributing the odds. So, the result you mention holds also for Brier score.
If I perform 2-replicated 5-fold CV of the NBC performance on the Pima indians dataset, I get the following [0.76 0.75 0.87 0.76 0.74 0.77 0.79 0.72 0.78 0.82 0.81 0.79 0.73 0.74 0.82 0.79 0.74 0.77 0.83 0.75 0.79 0.73 0.79 0.80 0.76]. Of course, I can plop out the average of 0.78. But it is nicer to say that the standard deviation is 0.04, and summarize the result as 0.78 +- 0.04. The performance estimate is a random quantity too. In fact, if you perform many replications of cross-validation, the classification accuracy will have a Gaussian-like shape too (a bit skewed, though).
I too recommend against LOO, for the simple reason that the above empirical summaries are often awfully strange.
Very very interesting. However, I still feel (but would love to be convinced otherwise) that when the dataset is small and no additional data can be obtained, LOO-CV is the best among the (admittedly non-ideal) choices. What do you suggest as a practical alternative for a small dataset?
I’m not convinced by your observation about people using LOO-CV with feature selection to overfit. Isn’t this just a problem with reusing the same validation set multiple times? Even if I use a completely separately drawn validation set, which Bengio and Grandvalet show yield an unbiased estimtae of the variance of the prediction error, I can still easily overfit the validation set when doing feature selection, right?
This is my first post on your blog. Thanks so much for putting it up — a very nice resource!
Aleks’s technique for bounding log loss by wrapping the box in a system that mixes with the uniform distribution has a problem: it introduces perverse incentives for the box. One reason why people consider log loss is that the optimal prediction is the probability. When we mix with the uniform distribution, this no longer becomes true. Mixing with the uniform distribution shifts all probabilistic estimates towards 0.5, which means that if the box wants to minimize log loss, it should make an estimate p such that after mixing, you get the actual probability.
David McAllester advocates truncation as a solution to the unboundedness. This has the advantage that it doesn’t create perverse incentives over all nonextreme probabilities.
Even when we swallow the issues of bounding log loss, rates of convergence are typically slower than for classification, essentially because the dynamic range of the loss is larger. Thus, we can expect log loss estimates to be more “noisy”.
Before trusing mutual information, etc…, I want to see rate of convergence bounds of the form I mentioned above.
I’m not sure what Brier score is precisely, but justing using L(p,y)=(p-y)^2 has all the properties mentioned.
I consider reporting standard deviation of cross validation to be problematic. The basic reason is that it’s unclear what I’m supposed to learn. If it has a small deviation, this does not mean that I can expect the future error rate on i.i.d. samples to be within the range of the +/-. It does not mean that if I cut the data in another way (and the data is i.i.d.), I can expect to get results in the same range. There are specific simple counterexamples to each of these intuitions. So, while reporting the range of results you see may be a ‘summary’, it does not seem to contain much useful information for developing confidence in the results.
One semi-reasonable alternative is to report the confidence interval for a Binomial with m/K coin flips, which fits intuition (1), for the classifier formed by drawing randomly from the set of cross-validated classifiers. This won’t leave many people happy, because the intervals become much broader.
The notion that cross validation errors are “gaussian-like” is also false in general, on two counts:
This is an important issue because it’s not always obvious from experimental results (and intuitions derived from experimental results) whether the approach works. The math says that if you rely on leave-one-out cross-validation in particular you’ll end up with bad inuitions about future performance. You may not encounter this problem on some problems, but the monsters are out there.
For rif’s questions — keep in mind that I’m only really considering methods of developing confidence here. I’m ok with people using whatever ugly nasty hacks they want in producing a good predictor. You are correct about the feature selection example being about using the same validation set multiple times. (Bad!) The use of leave-one-out simply aggravated the effect of this with respect to using a holdout set because it’s easier to achieve large deviations from the expectation on a leave-one-out estimate than on a holdout set.
Developing good confidence on a small dataset is a hard problem. The simplest solution is to accept the need for a test set even though you have few examples. In this case, it might be worthwhile to compute very exact confidence intervals (code here). Doing K-fold cross validation on m examples and using confidence intervals for m/K coin flips is better, but by an unknown (and variable) amount. The theory approach, which has never yet worked well, is to very carefully use the examples for both purposes. A blend of these two approaches can be helpful, but the computation is a bit rough. I’m currently working with Matti Kääriäinen on seeing how well the progressive validation approach can be beat into shape.
And of course we should remember that all of this is only meaningful when the data is i.i.d, which it often clearly is not.
I think we have a case where the assumptions of applied machine learners differ from the assumptions of the theoretical machine learners. Let’s hash it out!
==
* (Half-)Brier score is 0.5(p-y)^2, where p and y are vectors of probabilities (p-predicted, y-observed).
* A side consequence of mixing is also truncation; but mixing is smooth, whereas truncation results in discontinuities of the gradient. There is a good justification for mixing: if you see that you misclassify in 10% of the cases on the unseen test data, you can anticipate similar error in the future, and calibrate the predictions by mixing with the uniform distribution.
* Standard deviation of the CV results is a foundation for bias/variance decomposition and a tremendous amount of work in applied statistics and machine learning. I wouldn’t toss it away so lightly, and especially not based on the argument of non-independence of folds. The purpose of non-independence of folds in the first place is that you get a better estimate of the distribution over all the training/test splits of a fixed proportion (one could say that the split is chosen by i.i.d., not the instances). You get a better estimate with 10-fold CV than by picking 10 train/test splits by random.
* Both binomial and Gaussian model of the error distribution are just models. Neither of them is ‘true’, but they are based on slightly different assumptions. I generally look at the histogram and eyeball it for gaussianity, as I have done in my example. The fact that it is a skewed distribution (with the truncated hump at ~85%) empirically invalidates the binomial error model too. One can compute the first two moments as a “finite” summary as an informative summary even if the underlying distribution has more of them.
I am not advocating ‘tossing’ cross-validation. I am saying that caution should be exercised in trusting it.
Do you have a URL for this other analysis?
You are right to be skeptical about models, but the ordering of skepticism seems important. Models which make more assumptions (and in particular which makes assumptions that are clearly false) should be viewed with more skepticism.
What is standard deviation of cross validation errors is supposed to describe? I listed and dismissed a couple possibilities, so now I’m left without an understanding.
I’d like to follow up a bit on your comment that “It’s easier to achieve large deviations from the expectation on a leave-one-out estimate than on a holdout set.” I was not familiar with this fact. Could you discuss this in more detail, or provide a reference that would help me follow this up? Quite interesting.
I didn’t mean to imply that you’d disagree with cross-validation in general. The issue at hand is whether the standard deviation of CV errors is useful or not. I can see two reasons for why one can be unhappy about it:
a) It can happen that you get accuracy of 0.99 +- 0.03. What could that mean? The standard deviation is a summary. If you provide a summary consisting of the first two moments, it does not mean that you believe in the Gaussian model – of course those statistics are not sufficient. It is a summary that roughly describes the variance of the classifier, inasmuch that the mean accuracy indicates its bias.
b) The instances in a training and test set are not i.i.d. Yes, but the above summary relates to the question: “Given a randomly chosen training/test 9:1 split of instances, what can we say about the classifier’s accuracy on the test set?” This is a different question than “Given a randomly chosen instance, what will be the classifier’s expected accuracy?”
Several people have a problem with b) and use bootstrap instead of cross-validation in bias/variance analysis. Still, I don’t see a problem with the formulation, if one doesn’t attempt to perceive CV as an approximation to making statements about i.i.d. samples.
rif – see today’s post under “Examples”.
Aleks, I regard the 0.99 +/- 0.3 issue as a symptom that the wrong statistics are being used (i.e. assuming gaussianity on obviously non-gaussian draws).
I’m not particularly interested in “Given a randomly chosen training/test 9:1 split of instances, what can we say about the classifier’s accuracy on the test set?†because I generally think the goal of learning is doing well on future examples. Why should I care about this?
Reporting 0.99 +- 0.03 does not imply that one who wrote it believes that the distribution is Gaussian. Would you argue that reporting 0.99 +- 0.03 is worse than just reporting 0.99? Anyone surely knows that the classification accuracy cannot be more than 1.0, it would be most arrogant to assume such ignorance.
CV is the de facto standard method of evaluating classifiers, and many people trust the results that come out of this. Even if I might not like this approach, it is a standard, it’s an experimental bottom line. “Future examples” are something you don’t have, something you can only make assumptions about. Cross-validation and learning curves employ the training data as to empirically demonstrate the stability and convergence of the learning algorithm on what effectively *is* future data for the algorithm, under the weak assumption of permutability of the training data. Permutability is a weaker assumption than iid. My main problem with most applications of CV is that people don’t replicate the cross-validation on multiple assignments to folds, something that’s been pointed out quite nicely by, e.g.,
Estimating Replicability of Classifier Learning Experiments. ICML, 2004.
The problem with LOO is that you *cannot* perform multiple replications.
If your assumptions grow from iid, you shouldn’t use cross-validation, it’s a) not solving your problem, and b) you could get better results with an evaluation method that assumes more. It is unfair to criticize CV on these grounds. One can grow a whole different breed of statistics based on permutability and training/test splitting.
Reporting 0.99 +- 0.03 does mean that the inappropriate statistics are being used.
I am not trying to claim anything about the belief of the person making the application (and certainly not trying to be arrogant).
I have a problem with reporting the +/- 0.03. It seems that it has no interesting interpretation, and the obvious statistical interpretation is simply wrong.
The standard statistical “meaning” of 0.99 +- 0.03 is a confidence interval about an observation. A confidence interval [lower_bound(observation), upper_bound(observation)] has the property that, subject to your assumptions, it will contain the true value of some parameter with high probability over the random draw of the observation. The parameter I care about is the accuracy, the probability that the classifier is correct. Since the true error rate can not go above 1, this confidence interval must be constructed with respect to the wrong assumptions about the observation generating process. This isn’t that damning though – what’s really hard to swallow is that this method routinely results in intervals which are much narrower than the standard statistical interpretation would suggest. In other words, it generates overconfidence.
> Would you argue that reporting 0.99 +- 0.03 is worse than just reporting 0.99?
Absolutely. 0.99 can be interpreted as an unbiased monte carlo estimate of the “true” accuracy. I do not have an interpretation of 0.03, and the obvious interpretations are misleading due to nongaussianity and nonindependence in the basic process. Using this obvious interpretation routinely leads to overconfidence which is what this post was about.
I don’t regard the distinction between “permutable” and “independent” as significant here, because DeFinetti’s theorem says that all exchangeable (i.e. permutable) sequences can be thought of as i.i.d. samples conditioned on the draw of a hidden random variable. We do not care what the value of this hidden random variable is because a good confidence interval for accuracy works no matter what the datageneration process is. Consequently, the ‘different breed’ you speak of will end up being the same breed.
Many people use cross validation in a way that I don’t disagree with. For example, tuning parameters might be reasonable. I don’t even have a problem with using cross validation error to report performance (except when this creates a subtle instance of “reproblem”). What seems unreasonable is making confidence interval-like statements subject to known-wrong assumptions. This seems especially unreasonable when there are simple alternatives which don’t make known-wrong assumptions.
I think you are correct: many other people (I would not say it’s quite “the” standard) try to compute (and report) confidence interval-like summaries. I think it’s harmful to do so because of the routine overconfidence this creates.
rif — Another reason LOO CV is bad because it asymptotically suboptimal. For example if you use Leave One Out cross-validation for feature selection, you might end up selecting suboptimal subset, even with infinite training sample. Te neural-nets FAQ talks about it: http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html
Experimentally, Ronny Kohavi and Breiman found independently that 10 is the best number of folds for CV.
The FAQ says “cross-validation is markedly superior [to split sample validation] for small data sets; this fact is demonstrated dramatically by Goutte (1997)”. (google scholar has the paper), but I’m not sure their conclusions extend beyond their Gaussian synthetic data.
I agree with you regarding the inappropriateness of +- notation, and I also agree about general overconfidence of confidence intervals. Over here it says: “LTCM’s loss in August 1998 was a -10.5 sigma-event on the firm’s risk model, and a -14 sigma-event in terms of the actual previous price movements. Sometimes overfitting is very expensive LTCM “lost” quite a few hundred million US$ (“lost” — financial transactions are largely a zero-sum game).
What if I’d had written 0.99(0.03), without implying that 0.03 is a confidence interval (because it is not)? It is quite rare in statistics to provide confidence intervals – usually one provides either the standard deviation of the distribution or the standard error of the estimate of the mean. Still, I consider the 0.03 a very useful piece of information, and I’m grateful to any author that is dilligent enough to provide some information about the variation in the performance. I’d reject a paper that only provides the mean for a small dataset, or didn’t perform multiply replicated experiments.
As much as I’m concerned this is The Right Way of dealing with confidence intervals of cross-validated loss is to perform multiple replications of cross-validation, and provide the scores at appropriate percentiles. My level of agreement with the binomial model is about at the same level as your agreement with the Gaussian model. Probability of error is meaningless: there are instances that you can almost certainly predict right, there are instances that you usually misclassify, and there are boundary instances where the predictions of the classifier vary, depending on the properties of the split. Treating all these groups as one would be misleading.
Regarding de Finetti, one has to be careful: there is a difference between finite and infinite exchangeability. The theorem goes from *infinite* exchangeability to iid. When you have an infinite set, there is no difference between forming a finite sample by sampling-with-replacement (bootstrap) versus sampling-without-replacement (cross-validation). When you have a finite set to sample from, it’s two different breeds.
As for assumptions, they are all wrong… But some are more agreeable than others.
0.99(0.03) is somewhat better, but I suspect people still interpret it as a confidence interval, even when you explicitly state that it is not.
Another problem is that I still don’t know why it’s interesting. You assert it’s very interesting, but can you explain why? How do you use it? Saying 0.99(0.03) seems semantically equivalent to saying “I achieved test set performance of 0.99 with variation 0.03 across all problems on the UCI database”, except not nearly as reassuring because the cross-validation folds do not encompass as much variation across real-world problems.
On Binomial vs. Gaussian model: the Binomial model (at least) has the advantage that it is not trivially disprovable.
On probability of error: it’s easy to criticize any small piece of information as incomplete. Nevertheless, we like small pieces of information because we can better understand and use them. “How often should I expect the classifier to be wrong in the future” seems like an interesting (if incomplete) piece of information to me. A more practical problem with your objection is that distinguishing between “always right”, “always wrong” and “sometimes right” examples is much harder, requiring more assumptions, than distinguishing error rate. Hence, such judgements will be more often wrong.
I had assumed you were interested in infinite exchangeability because we are generally interested in what the data tells us about future (not yet seen) events. Analysis which is only meaningful with respect to known labeled examples simply doesn’t interest me, in the same way that training error rate doesn’t interest me.
Why bother to make a paper, at all? Why don’t you code stuff and throw it into e-market? There are forums, newsgroups, and selected “peers” for things that are incomplete and require some discussion.
No, 0.99(0.03) means 0.99 classification error across 90:10 training-test splits on a single data set. It is quite meaningless to try to assume any kind of average classification error across different data sets.
Regarding probability of error, if it’s easy to acquire this kind of information, why not do it?
Infinite exchangeability does not apply to a finite population. What do you do when I gather *all* the 25 cows from the farm and measure them? You cannot pretend that there are infinitely many cows in the farm. You can, however, wonder about the number of cows (2,5, 10, 25?) you really need to measure to characterize all the 25 with reasonable precision.
I maintain that future is unknowable. Any kind of a statement regarding the performance of a particular classifier trained from data should always be seen as relative to the data set.
This still isn’t answering my question: Why is 0.03 useful? I can imagine using an error rate in decision making. I can imagine using a confidence interval on the error rate in decision making. But, I do not know how to use 0.03 in any useful way.
Note that 0.99 means 0.99 average classification error across multiple 90:10 splits. 0.99(0.03) should mean something else if 0.03 is useful.
Your comment on exchangeability makes more sense now. In this situation, what happens is that (basically) you trade using a Binomial distribution for a Hypergeometric distribution to analyze the number of errors on the portion of the set you haven’t seen. The trade Binomial->Hypergeometric doesn’t alter intuitions very much because the distributions are fairly similar (Binomial is a particlular limit of the Hypergeometric, etc…)
0.03 gives you an indication of reliability, stability of a classifier. This relates to the old bias/variance tradeoff. A short bias/variance reading list:
Neural networks and the bias/variance dilemma
Bias, Variance, and Arcing Classifiers
A Unified Bias-Variance Decomposition for Zero-One and Squared Loss
This still isn’t the answer I want. How is 0.03 useful? How do you use it?
The meaning of “stability” here seems odd. It seems to imply nothing about how the algorithm would perform for new problems or even for a new draw of the process generating the current training examples. Why do we care about this very limited notion of stability?
If you don’t mind a somewhat philosophical argument, examine the Figure 5 in Modelling Modelled. The NBC becomes highly stable beyond 150 instances. On the other hand, C4.5 has a higher average utility, but also a greater variation in its utility on the test set. Is it meaningful to compare both methods when the training set consists of ~100 instances? The difference in expected utility is negligible in comparison to the total amount of variation in performance.
This still isn’t answering my question. How and why do you use 0.03? There should be a simple answer to this, just like there are simple answeres for 0.99 and for confidence intervals about 0.99.
(I don’t want to spend time debating what is and is not “meaningful”, because that seems to vague.)
(0.03) indicates how much the classification accuracy is affected by the choice of the training data across the experiments. It quantifies the variance of the learned model. It describes that the estimate of classification accuracy across test sets of a certain size is not a number, it is a distribution.
I get my distribution of expected classification accuracy through sampling, and the only assumption is the fixed choice of the relative size of the training and test set. The purpose of (0.03) is to stress that the classification accuracy estimate depends on the assignment of instances to training or test set. You get your confidence interval starting from an arbitrary point estimate “0.99” along with a very strong binomial assumption, one that is invalidated by the above sampling experiments. It’s a simple answer alright, but a very dubious set of assumptions.
By now, I’ve listed sufficiently many papers that attempt to justify the bias/variance problem, and the purpose of (0.03) should be apparent in the context of this problem. Do you have a good reason for disagreeing with with the whole issue of bias/variance decomposition?
I know what (0.03) indicates, but this still doesn’t answer my question. How do we _use_ it? How is this information supposed to affect the choices that we make? The central question is whether or not (0.03) is relevant to decision making, and I don’t yet see that relevance.
“Binomial distribution” is not the assumption. Instead, it is the implication. The assumption is iid samples. This assumption is not always true, but none of the experiments in the ‘modelling modeled’ reference seem to be the sort which disprove the correctness of the assumption. In particular, cutting up the data in several different ways and learning different classifiers with different observed test error rates cannot disprove the independence assumption.
This reminds me of Lance’s post on calibrating weather prediction numbers. The weatherman tells us that (subjective) probability of rain tomorrow is 0.8 How do (should) we use that? Now suppose we know something about the prior he used to come up with the 0.8 estimate. Does that change the way we use the number?
Re: Yaroslav – Yes, if the prior doesn’t match our own prior, we can squeeze out the update and update *our* prior.
Re: John – If you accept the bias/variance issue, then (0.03) is interesting therefore intrinsically useful I guess you don’t buy this. It concerns the estimation of risk, second-order probability (probability-of-probability), etc. The issue is that you cannot characterize the error rate reliably, and must therefore use a probability distribution. This is the same pattern as with introducing error rate because you cannot say whether a a classifier is always correct or always wrong.
A more practical utility is comparing two classifiers in two cases. In one case, the classifier A gets the classification accuracy of 0.88(0.31) and B gets 0.90(0.40). What probability would you assign to the statement “A is better than B?” in the absence of any other information? Now consider another experiment, where you get 0.88(0.01) for A and 0.90(0.01) for B.
Why would I want to assign a probability to “A is better than B”? How would you even do that given this information? And what does “better” mean?
a) What is the definition you use to do model selection? b) Any assignment is based upon a particular data set. c) “better” – lower aggregate loss on the test set.
a) I am generally inclined to avoid model selection because it is a source of overfitting. I would generally rather make a weighted integration of predictions. If pressed for computational reasons, I might choose the classifier with the smallest cross validation or validation set error rate.
I still don’t understand why you want to assign a probability.
b) I don’t understand your response. You give examples of 0.88(0.01) and 0.90(0.01). How do you use the 0.01 to decide?
c) I agree with your definition of better, as long as the test set is not involved in the cross validation process.
Interesting! Now I understand: all the stuff I’ve been talking about in this thread is very much about the tools and tricks in order to do model selection. But you dislike model selection, so obviously these tools and tricks may indeed seem useless.
a) If you have to make a choice, how easy is it for you to then state that A is better than B? It’s very rare that A would always be better than B. Instead, it may usually be better. Probability captures the uncertainty inherent to making such a choice. The probability of 0.9 means that in 90% of the test batches, A will be better.
b) With A:0.88(0.01) vs B:0.90(0.01), B will almost always be better than A. With A:0.88(0.1) vs B:0.90(0.1), we can’t really say which one will be better, and a choice could be arbitrary.
c) OK, but assume you have a certain batch of the data. That’s all you have. What do you do? Create a single test/train split, or create a bunch of them and ‘integrate out’ the dependence of your estimate on the particular choice?
Regarding the purpose of model selection. I’m sometimes working with experts, e.g. MD’s, who gathered the data and want to see the model. I train SVM, I train classification trees, I train NBC, I train many other things. Eventually, I would like to give them a single nicely presented model. They cannot evaluate or teach this ensemble of models. They won’t get insights from an overly complex model, they need something simpler, something they can teach/give to their ambulance staff to make decisions. So the nitty-gritty reality of practical machine learning has quite an explicit model complexity cost.
And one way of dealing with model complexity is model selection. It’s cold and brutal, but it gets the job done. The above probability is a way of quantifying how unjustified or arbitrary it is in a particular case. If it’s too brutal and if the models are making independent errors, then one can think about how to approximate or present the ensemble. Of course, I’d want to hand the experts the full Bayesian posterior, but how do I print it out on an A4 sheet of paper so that the expert can compare it to her intuition and experience?
Of course, I’m not saying that everyone should be concerned about model complexity and presentability. I am just trying to justify its importance to applied data analysis.
I understand that some form of predictor simplification/model selection is sometimes necessary.
a) I still don’t understand why you want to assign a probability to one being better than another. If we accept that model selection/simplification must be done, then it seems like you must make a hard choice. Why are probabilities required?
b) The reasoning about B and A does not hold on future data in general (and I am not interested in examples where we have already measured the label). In particular, I can give you learning algorithm/problem pairs in which there is a very good chance you will observe something which looks like a significant difference over cross validation folds, but which is not significant. The extreme example mentioned in this post shows you can get 1.00(0.00) and 0.00(0.00) for two algorithms producing classifiers with the same error rate.
c) If I thought there was any chance of a time ordering in the data, I would using a single train/test split with later things in the test set. I might also be tempted to play with “progressive validation” (although that’s much less standard). If there was obviously no time dependence, I might use k-fold cross validation (with _small_ k) and consider the average error rate a reasonable predictor of future performance. If I wanted to know roughly how well I might reasonably expect to do in the future and thought the data was i.i.d. (or effectively so), I would use the test set bound.
a) I consider 10-fold cross-validation to be a series of 10 experiments. For each of these experiments, we obtain a particular error rate. For a particular experiment, A might be better than B, but for a different experiment B would be better than A. Both probability and the standard deviations are ways of modelling the uncertainty that comes with this. If I cannot make a sure choice, and if modelling uncertainty is not too expensive, why not model it?
b) Any fixed method can be defeated by an adaptive adversary. I’m looking for a sensible evaluation protocol that will discount both overfitting and underfitting, and I realize that nothing is perfect.
c) I agree with your suggestions, especially with the choice of a small ‘k’. Still, I would stress that cross-validation is to be replicated multiple times, with several different permutations of the fold-assignment vector. Otherwise, the results are excessively dependent on a particular assignment to folds. If something affects your results, and if you are unsure about it, then you should not keep it fixed, but vary it.
a) I consider the notion that 10-fold cross validation is 10 experiments very misleading, because there can exist very strong dependencies between the 10 “experiments”. It’s like computing the average and standard deviations of the wheel locations of race car #1 and race car #2. These simply aren’t independent, and so the amount of evidence they provide towards “race car #1 is better than race car #2″ is essentially the same as the amount of evidence given by “race car #1 is in front of race car #2″.
b) Pleading “but nothing works in general” is not convincing to me. In the extreme, this argument can be used to justify anything. There are some things which are more robust than other things, and it seems obvious that we should prefer the more robust things. If you use confidence intervals, this nasty example will not result in nonsense numbers, as it does with the empirical variance approach.
You may try to counterclaim that there are examples where confidence intervals fail, but the empirical variance approach works. If so, state them. If not, the confidence interval approach at least provides something reasonable subject to a fairly intuitive assumption. No such statement holds for the empirical variance approach.
c) I generally agree, as time allows.
I agree about b), but continue to disagree about a). The argument behind it is somewhat intricate. We’re estimating something random with a non-random set of experiments. Let me pose a small problem/analogy: if you wanted to use monte carlo sampling to estimate the area of a certain shape in 2D, but you can only take 10 samples, would you draw these samples purely at random? You would not, because you would risk the chance that you’d sample the same point twice, and would gain no information. Cross-validation is a bit like that: it tries to diversify the samples in order to get a better estimate with fewer samples. Does it make sense?
No, it does not. Cross validation makes samples which are (in analogy) more likely to be the same than independent samples. That’s why you can get the 1.00(0.00) or 0.00(0.00) behavior.
Back to this tar baby I understand your concern, but it is inherent to *sampling without replacement* of instances as contrasted to *sampling with replacement* of instances. I was not arguing bootstrap or iid versus training/test split or cross-validation. I was arguing for cross-validation compared to random splitting into the training and test set.
It’s quite clear that i.i.d. is often incompatible with sampling without replacement, and I can demonstrate this experimentally. In some cases, i.i.d. is appropriate (large populations, random sampling), and in other cases splitting is appropriate (finite populations, exhaustive or stratified sampling). These two stances should be kept apart and not mixed, as seems to be the fashion. What should be a challenge is to study learning in the latter case.
I don’t understand what is meant by “incompatible” here.
Assuming m independent samples, what we know (detailed here) is that K-fold cross validation has a smaller variance, skew, or other higher order moment then a random train/test split with the test set of size m/K. We do not and cannot (fully) know how much smaller this variance is. There exist examples where K-fold cross validation has the same behavior as a random train/test split.
If you want to argue that cross-validation is a good idea because it removes variance, I can understand that. If you want to argue that the individual runs with different held out folds are experiments, I disagree. This really is like averaging the position of wheels on a race car. It reduces variance (i.e. doesn’t let a race car with a missing wheel win), but it is still only one experiment (i.e. one race). If you want more experiments, you should not share examples between runs of the learning algorithm.
Incompatible means that assuming i.i.d. within the classifier will penalize you if the classifier is evaluated using cross-validation: the classifier is not as confident as it can afford to be. I’m not arguing that CV is better, I’m just arguing that it’s different. I try to be agnostic with respect to evaluation protocols, and adapt to the problem at hand. CV tests some things, bootstrap other things, each method has its pathologies, but advocating a single individual train/test split is complete rubbish unless you’re in highly cost-constrained adversial situation.
But now I’ll play the devil’s advocate again. Assume that I’m training on 10% and testing on 90% of data in “-10″-fold CV. Yes, the experiments are not independent. Why should they be? Why shouldn’t I exhaustively test all the tires of the car in four *dependent* experiments? Why shouldn’t I test the blood pressure of every patient just once, even if this makes my experiments dependent? Why shouldn’t I hold out for validation each and every choice of 10% of instances? Why is having this kind of dependence any less silly than sampling the *same* tire multiple times in order to keep the samplings “independent”? Would it be less silly than sampling just one tire and compute a bound based on that single measurement, as any additional measure could be dependent? Why is using a Gaussian to model the heights of *all* the players in a basketball team silly, even if the samples are not independent?
The notion that “advocating a single individual train/test split is complete rubish except in a cost constrained adversarial situation” is rubbish. As an example, suppose you have data from wall street and are trying to predict stock performance. This data is cheap and plentiful, but the notion of using cross validation is simply insane due to the “survivor effect”: future nonzero stock price is a strong and unfair predictor of past stock price. If you try to use cross validation, you will simply solve the wrong problem.
What’s happening here is that cross validation relies upon identicality of data in a far more essential manner than just having a training set and a test set. It is essential to understand this in considering methods for looking at your performance.
For your second point, I agree with the idea of reducing variance via cross validation (see second paragraph of comment 42) when the data is IID. What I disagree with is making confidence interval-like statements about the error rate based upon these nonindependent tests. If you want to know that one race car is better than another, you run them both on different tracks and observe the outcome. You don’t average over their wheel positions in one race and pretend that each wheel position represents a different race.
Well, of course neither cross-validation nor bootstrap makes sense when the assumption of instance exchangeability is clearly not justified. It was very funny to see R. Kalman make this mistake in http://www.pnas.org/cgi/content/abstract/101/38/13709/ – a journalist noticed this and wrote a pretty devastating paper on why peer review is important. My comment on “rubbish” was in the context of the validity of instance exchangeability, of course.
Regarding your note on “reducing variance”: I believe that you’re trying to find some benefit of cross-validation in the context of IID. Although you might do that, the crux of my message is that finite exchangeability (FEX) exercised by CV is different from infinite exchangeability (iid) exercised by bootstrap. Finite exchangeability has value on its own, not just as an approximation to infinite exchangeability. In fact, I’d consider finite exchangeability as primary, and infinite exchangeability as n approximation to it. I guess that your definition of confidence interval is based upon IID, so if I do “confidence intervals” based on FEX, it may look wrong.
I hope that I understand you correctly. What I’m suggesting is to allow for and appreciate the assumption of finite exchangeability, and build theory that accomodates for it. Until then, it would be unfair to dismiss empirical work assuming FEX in some places just because most theory work assumes IID.
I’ve worked on FEX confidence intervals here. The details change, but not the basic message w.r.t. the IID assumption.
The basic issue we seem to be debating, regardless of assumptions about the world, is whether we should think of the different runs of cross validation as “different” experiments. I know of no reasonable assumption under which the answer is “yes” and many reasonable assumptions under which the answer is “no”. For this conversation to be further constructive, I think you need to (a) state a theorem and (b) argue that it is relevant.
[...] Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here: including “reprobleming”, “data set selection”, and probably “overfitting by review”. It isn’t too surprising to observe this: when the testers of a drug have $109 or more riding on the outcome the temptation to make the outcome “right” is extreme. [...]
Useful list. Should be made required reading for students of ML.