我们上次介绍的数据抽样的widget例子,在数据传输通道上是简单和直接的。widget 被设计从一个widget接收数据,处理后将Token通过另外一个Channel发送出去。像下面这个图一样:
关于channels和tokens的管理,其实这里有一些更多的情况,这里我们将更复杂的事情做一个概览,这些了解可以帮助你做出一些复杂的widgets,用于处理多路输出、多路输入的一些处理逻辑。
简单来说,“multi-input” channels 就是这个widget可以与多个widgets的多个output channels进行连接。这样子的话,多个来源的数据可以被 feed 到一个Widget中进行处理,就像一个函数可以输入多个参数一样的情况。
比如说,我们想构建一个widget,将获取数据并且通过多种预测模型在之上进行测试。widget必须有 input data channel, 我们已经知道如何进行处理。但是,不同的是,我们希望连接多个widgets,像下图定义的逻辑:
我们将了解如何定义learning curve widget的多个channels,以及如何管理多个input tokens。但在此之前,先简单说明一下: learning curve是用于测试机器学习算法的,以试图确定在特定的训练集上的执行性能。为了这个,需要先抽出一个数据的子集,学习分类器,然后再在其它的数据集上进行测试。为了做这件事情 (by Salzberg, 1997),我们执行一个k-fold cross validation, 但只使用一定比例的数据用于训练。output widget看起来像下面的样子:
现在,回到channels 和 tokens的话题。 定义我们的widget的Input和output channels,像下面这样:
inputs = [("Data", Orange.data.Table, "set_dataset"), ("Learner", Orange.classification.Learner, "set_learner", widget.Multiple + widget.Default)]
不知道你注意到了没有,这个与之前定义的widgets,大部分都是相同的,除了widget.Multiple + widget.Default
(from theOrange.widgets.widget
namespace)。 作为input列表定义的最后一项,定义了一个Learner
channel。 这个widget.Multiple + widget.Default
的意思是说,这是一个multi-input channel,并且是这个类型输入的缺省input。如果没有指定这个参数,那么缺省情况下widget.Single将被使用。这意味着,这个
widget只能从一个widget接收input,并且不是缺省的通道 (缺省的channels后面再说)。
注意:Default
flag here is used for illustration. Since “Learner”channel is the only channel for a Orange.classification.Learner
type it is also the default.
在 Orange中,tokens被发送是依赖于widget的id的,具有multi-input channel 仅仅告诉 Orange 发送token together with sending widget id, the two arguments with which the receiving function is called. 对于我们的“Learner”channel,接收函数是 set_learner()
,看起来像下面这个样子:
def set_learner(self, learner, id): """Set the input learner for channel id.""" if id in self.learners: if learner is None: # remove a learner and corresponding results del self.learners[id] del self.results[id] del self.curves[id] else: # update/replace a learner on a previously connected link self.learners[id] = learner # invalidate the cross-validation results and curve scores # (will be computed/updated in `_update`) self.results[id] = None self.curves[id] = None else: if learner is not None: self.learners[id] = learner # initialize the cross-validation results and curve scores # (will be computed/updated in `_update`) self.results[id] = None self.curves[id] = None if len(self.learners): self.infob.setText("%d learners on input." % len(self.learners)) else: self.infob.setText("No learners.") self.commitBtn.setEnabled(len(self.learners))
OK,看起来有点长、有点复杂。但是,保持耐心! Learning curve 不是一个特简单的widget。
这个函数中,有一些更多的代码,用于管理其特定情况的信息。要理解这个信号,我们下面介绍其机制。我们存储 learners (objects that learn from data) 在一个OrderedDict
中: self.learners
。这个词典对象是input id 和input value (the input learner itself)的Mapping类型数据。The reason this is an OrderedDict
is that the order of the input learners is important as we want to maintain a consistent column order in the table view of the learning curve point scores.
上面的函数首先检查channelid 是否已经在self.learners中,如果是则删除对象的。如果
learner
是 None
(记住:收到None值意味着
连接被移除或者关闭) or invalidates the cross validation results, and curve point for that channel id, marking for update in handleNewSignals()
. 同样的情况就是当我们收到learner for a new channel id。
The function above first checks if the learner sent is empty (
None
). Remember that sending an empty learner essentially means that the link with the sending widget was removed, hence we need to remove such learner from our list. If a non-empty learner was sent, then it is either a new learner (say, from a widget we have just linked to our learning curve widget), or an update version of the previously sent learner. If the later is the case, then there is an id which we already have in the learners list, and we need to replace previous information on that learner. If a new learner was sent, the case is somehow simpler, and we just add this learner and its learning curve to the corresponding variables that hold this information.The function that handles
learners
as shown above is the most complicated function in our learning curve widget. In fact, the rest of the widget does some simple GUI management, and calls learning curve routines from testing and performance scoring functions fromevaluation
.
注意,在这个widget中求值 (k-fold cross validation)实施只当给出learner, data set 和 evaluation parameters, 并且 scores are then derived from class probability estimates as obtained from the evaluation procedure. 意味着从一个 scoring function到另一个 (and displaying the result in the table) takes only a split of a second. 查看其它的方面,获取代码: its code
.
这里没啥新鲜的,只是需要一个widget,具有几个输出通道,演示缺省channels(下面会用到)。为了这个目的,我们修改之前构建的数据抽样的例子,让抽样数据从一个通道输出,而其它的数据从另一个通道输出。对应的通道定义如下:
outputs = [("Sampled Data", Orange.data.Table), ("Other Data", Orange.data.Table)]
我们使用 data sampler widget
的第三个变体。变化主要在函数selection()
and commit()中:
def selection(self): if self.dataset is None: return n_selected = int(numpy.ceil(len(self.dataset) * self.proportion / 100.)) indices = numpy.random.permutation(len(self.dataset)) indices_sample = indices[:n_selected] indices_other = indices[n_selected:] self.sample = self.dataset[indices_sample] self.otherdata = self.dataset[indices_other] self.infob.setText('%d sampled instances' % len(self.sample))
def commit(self): self.send("Sampled Data", self.sample) self.send("Other Data", self.otherdata)
如果widget具有同一种类型的多个通道,Orange Canvas打开一个窗口询问用户将连接到哪一个通道。因此,如果我们连接数据抽样器Data Sampler (C) widget 到 Data Table widget,如下:
我们得到下面的窗口请求用户输入多个通道的连接信息:
Now, let’s say we want to extend our learning curve widget such that it does the learning the same way as it used to, but can - provided that such data set is defined - test the learners (always) on the same, external data set. That is, besides the training data set, we need another channel of the same type but used for training data set. Notice, however, that most often we will only provide the training data set, so we would not like to be bothered (in Orange Canvas) with the dialog which channel to connect to, as the training data set channel will be the default one.
When enlisting the input channel of the same type, the default channels have a special flag in the channel specification list. So for our new learning curve
widget, the channel specification is
inputs = [("Data", Orange.data.Table, "set_dataset", widget.Default), ("Test Data", Orange.data.Table, "set_testdataset"), ("Learner", Orange.classification.Learner, "set_learner", widget.Multiple + widget.Default)]
这个 Train Data
channel是一个single-token channel,缺省的一个(第三个参数)。注意,标志可以被添加 (or OR-d)到一起,因此 Default + Multiple
是一个有效的标志。为了测试其是否工作,连接一个file widget到learning curve widget ,但是,什么也没有发生:
直到缺省的“Train Data”被选择时,是没有查询窗口在给定的 channels去连接和打开的。