机器学习系统设计(一)

确定工作的优先级

以垃圾邮件分类器算法为例,要对垃圾邮件分类,我们首先要确定特征变量x,且其个数通常不超过100。例如:在实际开发过程中,我们通常选择训练集中出现频率为10000~50000的单词作为特征变量。

为了降低误差,我们通常的办法有:

  • 收集更多的数据;
  • 基于邮件的路由信息(标题信息)设计一系列复杂的特征;
  • 基于邮件的正文信息设计一系列复杂的特征,例如:discount与discounts是否为同一单词等;
  • 探测(刻意的)拼写错误设计一套复杂的算法。

尽管我们有上述方法来帮助我们降低误差,但我们只是随机的选择一种方法来尝试降低误差。因此,我们还必须学习误差分析来帮助我们如何更系统地选择相关方法来降低误差。

补充笔记
Prioritizing What to Work On

System Design Example:

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the most frequently used words in our data set. If a word is to be found in the email, we would assign its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x vectors ready, we train our algorithm and finally, we could use it to classify if an email is a spam or not.

机器学习系统设计(一)_第1张图片

So how could you spend your time to improve the accuracy of this classifier?

  • Collect lots of data (for example "honeypot" project but doesn't always work)
  • Develop sophisticated features (for example: using email header data in spam emails)
  • Develop algorithms to process your input in different ways (recognizing misspellings in spam).

It is difficult to tell which of the options will be most helpful.

误差分析

在构建学习系统时,我们推荐如下步骤:

  1. 从一个简单的能快速实现的算法开始,实现该算法并使用交叉验证集测试该算法;
  2. 绘制学习曲线,决定优化方法;
  3. 进行误差分析:人工检查交叉验证集中算法做出的错误预测,找出这些错误样本中的某种特征。

以垃圾邮件分类器为例,观察垃圾邮件:

  • 是否可能按照某种类别分类,例如:广告类、钓鱼类等;
  • 是否存在某种特征,例如:邮件路由信息特殊、邮件内容存在刻意拼写错误等;
    ......

通过观察垃圾邮件,我们可以发现某种特征,进而帮助我们改进算法,从而降低误差。

但误差分析只能帮助我们找到算法可优化的地方,其并不提供具体优化的方法。因此,我们需要引入数值评估方法。

我们仍以垃圾邮件分类为例,现假设算法对邮件中存在“discount”这个单词即可判为垃圾邮件,那么邮件中存在“discounts”这个对于该算法而言,其判定为非垃圾邮件。因此,我们通过数值评估得出误差为5%。

现在,我们通过误差分析找到可优化的地方,并引入词干提取法对算法进行优化,通过数值评估得出误差为3%。

从上例中,我们可以看出当我们通过误差分析找到算法可优化的地方,我们可以立即采取优化措施,并通过数值评估立即观察到该优化措施的结果。因此,我们推荐在误差分析找到可优化的地方后,可立即采取相应的优化措施并使用数值评估观察优化结果。

注:我们推荐在交叉验证集上使用数值评估。

Question:
Why is the recommended approach to perform error analysis using the cross validation data used to compute JCV(θ) rather than the test data used to compute Jtest(θ)?

A. The cross validation data set is usually large.
B. This process will give a lower error on the test set.
C. If we develop new features by examining the test set, then we may end up choosing features that work well specifically for the test set, so Jtest(θ) is no longer a good estimate of how well we generalize to new examples.
D. Doing so is less likely to lead to choosing an excessive number of features.

结合之前所学的,我们不难选出C这个正确答案。

补充笔记
Error Analysis

The recommended approach to solving machine learning problems is to:

  • Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
  • Plot learning curves to decide if more data, more features, etc. are likely to help.
  • Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

For example, assume that we have 500 emails and our algorithm misclassifies a 100 of them. We could manually analyze the 100 emails and categorize them based on what type of emails they are. We could then try to come up with new cues and features that would help us classify these 100 emails correctly. Hence, if most of our misclassified emails are those which try to steal passwords, then we could find some features that are particular to those emails and add them to our model. We could also see how classifying each word according to its root changes our error rate:

机器学习系统设计(一)_第2张图片

It is very important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance. For example if we use stemming, which is the process of treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3% error rate instead of 5%, then we should definitely add it to our model. However, if we try to distinguish between upper case and lower case letters and end up getting a 3.2% error rate instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get a numerical value for our error rate, and based on our result decide whether we want to keep the new feature or not.

你可能感兴趣的:(机器学习系统设计(一))