《D o C P》学习笔记(3 - 1)Regular Expressions, other languages and interpreters - Lesson 3

备注1:每个视频的英文字幕,都翻译成中文,太消耗时间了,为了加快学习进度,我将暂停这个工作,仅对英文字幕做少量注释。
备注2:将.flv视频文件与Subtitles文件夹中的.srt字幕文件放到同1个文件夹中,然后在迅雷看看中打开播放,即可自动加载字幕。

Regular Expressions, other languages and interpreters

你可以学到什么:

  • 定义正则表达式的语言;解释这个语言。
  • 定义被1个正则表达式匹配的字符串集合。
  • 其他语言。

Lesson 3

视频链接:
Lesson 3 - Udacity

There is some great supplementary(增补的,追加的) material for this unit in the class wiki. For novice(初学者,新手) programmers, this unit may be the most challenging of the class. But if you can make it through this, you’ll know you can handle the rest of the class.

Course Syllabus

Lesson 3: Regular Expressions, Other Languages and Interpreters

  • Lesson 3 Course Notes(主要是课程视频对应的英文字幕的网页。)
  • Lesson 3 Code
  • Help with Regular Expressions

1. Introduction

Introduction - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

spread n.展开

homo n.人类

sapiens adj.(类似)现代人的

homo sapiens 智人(现代人的学名)

fancy vt.设想;想要;想象

distinguish vt.有别于;区分

screw driver 螺丝刀(起子)

plyers 钳子

specific 特种的;具体的

specifically 特有地,明确地;按特性地

swell 极好的;了不起的;非常棒的

prize n.奖赏

surgeons forceps 外科医生钳子

refine vt.改善;提炼

trick n.戏法,把戏;计谋,诀窍;骗局

mechanism (机械)结构,机械装置

craftsman 工匠;手艺人;技工

blame 指责;责怪;归咎于

whine 抱怨,牢骚

employ vt.雇用;使用,利用

malleable 有延展性的;可扩展的

本段视频的核心部分的,我的翻译

这个lesson中,我们将会学习软件工具
我们将学习一些通用软件工具、一些非常特种软件工具

并且特别地,我们将讨论2种情况。
首先,我们将讨论语言。语言可能是人类最伟大的发明——我们最伟大的工具。在计算机编程中,1个计算机程序是用1种语言写成的。但是,1个计算机程序也可以利用语言作为工具,以完成程序的工作
另外1个我们将要讨论的工具是函数。我们将讨论使用函数的新技术,并且,学习为什么它们比其他的工具更通用更有扩展性

2. Regular Expressions Review - Part 1

Regular Expressions Review - Part 1 - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

feel free 随自己之意

convention 惯例,习俗,规矩

by convention 按照惯例

exclamation 呼喊,惊叫

an exclamation point 一个感叹号

specify 指定;详述

finite 限定的;有穷的,有限的;有限的

infinite 无限的;无穷尽的;极大的

subsequent 随后的;后来的

notation 记号,标记法

asterisk 星号

preceding 在前的,前面的

本段视频的核心部分的,我的翻译

这段视频简单地讲解了正则表达式的解决了什么问题。

对于string.find(substring)方法,只能查找有限的子串。例如对于s = 'some long thing with words's.find('word')只能查找指定的、单个的、有限个数的子串,而'baa*'可以查找超过1个的、无限个数的子串。

'baa*'中,星号*表示任意数量的在前的字符。

3. Regular Expressions Review - Part 2

Regular Expressions Review - Part 2 - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

occurrence 发生,出现;遭遇,事件

asterisk 星号

question mark 问号

dot 点,小圆点

period 句号

exclamation 感叹号

irregular 不规则的

specific 具体的;特种的

conjunction 连接;联合;连词

concatenation 一系列互相关联的事物

convention 惯例,习俗

strike out 删去

本段视频的核心部分的,我的翻译

本段视频主要介绍了正则表达式的可以进行字符模式匹配的几个元素。

重新做表格太麻烦了,这里补充一点表格中遗漏的内容。

星号*:如果模式后面根由元字符星号*,这个模式会重复 0 次或多次。(允许1个模式重复 0 次,意味着,这个模式并不需要出现,就能匹配。《Python 标准库》 - 第 1 章 文本 - 1.3 re 正则表达式 - 1.3.4 模式语法,P12 - P13。)

问号?:意味着模式要出现 0 或者 1 次。

点号.:元字符.指模式应当匹配该位置的任何单字符。

.*:表示任意数量的任意字符。

4. 练习:Regular Expressions Review - Quiz 1

Regular Expressions Review - Quiz 1 - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

exclamation 感叹号

be up to 胜任

in terms of 根据;用…的话

the other way around 从相反方向

reduce to 归纳;使降低到

onward adv.向前;在前面

constrain 限制;约束

distinction 区别

本段视频的核心部分的,我的翻译

本段视频的主要内容:

Pythonre模块中,有2个函数searchmatch。这里想自己实现者两个函数,并且简化了功能,只返回True或者False

def search(pattern, text):
    "Return true if pattern appears anywhere in text."

def match(pattern, text):
    "Return True if pattern appears at the start of text."

def test():
    assert search('baa*!', 'Sheep said baaaa!')
    assert search('baa*!', 'Sheep said baaaa humbug') == False
    assert match('baa*!', 'Sheep said baaaa!') == False
    assert mathc('baa*!', 'baaaaaaaaa! said the sheep')

另外,还有很多测试用例,测试了美元符号$、向上尖号^、星号*、问号?,还有更多的细节,testing a pattern against multiple possible words——at the and, at the start, in the middle, check for this dot character。

assert all(match('aa*bb*cc*$', s)
           for s in words('abc aaabbccc aaaabcccc'))
assert not any(match('aa*bb*cc*$', s)
               for s in words('ac aaabbcccd aaaa-b-cccc'))
assert all(search('^ab.*aca.*a$', s)
           for s in words('abracadabra abacaa about-acacia-f'))
assert all(search('t.p', s)
           for s in wrods('tip top tap atypical tepid stop .'))
assert not any(search('t.p', s)
               for s in words('TYPE teepee tp'))

Peter说,让我们根据match函数来写search函数,尽管从相反方向,也可以完成这件事。
真正的区别是,pattern是否以一个向上的尖号^开始。
我将要做的事是,检查pattern是否以向上的尖号^开头。向上的尖号真正有意义的地方是在pattern中的第1个字符(例如^b匹配任意以b开头的字符串,参照上一小节的视频和截图)。如果^出现在其他地方,我们将仅仅把它看做1个错误,不去处理它。

做下面这个代码题:

def search(pattern, text):
    """Return true if pattern appears anywhere in text
       Please fill in the match(          , text) below.
       For example, match(your_code_here, text)"""
    if pattern.startswith('^'):
        return match(              , text) # fill this line
    else:
        return match(              , text) # fill this line

def match(pattern, text):
    "Return True if pattern appears at the start of text."
    pass

这个问题,我个人感觉有点抽象,我没有去仔细思考search的实现,暂时,我只思考了match 实现的一部分。并且,代码我也没有写完。

def match(pattern, text):
    if pattern.startswith('^'):
        if pattern[1:] == text[0: sth.]:
            return True

为了便于解释,我举个例子。例如pattern = ^b,对应于这些babachbcmg等等。对于这个pattern^是它的第0个字符,pattern的第1个字符开始的往后的所有字符组成的字符串,如果等于text的第0个字符开始的往后到某个位置的字符串,那么,应该返回True

这个代码题如何解,我想不出来,看看Peter的答案吧:

def search(pattern, text):
    "Return True if pattern appears anywhere in text."
    if pattern.startswith('^'):
        return match(pattern[1:], text)
    else:
        return match('.*' + pattern, text)

def match(pattern, text):
    "Return True if pattern appears at the start of text."

Peter的解释:

And the answer is a search with an up arrow constrains the pattern to match right at the beginning.
So that’s the same as just doing a match with the rest of the pattern.
And the test of the pattern is just the pattern from position one, onward.

Now, if the pattern doesn’t start with an up arrow and we’re doing a search, that means the search could be anywhere within the text.
But that’s the same as asking for a pattern that has any number of characters at the beginning, match of any character any number of times, and the as ame as adding a dot star.* to the start of the pattern.

我的理解:

举个例子来辅助理解吧。

例如:^b的含义是任何以b开头的字符串。
search(pattern = '^b', text)是在text中,搜索任何以b开头的字符串。
match(pattern = 'b', text)是在text中,搜索任何以b开头的字符串。
抽象一下,如果pattern^开头,那么search(pattern, text)match(pattern[1:], text)作用相同。

.*的含义是任意数量的任意字符组成的字符串。
如果pattern不以^开头,那么search(pattern, text)match('.*' + pattern, text)作用相同。

5. 练习:Regular Expressions Review - Quiz 2

Regular Expressions Review - Quiz 2 - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

concatenation 一系列互相关联的事物

本段视频的核心部分的,我的翻译

def match(pattern, text):
    """
    Return True if pattern appears at the start of text

    For this quiz, please fill in the return values for:
        1) if pattern == '':
        2) elif pattern == '$':
    """

    if pattern == '':
        return # fill in this line
    elif pattern == '$':
        return # fill in this line
    # you can ignore the following elif and else conditions
    # We'll implement them in the next quiz
    elif len(pattern) > 1 and pattern[1] in '*?':
        return True
    else:
        return True

这一小结要求给出pattern为空字符串''或者'$'时,match的实现。
但是,对于pattern为空字符串''或者'$'的情况,我无法理解。还是看看Peter的答案吧。

Peter的答案:

def match(pattern, text):
    "Return True if pattern appears at the start of text."
    if pattern == '':
        return True
    else pattern == '$':
        return (text == '')
    elif len(pattern) > 1 and pattern[1] in '*?':
    else:

答案是,如果pattern是空字符串'',然后空字符串''匹配任何东西,因为空字符串''出现在每1个text中,即使text本身就是1个空字符串''。所以,无论如何,我们仅仅返回True

If the pattern is the dollar sign, that means that matches only at the very end of the text。所以,只有在text是空字符串''的情况下,我们想返回True。所以,我们说,返回“text是否等于空字符串?”如果是True,那就返回True,如果是False,那就返回False
(这里,我对本段的pattern$的情况做一点解释吧。通常情况,例如,a$匹配任何以a结尾的字符串。pattern等于单个的$,我的理解是,可以看做是空字符串''加美元符号$''$,我将其理解成任何以''结尾的字符串。。。。不好理解啊。。。直接利用Peter的结论和代码,先跳过吧。)

6. 练习:Regular Expressions Review - Quiz 3

Regular Expressions Review - Quiz 3 - Design of Computer Programs - YouTube

本段视频的核心部分的,我的翻译

def match(pattern, text):
    "Return True if pattern appears at the start of text."
    if pattern == '':
        return True
    else pattern == '$':
        return (text == '')
    elif len(pattern) > 1 and pattern[1] in '*?':

    else:
        return (match1(pattern[0], text) and
                match(             ,           ))

1个字符后面跟着1个*或者1个?,是最难的部分,留到最后。

先看下最后的情况(即else子句),当我们知道pattern不是空的的时候(因为排除了第1个分句if pattern == ''),并且,我们知道我们已经检查了所有特殊的字符。所以,第1个字符不是特殊的字符(这里说的太简单,我没有完全看明白),除了可能是1个.,因为我们还没有处理.。并且,所以,我们想要匹配第1个字符,并且,然后,我们想要匹配这个pattern的剩余部分,可能它是空的,或者可能它是其他的什么(这一句其实是把问题分解为2个部分)。

所以我将要做的是,发明1个新的函数,我称为match1,并且,那匹配1个单个的字符,并且,如果那个单个的字符,是 the first character of the pattern matches against the text,我们想要返回True,并且,如果我们可以递归匹配 the test of the pattrn against the rest of th text。

所以,这里的意思是,“在这里,我们没有1个特殊的字符,除了可能是1个点号.,”,并且,所以我们有 like b followed by a, we want to match the b against the text, and then we want to match the A againt something else.So fill in for me here what I should be matching against to match the test of the pattern against the test of the text.

Peter的答案:

    else:
        return (match1(pattern[0], text) and
                match(pattern[1:], text[1:]))

7. Regular Expressions Review - Part 3

Regular Expressions Review - Part 3 - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

operator 运算符

inline 内联

internal 内部的;体内的

representation 表现;陈述

本段视频的核心部分的,我的翻译

最后的1部分是,1个字符后面跟着1个星号*或者1个问号?

    elif len(pattern) > 1 and pattern[1] in '*?':
        p, op, pat = pattern[0], pattern[1], pattern[2:]
        if op == '*':
            return match_star(p, pat, text)
        elif op == '?':
            if match1(p, text) and match(pat, text[1:]):
                return True
            else:
                return match(pat, text)

pattern是像'A*!'这样的某个字符串。并且所以,第1部分是pp*前面的1个单独的字符。第2部分是运算符 operator op,是1个星号*或者1个问号?。第3部分是pattern剩余的部分:任何其他的、在星号*或者问号?之后的。

现在,有2种情况。
如果操作符 operator op是1个星号* ,我们将调用另一个函数。这会是个递归函数。
如果操作符 operator op是1个问号?,然后如果我们可以匹配第1字符p,against the text,and match the rest of the pattern after the question mark against the rest of the text,then we return “True“。(不想敲了,要加快进度了,看视频吧)。

def match1(p, text):
    """Return true if first character of text matches
    pattern character p."""
    if not text: return False
    return p == '.' or p == text[0]

def match_star(p, pattern, text):
    """Return true if any number of char p,
    followed by pattern, matches text."""
    return (match(pattern, text) or                # match zero times
            (match1(p, text) and                   # one time
             match_star(p, pattern, text[1:])))    # zreo or more times

A Regular Expression Matcher:Rob Pike 的论文,介绍了正则表达式是什么,如何写程序处理正则表达式,正则表达式可以做什么。

现在,在这个程序中,1个正则表达式只是1个字符串,并且,它很容易处理,因为,我们所有必须处理的是,字符串的第1个字符、第2个字符,等等。

但是正则表达式可能变得更复杂。它们可以有嵌套结构,相较于字符串,更像是树。
在这个单元的剩余部分,我们将会处理基于树的正则表达式的1个内部的表现。那将会不同于这个。你仍然可以有字符串这种形式的输入,翻译为树,但是,从现在开始,我们将会处理1个树结构。

所以这里会给你足够好的1个介绍和1个复习,以处理堂课的剩余部分。

8. Language

Language - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

bizarre 奇特的;奇怪的

reference 参考;参考书

customize 定制;定做

matrices 矩阵;母体,子宫

concatenate 把(一系列事件、事情等)联系起来

optimization 最优化;优选法

octane 辛烷

fuels 燃料

component 零件;成分;要素

cover 涉及;遮盖

本段视频的核心部分的,我的翻译

Python

  • statements 语句

  • expressions 表达式

  • format 格式化(字符串)
    例如:

print '%(language)s has %(#)03d quote types.' % {
          'language': "Python", "#": 2}
  • class - operator overloading 运算符重载。
    例如,x + y 等于 5。但是你也可以说,x、y是你自己的数据类型。它们可能是矩阵。它们可能是你正在构建的某种构建物的零件,x + y 的意思可以是把它们放到一起。它可能是2种类型,你可以在屏幕上将2种类型联系起来。

你可以在你想要处理的对象的类型之上,定义你自己的语言。你可以更进一步,定义你自己的领域特定语言。

下面是1个例子。

《D o C P》学习笔记(3 - 1)Regular Expressions, other languages and interpreters - Lesson 3_第1张图片

这是1种语言,用于描述1个最优化问题,与辛烷价格、燃料不同的类型有关,并且,你可以描述这些参数,并且,然后建立语言处理器,将上图中作为输入,并且计算输出。

当然,你也可以利用常规的Python语句处理,但是这里,我们已经做了的是,为这个问题,设计特定的语言,并且,然后,写下那种语言中的问题描述。

在这个单元中,我们将涉及的,语言是什么,语法是什么,编译器和解释的区别是什么,怎样使用语言作为1种设计工具。

9. Regular Expressions

Regular Expressions - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

in any event 在任何情况下,无论如何

overview 回顾,复习;概观

correspond to 相符;契合

compositional 组合的

literal 原义的

repetition 重复

本段视频的核心部分的,我的翻译

我们将以正则表达式语言开始。它们可以被表达为字符串。

grammar 语法是1种语言的描述。
举例:a*b*c*a+b?

language 语言是字符串的1个集合。
对应上面的语法a*b*c*,有abcaaabccb''ccccc等等。

这种形式的语法的描述,多个字符的1个长序列很方便,当你快速键入某些东西的时候,但是它很难work with。语法表达式很长。

所以,我们将要以一种更加 compositional 的格式,描述可能的语法。
换句话说,我将描述的是 1 个 API,代表应用程序编程接口。
我们将描述1系列的函数调用(a series of functions calls),可以被用来描述1个正则表达式的语法。我们将说明,1个正则表达式可以被这些调用类型构建。

a literal of some string S。
lit(s)
lit('a')描述了语言仅仅由字符串'a'组成,不含其它的,即{a}

We have the API call sequence x, y。
seq(x, y)
seq(lit(a), lit(b))仅仅由字符串'ab'组成,即{ab}

Then we could say alternatives x, y。
alt(x, y)
alt(lit('a'), lit('b'))会包含2种可能性——字符串'a',或者字符串'b',即{a, b}

我们会使用the standard notation for the name of our API call here。
star(x)代表任意数量的重复—— 0 个或者更多。
star(lit('a'))将会是空字符串或者'a'或者'aa',等等,即{'',a,aa,......}

oneof(c),可能的字符的字符串。That’s the same as the alternative of all the individual characters。
oneof('abc')匹配a或者b或者c。即{a,b,c}。这是alt函数的1个约束的版本。

We’ll use the symbol "eol", standing for end of line,只用来匹配单个字符的字符串的末尾,不会是其他地方。What matches is the empty string, but it matches only at the end.
eol
例子eol,匹配空字符串'',即{''}
例子seq(lit('a'), eol),准确地匹配'a'and nothing else at the end。

Then we’ll add 'dot',匹配任何可能的字符。
dot
例子dot,匹配alphabet中任何字符{a, b, c......}

10. Specifications

Specifications - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

processor 数据处理机;加工机

literal 原义的;照字面的

本段视频的核心部分的,我的翻译

Python有1个叫做re模块的正则表达式模块。这里的要点是,不用太多地使用正则表达式,是关于展示给你,构造1个语言处理器是怎么样的。

我写了1个测试函数,定义了一些正则表达式使用这个API。

我定义了2个函数与re模块中的对应。

search(pattern, text)传入1个pattern和1个text,在re模块中,相应的search函数返回1个match对象。但是这里,我们将要使它做的是,返回1个字符串string,对于pattern,这里的string是在text中最早匹配的,并且,如果,在相同的位置,有超过1个match,it’ll be the longest of those。
例如,search('def', 'abcdef')会返回'def',因为他在这里被找到了。。

下1个函数叫做match(pattern, text),返回 the string that matches。
但是对于match函数,匹配必须发生在字符串的开头。
例如,search('def', 'abcdef')会返回None,表明没有匹配。
search('def', 'def fn(x)')会返回def

上图给了我1个说明书的开始,用来我想要如何写我的search函数和match函数。

Udacity的本段视频下方的补充材料:

Again, see the wiki for help here. We define this language of patterns or regular expressions, where s is a string, c is also a string denoting a set of characters, and x and y are patters formed from one of the entries in the chart.

《D o C P》学习笔记(3 - 1)Regular Expressions, other languages and interpreters - Lesson 3_第2张图片

11. Concept Inventory

Concept Inventory - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

start out 开始(着手)做;出发。

inventory 清单;存货清单

notion 概念;意见

match against 与…相配

partial 部分的

iteration 循环;迭代

literal 照字面的;原义的

notation 标记法

tricky 复杂的

remainder 剩余物;

auxiliary 辅助的;备用的;附加的

consume 消耗;耗尽

resolve 决定;使分解

本段视频的核心部分的,我的翻译

我现在要搞1个将要用到的概念的清单。

pattern,模式或者正则表达式的概念。

text,与pattern相匹配的文本的概念。

result,结果,是某种类型的1个字符串string。

partial result,部分结果的概念。

control over the iteration,控制迭代循环的某种概念。
一般的例子很简单,例如pattern'def'text'abcdef'
来1个复杂点的例子。pattern'a*b+',意思是任意数量的a,后面跟着1个或者更多的b
在我们的API概念中,我们将把它写为

seq(star(lit('a')),
    plus(lit('b')))

我们假设它可以匹配字符串'aaab'
现在,如果我们有1个控制结构,假设sequence,匹配第1个,然后第2个,如果第1个——star(lit('a'))——只有1种可能的结果,然后它就会说,它刚好匹配了'aaab'的开头,现在,查找在那之后的something。它是否匹配plus(lit('b'))?不匹配。
So I’ve got have iteration over the possible results。我必须假设,star(lit('a'))可以匹配超过1个位置。它可以匹配0个、1个、2个、3个a,并且,仅仅在3个之后,然后我们可以匹配第2部分,找到b,然后找到这整个匹配的表达式。
那将是关于这个control的1个复杂的部分,当1个序列不匹配的时候,或者,当我们有1个可选的2种可能——a或者b,或者alt(a, b)与之相似。
这里的复杂性看起来将会很难,但是它可以完全分解,在我们做了1个好的选择之后。看懂那个选择可能是什么,需要一些经验,但是,如果我们决定,把这些部分的结果partial result,表示为a set of remainders of the text,then everything falls into place。
remainder的意思是什么?我的意思是everything else after the match。
如果我们匹配lit('a') ,如果我们match 这个字符串'aaab'的0个字符,remainder将会是'aaab'
如果我们 match 1 个 'a'remainder将会是'aab'。等等。
现在,我将做的是,定义1个辅助函数,叫做matchset(pattern, text),返回this set of remianders。
当我们只被给了这里的这个pattern作为输入,star(lit('a'))'aaab' 作为text,然后remainder将会是这样1个集合{aaab, aab, ab, b}
换句话说,star(lit('a'))可能消耗了0个、1个、2个或者3个a。

12. 练习:Matchset

Matchset - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

outcome 结果;成果

accordingly 因此,于是;依据;相应地

本段视频的核心部分的,我的翻译

Peter给了1个题目:

#----------------
# User Instructions
#
# The function, matchset, takes a pattern and a text as input
# and returns a set of remainders. For example, if matchset 
# were called with the pattern star(lit(a)) and the text 
# 'aaab', matchset would return a set with elements 
# {'aaab', 'aab', 'ab', 'b'}, since a* can consume one, two
# or all three of the a's in the text.
#
# Your job is to complete this function by filling in the 
# 'dot' and 'oneof' operators to return the correct set of 
# remainders.
#
# dot:   matches any character.
# oneof: matches any of the characters in the string it is 
#        called with. oneof('abc') will match a or b or c.

def matchset(pattern, text):
    "Match pattern at start of text; return a set of remainders of text."
    op, x, y = components(pattern)
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null
    elif 'seq' == op:
        return set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))
    elif 'alt' == op:
        return matchset(x, text) | matchset(y, text)
    elif 'dot' == op:
        return # your code here
    elif 'oneof' == op:
        return # your code here
    elif 'eol' == op:
        return set(['']) if text == '' else null
    elif 'star' == op:
        return (set([text]) |
                set(t2 for t1 in matchset(x, text)
                    for t2 in matchset(pattern, t1) if t1 != text))
    else:
        raise ValueError('unknown pattern: %s' % pattern)

null = frozenset()

def components(pattern):
    "Return the op, x, and y arguments; x and y are None if missing."
    x = pattern[1] if len(pattern) > 1 else None
    y = pattern[2] if len(pattern) > 2 else None
    return pattern[0], x, y

def test():
    assert matchset(('lit', 'abc'), 'abcdef')            == set(['def'])
    assert matchset(('seq', ('lit', 'hi '),
                     ('lit', 'there ')), 
                   'hi there nice to meet you')          == set(['nice to meet you'])
    assert matchset(('alt', ('lit', 'dog'), 
                    ('lit', 'cat')), 'dog and cat')      == set([' and cat'])
    assert matchset(('dot',), 'am i missing something?') == set(['m i missing something?'])
    assert matchset(('oneof', 'a'), 'aabc123')           == set(['abc123'])
    assert matchset(('eol',),'')                         == set([''])
    assert matchset(('eol',),'not end of line')          == frozenset([])
    assert matchset(('star', ('lit', 'hey')), 'heyhey!') == set(['!', 'heyhey!', 'hey!'])

    return 'tests pass'

print test()

对上述代码的解释:
component(pattern)pattern分解成3个组件。op,是1个操作符operator,x部分,y部分,取决于定义。例如,literal 'lit'将会只有1个x组件。sequence'seq'和alternative'alt'将有x和1个'y'

如果操作符operator在请求literal string x,'seq' == op,我们问,text是都以x开始?如果是,然后remainder将会是a singleton set,a set of just one element,which is the remainder of the text after we’re broken off the length of text characters。set([text[len(x):]])。If we matched a three-character sequence for x,we return the text without the first three characters。If x was 1character, we return the text without the first character。Otherwise, we return the null set。There’s no match。

(对return set([text[len(x):]]) if text.startswith(x) else null这一行代码,我的补充解释:
举个例子,例如text = 'aaab'pattern = 'a',如果op操作符等于'lit',那么剩余的部分就是'aab',即set([text[len(x):]]))

    elif 'alt' == op:
        return matchset(x, text) | matchset(y, text)

|意思是“并集” ,2个集合的并集。

    elif 'seq' == op:
        return set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))

例如,如果这是我们的pattern——seq(star(lit('a')), plus(lit('b'))),我们正在匹配'aaab',然后x就是star(lit('a')),y就是plus(lit('b')),那么the matchset for x is this set here,{aaab, aab, ab, b}。我们尝试match y,theplus(lit('b')),against all of these match sets,and it’ll fail to match against each of these three(指{aaab, aab, ab, b}的前3个)。It will match against this one(指{aaab, aab, ab, b}的最后1个b),所以,现在我们have a match that consumes the entire string。The result from the match of this sequence of x and y will be the set consisting of just the empty string({''}),because we’ve matched off all the a’s and one b, and there’s no remainder left over.

(上下其实是同1段,我做一点解释。

例如,对于pattern等于seq(star(lit('a')), plus(lit('b')))'a*b+'text等于'aaab'。有op等于'seq'x等于'a*',y等于b+

对于for t1 in matchset(x, text),由于x等于'a*'x可能为0个或多个a,又因为text'aaab',所以for t1 in matchset(x, text){aaab, aab, ab, b}

对于for t2 in matchset(y, t1),由于y等于'b+'y可能为1个或多个b,又因为t1{aaab, aab, ab, b}中,所以最终t2只能是b

对于最后一句The result from the match of this sequence of x and y will be the set consisting of just the empty string({''}),because we’ve matched off all the a’s and one b, and there’s no remainder left over,我并不是很理解。)

上面这一段说到,我们的patternseq(star(lit('a')), plus(lit('b'))),我们正在匹配'aaab',然后xstar(lit('a'))yplus(lit('b')),并且这里的matchset for x
is this set here{aaab, aab, ab, b}。我们尝试match y,plus(lit('b')),against all of these match sets,and it’ll fail to match against each of these three(指{aaab, aab, ab, b}的前3个)。It will match against this one(指{aaab, aab, ab, b}的最后1个b),所以,现在我们have a match that consumes(消耗;耗尽) the entire string。

Note that there’s a big difference between the outcome of saying here’s a result consisting of one string, the empty string{''}, versus the set that is the empty set(即{}). The empty set is a failed match, and the set consisting of the empty string is the successful match.

Now let’s see if you can fill in your code for these two missing cases here.

Remember, you’re going to be returning a set of possible remainders if the match is successful.

Matchset (Answer)

如果op为1个点号dot.,则可以匹配text中的任意1个字符。如果text不为空,即至少有1个字符,那么,text[1:]都将是remainder。可以不用考虑x、y。

Here’s my answer. A dot matches any character in the text. If the text has at least one character, then there’s going to be a match, and the match is going to have a remainder, which is all the rest of the text.

    elif 'dot' == op:
        return set([text[1:]]) if text else null

That remainder is a set, and it’s the set of one element, and the element is the text from the first character on. In other words, dropping off the 0th character. We’re going to return that if there is any text at all. That is, if the text is not empty. Otherwise, if the text is empty then we’re going to return the null set. We defined the null set down here.

    elif 'oneof' == op:
        return set([text[1:]]) if text.startswith(x) else null
        '''或者如下这样的:
        return set([text[1:]]) if any(text.startswith(c) for c in x) else null
        '''

这个情况也比较好理解。
oneof('abc')op'oneof'x('a', 'b', 'c')

注意:startswith(prefix)的参数prefix可以是元组

>>> s = 'are'
>>> s.startswith('a')
True
>>> s.startswith(('a', 'b'))
True

如果textx开始,那么remainder

How about for oneof? Oneof takes a string of possible characters and what it should return is similar to dot. It should return the remaining characters if the first character is equal to one of the characters in x. We’re going to return exactly the same thing, a set consisting of a signal answer which is the original text with the first character dropped off, and we’re going to return that. What I’d like to be able to say if the text starts with any of the characters in x. It turns out that I can actually say exactly that — text.startswith(x), if I arrange(整理;把…分类) to have x be a tuple of characters rather than a character string.

str.startswith(prefix[, start[, end]])
    Return True if string starts with the prefix, otherwise return False.
    prefix can also be a tuple of prefixes to look for.
    With optional start, test string beginning at that position.
    With optional end, stop comparing string at that position.

Here I have the documentation from the Python manual for the string that starts with method, and it says it’s true if the string starts with a prefix(前缀), so we can ask does some string start with a particular string — yes or no? But prefix can also be a tuple of prefixes to look for. All we have to do is arrange(整理;把…分类) for that x to be a tuple, and then it automatically searches for all of them.

What I’m saying is I want the API function oneof('abc'). That should return some object, but we’re not yet specifying(specify 指定;详述) what that object is, such that the operator of that object is oneof, and the x for that object should be the tuple ('a', 'b', 'c'). It’s sort of a little optimization(最优化) here that I’ve defined the API to construct an object of just the right form so that I can use it efficiently here. Just go ahead and check does the text start with any one of the possible x’s. Otherwise, return no. If you didn’t know about this form of text.startswith, you could have just checked to see if any of the character c for c in x. We’d say return the rest of the string if any of the characters in x if the text starts with any one of those characters. Otherwise, return null.

13. 练习:Filling Out The Api

Filling Out The Api - Design of Computer Programs - YouTube

补充视频中,一些不认识的词汇

constructor 构造器

本段视频的核心部分的,我的翻译

This looks like a good time to define the rest of the APIs for the constructors(构造器) for patterns. Here I’ve defined literal and sequence to say they’re going to return tuples where the first element is the operator, the second — if it exists — is x, and then the final — if it exists — is the y. Those are for the components that take an operand or two, but dot and end of line don’t, so I just defined dot to be a one element tuple. See if you can go ahead and fill in the remainder of this API.

#---------------
# User Instructions
#
# Fill out the API by completing the entries for alt, 
# star, plus, and eol.


def lit(string):  return ('lit', string)
def seq(x, y):    return ('seq', x, y)
def alt(x, y):    # ??
def star(x):      # ??
def plus(x):      # ??
def opt(x):       return alt(lit(''), x) #opt(x) means that x is optional
def oneof(chars): return ('oneof', tuple(chars))
dot = ('dot',)
eol = #??

def test():
    assert lit('abc')         == ('lit', 'abc')
    assert seq(('lit', 'a'), 
               ('lit', 'b'))  == ('seq', ('lit', 'a'), ('lit', 'b'))
    assert alt(('lit', 'a'), 
               ('lit', 'b'))  == ('alt', ('lit', 'a'), ('lit', 'b'))
    assert star(('lit', 'a')) == ('star', ('lit', 'a'))
    assert plus(('lit', 'c')) == ('seq', ('lit', 'c'), 
                                  ('star', ('lit', 'c')))
    assert opt(('lit', 'x'))  == ('alt', ('lit', ''), ('lit', 'x'))
    assert oneof('abc')       == ('oneof', ('a', 'b', 'c'))
    return 'tests pass'

print test()

13.Filling Out The Api (Answer)

Filling Out The Api Solution - Design of Computer Programs - YouTube

Here are the definitions I came up with. Note that I decided here to define plus and opt in terms of things that we had already defined — sequence and alt and star and lit. You could’ve just defined them to be similar to the other ones if you prefer that representation. It’s just a question of whether you want more work to go on here in the constructors (plus and opt) for the patterns or here in the definition of match_set.

def lit(string):  return ('lit', string)
def seq(x, y):    return ('seq', x, y)
def alt(x, y):    return ('alt', x, y)
def star(x):      return ('star', x)
def plus(x):      return seq(x, star(x))
def opt(x):       return alt(lit(''), x) #opt(x) means that x is optional
def oneof(chars): return ('oneof', tuple(chars))
dot = ('dot',)
eol = ('eol',)

seq(lit(a), lit(b))仅仅由字符串'ab'组成,即{ab}
def plus(x): return seq(x, star(x))这一行我做一点解释,plus(x)x+,含义是至少1个x。
star(x)x*,含义是0个或者更多的x。
xstar(x)合起来,就是seq(x, star(x)),含义是1个或更多的x。

14. 练习:Search And Match

Search And Match - Design of Computer Programs - YouTube

主要复习了之前的API。

左上是API:
match(p, t) -> '...' | None
search(p, t) -> '...' | None
返回1个单个字符串,是1个match。

右上是an API of functions to create patterns:
lit(s)
alt(x, y)
seq(x, y)
等等

下面是内部函数:
matchset(p, t) -> {'...', '...'}返回a set of strings,which are remainders。

对于任何remainder,有match + remainder等于原始text

Let’s just review(复习) what we’ve defined in terms of(in terms of 根据;就…而言) our API. We have a function match and a function search, and they both take a pattern and a text, and they both return a string representing the earliest longest match.

match(p, t) -> '...'  | None
search(p, t) -> '...' | None

But for match the string would only return if it’s at the start of the string. For search, it’ll be anywhere within the string. If they don’t match, then they return None.

We’ve also defined an API of functions to create patterns — a literal string(即lit(s)), an alternative between two patterns x and y(即alt(x, y)), a sequence of two patterns x and y(即seq(x, y)), and so on. That’s sort of the API that we expect the programmer to program to. You create a pattern on this side and then you use a pattern over here against a text to get some result. Then below the line of the API — sort of(有几分地;可以说) an internal(内部的) function — we’ve defined matchset, which is not really designed for the programmer to call it. It was designed for the programmer to go through this interface (match and search), but this function is there. It also takes a pattern and a text. Instead of returning a single string, which is a match, it returns a set of strings, which are remainders. For any remainder we have the constraint that the match plus the remainder equals the original text. Here I’ve written versions of search and match. We already wrote matchset. The one part that we missed out was this component that pulls out the x, y and op components out of a tuple. I’ve left out two pieces of code here that I want you to fill in.

#---------------
# User Instructions
#
# Complete the search and match functions. Match should
# match a pattern only at the start of the text. Search
# should match anywhere in the text.

def search(pattern, text):
    "Match pattern anywhere in text; return longest earliest match or None."
    for i in range(len(text)):
        m = match(pattern, text[i:])
        if # your code here
            return m

def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = matchset(pattern, text)
    if remainders:
        shortest = min(remainders, key=len)
        return # your code here

def components(pattern):
    "Return the op, x, and y arguments; x and y are None if missing."
    x = pattern[1] if len(pattern) > 1 else None
    y = pattern[2] if len(pattern) > 2 else None
    return pattern[0], x, y

def matchset(pattern, text):
    "Match pattern at start of text; return a set of remainders of text."
    op, x, y = components(pattern)
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null
    elif 'seq' == op:
        return set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))
    elif 'alt' == op:
        return matchset(x, text) | matchset(y, text)
    elif 'dot' == op:
        return set([text[1:]]) if text else null
    elif 'oneof' == op:
        return set([text[1:]]) if text.startswith(x) else null
    elif 'eol' == op:
        return set(['']) if text == '' else null
    elif 'star' == op:
        return (set([text]) |
                set(t2 for t1 in matchset(x, text)
                    for t2 in matchset(pattern, t1) if t1 != text))
    else:
        raise ValueError('unknown pattern: %s' % pattern)

null = frozenset()

def lit(string):  return ('lit', string)
def seq(x, y):    return ('seq', x, y)
def alt(x, y):    return ('alt', x, y)
def star(x):      return ('star', x)
def plus(x):      return seq(x, star(x))
def opt(x):       return alt(lit(''), x)
def oneof(chars): return ('oneof', tuple(chars))
dot = ('dot',)
eol = ('eol',)

def test():
    assert match(('star', ('lit', 'a')),'aaabcd') == 'aaa'
    assert match(('alt', ('lit', 'b'), ('lit', 'c')), 'ab') == None
    assert match(('alt', ('lit', 'b'), ('lit', 'a')), 'ab') == 'a'
    assert search(('alt', ('lit', 'b'), ('lit', 'c')), 'ab') == 'b'
    return 'tests pass'

print test()
def search(pattern, text):
    "Match pattern anywhere in text; return longest earliest match or None."
    for i in range(len(text)):
        m = match(pattern, text[i:])
        if # your code here
            return m

def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = matchset(pattern, text)
    if remainders:
        shortest = min(remainders, key=len)
        return # your code here

def matchset(pattern, text):
    "Match pattern at start of text; return a set of remainders of text."
    ...

本节开始中提到,对于任何remainder,有match + remainder等于原始text。所以,match函数中,应该返回的是text - shortest

这个问题,暂时只能想到这么多了,鉴于时间关系,还是看看Peter的答案吧。

14. Search and Match (Answer)

def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = matchset(pattern, text)
    if remainders:
        shortest = min(remainders, key=len)
        return text[:len(text)-len(shortest)] # your code here

(对上面一段代码做一点注释:
shortest = min(remainders, key=len),最短的remainder应该对应了最长的text;
text[:len(text)-len(shortest)],看下match函数中的介绍,由于Match pattern against start of text,所以,应该从第0个字符开始,到最长的那个字符结束。)

Let’s do match first. Match interfaces with the matchset, finds the set of remainders. If there is a set of remainders, then it finds the shortest one.

The shortest remainder should be the longest text, and then we want to return that.

We want to return the text, and it’s a match not a search, so the match has to be at the beginning of the text, and it would go up to match. So we match from the beginning of the text. How far do we want to go? Well, everything except the remainder.

How much is that? Well, we can just subtract(减去) from the length of the text the length of the shortest, and that gives us the initial piece of the longest possible match.

Here search calls into match. What we do is we start iterating. We start at position number zero. If there is a match there for the text starting at position zero, then we want to return it. If not, we increment i to 1, and we say does the text from position 1 on — is there a match there and so on. We just keep on going through until we find one.

def search(pattern, text):
    "Match pattern anywhere in text; return longest earliest match or None."
    for i in range(len(text)):
        m = match(pattern, text[i:])
        if m is not None: # your code here
            return m

(对上面一段代码做一点注释:
m = match(pattern, text[i:])i是从0开始递增,match(pattern, text[i:]是利用match函数,扫描从第0、1、2…个字符开始的text,如果出现了匹配,则返回。
if m is not None:,没有写if m:,是因为,空字符串'',我们将其当做True。而if m:会是False。)

Here what we want to say is if the match is not None, then return m. Notice that it would be a bad idea to say if m return m. Normally it’s idiomatic(符合语言习惯的) in Python to say that if we’re looking for a true value. But the problem here is that the empty string we want to count as a true value. A pattern might match the empty string. That counts as a match, not as a failure. We can’t just say if m because the empty string counts as false. We have to say if m is not None.

15. Compiling

Compiling - Design of Computer Programs - YouTube

(视频下方的补充材料:
Note: there is a typo in the definition of lit(s). The x in len(x) and text.startswith(x) should be an s.)

Let’s quickly summarize(总结;概述) how a language interpreter works.

For regular expressions we have patterns like (a|b)+, which define languages. A language is a set of strings like {a, b, ab, ba, ...} and so on, defined by that pattern.

(我们有解释器,比如matchset,传入patterntext,返回strings的1个列表或者set。
我们说matchset是个解释器,是因为它传入语言的1个description,也就是一个pattern作为1个数据结构,并且,在那个pattern上面进行操作。)
Then we have interpreters like matchset, which in this case takes a pattern and a text and returns a list of strings or a set of strings.

So we saw that matchset is an interpreter because it takes a description of the language, namely a pattern as a data structure, and operates over that pattern. Here’s the definition of matchset.

(如下三四段介绍了解释器的工作流程)
You see it looks at the pattern, breaks out its components, and then the first thing it does is this big case statement to figure out what type of operator we have and to do the appropriate(适当的) thing.

There’s an inherent(固有的) inefficiency(无效率) here in that the pattern is defined once, and it’s always the same pattern over a long string of text and maybe over many possible texts.

We want to apply the same pattern to many texts. Yet every time we get to that pattern, we’re doing this same process of trying to figure out what operator we have when, in fact, we should already know that, because the pattern is static, is constant(持续的).

So this is kind of repeated work. We’re doing this over and over again for no good reason.

There’s another kind of interpreter called a “compiler” which does that work all at once. The very first time when the pattern is defined we do the work of figuring out which parts of the pattern are which so we don’t have to repeat that every time we apply the pattern to a text.(编译器的做法优于解释器的原因)

(1个解释器传入patterntext并且在这2个参数上操作的时候,编译器做了什么。)
Where an interpreter takes a pattern and a text and operates on those, a compiler has two steps. In the first step, there is a compilation function, which takes just the pattern and returns a compiled object, which we’ll call c(称为c). Then there’s the execution of that compiled object where we take c and we apply that to the text(即c(text)) to get the result.

Here work can be done repeatedly every time we have a text.

Here the work is split up. Some of it is done in the compilation stage(阶段) to get this compiled object. Then the rest of it is done every time we get a new text. Let’s see how that works.

Here is the definition of the interpreter. Let’s focus just on this line here:

def matchset(pattern, text):
    "Match pattern at start of text; return a set of remainders of text."
    op, x, y = components(pattern)
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null

This says if the op is a literal, then we return this result.

The way I’m going to change this interpreter into a compiler is I’m going to take the individual statements like this that were in the interpreter, and I’m going to throw them into various parts of a compiler, and each of those parts is going to live in the constructor for the individual type of pattern.

We have a constructor — literal takes a string as input and let’s return a tuple that just represents what we have, and then the interpreter deals with that.

Now, we’re going to have literal act as a compiler. What it’s going to do is return a function that’s going to do the work.

def lit(s): return lambda text: set([text[len(s):]]) if text.startswith(s) else null

What is this saying?

We have the exact same expression here as we had before, but what we’re saying is that as soon as we construct a literal rather than having that return a tuple, what it’s returning is a function from the text to the result that matchset would have given us.
(不大好理解啊。。)

16. Lower Level Compilers

Lower Level Compilers - Design of Computer Programs - YouTube

We can define a pattern — let’s say pattern is lit('a').

>>> pat = lit('a')

Now what is a pattern? Well, it’s a function. It’s no longer a tuple.

>>> pat
lambda> at 0x101b7bd70>

We can apply that pattern to a string and we get back the set of the remainders.

>>> pat('a string')
set([' string'])

It says, yes, we were able to successfully parse a off of a string, and the remainder is a string.

We can define another pattern. Let’s say pattern 2 equals plus(pat).

>>> pat2 = plus(pat)
>>> pat2
lambda> at 0x101b7bcf8>

Pattern 2 is also a function, and we can call pattern 2 of let’s say the string of five a’s followed by a b.

>>> pat2('aaaaab')
set(['b', 'ab', 'aaab', 'aaaab', 'aab'])
  • Now we get back this set that says we can break off any number of a’s because we’re asking for a and the plus of that and the closure of a. These are the possible remainders if we break off all of the a’s or all but one or all but three, and so on. Essentially(本质上) we’re doing the same computation that in the previous incarnation(前身) with an interpreter we would have done with:
matchset(pat2, 'aaaab')

Now we don’t have to do that. Now we’re calling the pattern directly. So we don’t have matchset, which has to look at the pattern and figure out, yes, the top-level pattern is a plus and the embedded pattern is a lit. Instead the pattern is now a composition(构图,布置;妥协) of functions, and each function does directly what it wants to do. It doesn’t have to look up what it should do.

In interpreter we have a way of writing patterns that describes the language that the patterns below to. In a compiler there are two sets of descriptions to deal with. There’s a description for what the patterns look like, and then there’s a description for what the compiled code looks like.

(在我们的这种情况中,compile code由Python函数组成。
还有其他可能。
像C语言的编译器,生成实际的机器指令machine instructions。
Java、Python使用虚拟机VM。)
Now, in our case — the compiler we just built — the compile code consists of Python functions. They’re good target representations because they’re so flexible. You can combine them in lots of different ways. You can call each other and so on. That’s the best unit that we have in Python for building up compiled code.

There are other possibilities. Compilers for languages like C generate code that’s the actual machine instructions for the computer that you’re running on, but that’s a pretty complicated process to describe a compiler that can go all the way down to machine instructions. It’s much easier to target Python functions.

Now there’s an intermediate(中间的) level where we target a virtual machine, which has its own set of instructions, which are portable across different computers. Java uses that, and in fact Python also uses the virtual machine approach, although it’s a little bit more complicated to deal with. But it is a possibility, and we won’t cover it in this class, but I want you to be aware of the possibility.

Here is what the so-called byte code from the Python virtual machine looks like. I’ve loaded the module dis for disassemble(反汇编) and dis.dis takes a function as input and tells me what all the instructions are in that function.

Here’s a function that takes the square root of x-squared plus y-squared.

(Python中的字节码byte code的样子。)

>>> import dis
>>> import math
>>> sqrt = math.sqrt
>>> dis.dis(lambda x, y: sqrt(X ** 2 + y ** 2))
  1           0 LOAD_GLOBAL              0 (sqrt)
              3 LOAD_GLOBAL              0 (x)
              6 LOAD_CONST               1 (2)
              9 BINARY_POWER
             10 LOAD_FAST                1 (y)
             13 LOAD_CONST               1 (2)
             16 BINARY_POWER
             17 BINARY_ADD
             18 CALL_FUNCTION            1
             21 RETURN_VALUE

This is how Python executes that. It loads the square root function. It loads the x and the 2, and then does a binary power, loads the y and the 2, does a binary power, adds the first two things off the top of the stack, and then calls the function, which is the square root function with that value, and then returns it.

This is a possible target language, but much more complicated to deal with this type of code than to deal with composition of functions.

17. 练习:Alt

Alt - Design of Computer Programs - YouTube

Let’s get back to our compiler.

def matchset(pattern, text)
    ...
    elif 'seq' == op:
        return set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))

Again, in matchset I pulled out one more clause. This is a clause for sequence, and this is what we return. If I want to write the compiler for that sequence clause, I would say let’s define seq(x, y).

def seq(x, y): return lambda text:后面的语句可以照着上面的搬下来,但是也可以改写成下面这样的。

def seq(x, y): return lambda text: set().union(*map(y, x(text))

It’s a compiler so it’s going to return a function that operates on x and y, take as input a text and then returns a result. We could take exactly(恰恰;完全地;精确地) that result. While I’m moving everything to this more functional notation, I decided let’s just show you a different way to do this. This way to do it would be fine, but I could have the function return that. But instead, let’s have it say what we’re really trying to do is form a union of sets(多个集合的1个并集,补充知识:set().union(set1, set2)). What are the sets? The sets that we’re going to apply union to. First we apply x to the text, and that’s going to give us a set of remainders. For each of the remainders, we want to apply y to it. What we’re saying is we’re going to map y to each set of remainders. Then we want to union all those together. Now, union, it turns out, doesn’t take a collection. It takes arguments with union a, b, c. So we want to turn this collection into a list of arguments to union. We do that using this apply notation(标记法) of saying let’s just put a star(星号) in there. Now, we’ve got out compiler for sequence. It’s the function from text to the set that results from finding all the remainders for x and then finding all the remainders from each of those after we apply y. Unioning all those together in union will eliminate(排除,消除) duplicates(重复).

Now it’s your turn to do one. This was the definition of alt in the interpreter matchset. Now I want you to write the definition of the compiler for alt, take a pattern for (x, y), and return the function that implements that.

17 Alt (Answer)

Alt Solution - Design of Computer Programs - YouTube

Here is the answer.

'''
def matchset(pattern, text):
    op, x, y = components(pattern)
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null
    elif 'seq' == op:
        return set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))
    elif 'alt' == op:
        return matchset(x, text) | matchset(y, text)
'''

def lit(s): return lambda t: set([t[len(s):]]) if t.startswith(s) else null
def seq(x, y): return lambda t: set().union(*map(y, x(t)))
def alt(x, y): return lambda t: x(t) | y(t)

The structure is exactly(恰恰;完全地) the same. It’s the union of these two sets. The difference is that with a compiler the calling convention is pattern gets called with the text as argument. In the interpreter the calling convention is matchset calls with the pattern and the text.

18. 练习:Simple Compilers

Simple Compilers - Design of Computer Programs - YouTube

Here’s the whole program.

def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = pattern(text)
    if remainders:
        shortest = min(remainders, key = len)
        return text[:len(text)-len(shortest)]

def lit(s): return lambda t: set([t[len(s):]]) if t.startswith(s) else null
def seq(x, y): return lambda t: set().union(*map(y, x(t)))
def alt(x, y): return lambda t: x(t) | y(t)
def oneof(chars): return lambda t: set([t[1:]]) if (t and t[0] in chars) else null
dot = lambda t: set([t[1:]]) if t else null
eol = lambda t: set(['']) if t == '' else null
def star(x): return lambda t: (set([t]) |
                               set(t2 for t1 in x(t) if t1 != t
                                   for t2 in star(x)(t1)))

Now, compilers have a reputation(名气;声誉) as being difficult and more complicated than interpreters, but notice here that the compilers is actually in many ways simpler than the interpreter.

It’s fewer lines of code over all(over all 遍及). One reason is because we didn’t have to duplicate(重复) effort here of saying first we need constructors to build up a literal and then within matchset have an interpreter for that literal. Rather we did it just once. Just once! We said the constructor for literal returns a function which is going to be the implementation of the compiler for that type of pattern. It’s very concise(简洁的,简明的). Most of these are one-liners. Maybe I cheated a little bit and I replaced the word “text” with the word “t” to make it a little bit shorter and fit on one line.

There’s only one that’s complicated. That’s the star of x, because it’s recursive. The ones I haven’t listed here is because they’re all the same as before. Before we get into star(x) let me note that.

I didn’t have to put down search here, because search is exactly the same as before.

I didn’t have to put down plus, because plus is exactly the same as before. It’s defined in terms of star.

What is the definition of star? One thing we could return is the remainder could be the text itself. Star of something — you could choose not to take any of it and return the entire text as the remainder. That’s one possibility. The other possibility is we could apply the pattern x. From star(x) apply the pattern x to the text and take those sets as remainders. For every remainder that’s not the text itself — because we already took care of that. We don’t need to take care of it again. For all the remainders that are different from the whole text then we go through and we apply star(x) to that remainder. We get a new remainder and that’s the result. That’s all we need for the compiler result.

Oh, one piece that was missing is how do interface the match function, which takes a pattern and a text, with this compiler where a pattern is applied to the text. That’s one line, which is slightly(轻微地) different. Here before we called matchset. In the previous implementation we had

def match(pattern, text):
    remainders = matchset(pattern, text)
    ...

Your job then is to replace that with the proper code for the implementation that calls the compiler.

要做这个编程题

# --------------
# User Instructions
#
# Fill out the function match(pattern, text), so that 
# remainders is properly assigned. 

def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = # your code here.
    if remainders:
        shortest = min(remainders, key=len)
        return text[:len(text)-len(shortest)]

def lit(s): return lambda t: set([t[len(s):]]) if t.startswith(s) else null
def seq(x, y): return lambda t: set().union(*map(y, x(t)))
def alt(x, y): return lambda t: x(t) | y(t)
def oneof(chars): return lambda t: set([t[1:]]) if (t and t[0] in chars) else null
dot = lambda t: set([t[1:]]) if t else null
eol = lambda t: set(['']) if t == '' else null
def star(x): return lambda t: (set([t]) | 
                               set(t2 for t1 in x(t) if t1 != t
                                   for t2 in star(x)(t1)))

null = frozenset([])

def test():
    assert match(star(lit('a')), 'aaaaabbbaa') == 'aaaaa'
    assert match(lit('hello'), 'hello how are you?') == 'hello'
    assert match(lit('x'), 'hello how are you?') == None
    assert match(oneof('xyz'), 'x**2 + y**2 = r**2') == 'x'
    assert match(oneof('xyz'), '   x is here!') == None
    return 'tests pass'

本段视频下方的补充——开始

Remember, here a pattern is a function, such that

pattern(text) = set([remainder1, remainder2, ...])

For example,

pat = lit('the')
text = 'them there eyes'
pat(text)  == set(['m there eyes'])

That is, lit(‘the’) is a pattern, compiled into a function. When you call that function with text as the argument, you get back a set of possible remainders. Since this is a literal, if the literal matches, there is only one way it can match, and thus one possible remainder. If it doesn’t match, there will be the empty set of remainders:

pat = lit('the')
text = 'whole lot of trouble'
pat(text)  == set()

If a pattern can match different things, then the set of possible remainders (the result of applying the pattern function to the text) can have more than one element:

pat = alt(lit('the'), lit('them'))
text = 'them there eyes'
pat(text)  == set([' there eyes', 'm there eyes'])

本段视频下方的补充——结束

18.Simple Compilers (Answer)

Simple Compilers Solution - Design of Computer Programs - YouTube

The answer is that the interface with the compiler is we just call a pattern with text as the input. That’s all we need to do.

def match(pattern, text):
    remainders = pattern(text)
    ...

19. Recognizers And Generators

Recognizers And Generators - Design of Computer Programs - YouTube

So far what we’ve done is call the recognizer task. We have a function match which takes a pattern and a text, and that returns back a substring of text if it matches or None.

It’s called a recognizer, because we’re recognizing whether the prefix of text is in the language defined by the pattern.

There’s a whole other task called the generator in which we generate from a pattern a complete language defined by that pattern.

For example, the pattern a or b sequenced with a or b — (a|b)(a|b). That defines a language of four different strings — {aa, ab, ba, bb}, and we could define a function that takes a pattern and generates out that language. That all seems fine.

One problem, though. If we have a language like a* then the answer of that should be the empty string or a or aa or aaa and so on — {'', a, aa, aaa, ...}. It’s an infinite set. That’s a problem. How are we going to represent this infinite set?(怎样表示1个无限的集合?)

Now, it’s possible, we could have a generator function that generates the items one at a time. That’s a pretty good interface, but instead I’m going to have one where we limit the sizes of the strings we want. If we say we want all strings up to n characters in length, then that’s always going to be a finite set.

I’m going to take the compiler approach. Rather than write a function “generate,” I’m going to have the generator be compiled into the patterns. What we’re going to write is a pattern, which is a compiled function, and we’re going to apply that to a set of integers representing the possible range of lengths that we want to retrieve(取回;恢复;检索;重新得到). That’s going to return a set of strings.

pat({int}) --> {str}

So for example, if we define pattern to be a* — we did that appropriately(适当地) — and then we asked for pattern, and we gave it the set {1, 2, 3},

pat = a*
pat({1,2,3}) --> {a, aa, aaa}

then that should return all strings which are derived from the pattern that have a length 1, 2, or 3. So that should be the set {a, aa, aaa}. Now let’s go ahead and implement this.

20. 练习:Oneof And Alt

Oneof And Alt - Design of Computer Programs - YouTube

Okay. Here’s the code for the compiler.

(注意:下面的Ns意思是a set of Numbers,数的1个集合,a list of possible lengths that we’re looking for)

def lit(s):         return lambda Ns: set([s]) if len(s) in Ns else null
def alt(x, y):      return lambda Ns: # your code here
def star(x):        return lambda Ns: opt(plus(x))(Ns)
def plus(x):        return lambda Ns: genseq(x, star(x), Ns, startx=1) #Tricky
def oneof(chars):   return lambda Ns: # your code here
def seq(x, y):      return lambda Ns: genseq(x, y, Ns)
def opt(x):         return alt(epsilon, x)
dot = oneof('?')    # You could expand the alphabet to more chars.
epsilon = lit('')   # The pattern that matches the empty string.

null = frozenset([])

def test():

    f = lit('hello')
    assert f(set([1, 2, 3, 4, 5])) == set(['hello'])
    assert f(set([1, 2, 3, 4]))    == null

    g = alt(lit('hi'), lit('bye'))
    assert g(set([1, 2, 3, 4, 5, 6])) == set(['bye', 'hi'])
    assert g(set([1, 3, 5])) == set(['bye'])

    h = oneof('theseletters')
    assert h(set([1, 2, 3])) == set(['t', 'h', 'e', 's', 'l', 'r'])
    assert h(set([2, 3, 4])) == null

    return 'tests pass'

Now, remember the way the compiler works is the constructor for each of the patterns takes some arguments — a string, and x and y pattern, or whatever — and it’s going to return a function that matches the protocol that we’ve defined for the compiler.

The protocol is that each pattern function will take a set of numbers where the set of numbers is a list of possible lengths that we’re looking for. Then it will return a set of strings.

What have I done for lit(s)? I’ve said we return the function which takes a set of numbers as input, and if the length of the string is in that set of number — if the literal string was "hello" and if hello has five letters and if 5 is one of the numbers we’re trying to look for — then return the set consisting of a single element — the string itself. Otherwise, return the null set.

star I can define in terms of other things.

plus I’ve defined in terms of a function sequence that we’ll get to in a minute. It’s a little bit complicated. It’s really the only complicated one here. We can reduce all the other complications down to calling plus, which calls genseq(). seq does that too.

I’ve introduced epsilon(n. 希腊语的第五个字母), which is the standard name in language theory for the empty string. So it’s the empty string. It’s the same as just the literal of the empty string, which matches just itself if we’re looking for strings of length 0.

For dot — dot matches any character. I’ve decided to just return a question mark to indicate that. You could return all 256 characters or whatever you want. Your results would start to get bigger and bigger. You can change that if you want to.

I left space for you to do some work.

Give me the definitions for oneof(chars).

If we ask for oneof('abc') , what should that match?

What it should match is if 1 is an element of Ns then it should be abc. Otherwise, it shouldn’t be anything.

Similarly for alt. Give me the code for that.

20.Oneof and Alt (Answer)

Oneof And Alt Solution - Design of Computer Programs - YouTube

def lit(s):         return lambda Ns: set([s]) if len(s) in Ns else null
def alt(x, y):      return lambda Ns: x(Ns) | y(Ns)
def star(x):        return lambda Ns: opt(plus(x))(Ns)
def plus(x):        return lambda Ns: genseq(x, star(x), Ns, startx = 1) #Tricky
def oneof(chars):   return lambda Ns: set(chars) if 1 in Ns else null
def seq(x, y):      return lambda Ns: genseq(x, y, Ns)
def opt(x):         return alt(epsilon, x)
dot = oneof('?^')    # You could expand the alphabet to more chars.
epsilon = lit('')   # The pattern that matches the empty string.

The answer is if we want an alternative of the patterns x and y, then we use a protocol. We say let’s apply x to the set of numbers. That gives us a set of strings that matches. We’ll just union that with the set that comes back from applying y to the set of numbers. Now, for one of char — this will be a list of possible characters that we’re trying to match one of. If 1 is in our list of numbers, then we should return all of those, and we have to return them as a set. We’ll say the set of character if 1 is in there. Otherwise, there are no matches at all, so return null.

21. Avoiding Repetition

Avoiding Repetition - Design of Computer Programs - YouTube

def lit(s):         return lambda Ns: set([s]) if len(s) in Ns else null
def alt(x, y):      return lambda Ns: x(Ns) | y(Ns)
def star(x):        return lambda Ns: opt(plus(x))(Ns)
def plus(x):        return lambda Ns: genseq(x, star(x), Ns, startx = 1) #Tricky
def oneof(chars):   return lambda Ns: set(chars) if 1 in Ns else null
def seq(x, y):      return lambda Ns: genseq(x, y, Ns)
def opt(x):         return alt(epsilon, x)
dot = oneof('?^')    # You could expand the alphabet to more chars.
epsilon = lit('')   # The pattern that matches the empty string.

That’s the whole compiler. I want to show you just a little bit of the possibility of doing some compiler optimizations(optimization 最优化). Notice this sort of barrier(分界线) here where we introduce lambda, where we introduce a function. Remember I said that there’s two parts to a compiler. There’s the part where we’re first defining a language. When we call lit and give it a string, then we’re doing some work to build up this function that’s going to do the work every time we call it again. Anything that’s on the right of the lambda is stuff(原料;材料) that gets done every time. Anything that’s to the left is stuff that gets done only once.

Notice that there is a part here building up this set of s that I’m doing every time, but that’s wasteful because s doesn’t depend on the input. s is always going to be the same.

I can pull this out and do it at compile time rather than do it every time we call the resulting function.

I’ll make this set of s and I’ll give that a name — set_s. Over here I’ll do set_s equals that value. It looks like I’d better break this up into multiple lines.
(如下代码)

def lit(s): set_s = set([s]); return lambda Ns: set_s if len(s) in Ns else null

Now I pulled out that precomputation so it only gets done once rather than gets done every time. You could look around for other places to do that.
(如下代码)

def lit(s):
    set_s = set([s])
    return lambda Ns: set_s if len(s) in else null

I could pull out the computation of this set of characters and do that only once as well.

(指的是这一行:

def oneof(chars):   return lambda Ns: set(chars) if 1 in Ns else null

)
That’s a lifting operation that stops us from repeating over and over again what we only need to do once. That’s one of the advantages of having a compiler in the loop. There is a place to do something once rather than to have to repeat it every time.

22. 练习:Genseq

Genseq - Design of Computer Programs - YouTube

Now there’s only one bit left — this generate sequence.

def seq(x, y): return lambda Ns: genseq(x, y, Ns)

(seq(x, y)是1个函数,传入2个参数,x、y,是2个patterns,返回1个函数,这个函数传入数numbers的1个列表,返回匹配的text的1个集合。)
Let’s talk about that.Now sequence in this formulation(构想,规划;公式化) is a function that takes x and y(即seq(x, y)), two patterns, and what it returns is a function, and that function takes a list of numbers(即fn(Ns)) and returns a set of text(即{text}) that match.

(seq()函数是延迟计算。它是计算出1个函数,这个函数稍后可以做计算。)
So sequence is delaying the calculation. It’s computing a function which can do the calculation later on.

(genseq(x, y, Ns)是立即做计算。它传入x、y和numbers的1个集合,立即计算出可能的text的集合。)
Genseq does the calculation immediately. It takes x and y and a set of numbers(即genseq(x, y, Ns)), and it immediately calculates the set of possible text(即{text}).

Now the question is what do we know about genseq in terms of the patterns x and y and the set of possible numbers. We know at some point we’re going to have to call the pattern x with some set of numbers(即x(N)). We’re not yet quite sure what. That’s going to return a list of possible text. Then we’re going to have to call y with some other set of numbers(即y(N)).

Then we’re going to have to concatenate them together and see if they make sense, if the concatenation of some x and some y, if that length is within this allowable set. Now, what do we know about what these Ns should be in terms of this set of possible numbers here regardless of what this set is.

This could be a dense(密集的) set, so we could have Ns equals 0, 1, 2, all the way up to 10 or something. Or it could be a sparse(稀疏的) set. It could be, say, only the number 10. But either way, the restriction(约束;限定,限制) on x and y is such that they have to add up to no more than 10. But x could be anything. If the list of possible numbers that we want to add up to is only 10, that doesn’t constrain x at all other than to be less than 10. This N should be everything up to the maximum of N sub s. Then what should y be? Well, we have two choices. One, we could for each x that comes back we could generate the y’s. Or we could generate the y’s all at once and then try to combine them together and see if they match up. I think that’s actually easier. So for the y’s also, they can be any size up to the maximum. Then we take the two together, add up the x match and the y match and see if that length is within N. In this example, if Ns is equal to 10, here we want to have the Ns be everything from 0 up to 10 inclusive(包括…的) in both cases, and we get back some results like, say, a, abb, acde, and so on, and some other result over here — ab, bcd. Then for each of them we add them up, and if we say abb plus ab and check to see if that’s in Ns. If it is, we keep it. If it’s not, we don’t keep it.

Here is candidate solution(候补解) for genseq.

def genseq(x, y, Ns):
    Nss = range(max(Ns)+1)
    return set(m1 + m2
               for m1 in x(Nss) for m2 in y(Nss)
                  if len(m1 + m2) in Ns)

We take x, y, and a set of numbers, and then we define Ns as being everything up to the largest number, including the largest number. We have to add 1 to the maximum number in order to get a range going from 0 up to and including the largest number. Now that we know the possible values of the numbers that we’re looking for for the sizes of the two components-the x and the y components — then we can say m1 is all the possible matches for x, m2 is all the possible matches for y. If the length of m1 plus m2 is in the original set of numbers that we’re looking for, then return m1 plus m2. This seems reasonable. It looks like it’s doing about what we’re looking for to generate all sequences of x and y concatenated(concatenate 把(一系列事件、事情等)联系起来) together.

But I want you to think about it and say, have we really gotten this right? The choices are is this function correct for all inputs? Or is in incorrect for some? Does it return incorrect results? Or is it correct when it returns, but doesn’t doesn’t always return? Think about that. Think about is there any result that looks like it’s incorrect that’s being formed. Think about does it infinite loop or not. Think about base cases on recursion and saying is there any case where it looks like it might not return. This is a tricky(复杂的) question, so I want you to try it, but it may be difficult to get this one right.

[] correct for all inputs
[] incorrect for some
[] correct when returns; does't always return

视频下方的补充材料——开始

In the forum, qlq3763 points out that this version of genseq doesn’t even handle plus(lit(‘a’)).

视频下方的补充材料——结束

Genseq (Answer)

Genseq Solution - Design of Computer Programs - YouTube

The answer is that it is correct when it returns.

All the values it builds up are correct, but unfortunately it doesn’t always return. Let’s try to figure out why.

In thinking about this, we want to think about recursive patterns.

Let’s look at the pattern x+. We’ve defined x+ as being the sequence of x followed by x*(即xx*). And now for most instances of x that’s not a problem.

If we had plus(lit('a')), it not going to be a problem. That’s going to generate a, aa, aaa, and so on.

But consider this — let’s define a equals lit('a')(即a = lit('a')),pat equals plus(opt(a))(即pat = plus(opt(a))).

Now, this should be the same. This should also generate a, aa, aaa.

The way we can see that is we have a plus so that generates any number of these. If we pick a once, we get this. It we pick a twice we get this. If we pick a three times we get this. But the problem is there’s all these other choices in between.

opt(a) means we we can either be picking a or the empty string. As we go through the loop for plus, we could pick empty string, empty string, empty string. We could pick empty string an infinite(无限的;无穷尽的) number of times. Even though our N is finite(有穷的;有限的) — at some point we’re going to ask for pattern of some N — let’s say the set {1, 2, 3} — we won’t have a problem with having an infinite number of a’s, but we will have a problem of choosing from the opt(a) the empty part. If an infinite number of times we choose the empty string rather than choosing a, then we’re never going to get past three as the highest value. We’re going to keep going forever. That’s the problem. We’ve got to somehow say I don’t want to keep choosing the empty string. I want to make progress and choose something each time through. So how can we make sure that happens?

23. Induction(归纳(法))

Induction - Design of Computer Programs - YouTube

What I have to do is I have to check all my cases where I have recursion and eliminate(排除;消除) any possibility for infinite recursion(递归).

Now, there are two possibilities in the star function and the plus function. Those are the two cases where regular expressions have recursion. But now star I defined in terms of plus, so all that’s left is to fix plus to not have an infinite(无限的) recursion.

Here’s how I define plus. Basically, I said that x+ is defined as x followed by x*(即xx*), and the x* is in turn defined in terms of x+. The problem was that I was going through and saying, okay, for x when I’m doing plus(opt(a)), for my x I want to choose opt(a). Okay, I think I’ll choose the empty string. So I chose that, and now I’m left in a recursion where I have an x*, which is defined in terms of x+, and I haven’t made any progress. I have a recursive call that’s defined in terms of itself. We know in order to make sure that a recursion terminates, we have to have some induction(归纳) where we’re reducing(换算;约束;使变为) something. It makes sense here that what we’re going to reduce is our set Ns. One way to guarantee to do that is to say when I’m generating the x followed by the x* let’s make sure that the x generates at least 1 character. If we can guarantee that x generates a character, then when we go to do the x* we’ve reduced our input. So we have this inductive(归纳的;感应的) property saying that now our set of Ns will be smaller. It’s smaller by 1 each time, and if it started out finite and we reduce by 1 every time, then eventually we’re going to terminate(结束).

Let’s see how we can implement that idea of saying every time we have an x+ we have a x and an x*, we have to choose at least 1 character for x. Note that that’s not limiting us in any way. That hasn’t stopped us from generating all possible expressions, because if we were going to generate something, it would have to come from somewhere — either from this x or this x — and we might as well make sure it comes from here rather than adding an infinite number of nothings before we generate something.

24. Testing Genseq

Testing Genseq - Design of Computer Programs - YouTube

def genseq(x, y, Ns, startx=0):
    "Set of matches to xy whose total len is in Ns, with x-match's len in Ns_"
    # Tricky part: x+ is defined as : x+ = x x*
    # To stop the recursion, the first x must generate at least 1 char,
    # and then the recursive x* has that many fewer characters.  We use
    # startx=1 to say that x must match at least 1 character
    if not Ns:
        return null
    xmatches = x(set(range(startx, max(Ns)+1)))
    Ns_x = set(len(m) for m in xmatches)
    Ns_y = set(n-m for n in Ns for m in Ns_x if n-m >= 0)
    ymatches = y(Ns_y)
    return set(m1 + m2
               for m1 in xmatches for m2 in ymatches
               if len(m1+m2) in Ns)

Here’s what gensequence looks like. We have a recursive base case(递归基例子。百度翻译) that says, if there are no numbers that we’re looking for, we can’t generate anything of those lengths, and so return the empty set.

Then we say the xmatches we get by applying x to any number up to the maximum of Ns, including the maximum of Ns, but then we got to do some computation to figure out what can be the allowable sizes for y, and we do that by saying, let’s take all the possible values that came back from the xmatches and then for each of those values and for each of the original values for the lengths that we’re looking for, subtract those(减去) off and say, total is going to be one of the things we got from x and one of the things we got from y, that better add up to one of the things in Ns. Then we call y with that set of possible ends for y and then we do the same thing that we were going to do before. We go through those matches, but this is going to be with a reduced set of possibilities and count those up, and now, the thing that makes it all work is this optional argument here, saying the number that we’re going to start at for the possible sizes, for x in the default case, that’s 0, and so we start the range at 0. But in the case where we’re calling from +, we’re going to set that to 1. Let’s see what that looks like.

def plus(x):    return lambda Ns: genseq(x, star(x), Ns, startx=1)
def seq(x, y):  return lambda Ns: genseq(x, y, Ns)

Here’s the constructors, the compilers for sequence and plus. For a regular sequence, there is no constraint on this start for x. X can be any size up to the maximum of the N’s. But for plus, we’re going to always ask that the x part have a length of at least 1, and then the y part will be whatever is left over. That’s how we break the recursion, and we make sure that genseq will always terminate.

def test_gen():
    def N(hi): return set(range(hi+1))
    a,b,c = map(lit, 'abc')
    assert star(oneof('ab'))(N(2)) == set(['', 'a', 'aa', 'ab', 'ba',
                                          'bb', 'b'])
    assert (seq(start(a), seq(start(b), start(c)))set([4])) ==
            set(['aaaa', 'aaab', 'aaac', 'aabb', 'aabc', 'aacc', 'abbb',
         'abbc', 'abcc', 'accc', 'bbbb', 'bbbc', 'bbcc', 'bccc', 'cccc'])
    assert (seq(plus(a), seq(plus(b), plus(c)))(set[5])) ==
            set(['aaabc', 'aabbc', 'aabcc', 'abbbc', 'abbcc', 'abccc']))
    assert (seq(oneof('bcfhrsm'), lit('at'))(N(3)) ==
            set(['bat', 'cat', 'fat', 'hat', 'mat', 'rat', 'sat']))
    assert (seq(start(alt(a, b)), opt(c))set([3])) ==
            set(['aaa', 'aab', 'aac', 'aba', 'abb', 'abc', 'baa',
         'bab', 'bac', 'bba', 'bbb', 'bbc']))
    assert lit('hello')(set([5])) == set(['hello'])
    assert lit('hello')(set([4])) == set()
    assert lit('hello')(set([6])) == set()
    return 'test_gen passes'

Now this language generation program is a little bit complex. So I wanted to make sure that I wrote a test suite for it to test the generation. So here I’ve just defined some helper functions and then wrote a whole bunch of statements here. If we check one of ‘ab’ and limit that to size 2, that should be equal to this set. It’s gone off the page. Let’s put it back where it belongs. One element of size 0, 2 elements of size 1, and 4 elements of size 2, just what you would expect. Here are sequences of a star, b star, c star of size exactly 4. Here they are and so on and so on. We’ve made all these tests. I should probably make more than these, but this will give you some confidence that the program is doing the right thing if it passes at least this minimal test suite(套件).

25. Theory And Practice

Theory And Practice - Design of Computer Programs - YouTube

This is a good time to pause and summarize what we’ve learned so far. We’ve learned some theory and some practice.

In theory, we’ve learned about patterns, which are grammars which describe languages, where a language is a set of strings. We’ve learned about interpreters over those languages, and about compilers, which can do the same thing only faster.

In terms of practice, we’ve learned that regular expressions are useful for all sorts of things, and they’re a concise(简明的;简洁的) language for getting work done. We’ve learned that interpreters, including compilers, can be valuable(有价值的) tools, and that they can be more expressive(有表现力的) and more natural to describe a problem in terms of a native language that makes sense for the problem rather than in terms of Python code that doesn’t necessarily make sense. We learned functions are more composable(组成的) than other things in Python. For example, in Python we have expressions, and we have statements. They can only be composed by the Python programmer whereas(然而) functions can be composed dynamically. We can take 2 functions and put them together. We can take f and call g with that and then apply that to some x(即f(g(x))). We can do that for any value of f and g. We can pass those into a function and manipulate(操作;处理) them and have different ones applying to x. We can’t do that with expressions and statements. We can do it with the values of expressions, but we can’t do it with expressions themselves. Functions provide a composability(可组合性) that we don’t get elsewhere. Functions also provide control over time, so we can divide up the work that we want to do into do some now and do some later. A function allows us to do that. Expressions and statements don’t do that because they just get done at 1 time when they’re executed. Functions allow us to package up computation that we want to do later.

26. 练习:Changing Seq

Changing Seq - Design of Computer Programs - YouTube

Now one thing I noticed as I was writing all those test patterns is that functions like seq and alt are binary, which means if I want a sequence of 4 patterns, I have to have a sequence of (a, followed by the sequence of (b, followed by sequence of (c,d)(即seq(a, seq(b, seq(c, d)))), and then I have to count the number of parens(括号) and get them right. It seems like it’d be much easier if I could just write sequence of (a, b, c, d)(即seq(a, b, c, d)).

(目前seq()函数只能接收2个参数,用起来不大方便。考虑将其重构,修改为可以接收任意数量参数。
seq()函数调用其他组件和被其他组件调用,如果改变了seq()函数,那么需要考虑它关联的影响。有2个因素需要考虑。
1是改变是否是向前兼容的?也就是,我们对seq()做了一些改变,我们必须保证,无论它以前如何被使用,seq()改变了之后,那些使用仍然是好的,并且不用改变。If so, then my change will be local to sequence, and I won’t have to be able to go all over the program changing it everywhere else.
第2个因素是,改变是内部的还是外部的。So am I changing something on the inside of sequence that doesn’t effect all the callers, then that’s okay。或者,改变的是外部的东西——interface。)

And we talked before about this idea of refactoring(refactor 重构), that is changing your code to come up with a better interface that makes the program easier to use, and this looks like a good example. This would be a really convenient thing to do. Why did I write seq this way? Well, it was really convenient to be able to define sequence of (x,y) and only have to worry about exactly 2 cases. If I had done it like this, and I had to define sequence of an arbitrary(随意的) number of arguments, then the definition of sequence would have been more complex. So it’s understandable that I did this.

I want to make a change, so let’s draw a picture. Imagine this is my whole program and then somewhere here is the sequence part of my program. Now, of course, this has connections to other parts of the program. Sequence is called by and calls other components, and if we make a change to sequence, then we have to consider the effects of those changes everywhere else in which it’s used. When we consider these changes, there are 2 factors we would like to break out. One is, is the change backward compatible(向前兼容)? That is, if I make some change to sequence, am I guaranteed that however it was used before, that those uses are still good, and they don’t have to be changed? If so, then my change will be local to sequence, and I won’t have to be able to go all over the program changing it everywhere else. So that’s a good property to have. So for example, in this case, if I change sequence so that it still accepted than that would be a backwards compatible change as long as I didn’t break anything else.

And then the second factor(因素) is whether the change is internal(内部的) or external(外部的). So am I changing something on the inside of sequence that doesn’t effect all the callers, then that’s okay. In general, that’s going to be backwards compatible. Or am I changing something on the outside — changing the interface to the rest of the world? In this case, going from the binary version to this n_ary version, I can make it backwards compatible if I’m careful. It’s definitely(明确地;一定地) going to be both an internal and external change. So I’m going to have to do something to the internal part of sequence. And then I’m also changing the signature(签名) of the function, so I’m effecting the outside as well. I can make that effect in a backwards compatible way.

Thinking about those 2 factors, what would be the better way to implement this call? Let’s say we’re dealing with the match-set version where we’re returning a tuple, would it be better to return the tuple sequence (a, b, c, d) or the tuple sequence of (a, sequence of (b, sequence of (c, d)? Tell me which of these do you prefer from these criteria(标准,准则).

[] 元组 ('seq', a, b, c, d)
[much better] ('seq', a, ('seq', b, ('seq', c, d)))

Changing seq (Answer)

Changing Seq Solution - Design of Computer Programs - YouTube

(('seq', a, (seq, b, (seq, c, d))))
The answer is this approach is much better because now from the external part everybody else sees exactly the same thing. But internally, I can write the calls to the function in a convenient form and they still get returned in a way that the rest of the program can deal with, and I don’t have to change the rest of the program.

27. Changing Functions

Changing Functions - Design of Computer Programs - YouTube

Let’s go about implementing that. Let’s say I had my old definition of sequence of (x,y), and we say return the tuple consisting of a sequence x and y.

def seq(x, y):
    return ('seq', x, y)

Now I want to change that. How would I change that? Well, instead of x and y, I think I’m going to insist(坚持;强调) that we have at least 1 argument, so I can say x and then the rest of the args, and I can say, if the length of the rest of the args equals 1, then I’ve got this binary case and then I can take sequence of x. The second arg is now no longer called y. It’s called args at 0. Else: now I’ve got a recursive case with more than 2 args, and I can do something there.

def seq(x, *args):
    if len(args) == 1:
        return ('seq', x, args[0])
    else:
        pass

So it’s not that hard, but I had to do a lot of violence to this definition of sequence, and come to think of it, I may be repeating myself because I had to do this for sequence, and I’m also going to have to do it for alt, and if I expand(扩展) my program, let’s say I start wanting to take on arithmetic(算术,计算;算法) as well as regular expressions, then I may have functions for addition and multiplication and others, and I’m going to have to make exactly the same changes to all these binary functions. So that seems to violate(违反) the “Don’t Repeat Yourself” principle. I’m making the same changes over and over again(一再地). It’s more work for me. There’s a possibility of introducing bugs. Is there a better way?

(从f(x, y)f2(x, ...)。)
So let’s back up and say, what are we doing in general? Well, we’re taking a binary function f of (x,y), and we want to transform that somehow into an n_ary function — f prime(基本的;首要的;最初的), which takes x and any number of arguments. The question is, can we come up with a way to do that to automatically — change one function or modify or generate a new function from the definition of that original function.

28. 练习:Function Mapping

Function Mapping - Design of Computer Programs - YouTube

What’s the best way to do that? How can I map from function f to a function f prime? One possibility would be to edit the bytecode of f. Another possibility would be to edit the source string of f and concatenate some strings together. Another possibility would be to use an assigment statement to say f = some function g of f to give us a new version of f. Which of these would be the best solution?

[] edit the bytecode of f。
[] edit the source string of f and concatenate some strings together。
[right] use an assigment statement to say f = some function g of f to give us a new version of f。`f = g(f)`

28.Function Mapping (Answer)

Function Mapping Solution - Design of Computer Programs - YouTube

I think it’s pretty clear that this one is the best because we know how to do this quite easily. We know how to compose functions together, and that’s simple, but editing the bytecode or the source code, that’s going to be much trickier and not quite as general, so let’s go for the solution.

29. 练习:N Ary Function

N Ary Function - Design of Computer Programs - YouTube

What I want to do is define a function, and let’s call it n_ary, and it takes (f), which should be a binary function, that is a function that takes exactly 2 arguments, and n_ary should return a new function that can take any number of arguments. We’ll call this one f2, so that f2 of (a, b, c)(即f2(a, b, c)) is equal to f(a, f(b, c)), and that will be true for any number of arguments — 2 or more. It doesn’t have to just be a, b, c.

# ---------------
# User Instructions
#
# Write a function, n_ary(f), that takes a binary function (a function
# that takes 2 inputs) as input and returns an n_ary function. 

def n_ary(f):
    """Given binary function f(x, y), return an n_ary function such
    that f(x, y, z) = f(x, f(y,z)), etc. Also allow f(x) = x."""
    def n_ary_f(x, *args):
        # your code here
    return n_ary_f

So let’s see if you can write this function n_ary. Here’s a description of what it should do. It takes a binary function (f) as input, and it should return this n_ary function, that when given more than 2 arguments returns this composition of arguments. When given 2 arguments, it should return exactly what (f) returns. We should also allow it to take a single argument and return just that argument. That makes sense for a lot of functions (f), say for sequence. The sequence of 1 item is the item. For alt, the alternative of 1 item is the item. I mentioned addition and multiplication makes sense to say the addition of a number by itself is that number or same with multiplication. So that’s a nice extension for n_ary. See if you can put your code here. So what we’re doing is, we’re passed in a function. We’re defining this new n_ary function, putting the code in there, and then we’re returning that n_ary function as the value of that call.

29.n_ary Function (Answer)

N Ary Function Solution - Design of Computer Programs - YouTube

Peter的答案:

def n_ary(f):
    """Given binary function f(x, y), return an n_ary function such that
    f(x, y, z) = f(x, f(y, z)), etc. Also allow f(x) = x."""
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    return n_ary_f

Here’s the answer. It’s pretty straight forward. If there’s only 1 argument, you return it. Otherwise, you call the original f that was passed in with the first argument as the first argument, and the result of the n-ary composition as the other argument.

30. Update Wrapper

Update Wrapper - Design of Computer Programs - YouTube

def seq(x, y): return ('seq', x, y)

seq = n_ary(seq)

Now how do we use this? Well, we take a function we define, say seq of x, y, and then we can say sequence is redefined as being an n_ary function of sequence. Oops — I guess I got to fix this typo here. From now on, I can call sequence and pass in any number of numbers, and it will return the result that looks like that. So that looks good.

@n_ary
def seq(x, y): return ('seq', x, y)

In fact, this pattern is so common in Python that there’s a special notation for it. The notation is called the decorator notation. It looks like this. All we have to do is say, @ sign, then the name of a function, and then the definition. This is the same as saying sequence = n_ary of sequence. It’s just an easier way to write it.

(

>>> def n_ary(f):
...     def n_ary_f(x, *args):
...             return x if not args else f(x, n_ary_f(*args))
...     return n_ary_f
...
>>> @n_ary
... def seq(x, y): return ('seq', x, y)
...
>>> help(seq)
Help on function n_ary_f in module __main__:

n_ary_f(x, *args)

)
But there is one problem with the way we specified this particular decorator, which is if I’m in an interactive session, and I ask for help on sequence, I would like to see the argument list and if there is a doc string, I want to see the documentation here. I didn’t happen to put in any documentation for sequence. But when I ask for help, what I get is this. I’m told that sequence is called n_ary function. Well, why is that? Because this is what we returned when we define sequence = n_ary of sequence. We return this thing that has the name n_ary function. So we would like to fix n_ary up so that when the object that it returns has the same function name and the same function documentation — if there is any documentation — and have that copied over into the n_ary f function.

from functools import update_wrapper

Now it turns out that there is a function to do exacty that, and so I’m going to go get it. I’m going to say from the functools — the functional tools package. I want to import the function called update_wrapper.

from functools import update_wrapper

def n_ary(f):
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    update_wrapper(n_ary_f, f) # 增加了这1行
    return n_ary_f

@n_ary
def seq(x, y): return ('seq', x, y)

help(seq)

Help on function seq in module __main__:

seq(x, *args)

update_wrapper takes 2 functions, and it copies over the function name and the documentation and several other stuff from the old function to the new function, and I can change n_ary to do that, so once I’ve defined the n_ary function, then I can go ahead and update the wrapper of the n_ary function — the thing I’m going to be returning from the old function. So this will be the old sequence, which has a sequence name, a list of arguments, maybe some documentation string, and this will be the function that we were returning, and we’re copying over everything from f into n_ary f. Now when I ask for help — when I define n_ary sequence, and I ask for help on sequence, what I’ll see is the correct name for sequence, and if there was any documentation string for sequence, that would appear here as well.

So update_wrappers is a helpful tool. It helps us when we’re debugging. It doesn’t really help us in the execution of the program, but in doing debugging, it’s really helpful to know what the correct names of your functions are.

Notice that we may be violating(violate 违反) the Don’t Repeat Yourself principle here. So this n_ary function is a decorator that I’m using in this form to update the definition of sequence. I had to — within my definition of n_ary — I had to write down that I want to update the wrapper. But it seems like I’m going to want to update the wrapper for every decorator, not just for n_ary, and I don’t want to repeat myself on every decorator that I’m going to define.

31. Decorated Wrappers

Decorated Wrappers - Design of Computer Programs - YouTube

@decorator
def n_ary(f):
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    return n_ary_f

So here’s an idea. Let’s get rid of this line, and instead, let’s declare that n_ary is a decorator. We’ll write a definition of what it means to be a decorator in terms of updating wrappers. Then we’ll be done, and we’ve done it once and for all. We can apply it to n_ary, and we can apply it to any other decorator that we define. This is starting to get a little bit confusing because here we’re trying to define decorator, and decorator is a decorator. Have we gone too far into recursion? Is that going to bottom out?

Let’s draw some pictures and try to make sense of it. So we’ve defined n_ary, and we’ve declared that as being a decorator(即@decorator def n_ary(...):...), and that’s the same as saying n_ary = decorator of n_ary(即n_ary = decorator(n_ary)). Then we’ve used n_ary as a decorator(即@n_ary). We’ve defined sequence to be an n_ary function(即@n_ary def seq(...):...). That’s the same as saying sequence = n_ary of sequence(即seq = n_ary(seq)).

Now we wanted to make sure that there’s an update so that the documentation and the name of sequence gets copied over. We want to take it from this function, pass it over to this function because that’s the one we’re going to keep. While we’re at it, we might as well do it for n_ary as well. We want to have the name of n_ary be n_ary and not something arbitrary(随意的) that came out of decorator. So we’ve got 2 updates that we want to do for the function that we decorated and for the decorator itself.

def decorator(d):
    def _d(fn):
        return update_wrapper(d(fn), fn)
    update_wrapper(_d, d)
    return _d

Now let’s see if we can write decorator so that it does those 2 updates. So let’s define decorator. It takes an argument (d), which is a function. Then we’ll call the function we’re going to return _d, and that takes a function as input. So it returns the update wrapper from applying the decorator to the function and copying over onto that decorated function, the contents of the original function’s documentation and name, and then we also want to update the wrapper for the decorator itself. So from (d) the decorated function, we want to copy that over into _d and then return _d.

Now which update is which? Well, this one here is the update of _d with d(即update(_d, d)), and this one is the update of the decorated function from the function(即update(d(fn), fn)). So here we’re saying the new n_ary that we’re defining gets the name from the old n_ary, the name in the documentation string, and here we’re saying the new sequence, the new n_ary sequence, gets its name from the old sequence.

from functools import update_wrapper

def decorator(d):
    "Make function d a decorator: d wraps a function fn."
    def _d(fn):
        return update_wrapper(d(fn), fn)
    update_wrapper(_d, d)
    return _d

@decorator
def n_ary(f):
    """Given binary function f(x, y), return an n_ary function such that
    f(x, y, z) = f(x, f(y, z)), etc. Also allow f(x) = x."""
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    return n_ary_f

Here’s what it all looks like. If you didn’t quite follow that the first time, don’t worry about it. This is probably the most confusing thing in the entire class because we’ve got functions pointing to other functions, pointing to other functions. Try to follow the pictures. If you can’t follow the pictures, that’s okay. Just type it into the interpreter. Put these definitions in. Decorate some functions. Decorate some n_ary functions. Take a look at them and see how it works.

32. 练习:Decorated Decorators

Decorated Decorators - Design of Computer Programs - YouTube

def decorator(d):
    "Make function d a decorator: d wraps a function fn."
    def _d(fn):
        return update_wrapper(d(fn), fn)
    update_wrapper(_d, d)
    return _d

## QUIZ: DOES THIS WORK?

def decorator(d):
    "Make function d a decorator: d wraps a function fn. @author Darius Bacon"
    return lambda fn: update_wrapper(d(fn), fn)

decorator = decorator(decorator)

"""     () Yes
        () No, results in an error
        () No, updates decorator (n_ary) but not decorated (seq)
        () No, my brain hurts
"""

Okay, now for a quick quiz. We have this definition of decorator, and we’ve seen how that works. Here’s an alternative that was proposed(propose 提议) by Darius Bacon, which is this1 line — return the function that updates wrapper for the decorator applied to the function from the original function, and then 1 more line, which says, decorator = decorator of decorator. Can you get any more recursive(递归的) than that? The question is, does this work? I want you to give me the best answer. The possible answers are yes, it does; no, it’s an error; no, it correctly updates decorator such as n_ary, but not the decorated function such as (seq); or no, my brain hurts.

32 Decorated Decorators (Answer)

Decorated Decorators Solution - Design of Computer Programs - YouTube

Now if you answered, no, my brain hurts — well, who am I to argue with that? But I think the best answer is yes, it does in fact work. Even though there’s only 1 update wrapper call, both of them happen, and the reason is because decorator becomes a decorator. So this version of decorator updates (seq), and then this version that gets created when we make it a decorator updates the decorator.

33. 练习:Cache Management

Cache Management - Design of Computer Programs - YouTube

def fib(n): ...fib(n-1)...

If you took CS 101, you saw the idea of memoization(备忘). If you haven’t seen it, the idea is that sometimes particularly with the recursive function, you will be making the same function calls over and over again. If the result of a function call is always the same and the computation took a long time, it’s better to store the results of each value of N with its result in a cache(快速缓冲贮存区), a table data structure, and then look it up each time rather than try to recompute it.

def fib(n):
    if n is cache:
        return cache[n]
    cache[n] = result = ...
    return result

We can make this function be cached very simply with a couple extra lines of code. We ask if the argument is already in the cache, then we just go ahead and return it. Otherwise, we compute it, store it, and then return it. So this part with the dot, dot, dot, is the body of the function. All the rest is just the boiler plate(锅炉钢板) that you have to do to implement this idea of a cache. We’ve done this once, and that’s fine, but I’m worrying about the principle of Don’t Repeat Yourself. There’s probably going to be lots of functions in which I want to store intermediate(中间的) results in a cache, and I don’t want to have to repeat this code all of the time. So this is a great idea for a decorator. We can define a decorator called memo(备忘录), which will go ahead and do this cache management, and we can apply it to any function. The great thing about this pattern of using memoization is that it will speed up any function f that you pass to it because doing a table look-up is going to be faster than a computation as long as the computation is nontrivial(非平凡的), is more than just a look-up.

Now the hockey(曲棍球) player, Wayne Gretzsky, once said that you miss 100% of the shots you don’t take. This is kind of the converse(交谈;对话). This is saying you speed up(加速) 100% of the computations that you don’t make. So here’s the memo decorator.

@decorator
def memo(f):
    """Decorator that caches the return value for each call to f(args).
    Then when called again with same args, we can just look it up."""
    cache = {}
    def _f(*args):
        try:
            return cache[args]
        except KeyError:
            cache[args] = result = f(*args)
            return result
        except TypeError:
            # some element of args can't be a dict key
            return f(args)
    return _f

The guts(gut 勇气;内脏;直觉) of it is the same as what I sketched out(概述;草拟) previously.

If we haven’t computed the result already, we compute the result by applying the function f to the arguments. It gives us the result. We cache that result away, then we return it for this time. It’s ready for next time. Next time we come through(到达;穿过;传来;恢复), we try to look up the arguments in the cache to see if they’re there. If they are, we return the result.

And now I’ve decided to structure(构成,排列;安排) this one as a try-except statement rather than an if-then statement. In Python, you always have 2 choices. You can first ask for permission to say are the args in the cache, and if so, return cache or args, or you can use the try-except pattern to ask for forgiveness(宽恕;原谅) afterwards to say, I’m first going to try to say, if the args are in the cache, go ahead and return it. If I get a keyError, then I have to fill in the cache by doing the computation and then returning the result.

The reason I use the try structure here rather than the if structure is because I knew I was going to need it anyways(不管怎样) for this third case. Either the args are in the cache, or they aren’t, but then there’s this third case which says that the args are not even hashable.

What does that mean?

d = {}
x = 42
x in d --> False
y = [1, 2, 3]
y in d --> Not False。 An error

Start out with a dictionary d being empty, and then I’m going to have a variable x, and let’s say x is a number. If I now ask, is x in d? That’s going to tell me false. It’s not in the dictionary yet. But now, let’s say I have another variable, which is y, which is the list [1, 2, 3] and now if I ask is y in d? You’d think that would tell me false, but in fact, it doesn’t. Instead, it gives me an error, and what it’s going to tell me is type error: unhashable type: list.

What does that mean?

That means we were trying to look up y in the dictionary, and a dictionary is a hash table — implemented as a hash table. In order to do that, we have to compute the hash code for y and then look in that slot(位置;水沟) in the dictionary. But this error is telling us that there is no hash code for a list.

Why do you think that is?

Are lists unhashable because:

  • lists can be arbitrarily(任意地) long?
  • lists can hold any type of data as the elements, not just integers?
  • lists are mutable(可变的)?(this is the answer)

Now I recognize this might be a hard problem if you’re not up on hash tables. This might not be a question you can answer. But give it a shot and give me your one best response.

33.Cache Management (Answer)

Cache Management - Design of Computer Programs - YouTube

The answer is because lists are mutable(可变的). That makes them unlikely candidates(报考者) to put into hash tables. Here’s why. Let’s suppose we did allow lists to be hashable. Now we’re trying to compute the hash function for y, and let’s say we have a very simple hash function — not a very good one — that just says add up the values of the elements. Let’s also say that the hash of an integer is itself, so the hash code for this list would be equal to 6, the sum of the elements. But now the problem is, because these lists are mutable, I could go in, and I could say, y[0] = 10. Y would be the list [10, 2, 3]. Now when we check and say, is y in d? We’re going to compute the hash value 10 + 2 + 3 = 15. That’s a different hash value than we had before. So if we stored y into the dictionary when it had the value 6, and now we’re trying to fetch it when it has the value 15, you can see how Python is going to be confused.

Now, there’s 2 ways you could handle that. One — the way that Python does take is to disallow putting the list into the hash table in the first place because it potentially could lead to errors if it was modified. The other way is Python could allow you to put it in, but then recognize that it’s the programmers fault, and if you go ahead and modify it, then things are not going to work anymore, and Python does not take that approach, although some other languages do.

34. 练习:Save Time Now

Save Time Now - Design of Computer Programs - YouTube

Now to show off how insanely(疯狂地) great memo(备忘录) is, we’ll want to have before and after pictures, showing the amazing weight loss of a model that was fat before and was thin(瘦的) after applying memo. Oh! Wait a minute. It’s not weight loss. It’s time loss that we’re going to try to measure.

We want to show that before, we have a function f, and that’s going to run very slowly, making us sad. And after, we have a function memo of f(即memo(f)), and that’s going to run very quickly, making us happy.

Now we could do that with a timer(计时器) and say it took 20 seconds to do this and .001 seconds after, but instead of doing it with timing, I think it’s a little bit more dramatic(激动人心的;引人注目的) just to show the number of function calls, and I could go into my function and modify it to count the number of function calls, but you could probably guess a better way to do that. I’m going to define a decorator to count the calls to a function because I’m probably going to want to count the calls to more than 1 function as I’m trying to improve the speed of my programs. So it’s nice to have that decorator around.

@decorator
def countcalls(f):
    "Decorator that makes the function count calls to it, in callcount[f]."
    def _f(*args):
        callcounts[_f] += 1
        return f(*args)
    callcounts[_f] = 0
    return _f

callcounts = {} 

So here’s the decorator countcalls, you pass it a function, and this is the function that it returns. It’s going to be a function that just increments entry for this function in a dictionary callcounts. Increment that by 1 and then go ahead and apply the function to the arguments and return that result. We have to initialize the number of calls to each funciton to be 0, and that’s all there is to it.

@countcalls
def fib(n): return 1 if n <= 1 else fib(n-1) + fib(n-2)

So here I’ve defined the Fibonacci function. Through cursive(草书的) function it calls itself twice for each call, except for the base case.

@countcalls
@memo
def fib(n): return 1 if n <= 1 else fib(n-1) + fib(n-2)

I can count the calls both with and without the memoized version. So I’m going to do before and after pictures — before and after memoizing.

《D o C P》学习笔记(3 - 1)Regular Expressions, other languages and interpreters - Lesson 3_第3张图片

《D o C P》学习笔记(3 - 1)Regular Expressions, other languages and interpreters - Lesson 3_第4张图片

So here’s before. I have the values of n and a value computed for Fibonacci number of n, the number of calls created by countcalls, and then I have the ratio(比例;比率) of the number of calls to the previous number. We see the number of calls goes up by the time we get up to n = 20. We got 10,000 calls. We can scroll down, and by the time we’re up to n = 30, we have 2.6 million calls.

《D o C P》学习笔记(3 - 1)Regular Expressions, other languages and interpreters - Lesson 3_第5张图片

《D o C P》学习笔记(3 - 1)Regular Expressions, other languages and interpreters - Lesson 3_第6张图片

And here’s the after. Now we’ve applied the memo decorator. Now the number of calls is much more modest. Now at 20, we’re only at 39 calls, and at 30, we’re at 59 calls rather than 2.6 million. So that’s pretty amazing weight loss to go from 2.6 million down to 59, just by writing 1 little decorator and applying it to the function. Now just as an aside here, and for you math fans in the audience, I’m back to the before part. This is without the memoization. This number here in this column is the ratio of the number of calls for n = 30 to the number of calls for n = 29. You can see that it’s converging(趋同的,收敛的,收缩的) to this number 1.6180. Math fans out there, I want you to tell me if you recognize that number. Do you think it’s converging to 1 + square root of 5 / 2, or the square root of e?

34.Save Time Now (Answer)

The answer is 1 + square root of 5 over 2, otherwise known as the Golden Ratio. The Golden Ratio I knew was actually the ratio of success of elements of the Fibonacci sequence — the sequence 1,1, 2, 3, 5, 8, and so on — converges(converge 聚集) to that ratio. But I didn’t know that the number of calls in the implementation also converges to that ratio. So that’s something new.

35. 练习:Trace Tool

I want to make the point there are different types of tools that you can use in your tool box. We just saw the count calls. I think I would classify(分类;归类) that as a debugging tool, and we saw a memo, and I’ll classify that as a performance tool. Earlier, we saw n_ary, another decorator, which I can classify as an expressiveness(表现,表示) tool. It gives you more power to say more about your language. This gives you no more power but makes it faster. This isn’t going to end up in your final program, but helps you develop the program faster. I want to add another tool here in debugging called trace, which can help you see the execution of your program.

《D o C P》学习笔记(3 - 1)Regular Expressions, other languages and interpreters - Lesson 3_第7张图片
(截图未截完,详见视频。)

So I’m going to define a decorator trace, which when we apply to fib, gives us this kind of output. When I ask here for what’s the 6th Fibonacci number? It says for each recursive call, we have an indented(受契约约束的) call with an arrow going to the right saying we’re making a call, and for each return, we have an arrow going to the left. When you ask for fib of 6, you keep on going down the list until we get near the end. When we ask for fib of 2, then that’s defined in terms of 1 and 0, and they both return 1, so that means fib of 2 returns 2 and so on. We can see the shape of the trace here as we go. It’s coming to the right and then returning back and coming to the right some more and returning back. The pattern takes a long time to reveal(揭露) itself and would take even longer for larger arguments other than 6. But it gives you some idea for the flow of control of the program.

# ---------------
# User Instructions
#
# Modify the function, trace, so that when it is used
# as a decorator it gives a trace as shown in the previous
# video. You can test your function by applying the decorator
# to the provided fibonnaci function.
#
# Note: Running this in the browser's IDE will not display
# the indentations.

from functools import update_wrapper


def decorator(d):
    "Make function d a decorator: d wraps a function fn."
    def _d(fn):
        return update_wrapper(d(fn), fn)
    update_wrapper(_d, d)
    return _d

@decorator
def trace(f):
    indent = '   '
    def _f(*args):
        signature = '%s(%s)' % (f.__name__, ', '.join(map(repr, args)))
        print '%s--> %s' % (trace.level*indent, signature)
        trace.level += 1
        try:
            # your code here
            print '%s<-- %s == %s' % ((trace.level-1)*indent, 
                                      signature, result)
        finally:
            # your code here
        return # your code here
    trace.level = 0
    return _f

@trace
def fib(n):
    if n == 0 or n == 1:
        return 1
    else:
        return fib(n-1) + fib(n-2)

fib(6) #running this in the browser's IDE  will not display the indentations!

So that’s a useful tool have, and here’s an implementation. It follows the same pattern as usual. Decorator takes a function as input. We’re going to create another function, and this is what it’s going to look like. We’re going to figure out what it is that we’re going to print. We’re going to keep a variable, which we keep as an attribute of the trace function itself called the level. We’ll increment that as we come in, print out some results here. We initialize the trace level to 0 — the indentation level — and then finally, we return the function that we just built up. I’ve left out some bits here, and I want you to fill them in to make this function work properly to show the trace that I just showed.

35 Trace Tool (Answer)

@decorator
def trace(f):
    indent = '   '
    def _f(*args):
        signature = '%s(%s)' % (f.__name__, ', '.join(map(repr, args)))
        print '%s--> %s' % (trace.level*indent, signature)
        trace.level += 1
        try:
            result = f(*args)
            print '%s<-- %s == %s' % ((trace.level-1)*indent, 
                                      signature, result)
        finally:
            trace.level -= 1
        return result
    trace.level = 0
    return _f

So the code you had to write was pretty straight forward(一直;直白). Like most decorators, we compute the result here by applying the function to the args, then we return the result down here. Maybe a little bit tricky is what’s in this finally clause, which is we’re decreasing the trace level. So the indentation(缩格;凹进) level goes up by 1 every time we enter a new function and down by 1 every time we go back. The issue here is, we want to make sure that we do decrement this. If we increment this once, and then calling the function results in an error, and we get thrown out of that error, we want to make sure we put it back where it belongs. We don’t want to mess with(打扰) something and then not restore it. So that’s why we put this in a try finally.

36. Disable Decorator

We’re coming to the end of what I want to say about decorators. I wanted to add one more debug tool. That’s one I’m going to call disabled. It’s very simple.

def disabled(f): return f

Disabled is another name for the identity(身份;恒等(式)) function — that is the function that returns its argument without doing any computation on it whatsoever. Why do I want it and why do I call it “disabled?”

@trace
def f...

Well, the idea is that if I’m using some of these debugging tools like trace or countcalls, I might have scattered(scatter (使)散开、分散) throughout my program trace define f and some other traced functions. Then I might decide I think I’m okay now. I think I’ve got it debugged. I don’t want to trace any more. Then what I can do is I can just say “trace = disabled” and reload my program, and now the decorator trace will be applied to the function, but what it will do is return the function itself. Notice we don’t have to say that disabled is a decorator, even though we’re using it as if it were one, because it doesn’t create a new function. It just returns the original function. That way we won’t have the trace output cluttering up(clutter up 使散乱) our output, and the function will be efficient. There won’t even be a test to see if we are tracing or not. It’ll just use the exact function that passed in.

37. 练习:Back To Languages

Now we’re done with decorators, and I want to go back to languages. The first thing I want to say is that our culture is full of stories of wishful thinking. We have the story of Pinocchio, who wishes to be a real boy, and the of Dorothy, who wishes to return from Oz to her home in Kansas by clicking her shoes together. They find that they have the power within them to fulfill(完成;达到) their wishes.

def fib(n): ...fib(n-1)...

Programming is often like that. I’m writing the body of the definition of a function, and I wish I had the function “fib” defined already. If I just assume that I did, then eventually my wish will come true. In this case, it was good while writing the right-hand side to just assume I wish I had the function I want and proceed as if you did, and sometimes it’s a good idea to say I wish I had the language I want and proceed as if you did.

3 + x

Here’s an example. Suppose you had to deal with algebraic(代数的) expressions(3 + x) and not just compute them the way we can type this expression into Python and see its value if x is defined but manipulate(操作;操纵) them. Have the user type them in, modify them, and treat them as objects rather than as something to be evaluated.

Now, my question for you is it is possible to write a regular expression which could recognize expressions like this that are valid.

Is the answer, yes, it is possible to write that?

No, it’s not possible because our language we’re trying to define has the plus and asterisk(星号) symbols and those are special within regular expressions?

Or no, we can’t parse this language because this language includes parentheses(圆括号), and we can’t do balanced parentheses for the regular expressions.

y + (3 + x)

If you’re not familiar with language theory, this may be a hard question for you, but go ahead and give it a shot(give it a shot 试一试;试试看) anyways.

Back to Languages (Answer)

The answer is that the problem is the balanced parentheses(balanced parentheses 平衡圆括号). Regular expressions can handle a set number of nesting parentheses — one or two or three — but they can’t handle an arbitrary(随意的) number of nestings. It makes sure that all the left parentheses balance with the right parentheses. We’re going to need something else. The thing we traditionally look at is called context-free(上下文无关) languages, which are more general than the regular languages. We’ll see how to handle that.

38. Writing Grammar

The tricky(复杂的) part is that if I have an expression like (m * x) + b, I want to make sure that I parse(vt.从语法上描述或分析(语句等)) that as if m and x are together and not x and b are together.

Another way to say that is I want to see any expression as if it were the sequence of something else plus something else plus something else: (m * x) + b => … + … + … These pluses can also be minus signs: … - … - … But, anything with the times sign(我猜测是乘号*), I want to have that all be within the dotdotdot: (…) + (…) + (. * .) rather than encompassing(encompass 围绕,包围;完成) a plus sign in it.

So, what I want to do is write a grammar that defines the language of these expressions. Grammar is a description. It’s finite(有限的,有穷的) in length, and it should be easy to write. A language is a set of all possible strings that that grammar describes.

I want to be able to say that in a concise(简明的,简洁的) language, maybe something like saying I’m going to define expressions, and I want to define the expression as something that looks like this: … + … + … or … - … - … , and I need a name for the dotdotdots(即省略的点点点), so let’s call that a term(术语).

So, an expression can be a term, a minus or a plus: Exp => Term [-+] using the regular expression notation to describe a plus or minus, and then more terms, but I can just say recursively(递归地) that that’s an expression: Exp => Term [-+] Exp

That will give me any number of terms with plus or minuses in between them. So that’s one possibility, but that’s recursive, and I need a base case. So I can say that it can also be the case that an expression can consist of a single term: Exp => Term [-+] Exp | Term. So, a + b is an expression, a + b + c is an expression, and also just a.

That is the rule for expression. Then I would write the rule for term; it would be similar. It would take into account the multiply and divide: Term => [*/], and I would write the rest of the grammar in this format.

Notice here that I have just made up, I’ve used wishful thinking to say I wish I had a notation like this to describe what the grammar is.

39. Descriptionary

Exp     => Term [+-] Exp | Term
Term    => Factor [*/] Term | Factor
Factor  => Funcall | Var | Num | [(] Exp [)]
Funcall => Var [(] Exps [)]
Var     => [a-zA-Z_]\w*
Num     => [-+]?[0-9]+([.][0-9]*)?

Here I’ve continued my wishful thinking, and I’ve written out the complete grammar. I think I’ve got it right. Now all I’m missing is something that understands this format. Let’s continue the wishful thinking and wish that we had a function that could take this format and turn it into something
that we can then use in a parser.

G = grammar(r"""
Exp     => Term [+-] Exp | Term
Term    => Factor [*/] Term | Factor
Factor  => Funcall | Var | Num | [(] Exp [)]
Funcall => Var [(] Exps [)]
Var     => [a-zA-Z_]\w*
Num     => [-+]?[0-9]+([.][0-9]*)?
""")

Here I have applied the wishful thinking and said I wish I had this function grammar that would give me this grammar G, and then I could use it to parse expressions.

Now it’s up to me to deliver on that promise or that wishful thinking.

G = {'Exp': (['Term', '[+-]', 'Exp'], [''Term]),
     'Term': (['Factor', '[*/]', 'Term'], ['Factor']),
    }

The first thing I should ask myself is what do I want this object G to look like?

One good idea is that G could be a dictionary. The keys of the dictionary would be these things on the left-hand side, and then the values of each key would be some object corresponding to this representation on the right-hand side. I would have expression–something like this. Now all I’ve got to do is figure out what goes on the right-hand side.

Here I’ve made a choice, and what I’ve said is, well, the right-hand side looks like it’s always going to be a set of alternatives. The order matters, so I’m not going to make it into a set.
I’m going to make it into a sequence of some kind, and let’s make that a tuple. The right-hand side is always going to be a tuple of alternatives. It might be a tuple of length 1, but it’s always going to be a tuple. Each element of that tuple is going to be a sequence–here a sequence of three items, here a sequence of one items. We’ll represent sequences as lists. Here are the three items, and here is the one item. Every right-hand side is a tuple of lists where the lists are a sequence and each element of the list is a atom. The atoms can either be the name of a category(类型,部门,种类) that’s defined on the left-hand side, or they can be a regular expression that’s going to match actual characters in the input when we go to parse.

Now, of course I could have typed in the whole grammar like this to begin with, but that would have been messy(凌乱的;复杂的,麻烦的) and error-prone(有…倾向的;易于…的) and harder to read. I did myself a favor(do a favor 帮忙) imagining the language I wanted and writing in that language, then figuring out how to get it into something the computer can understand.

Now all I have to do is write this function grammar that translates from this format(指G = grammar(...)这种形式) into this format(指G = {key: value}这种形式).

def grammar(description):
    """Convert a description to a grammar. """
    G = {}
    for line in split(description, '\n')
        lhs, rhs = split(line, ' => ', 1)
        alternatives = split(rhs, ' | ')
        G[lhs] = tuple(map(split, alternatives))
    return G

Fortunately, it’s really easy to write this function. Here the whole thing is. Here’s what I’ve done. I’ve initialized the grammar that I’m going to produce to be the empty dictionary. Then I’ve gone through the whole description–the description is the input–and split it up by lines. For each line I’ve split up each line by the arrow into left-hand side an a right-hand side. Then for each right-hand side I’ve split that up, using the or sign, into a list of alternatives of one or more. Then I’ve just said let’s assign to the left-hand side–expression or term–a tuple, because that’s what we’re using for alternatives of the map of splitting up each alternative. The alternative here would be a string. It would be the string of everything starting from term and ending with expression. Then I just mapped the split function to it to split it up into individual atoms. That would split it up into a three-element list–term, plus or minus, and expression. That’s all I need. This is a description of the grammar format, immediately implements it, and builds up a dictionary for us.

def split(text, sep=None, maxsplit=-1):
    "Like str.split applied to text, but strips whitespace from each piece."
    return [t.strip() for t in text.strip().split(sep, maxsplit) if t]

I should say I’ve used here this utility function “split.” It’s just like string.split, except when it splits things up it strips the white space from each of the pieces.

40. White Space

Now, I should say that if we take this grammar here, and if I have this function to define grammars, and if I write a function to parse, I can take this and if I have an expression like m*x+b, I’m going to parse it just great.

Unfortunately, with this description just as it is, I would have trouble with m * x + b with spaces in between it, because nowhere here did I say that spaces can occur between tokens.

In some grammars you’re free to put spaces everywhere, and in other grammars you aren’t. If I want this grammar function to be useful, I should be able to tell which is which.

I’m going to show a new version of grammar that takes that into account.

Here it is. There are two things you should notice about it right away.

def grammar(description, whitespace=r'\s*'):
    G = {' ': whitespace}
    description = description.replace('\t', ' ') # no tabs!
    for line in split(decription, '\n'):
        lhs, rhs = split(line, ' => ', 1)
        alternatives = split(rhs, ' | ')
        G[lhs] = tuple(map(split, alternatives))
    return G

First is that I add an optional parameter to say you can specify what white space is allowed in between tokens. Here it’s saying s is allowed–any number of spaces–zero or more. You could change that if you have a different type of grammar you’re defining.

Secondly, have much longer documentation here–a documentation string describing what the language is.

Third, I’ve added one more statement here, which says if there are any tabs in the input, replace them with spaces. Why did I do that? Notice that I’m splitting here an arrow surrounded by spaces and an or bar surrounded by spaces. If somebody had written something and used a tab there rather than a space, that would be really hard to debug. I don’t want to have to force them to debug it. I want to do the debugging for them and make that not be an error at all.

41. Parsing

parse(symbol, text, G)

Now I’m going to define the parser, and I want it to have this signature. I’m going to define a function parse, which takes a symbol, saying I want to parse something like an expression. It takes text that we’re going to parse, and it takes the grammar that defines that symbol and all the other symbols.

I want it to return–when we had regular expressions we looked at returning a set of remainders. I’m going to have this return a single remainder.

parse(symbol, text, G)  -->  remainder

The idea is that we want the author of the grammar to have some control over what’s going on and make this grammar be unambiguous(不含糊的;清楚的) so that there is a single result for each parse rather than having to keep a set of results.

Exp => Term [+-] Exp | Term

The idea is that for each symbol we’re going to consider the alternatives in left-to-right order. If we’re asked to parse an expression, we’ll say can we parse this alternative. If we can, if we can parse in succession a term and then a plus or minus then an expression and then we’ll commit to that. We’ll say that’s the result we’re going to return. That’s the single remainder, and we’re not going to even explore the other alternatives. Because we have this left-to-right choice, that’s why we decide to write the expressions in this order.

Term | Term [+-] Exp

We don’t write them in the opposite order because if we wrote the rule this way and we were trying to parse the expression a + 3, we’d be asked to parse this, we’d try to parse a term, we’d say, yes, a is a term, and then we’d stop, and we’d say that’s the end. We wouldn’t even consider parsing off the term plus the expression.

This is no good(指Term | Term [+-] Exp). It’s up to the author of the grammar to write rules in the correct order so that the longest thing gets tried first.

Our regular expression functions are what is known as recognizers. They told us–yes or no–is a string part of a language. Here what we’re doing is different. It’s a parser. It tells you–yes and no–is it part of the language, but then it also gives you an internal structure, a tree structure of the parts of the expression.

《D o C P》学习笔记(3 - 1)Regular Expressions, other languages and interpreters - Lesson 3_第8张图片

Here it would be a + 3. If we had m * x + b, then that would parse into a structure that had m * x + b. Here I said we were a remainder, but actually I want to return a two element tuple of the tree followed by the remainder(即(tree, remainder)).

>>> parse('Exp', 'a * x', G)
(['Exp', ['Term', ['Factor', ['var', 'a']],
  '*',
  ['Term', ['Factor', ['Var', 'x']]]]], '')

Here’s what we’re going for. If I asked to parse the expression a x with the grammar G, I want to get back this tree structure. It looks kind of complicated. All it’s really saying is that we have an a in the first element, then the multiply sign, and then an x in the third element, and there’s no remainder. We parse the whole string. That’s what it says, but there’s all these intermediate(中间的) parts here because that’s the way the grammar is defined. I should say here that this is a tree, remainder result.

Fail = (None, None)

We’re going to use the convention that Fail corresponds to a value of None.

Let’s think then about what we have to do to be able to parse a text. I can think of four cases that we have to deal with. One, we want to be able to parse an expression or a symbol like the expression keyword(即'Exp'). To do that we’ve got to be able to look that up in the grammar G. We’ve also got to be able to parse a regular expression, plus or minus([+-]), and we’ve seen how to do that before, so we reduce that to a solved problem. We’re going to have to be able to parse a tuple of alternatives(即([...], [...])). Here one alternative or another alternative done in left-to-right order. Finally, we’re going to have to be able to parse a list of atoms representing a sequence–1, 2, 3(即[..., ..., ...]).

Now, I’m going to tell you the plan for how we’re going to implement this as code within the function parse. So this first case I’m going to handle with a subroutine called “parse_atom.” This is an atomic(原子的) expression. We should be able to handle that.

The second part, the regular expression, is a type of atom so it occurs within the parse atom routine. We’re going to define a variable called “tokenizer” to help us do that. Then we’re going to use the built-in re.match along with the tokenizer to recognize those regular expressions.

Then for the alternatives, a tuple of alternatives is what you get when you look up the symbol, like exp, in the grammar. I could have had a separate function called parse_alternatives but instead I just decided to have that be part of parse_atom, because it was so simple it was just an immediate step to go from the symbol to the list of alternatives.

Finally, to parse this sequence of atoms, I have a subroutine or function called “parse_sequence.”

42. Parse Function

def parse(start_symbol, text, grammar):
    """Example call: parse('Exp', '3*x + b', G).
    Returns a (tree, remainder) pair. If remainder is '', it parsed the whole
    string. Failure if remainder is None. This is a deterministic PEG parser, 
    so rule order (left-toright) matters. Do 'E => T op E | T', putting the 
    longest parse first; don't do 'E => T | T op E'
    Also, no left recursion allowed: don't do 'E => E op T'"""

    tokenizer = grammar[' '] + '(%s)'

    def parse_sequence(sequence, text):
        result = []
        for atom in sequence:
            tree, text = parse_atom(atom, text)
            if text is None: return Fail
            result.append(tree)
        return result, text

    def parse_atom(atom, text):
        if atom in grammar:  # Non-Terminal: tuple of alternatives
            for alternative in grammar[atom]:
                tree, rem = parse_sequence(alternative, text)
                if rem is not None: return [atom]+tree, rem
            return Fail
        else:  # Terminal: match characters against start of text
            m = re.match(tokenizer % atom, text)
            return Fail if (not m) else (m.group(1), text[m.end():])

    # Body of parse:
    return parse_atom(start_symbol, text)

Fail = (None, None)

Here we go. Here’s a function “parse.” It takes a start symbol like expression. It takes a text like 3x + b. It takes a grammar defined with our grammar function.

The first thing we do is define the tokenizer. The tokenizer has two jobs. First it has to handle white space that occurs before the token, and it does that by looking up in grammar under the key consisting of just a space. In the definition of grammar we arrange to store the white space parameter under that key. This says build up a regular expression that will parse off the appropriate amount of white space–some, all, none, whatever is appropriate for your grammar.

Then parse off whatever was defined but the atom that we’re looking at next. We’ll see when tokenizer is used how we go ahead and do that. Then parse sequence says we’re just going to go through a sequence. This is a of atoms. We’re going to initialize our result to be the empty list. Then we’re going to go through, try to parse an atom one at a time. If we get back nothing for a remainder, then Fail. Otherwise, append to the result the tree that we built up by doing that parse and continue on in the loop. Notice that we’re updating the text variable, so we’re taking the remainder each time and parsing the next atom from the remainder of the previous atom.

Now parse_atom handles two cases. If the atom is something that’s in the grammar like bxb expression, we map it into this tuple of alternatives by looking it up in the grammar, getting that list of alternatives. For each alternative, we try to parse. If we have a successful match–that is if the remainder is not None, indicating failure–if there’s some sort of remainder left over, then we want to return the result. We return the remainder we got, and we build up a tree structure consisting of the atoms that we’re trying to parse plus the tree of whatever we got back. If we exhaust all the alternatives and we can’t come up with anything, then we return Fail, which says no tree and no remainder. Otherwise if the atom is not in the grammar, then it must be a regular expression. We take the tokenizer that we built up before. We insert the atom, which is a regular expression, into that tokenizer and match it against the text. If there’s not a match, then we Fail. If there is a match, we pull out the matching part. That’s going to be the tree. That’s the token that we matched. We go ahead and we take the rest of the text after the match and return that as a remainder. This is the only place where the text actually advances, in this one spot where we’re matching tokens against the input text.

These two routines are internal routines inside the definition of parse, and here’s the body of parse that just says parse this atom–the start symbol against the text.

43. 练习:Speedy Parsing

That’s the parser, and I know we went really quick. We didn’t have time to go over a lot of the details. If you hadn’t seen anything about parsing or languages before, I may have been too fast for you and I apologize. But part of the idea was to teach a little bit about language theory, but more the idea was just to give you a tool and to show you how these tools are built and how you can use them. Don’t worry if you didn’t get all the details of exactly how the parser works. If you are interested, there’s another course and there’s some other documentation that you can read to get up to speed on it.

But I just want to talk about one more thing. There’s something that bothers me here, which is say is say we have an expression–one single expression that’s really long((x + y ··· ···))–and we start parsing it this way. We say let’s go left to right. We’ll do the first alternative first. Parse the term, and it parses this big, big long term and then tries to parse a plus or minus, and it says, oops, I hit the end. There’s no more plus or minus. This alternative fails.

Now I try the second alternative, and the second alternative says start with a term and we go back and we parse this whole big term all over again.

We’ve doubled the amount of work, and it can be worse that that, because inside of each of these terms there’s more little pieces, and we can be doubling each of those pieces work. That seems really inefficient. What we’d like to be able to do is fix up our parser so that once we’ve done this work we don’t have to do it a second time in case this version fails. Don’t worry about doing the work of saying what if like 5 out of 6 of these got parsed, just worry about the individual atoms, and saying if I did the work of computing the tree for an individual atom. I don’t want to have to repeat that twice.

The quiz question is can you go into here and modify this parse function so that it has that property that it doesn’t repeat the work twice. Once it’s done it once, it’s asked to do it a second time, it doesn’t repeat the computation. It just does the computation once. That may seem like a daunting(令人畏惧的) task, but there is a solution that’s really simple that you have the tools for to be able to solve.

# ---------------
# User Instructions
#
# Modify the parse function so that it doesn't repeat computations.
# You have learned about a tool in this unit that prevents 
# repetitive computations. Try using that!
#
# For this question, the grader will be looking for a specific 
# solution. Hint: it should only involve adding one line of code
# (and that line should only contain 5 characters).

from functools import update_wrapper
import re

def parse(start_symbol, text, grammar):
    """Example call: parse('Exp', '3*x + b', G).
    Returns a (tree, remainder) pair. If remainder is '', it parsed the whole
    string. Failure iff remainder is None. This is a deterministic PEG parser,
    so rule order (left-to-right) matters. Do 'E => T op E | T', putting the
    longest parse first; don't do 'E => T | T op E'
    Also, no left recursion allowed: don't do 'E => E op T'"""

    tokenizer = grammar[' '] + '(%s)'

    def parse_sequence(sequence, text):
        result = []
        for atom in sequence:
            tree, text = parse_atom(atom, text)
            if text is None: return Fail
            result.append(tree)
        return result, text

    def parse_atom(atom, text):
        if atom in grammar:  # Non-Terminal: tuple of alternatives
            for alternative in grammar[atom]:
                tree, rem = parse_sequence(alternative, text)
                if rem is not None: return [atom]+tree, rem  
            return Fail
        else:  # Terminal: match characters against start of text
            m = re.match(tokenizer % atom, text)
            return Fail if (not m) else (m.group(1), text[m.end():])

    # Body of parse:
    return parse_atom(start_symbol, text)

Fail = (None, None)

# The following decorators may help you solve this question. HINT HINT!

def decorator(d):
    "Make function d a decorator: d wraps a function fn."
    def _d(fn):
        return update_wrapper(d(fn), fn)
    update_wrapper(_d, d)
    return _d

@decorator
def memo(f):
    """Decorator that caches the return value for each call to f(args).
    Then when called again with same args, we can just look it up."""
    cache = {}
    def _f(*args):
        try:
            return cache[args]
        except KeyError:
            cache[args] = result = f(*args)
            return result
        except TypeError:
            # some element of args can't be a dict key
            return f(args)
    return _f

43.Speedy Parsing Solution

Here’s the answer to turn this parser from a slow one to a fast one that doesn’t repeat its work. All we need are these five characters. All we have to do is say let’s apply the memo decorator(即@memo) to parse_atom, and then we’re done. The recursive call to parse_atom with term and the remainder as inputs. We’ve done it once. This’ll be the same call here. We just look it up in the cache table, and we’ve got the result. Notice in order to make that work, I had to make this function, parse_atom, be something that had arguments that were hashable.

The atom and the text are both strings and those are both hashable, but the original call to parse that had a grammar as the third argument–well, the grammar is a dictionary, and dictionaries are mutable, so they’re not hashable. I couldn’t memoize the the whole parse function, and then have parse be called recursively here.

That’s why I had to have parse_atom and parse_sequence be internal functions inside the definition of parse, so that the grammar would be a variable that this function knows about, but it’s not one of the arguments so that the memo decorator works.

44. Catching Typos

Now we’ve got a grammar, and we’ve got a parser. Let’s see what it’s good for. Here’s an example of parsing URLs. Here’s an official specification from the W3–the World Wide Web Consortium. There’s a definition on their page. It’s not quite in the right format that we use. But it was very easy to take their format and translate it into our format. I just sort of went down their rules one-by-one. Some of them you copy verbatim((完全)照字面的). Some of them I had to fix up a little bit. You can see here is the whole specification of the grammar. It’s 40 lines or so. It’s fairly involved and verbose(冗长的,啰嗦的). But it handles URLs. It was straightforward(直截了当的明确的) to go from the specification that they had on their website to one that could be parsed by the simple parser we built.

def verify(G):
    lhstokens = set(G) - set([' '])
    rhstokens = set(t for alts in G.values() for alt in alts for t in alt)
    def show(title, tokens): print title,'=',' '.join(sorted(tokens))
    show('Non-Terms', G)
    show('Terminals', rhstokens - lhstokens)
    show('Suspects ', [t for t in (rhstokens - lhstokens) if t.isalnum()])
    show('Orphans  ', lhstokens - rhstokens)

Now, I have to say that as I was doing this, I didn’t get it all right the first time. I made a couple of types that made my grammar not work, and I found this function very useful. I wrote this function to verify a grammar. What it does is it finds all the tokens that are on the left-hand side and on the right-hand side, and then it shows them to me.

It shows me the nonterminals(nonterminal 非终结符号), the things that are on the left-hand side.

It shows me the terminals(终结符) that are the right-hand side but not on the left-hand side. Those should be the regular expressions.

It shows me some suspects. What’s a suspect(suspect 怀疑)? That’s a something that looks like it’s the name and should appear on the left-hand side, so it’s alphanumeric(文字数字的) characters, but doesn’t. If they’ve done that, it’s probably because I spelled it wrong in one place and tried to have the same word used twice and spelled it wrong in one place.

Then there are the orphans(orphan 孤儿). Those are the things that appear on the left-hand side but don’t appear on the right-hand side. They’re useless. They’re defined, but they aren’t called anywhere. As I was writing this grammar, I would call this function and say, oops, there’s an error. I fixed it and called the function and say, oops, there’s another error. Then I got it right.

45. Summary

Congratulations. We should be happy we’ve made it all the way through. Now let’s try to summarize what it is that we’ve covered in this unit.

To a large extent(长度;程度), this unit was about tools–how to build useful tools, how to recognize what they are and put them together, and how to apply them to components of a domain.

In particular, one of the important tools we looked at is language–being able to define the language that you want to deal with rather than the language that Python happens to give you.

We talked about grammars that describe a language, interpreters that interpret them, and compilers that do the same thing only faster.

The other main lesson in this unit was functions. Functions, too, are a very powerful tool They’re powerful in a way that statements are not.

With statements, although in the end we need our statements to write our programs, they have this flaw(缺点) in that they’re not as composable(组成的) as functions are. If you want to reuse a statement somewhere else, the only you have to do is copy and paste. That’s an issue. That can be a problem. because then you’ve copy multiple copies and you update one but you don’t update the other.

With functions you compose them rather than copy and paste them, and so that means there’s one function that’s an object, and that gets reused automatically. When you change it, it gets changed everywhere, because there’s one object, not multiple copies with copy and paste. That’s the advantages of functions.

We talked about decorators as functions. We showed how to use functions as objects, and we showed some patterns for how they fit together and allow us to define tools.

I hope you found this all useful, and I think you’ll see in the homework assignments that we have for this unit that the difficulty has been stepped up a bit, but you should be able to handle them with the tools that we’ve covered.

参考文献:

  1. Design of Computer Programs - 英文介绍 - Udacity;
  2. Design of Computer Programs - 中文介绍 - Udacity;
  3. Design of Computer Programs - 中文介绍 - 果壳;
  4. Design of Computer Programs - 视频列表 - Udacity;
  5. Design of Computer Programs - 视频列表 - YouTube;
  6. Design of Computer Programs - class wiki - Udacity。

你可能感兴趣的:(软件工程,Design,of,Computer,Programs,学习,SICP)