Lecture 13 Formal Language Theory & Finite State Automata

目录

      • 什么是语言?
      • Formal Language Theory 形式语言理论
      • 动机
      • 例子
      • 除了从属问题之外的问题
      • Regular Languages 正则语言
      • Finite State Acceptor 正则语言的性质
      • Derivational Morphology
      • Weighted FSA
      • Finite State Transducer (FST)
      • FST for Inflectional Morphology
      • Non-Regular Languages
      • Center Embedding

什么是语言?

目前为止,我们见过了一些处理单词、句子和文档等符号序列的方法:

  • 语言模型
  • 隐马尔可夫模型
  • 循环神经网络

但是,这些模型都没有涉及到语言的本质,因为它们可以用于处理任何符号序列,而不仅限于单词、句子等。

Formal Language Theory 形式语言理论

形式语言理论 (Formal Language Theory) 为我们提供了一种定义语言的框架,它是一种数学框架。

  • Studies classes of languages and their computational properties 研究语言的类及其计算性质

Lecture 13 Formal Language Theory & Finite State Automata_第1张图片

  • Language: set of strings 一种语言 = 字符串 (strings) 的集合

  • String: sequence of elements from a finite alphabet 一个字符串 = 来自一个有限 字母集 (alphabet) 的 元素 (element) 所组成的序列

    • 字母集可以视为 词典 (vocabulary)
    • 元素可以视为 单词 (words)

动机

形式语言理论研究的是语言的 类别 (classes) 和它们的计算特性。这门课中,我们将主要介绍以下两种形式语言:

* 正则语言 (Regular Language)
* 上下文无关语言 (Context Free Language)

这两种语言构成了形式语言理论中的前两个类别,之后还有更复杂的 上下文敏感语言 (Context Sensitive Language) 等,但是这门课中我们不会对其进行过多展开。

主要目的是为了解决 从属问题 (membership problem):一个字符串是否属于某种语言。

那么,我们应该怎样做呢?我们可以定义该语言的 语法 (grammar),然后检查该字符串是否符合该语法规则。

例子

  • E.g. of language:
    • Binary strings that start with 0 and end with 1: 二进制串(Binary strings)以 0 开头,以 1 结尾
      • {01, 001, 011, 0001, ....} belongs to this language
      • {1, 0, 00, 11, 100, ....} does not belong to this language
    • Even-length sequences from alphabet {a, b}: 来自字母集 {a, b} 的偶数长度的序列
      • {aa, ab, ba, bb, aaaa, ....} belongs to this language
      • {aaa, aba, bbb, ....} does not belong to this language
    • 以 wh- 类型的单词作为开头,问号 ?结尾的英文句子
      • {what?, where my pants?, …} belongs to this language

除了从属问题之外的问题

  • 从属问题(Membership)
    • 某个字符串是否属于某种语言?是/否
  • Beyond membership problem:
    • Scoring: 记分(Scoring)
      • Graded membership: How acceptable is a string
      • 具有记分等级的从属关系
      • 某个字符串在多大程度上可以被接受?(语言模型)
    • Transduction: 转导(Transduction)
      • Translate one string into another
      • 将一个字符串转变为另一个字符串(词干提取 stemming)

Regular Language

Regular Languages 正则语言

  • The simplest class of languages 正则语言(Regular language):语言中最简单的类别。

  • Any regular expression is a regular language 任何 正则表达式(regular expression)都是一种正则语言。

    • Describes what strings are part of the language. E.g. 0(0|1)*1 描述了什么样的字符串是该语言的一部分
  • Formally, a regular expression includes the following operations/definitions: 正式地,一个正则表达式包含以下运算:

    • Symbol drawn from alphabet Σ 从字母集中抽样得到的符号: Σ
    • Empty string ε 空字符串: ε
    • Concatenation of two regular expression 两个正则表达式的连接
      RS
    • Alternation of two regular expressions 两个正则表达式的交替RIS
    • Kleene star for 0 or more repeats 星号表示出现 0 次或者重复多次 R*
    • Parenthesis () to define scope of operations 圆括号定义运算的有效范围
  • E.g.

    • Binary strings that start with 0 and ends with 1: 0(0|1)*1
    • Even-length sequences from alphabet {a, b}: ((aa)|(ab)|(ba)|(bb))*
    • English sentences that start with wh-word and end in ?: ((what)|(where)|(why)|(which)|(whose)|(whom))Σ*?
  • Properties of Regular Languages:

    • Closure: If we take regular languages L1 and L2 and merge them, is the resulting language regular?
    • Regular languages are closed under these conditions/operations:
      • Concatenation and union
      • Intersection: strings that are valid in both L1 and L2
      • Negation: strings that are not in L
    • Extremely versatile. Can have regular languages for different properties of language, and use the together.

Finite State Acceptor

Finite State Acceptor 正则语言的性质

  • Regular expression defines a regular language. But it does not give an algorithm to check whether a string belongs to a language 封闭(Closure):如果我们对正则语言 L1 和 L2 进行合并,得到的结果仍然是正则语言吗?如果是,那么我们将该运算称为 封闭运算(closed operation)。

  • Finite state acceptor (FSA) describes the computation involved for membership checking 在以下运算中,正则语言是封闭的:
    连接(concatenation)和 求并(union):来自封闭的定义。
    求交(intersection):在正则语言 L1 和 L2 中都合法的字符串。
    求反(negation):不在正则语言 L 中的字符串。

  • FSA consists:

    • Alphabet of input symbols Σ
    • Set of states Q
    • Start state q0Q
    • Final states FQ
    • Transition function: symbol and state -> next state
  • Accepts strings if there is a path from q0 to a final state with transitions matching each symbol

    • Djisktra’s shortest-path algorithm, complexity O(V logV + E)
  • E.g.:

    • Input alphabet : {a, b}
    • States: {q0, q1}
    • Start, final states: q0, {q1}
    • Transition function: {(q0, a) -> q0, (q0, b)-> q1, (q1, b) -> q1}
    • Regular expression defined by this FSA: a*bb*

Derivational Morphology

  • Use of affixes to change word to another grammatical category

  • E.g.:

    • grace -> graceful -> gracefully
    • grace -> disgrace -> disgracefully
    • allure -> alluring -> alluringly
    • allure -> *allureful
    • allure -> *disallure
  • FSA for Morphology:

    • Want to accept valid forms (grace -> graceful)
    • Reject invalid ones (allure -> *allureful)
    • generalize to other words

    Lecture 13 Formal Language Theory & Finite State Automata_第2张图片

Weighted FSA

  • Some words are more plausible than others:

    • fishful vs. disgracelyful
    • musicky vs. writey
  • Weighted FSA: graded measure of acceptability:

    • Start state weight function: λ: Q -> R
    • Final state weight function: ρ: Q -> R
    • Transition function: δ:(Q, Σ, Q) -> R
  • Shortest-Path:

    • Total score of a path:
    • Use shortest-path algorithm to find π with minimum cost. Complexity: O(V logV + E)

Finite State Transducer

Finite State Transducer (FST)

  • Often do not want to just accept or score strings. But want to translate them into another string.

  • FST add string output capability to FSA

    • Includes an output alphabet
    • Transitions now take input symbol and emit output symbol (Q, Σ, &Sigma, Q)
  • Can be weighted (WFST) : Graded scores for transition

  • E.g. Edit distance as WFST: distance to transform one string to another

    Lecture 13 Formal Language Theory & Finite State Automata_第3张图片

FST for Inflectional Morphology

  • Verb inflection in Spanish must match the subject in person and number
  • Goal of morphological analysis:
    • canto -> cantar + VERB + present + 1P + singular

    Lecture 13 Formal Language Theory & Finite State Automata_第4张图片

Non-Regular Languages

  • Arithmetic expressions with balanced parentheses
    • (a + (b * (c / d)))
    • Can have arbitrarily many opening parentheses
    • Need to remember how many open parentheses to produce the same number of closed parentheses
    • Can not be done with finite number of states

Center Embedding

  • Center embedding of relative clauses

    • The cat loves Mozart
    • The cat the dog chased loves Mozart
    • The cat the dog the rat bit chased loves Mozart
    • The cat the dog the rat the elephant admired bit chased loves Mozart
  • Need to remember the n subject nouns, to ensure n verbs follow

  • Requires context-free grammar

你可能感兴趣的:(自然语言处理,自然语言处理,形式语言,状态机)