语义角色标注 Semantic Role Labeling(SRL) 初探(整理英文tutorial)

语义角色标注

本文链接
最近调研了一下语义角色标注,记录如下

  • 将语言信息结构化,方便计算机理解句子中蕴含的语义信息。

    语义角色标注 (Semantic Role Labeling, SRL) 是一种浅层的语义分析技术,标注句子中某些短语为给定谓词的论元 (语义角色) ,如施事、受事、时间和地点等。其能够对问答系统、信息抽取和机器翻译等应用产生推动作用。

  • 语义标注的不足之处

    • 仅仅对于特定谓词进行论元标注,那多谓词呢?没有涉及到。
    • 不会补出句子所省略的部分语义。信息有所缺失。
  • 核心的语义角色: A0-5 六种,A0 通常表示动作的施事,A1通常表示动作的影响等,A2-5 根据谓语动词不同会有不同的语义含义。

  • 附加语义角色(15种):

    • ADV adverbial, default tag ( 附加的,默认标记 )
    • BNE beneficiary ( 受益人 )
    • CND condition ( 条件 )
    • DIR direction ( 方向 )
    • DGR degree ( 程度 )
    • EXT extent ( 扩展 )
    • FRQ frequency ( 频率 )
    • LOC locative ( 地点 )
    • MNR manner ( 方式 )
    • PRP purpose or reason ( 目的或原因 )
    • TMP temporal ( 时间 )
    • TPC topic ( 主题 )
    • CRD coordinated arguments ( 并列参数 )
    • PRD predicate ( 谓语动词 )
    • PSR possessor ( 持有者 )
    • PSE possessee ( 被持有 )
  • 传统方法

    • 依赖句法分析的结果进行。因为句法分析包括短语结构分析、浅层句法分析、依存关系分析,所以语义角色标注也可以按照此思路分类。
    • 基于短语结构树的语义角色标注方法
    • 基于浅层句法分析结果的语义角色标注方法
    • 基于依存句法分析结果的语义角色标注方法
    • 基于特征向量的 SRL
    • 基于最大熵分类器的 SRL
    • 基于核函数的 SRL
    • 基于条件随机场的 SRL
    • 各方法的不同,主要集中在他们论元检出的过程有什么不同。
  • 统一标注的过程:句法分析->候选论元剪除->论元识别->论元标注->语义角色标注结果

    • 论元剪除:在较多候选项中去掉肯定不是论元的部分(span)
    • 论元识别:一个二值分类问题,即:是论元和不是论元
    • 论元标注:一个多值分类问题
# 短语结构分析
S——| 
|     | 
NN    VP 
我       |——| 
           Vt    NN 
           吃     肉
  • 分类问题的特征怎么设计?

    • 谓词本身、
    • 短语结构树路径、
    • 短语类型、
    • 论元在谓词的位置、
    • 谓词语态、
    • 论元中心词、
    • 从属类别、
    • 论元第一个词和最后一个词、
    • 组合特征。
  • 应用领域

    • 数字图书馆建设
    • 信息检索
    • 信息抽取
    • 科技文献知识抽取
  • 目前标注方法弊端

    • 依赖于句法分析的准确性
    • 领域适应能力差
    • 现有的分类算法还有多大潜力可挖掘?同样的,还能设计多少新特征?很难了。
    • end-to-end 就不用依赖于句法分析的结果了
    • 多语平行语料有助于弥补准确性的问题?

tutorial of NAACL2009

  • Linguistic Background, Resources, Annotation

    • Motivation: From Sentences to Propositions(抽取句子的主干意义)
    • Capturing semantic roles

    • Case Theory

      • Case relations occur in deep-structure
        • Surface-structure cases are derived
      • A sentence is a verb + one or more NPs

        • Each NP has a deep-structure case
          • A(gentive)
          • I(nstrumental)
          • D(ative) - recipient
          • F(actitive) – result
          • L(ocative)
          • O(bjective) – affected object, theme
        • Subject is no more important than Object
          • Subject/Object are surface structure
      • Case Theory Benefits - Generalizations

        • Fewer tokens
          • Fewer verb senses
          • E.g. cook/bake [ __O(A)] covers
            • Mother is cooking/baking the potatoes
            • The potatoes are cooking/baking.
            • Mother is cooking/baking.
        • Fewer types
          • “Different” verbs may be the same semantically, but with different subject selection preferences
          • E.g. like and please are both [ __O+D]
      • Oops, problems with Cases/Thematic Roles

        • How many and what are they?
        • Fragmentation: 4 Agent subtypes? (Cruse, 1973)
          • The sun melted the ice./This clothes dryer doesn’t dry clothes well
        • Ambiguity: Andrews (1985)
          • Argument/adjunct distinctions – Extent?
          • The kitten licked my fingers. – Patient or Theme?
        • Θ-Criterion (GB Theory): each NP of predicate in lexicon assigned unique θ-role (Chomsky 1981).
      • Argument Selection Principle

        • Proto-Agent- the mother
          • Volitional involvement in event or state
          •   Sentience (and/or perception)
          • Causes an event or change of state in another participant
          •   Movement (relative to position of another participant)
          •   (exists independently of event named)
            *may be discourse pragmatic
        • Proto-Patient – the cake

          • Undergoes change of state
          •   Incremental theme
          •   Causally affected by another participant
          •   Stationary relative to movement of another participant
          •   (does not exist independently of the event, or at all)
          •   *may be discourse pragmatic
        • Why numbered arguments?

          • Lack of consensus concerning semantic role labels
          •  Numbers correspond to verb-specific labels
          •  Arg0 – Proto-Agent, and Arg1 – Proto-Patient, (Dowty, 1991)
          •  Args 2-5 are highly variable and overloaded – poor performance
        • Why do we need Frameset ID’s?

          • 因为一个动词在不同的情形下有多个意义
      • Annotation procedure, WSJ PropBank Palmer, et. al., 2005

        • PTB II - Extraction of all sentences with given verb
          • Create Frame File for that verb Paul Kingsbury
            •  (3100+ lemmas, 4400 framesets,118K predicates)
            •   Over 300 created automatically via VerbNet
        • First pass: Automatic tagging (Joseph Rosenzweig)
          • http://www.cis.upenn.edu/~josephr/TIDES/index.html#lexicon
          • Second pass: Double blind hand correction (Paul Kingsbury)
          • Tagging tool highlights discrepancies (Scott Cotton)
          • Third pass: Solomonization (adjudication)
            • Betsy Klipple, Olga Babko-Malaya
  • Supervised Semantic Role Labeling and Leveraging Parallel PropBanks

    • basic knowledge

      • SRL on Constituent Parse(成分句法分析)
        • A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence “John sees Bill”, a constituency parse would be
          -非叶子节点是短语,叶子节点是word,边没有标记。
      • SRL on Dependency Parse
        • A dependency parse connects words according to their relationships. Each vertex in the tree represents a word, child nodes are words that are dependent on the parent, and edges are labeled by the relationship. A dependency parse of “John sees Bill”, would be:
        • 一个依存解析将word按照他们的关系连接起来,每个节点代表一个word,边用关系来进行表示。
      • 依存句法树能够根据成分句法树转换而来,但成分句法树不能通过依存树转化来。转换的规则是head-finding rules from Zhang and Clark 2008
      • head word 一般指的是短语结构中的中心词。
    • SRL Supervised ML Pipeline

      1. Syntactic Parse
      2. Prune Constituents [Xue, Palmer 2004]
        • For the predicate and each of its ancestors, collect their sisters unless the sister is coordinated with the predicate
        • If a sister is a PP(介词短语) also collect its immediate children
      3. Argument Identification(ML)
        • Extract features from sentence, syntactic parse, and
          other sources for each candidate constituent
        • Train statistical ML classifier to identify arguments
      4. Argument Classification(ML)
        • Extract features
        • Train statistical ML classifier to select appropriate label
          • SVM, Linear (MaxEnt, LibLinear, etc), structured (CRF)
            classifiers for arguments
          • All vs one, pairwise, structured multi-label classification
      5. Structural Inference(heuristic or ML optimization)
    • Commonly Used Features: Phrase Type

      • Intuition: different roles tend to be realized by different
        syntactic categories
      • For dependency parse, the dependency label can serve similar function
      • Phrase Type indicates the syntactic category of the phrase
        expressing the semantic roles
      • Syntactic categories from the Penn Treebank
      • FrameNet distributions:
        • NP (47%) – noun phrase
        • PP (22%) – prepositional phrase
        • ADVP (4%) – adverbial phrase
        • PRT (2%) – particles (e.g. make something up)
        • SBAR (2%), S (2%) - clauses
    • Governing Category

      • Intuition: There is often a link between semantic roles and
        their syntactic realization as subject or direct object
      • He drove the car over the cliff
        • Subject NP more likely to fill the agent role
      • Approximating grammatical function from parse
        • Function tags in constituent parses (typically not recovered in automatic parses)
      • Dependency labels in dependency parses
    • Features: Parse Tree Path

      • Intuition: need a feature that factors in relation to the target word.
      • Feature representation: string of symbols indicating the up and down traversal to go from the target word to the constituent of interest
      • For dependency parses, use dependency path

      • Issues:

        • Parser quality (error rate)
        • Data sparseness
          • 2978 possible values excluding frame elements with no matching parse constituent
            • Compress path by removing consecutive phrases of the same type, retain only clauses in path, etc
          • 4086 possible values including total of 35,138 frame elements identifies as NP, only 4% have path feature without VP or S ancestor [Gildea and Jurafsky, 2002]
      • Features: Subcategorization

        • List of child phrase types of the VP
          • highlight the constituent in consideration
      • Intuition: Knowing the number of arguments to the verb constrains the possible set of semantic roles
      • For dependency parse, collect dependents of predicate

      • Features: Position

        • Intuition: grammatical function is highly correlated with position in the sentence
          • Subjects appear before a verb
          • Objects appear after a verb
        • Representation:
          • Binary value – does node appear before or after the predicate
      • Features: Voice

        • Direct objects in active <> Subject in passive
          • He slammed the door.
          • The door was slammed by him.
        • Approach:
          • Use passive identifying patterns / templates (language dependent)
            • Passive auxiliary (to be, to get), past participle
            • bei construction in Chinese
      • Features: Tree kernel

        • Compute sub-trees and partial-trees similarities between training parses and decoding parse
        • Does not require exact feature match
          • Advantage when training data is small (less likely to have exact feature match)
        • Well suited for kernel space classifiers (SVM)
          • All possible sub-trees and partial trees do not have to be enumerated as individual features
          • Tree comparison can be made in polynomial time even when the number of possible sub/partial trees are exponential
      • More Features

        • Head word
          • Head of constituent
        • Name entities
        • Verb cluster
          • Similar verbs share similar argument sets
        • First/last word of constituent
        • Constituent order/distance
          • Whether certain phrase types appear before the argument
        • Argument set
          • Possible arguments in frame file
        • Previous role
          • Last found argument type
        • Argument order
          • Order of arguments from left to right
    • Nominal Predicates

      • Verb predicate annotation doesn’t always capture fine semantic details
      • Arguments of Nominal Predicates can be harder to classify because arguments are not as well constrained by syntax
      • Find the “supporting” verb predicate and its argument candidates
        • Usually under the VP headed by the verb predicate and is part of an argument to the
          verb
    • Structural Inference

      • Take advantage of predicate-argument structures to re-rank argument label set
        • Arguments should not overlap
        • Numbered arguments (arg0-5) should not repeat
        • R-arg[type] and C-arg[type] should have an associated arg[type]
      • Optimize log probability of label set
        • Beam search
        • Formulate into integer linear programming (ILP) problem
      • Re-rank top label sets that conform to constraints
        • Choose n-best label sets
        • Train structural classifier (CRF, etc)
    • SRL ML Notes

      • Syntactic parse input
        • Training parse accuracy needs to match decoding parse accuracy
          • Generate parses via cross-validation
        • Cross-validation folds needs to be selected with low correlation
          • Training data from the same document source needs to be in the same fold
      • Separate stages of constituent pruning, argument identification and argument labeling
        • Constituent pruning and argument identification reduce training/decoding complexity, but usually incurs a slight accuracy penalty
    • Linear Classifier Notes

      • Popular choices: LibLinear, MaxEnt, RRM
      • Perceptron model in feature space
        • each feature j contributes positively or negatively to a label i
      • How about position and voice features for classifying the agent?
        - He slammed the door.
        - The door was slammed by him.
        • Position (left): positive indicator since active construction is more frequent
        • Voice (active): weak positive indicator by itself (agent can be omitted in passive construction)
      • Combine the 2 features as a single feature
        • left-active and right-passive are strong positive indicators
        • left-passive and right-active are strong negative indicators
    • Support Vector Machine Notes

      • Popular choices: LibSVM, SVM light
      • Kernel space classification (linear kernel example)
        • The correlation (c j ) of the features of the input sample with each training sample j contributes positively or negatively to a label i
      • Creates ? × ? dense correlation matrix during training (? is the size of training samples)
        • Requires a lot of memory during training for large corpus
          • Use a linear classifier for argument identification
          • Train base model with a small subset of samples, iteratively add a portion of incorrectly classified training samples and retrain
        • Decoding speed not as adversely affected
          • Trained model typically only has a small number of “support vectors”
      • Tend to perform better when training data is limited
    • Evaluation

      • Precision – percentage of labels output by the system
        which are correct
      • Recall – recall percentage of true labels correctly
        identified by the system
      • F-measure, F_beta – harmonic mean of precision and
        recall
      • Lots of choices when evaluating in SRL:
        • Arguments(检测整个span还是只要短语中心词正确就可以)
          • Full span (CoNLL-2005)
          • Headword only (CoNLL-2008)
        • Predicates(数据是否需要标记谓语动词)
          • Given (CoNLL-2005)
          • System Identifies (CoNLL-2008)
          • Verb and nominal predicates (CoNLL-2008)
    • Applications

      • Question & answer systems (结构化信息)
      • Machine translation generation/evaluation
      • Identifying/recovering implicit arguments across language
        • Chinese dropped pronoun
# 成分句法分析例子 "John sees Bill"
                  Sentence
                     |
       +-------------+------------+
       |                          |
  Noun Phrase                Verb Phrase
       |                          |
     John                 +-------+--------+
                          |                |
                        Verb          Noun Phrase
                          |                |
                        sees              Bill

#  例子:“在秋天的时候,陶喆爱吃苹果”

(ROOT (IP (PP (P 在) (NP (DNP (NP (NN 秋天)) (DEC 的)) (NP (NN 时候)))) (PU ,) (NP (NR 陶喆)) (VP (VV 爱) (IP (VP (VV 吃) (NP (NN 苹果)))))))


# 例子:句法依存树

root(ROOT-0, 爱-7)
case(时候-4, 在-1)
nmod:assmod(时候-4, 秋天-2)
dep(秋天-2, 的-3)
nmod:prep(爱-7, 时候-4)
punct(爱-7, ,-5)
nsubj(爱-7, 陶喆-6)
ccomp(爱-7, 吃-8)
dobj(吃-8, 苹果-9)

  • Semi- , unsupervised and cross-lingual approaches

    • Shortcomings of Supervised Methods

      • Rely on large expert-annotated datasets (FrameNet and PropBank > 100k predicates)
      •  Even then they do not provide high coverage (esp. with FrameNet)
        •   ~50% oracle performance on new data [Palmer and Sporleder, 2010]
      •   Resulting methods are domain-specific [Pradhan et al., 2008]
      •   Such resources are not available for many languages
    • How can we reduce reliance of SRL methods on labeled data?

      • Transfer a model or annotation from a more resource-rich language (crosslingual transfer / projection)
      • Complement labeled data with unlabeled data (semi-supervised learning)
      • Induce SRL representations in an unsupervised fashion (unsupervised learning)
    • outline

      • Crosslingual annotation and model transfer
        • Annotation projection
        • Direct transfer
      • Semi-supervised learning
        • methods creating surrogate supervision
        • parameter sharing methods
      • Unsupervised learning

        • agglomerative clustering
        • generative modeling

          1. Crosslingual annotation and model transfer
      • Exploiting crosslingual correspondences: classes of methods

        • The set-up:
          •   Annotated resources or a SRL model is available for the source language (often English)
          •   No or little annotated data is available for the target language
        • How can we build a semantic-role labeller for the target language?
          • If we have parallel data, we can project annotation from the source language to the target language (annotation projection)
          • If no parallel data, we can directly apply a source SRL model to the target language (driect model transfer) [Kozhevnikov and Titov, 2013]
      • Crosslingual annotation projection: basic idea

        • Start with an aligned sentence pair
        • Label the source sentence
        • Check if a target predicate can evoke the same frame
        • Project roles from source to target sentence
      • Word-based projection(词对齐的错误和遗漏造成噪音)

        • For each source semantic role:
          • Follow alignment links
          • Target role spans all the projected words
          • Ensure contiguity
      • Syntax-based projection

        • Find alignment between constituents
        • For each source semantic role:
          • Identify a set of constituents in the source sentences
          • Label aligned constituents with the semantic role
      • Syntax-based projection

        • Find alignment between constituents
        • For each source semantic role:
          • Identify a set of constituents in the source sentences
          • Label aligned constituents with the semantic role
        • Define semantic alignment as an optimization task on a graph
        • Graph for each sentence pair
        • Choose an optimal alignment graph, maybe with some constraints:(注意:最优化对齐问题的写法)
          • Covers all target constituents (edge cover)
          • Edges in the alignment do not have common endpoints (matching)
      • Direct transfer of models

        • Is this realistic at all?
          • Requires (maximally) language-independent feature representation(设计跨语言的通用特征)
          •   Have been tried successfully for syntax
          •   Performance depends on how different the languages are
      • Language independent feature representations

        • Instead of words use either
          • cross-lingual word clusters [Tackstrom et al., 2012] or
          • cross-lingual distributed word features [Klementiev et al., 2012]
        • Instead of fine-grain part-of-speech (PoS) tags use coarse universal PoS tags[Petrov et al., 2012]
        • Instead of rich (constituent or dependency) syntax either use either
        • unlabeled dependencies or
        • transfer syntactic annotation from the source language before transferring semantic annotation and use it

        • CoNLL-2009 data (dependency representation for semantics)

        • Target syntax is obtained using direct transfer
        • Only accuracy on labeling arguments (not identification)
      • methods creating surrogate supervision

        1. Choose examples (sentences) to label from an unlabeled dataset(*How do we choose
          examples*?)
        2. Automatically annotate the examples
        3. Add them to the labeled training set
        4. Train a classifier on the expanded training set
        5. Optional: Repeat

        6. Semi-supervised learning

      • There are three main groups of semi-supervised learning (SSL) methods considered for SRL:

        • methods creating surrogate supervision: automatically annotate unlabeled data and treat it as new labeled data (annotation projection / bootstrapping methods)
        • parameter sharing methods: use unlabeled data to induce less sparse representations of words (clusters or distributed representations)

        • semi-*un**supervised learning: adding labeled data (and other forms of supervision) *to guide unsupervised models

      • Creating surrogate supervision

        1. Choose examples (sentences) to label from an unlabeled dataset (How do we choose examples?)
        2. Automatically annotate the examples(**How do we
          annotate examples?**)
        3. Add them to the labeled training set
        4. Train a classifier on the expanded training set
        5. Optional: Repeat(Makes sense only if the classifier is
          used at stages 1 or 2)

          • Basic self-training
          • Use the classifier itself to label examples (and, often, its confidence to choose examples at
            stage 1)
          • Does not produce noticeable improvement for SRL [He and Gildea, 2006]
          • Need a better method for choosing and annotating unlabeled examples
      • Monolingual projection: an idea

        • Assumptions: sentences similar in their lexical material and syntactic structure are likely to share the same frame-semantic structure
        • An example:
          •   Labeled sentence: [His back] Impactor [thudded] Impact [against the wall] Impactee
          • Unlabeled sentence: The rest of his body thumped against the front of the cage
        • An Implementation (roughly)
          • Choose labeled examples which are similar to an unlabeled example (compute scored alignments between them, select pairs with high scores)
          • Use alignments to project semantic role information to the unlabeled sentences
          • How do we compute these alignments?
      • Monolingual projection: alignment

        • Start with an unlabeled sentence, and a target predicate
        • Check a labeled sentence (one by one)
        • Find the best alignment(Use a heuristic to select the
          alignment domain) with Score = Lexical Score + Syntactic Score
      • parameter sharing methods

        • Reducing sparsity of word representations
          • Lexical features are crucial for accurate semantic role labeling
            • However, they are problematic as they are sparse
          • Less sparse features capturing lexical information are needed
          • Representations can be learnt from unlabeled data in the context of the language model task, for example:
            • Brown clusters [Brown et al., 1992]
            • Distributed word representations [Bengio et al., 2003] and then used as features in SRL systems

        Challenge: they might not capture phenomena relevant to SRL or not have needed granularity.

        • Learning lexical representations

          • Share words representations across tasks and learn
            simultaneously to be useful for both tasks

            1. Unsupervised learning( agglomerative clustering / generative modeling)
      • Defining Unsupervised SRL

        • Semantic role labeling is typically divided into two sub-tasks:
          • Identification: identification of predicate arguments
            • Arguably, the easier sub-task, can be
              handled with heuristics, e.g. [Lang and Lapata, 2010]
          • Labeling: assignment of their sematic roles
          • Equivalent to clustering of argument occurrences (or “coloring” them)

        Goal: induce semantic roles automatically from unannotated texts

  • Evaluating Unsupervised SRL

    • Before we begin, a note about evaluating unsupervised SRL
    • We do not have labels for clusters, so we use standard clustering metrics instead
      • Purity (PU) measures the degree to which each induced role contains arguments sharing the same gold (“true”) role
      • Collocation (CO) evaluates the degree to which arguments with the same gold roles are assigned to a single induced role
      • Report F1, harmonic mean of PU and CO

3.1. agglomerative clustering [Lang and Lapata, 2011b]

  • Role Labeling as Clustering of Argument Keys

    • Associate argument occurrences with syntactic signatures or argument keys
      • Will include simple syntactic cues such as verb voice and position relative to predicate
    • Argument keys are designed to map to a single semantic role as much as possible (for an individual predicate)

      Instead of clustering argument occurrences, the method clusters their argument keys(聚类相同的关键特征)

    • Here, we would cluster ACTIVE:RIGHT:OBJ and ACTIVE:RIGHT:PMOD_up together

  • Role Labeling via “Split-Merge” Clustering

    • Agglomerative clustering of arguments
      • Start with each argument key in its own cluster (high purity, low collocation)
      •  Merge clusters together to improve collocation
    • For a pair of clusters score

      • whether a pair contains lexically similar arguments
      •  whether arguments have similar parts of speech
      • whether the constraint that arguments in a clause should be in different roles is satisfied
    • Prioritization
      - Instead of greedily choosing the highest scoring pair at each step, start with larger clusters and select best match for each of them(非贪心算法,全局最优)

    3.2 generative modeling [Titov and Klementiev, 2012a][Titov and Klementiev, 2012b][Titov and Klementiev, 2011]

  • A Bayesian model for role labeling

    • Idea: propose a generative model for inducing argument clusters
      • As before, clusters are of argument keys, not argument occurrences
    • Learning signals are similar to Lang and Lapata (2011a, 2011b), e.g.
      • Selection preferences(distribution of argument
        fillers is sparse for every role)
      • Duplicate roles are unlikely to occur. E.g. this clustering is a bad idea:

        John taught students math

    • How can we encode these signals in a generative story?

    • The approaches we discussed induce roles for each predicate independently

    • These clusterings define permissible alternations
    • But many alternations are shared across verbs
    • Can we share this information across verbs?

    • Idea: keep track of how likely a pair of argument keys should be clustered

      • Define a similarity matrix (or similarity graph)
    • A formal way to encode this: dd-CRP

      • Can use CRP to define a prior on the partition of argument keys:
        • The first customer (argument key) sits the first table (role)
        • m-th customer sits at a table according to:
      • An extension is distance-dependent CRP (dd-CRP):
        • m-th customer chooses a customer to sit with according to:
    • Qualitative

      • Looking into induced graph encoding ‘priors’ over clustering arguments keys, the most highly ranked pairs encode (or partially encode)
      • Passivization
      • Near-equivalence of subordinating conjunctions and prepositions
      • Benefactive alternation
      • Dativization
      • Recovery of unnecessary splits introduced by argument keys
    • Generalization of the role induction model

      • The model can be generalized for joint induction of predicate-argument structure of an entire sentence
        • start with a (transformed) syntactic dependency graph (~ argument identification)
        • predict decomposition and labeling of its parts
          • label on nodes are frames (or semantic classes of arguments)
          • labels on edges are roles (frame elements)
  • Conclusions

    • We looked in examples of key directions in exploiting unlabeled data and cross-lingual correspondences
      • a lot of relevant recent work has not been covered
    • Still a new direction with a lot of ongoing work
      • research in the related area of information extraction should also closely watched

  • NN for SRL (tutorial of EMNLP2017)

    Outline: the fall and rise of syntax in SRL
    - Early SRL methods
    - Symbolic approaches + Neural networks (syntax-aware models)
    - Syntax-agnostic neural methods
    - Syntax-aware neural methods

    • Early SRL methods(pipeline)

      • Given a predicate

        1. Argument identification
          • Hand-crafted rules on the full syntactic tree [Xue and Palmer, 2004]
          • Binary classifier [Pradhan et al., 2005; Toutanova et al., 2008]
          • Both [Punyakanok et al., 2008]
        2. Role labeling
          • Labeling is performed using a classifier (SVM, logistic regression)
          • For each argument we get a label distribution
          • Argmax over roles will result in a local assignment
          • Disadvantage: No guarantee the labeling is well formed
          • overlapping arguments, duplicate core roles, etc.
        3. Global and/or constrained inference
          • Enforce linguistic and structural constraint (e.g., no overlaps, discontinuous arguments, reference arguments, …)
          • Viterbi decoding (k-best list with constraints) [Täckström et al., 2015]
          • Dynamic programming [Täckström et al.,2015; Toutanova et al., 2008]
          • Integer linear programming [Punyakanok et al., 2008]
          • Re-ranking [Toutanova et al., 2008; Bjö̈rkelund et al., 2009]
      • Early symbolic models

        • 3 steps pipeline
        • Massive feature engineering
          • argument identification
          • role labeling
          • re-ranking
        • Most of the features are syntactic [Gildea and Jurafsky, 2002]
    • Symbolic approaches + Neural networks (syntax-aware models)

      1. Fitzgerald et al., 2015

        • model
          • Rule based argument identification
            • as in [Xue and Palmer, 2004]but for dependency parsing
          • Neural network for local role labeling
          • Global structural inference based on dynamic programming
            • [Täckström et al., 2015]
        • innovation
          • Predicate-role composition
            • Predicate-specific role representation
            • Learning distributed predicate representation across different formalisms
            • State of the art on FrameNet dataset
          • Feature embeddings
            • Use “simple” span features
            • Let the network figure out how to compose them
            • Reduced feature engineering
      2. Roth and Lapata, 2016: Dependency path embeddings

        • model
          • Dependency-based SRL
            • Syntactic paths between predicates and arguments are an important feature
            • It may be extremely sparse
            • Creating a distributed representation can solve the problem
            • Use LSTM [Hochreiter and Schmidhuber, 1995] to encode paths
          • Neural network with dependency path embeddings as local classifier
            • Argument identification
            • Role labeling
          • Global re-ranking of k-best local assignments
        • innovation
          • Encode syntactic paths with LSTMs
            • Overcome sparsity
          • Combination of symbolic features and continuous syntactic paths
    • Syntax-agnostic neural methods (the fall)

      • SRL as a sequence labeling task
        • Argument identification and role labeling in one step(end-to-end)
      • General architecture
        • Word encoding
        • Sentence encoding (via LSTM)
        • Decoding
      • No use of any kind of treebank syntax (not trivial to encode it)
      • Differentiable end-to-end

        • [Collobert et al., (2011)]

          1. Zhou and Xu, 2015: Sentence encoding
        • model
          • Pretrained word embedding
          • Distance from the predicate
          • Predicate context (for disambiguation)
          • Predicate region mark
          • Bidirectional LSTM
            • Forward (left context)
            • Backward (right context)
            • Snake BiLSTM
          • Conditional Random Field
            • [Lafferty et al., 2001]
            • Markov assumption between role labels
        • innovation

          • No syntax
          • Minimal word representation
          • Sentence encoding with “Snake” BiLSTM

            1. He et al., 2017 ‘What Works and What’s Next’
        • model
          • wdd
        • innovation

          • No syntax
          • Super minimal word representation
          • Exploit at best the representational power of NN

            • Highway networks
            • Recurrent dropout

              1. Marcheggiani et al., 2017 ‘A Simple and Accurate Syntax-Agnostic Neural Model for Dependency-based Semantic Role Labeling’
      • model

        • Dependency-based SRL
        • Shallow syntactic information (POS tags)
        • Intuitions from syntactic dependency parsing
        • Local classifier
        • Word encoding
          • Pretrained word embedding
          • Randomly initialized embedding
          • Randomly initialized embedding of POS tags
          • Embeddings of the predicate lemmas
          • Predicate flag
        • Standard (non-snake) BI-LSTM
          • Forward LSTM encode left context
          • Backward LSTM encode right contex
          • Forw. and Backw. states are concatenate
      • innovation

        • Little bit of syntax (POS tags)
        • More sophisticated word representation
        • Fast local classifier conditioned on predicate representation
    • Syntax-aware neural methods (syntax strikes back!)

      • Is syntax important for semantics?

        • POS tags are beneficial [Marcheggiani et al., 2017]
        • Gold syntax is beneficial (but hard to encode) [He at al., 2017]
        • Encoding syntax with Graph Convolutional Networks

          • [Marcheggiani and Titov, 2017]

            1. Marcheggiani and Titov, 2017 ‘Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling’
        • model
          • Word encoding [Marcheggiani et. al, 2017]
          • Sentence encoding with BiLSTM [Marcheggiani et. al, 2017]
          • Syntax encoding with Graph Convolutional Networks (GCN)
            • Skip connections[Kipf and Welling, 2016]
            • Each word is enriched with the representation of its syntactic neighborhood(Longer dependencies are captured)
          • Local classifier [Marcheggiani et. al, 2017]
        • innovation
          • Encoding structured prior linguistic knowledge in NN
            • Syntax
            • Semantic
            • Coreference
            • Discourse
          • Complement LSTM with skip connections for long dependencies
      • We can live without syntax (out of domain)

      • But life with syntax is better

        • and the better the syntax (parsers) the better our semantic role labeler
      • What’s the (present) future?

        • Multi-task learning
        • Swayamdiptaet al. (2017) frame-semantic parsing + syntax
        • Peng et al. (2017) multi-task on different semantic formalisms
      • Neural networks work (I kid you not) …
      • … but we do have (a lot of) linguistic prior knowledge…
      • … and it is time to use it again.

你可能感兴趣的:(Parsing,Deep,learning,NLP)