The task of understanding participants and their relationship to events—being able to answer the question “Who did what to whom” (and perhaps also “when and where”)—is a central question of natural language understanding.
Semantic roles are representations that express the abstract role that arguments of a predicate can take in the event. Semantic role labeling is the task of assigning roles to the constituents or phrases in sentences.
Selectional restrictions are the semantic sortal restrictions or preferences that each individual predicate can express about its potential arguments, such as the fact that the theme of the verb eat is generally something edible.
Consider how in Chapter 14 we represented the meaning of arguments for sentences like these:
(18.1) Sasha broke the window.
(18.2) Pat opened the door.
A neo-Davidsonian event representation of these two sentences would be
KaTeX parse error: Expected 'EOF', got '\and' at position 28: …,y\ Breaking(e)\̲a̲n̲d̲ ̲ ̲Breaker(e,Sasha…
In this representation, the roles of the subjects of the verbs break and open are Breaker and Opener respectively. These deep roles are specific to each event, Breaking events have Breakers, Opening events have Openers, and so on.
Thematic roles are a way to capture the semantic commonality between Breakers and Eaters. We say that the subjects of both these verbs are agents. Thus, AGENT is the thematic role that represents an abstract idea such as volitional causation. Similarly, the direct objects of both these verbs, the BrokenThing and OpenedThing, are both prototypically inanimate objects that are affected in some way by the action. The semantic role for these participants is theme.
Although there is no universally agreed-upon set of roles, Figs. 18.1 and 18.2 list some thematic roles. We’ll use the general term semantic roles for all sets of roles, whether small or large.
[外链图片转存失败(img-u45TECYg-1562478115368)(18.1&2.png)]
Consider these possible realizations of the thematic arguments of the verb break:
[外链图片转存失败(img-kG6JFpmM-1562478115370)(18.3 example.png)]
These examples suggest that break has (at least) the possible arguments AGENT, THEME, and INSTRUMENT. The set of thematic role arguments taken by a verb is often called the thematic grid, θ \theta θ-grid, or case frame. We can see that there are case frame (among others) the following possibilities for the realization of these arguments of break:
AGENT/Subject, THEME/Object
AGENT/Subject, THEME/Object, INSTRUMENT/PP with
INSTRUMENT/Subject, THEME/Object
THEME/Subject
It turns out that many verbs allow their thematic roles to be realized in various syntactic positions. For example, verbs like give can realize the THEME and GOAL arguments in two different ways:
(18.8) a. Doris gave the book to Cary.
AGENT THEME GOAL
b. Doris gave Cary the book.
AGENT GOAL THEME
These multiple argument structure realizations (the fact that break can take AGENT, INSTRUMENT, or THEME as subject, and give can realize its THEME and GOAL in alternation verb either order) are called verb alternations or diathesis alternations. The alternation we showed above for give, the dative alternation, seems to occur with particular semantic classes of verbs, including “verbs of future having” (advance, allocate, offer, owe), “send verbs” (forward, hand, mail), “verbs of throwing” (kick, pass, throw), and so on. Levin (1993) lists for 3100 English verbs the semantic classes to which they belong (47 high-level classes, divided into 193 more specific classes) and the various alternations in which they participate. These lists of verb classes have been incorporated into the online resource VerbNet (Kipper et al., 2000), which links each verb to both WordNet and FrameNet entries.
Representing meaning at the thematic role level seems like it should be useful in dealing with complications like diathesis alternations. Yet it has proved quite difficult to come up with a standard set of roles, and equally difficult to produce a formal definition of roles like AGENT, THEME, or INSTRUMENT.
For example, researchers attempting to define role sets often find they need to fragment a role like AGENT or THEME into many specific roles. Levin and Rappaport Hovav (2005) summarize a number of such cases, such as the fact there seem to be at least two kinds of INSTRUMENTS, intermediary instruments that can appear as subjects and enabling instruments that cannot:
(18.9) a. The cook opened the jar with the new gadget.
b. The new gadget opened the jar.
(18.10) a. Shelly ate the sliced banana with a fork.
b. *The fork ate the sliced banana.
In addition to the fragmentation problem, there are cases in which we’d like to reason about and generalize across semantic roles, but the finite discrete lists of roles don’t let us do this.
Finally, it has proved difficult to formally define the thematic roles. Consider the AGENT role; most cases of AGENTS are animate, volitional, sentient, causal, but any individual noun phrase might not exhibit all of these properties.
These problems have led to alternative semantic role models that use either many fewer or many more roles.
The first of these options is to define generalized semantic roles that abstract over the specific thematic roles. For example, PROTO-AGENT and PROTO-PATIENT are generalized roles that express roughly agent-like and roughly patient-like meanings. These roles are defined, not by necessary and sufficient conditions, but rather by a set of heuristic features that accompany more agent-like or more patient-like meanings. Thus, the more an argument displays agent-like properties (being volitionally involved in the event, causing an event or a change of state in another participant, being sentient or intentionally involved, moving) the greater the likelihood that the argument can be labeled a PROTO-AGENT. The more patient-like the properties (undergoing change of state, causally affected by another participant, stationary relative to other participants, etc.), the greater the likelihood that the argument can be labeled a PROTO-PATIENT.
The second direction is instead to define semantic roles that are specific to a particular verb or a particular group of semantically related verbs or nouns.
In the next two sections we describe two commonly used lexical resources that make use of these alternative versions of semantic roles. PropBank uses both proto-roles and verb-specific semantic roles. FrameNet uses semantic roles that are specific to a general semantic idea called a frame.
The Proposition Bank, generally referred to as PropBank, is a resource of sentences annotated with semantic roles. The English PropBank labels all the sentences in the Penn TreeBank; the Chinese PropBank labels sentences in the Penn Chinese TreeBank. Because of the difficulty of defining a universal set of thematic roles, the semantic roles in PropBank are defined with respect to an individual verb sense. Each sense of each verb thus has a specific set of roles, which are given only numbers rather than names: Arg0, Arg1, Arg2, and so on. In general, Arg0 represents the PROTO-AGENT, and Arg1, the PROTO-PATIENT. The semantics of the other roles are less consistent, often being defined specifically for each verb. Nonetheless there are some generalization; the Arg2 is often the benefactive, instrument, attribute, or end state, the Arg3 the start point, benefactive, instrument, or attribute, and the Arg4 the end point.
Here are some slightly simplified PropBank entries for one sense each of the verbs agree and fall. Such PropBank entries are called frame files; note that the definitions in the frame file for each role (“Other entity agreeing”, “Extent, amount fallen”) are informal glosses intended to be read by humans, rather than being formal definitions.
(18.11) agree.01
Arg0: Agreer
Arg1: Proposition
Arg2: Other entity agreeing
Ex1: [Arg0 The group] agreed [Arg1 it wouldn’t make an offer].
Ex2: [ArgM-TMP Usually] [Arg0 John] agrees [Arg2 with Mary] [Arg1 on everything].
(18.12) fall.01
Arg1: Logical subject, patient, thing falling
Arg2: Extent, amount fallen
Arg3: start point
Arg4: end point, end state of arg1
Ex1: [Arg1 Sales] fell [Arg4 to $25 million] [Arg3 from $27 million].
Ex2: [Arg1 The average junk bond] fell [Arg2 by 4.2%].
Note that there is no Arg0 role for fall, because the normal subject of fall is a PROTO-PATIENT.
The PropBank semantic roles can be useful in recovering shallow semantic information about verbal arguments. Consider the verb increase:
(18.13) increase.01 “go up incrementally”
Arg0: causer of increase
Arg1: thing increasing
Arg2: amount increased by, EXT, or MNR
Arg3: start point
Arg4: end point
A PropBank semantic role labeling would allow us to infer the commonality in the event structures of the following three examples, that is, that in each case Big Fruit Co. is the AGENT and the price of bananas is the THEME, despite the differing surface forms.
(18.14) [Arg0 Big Fruit Co. ] increased [Arg1 the price of bananas].
(18.15) [Arg1 The price of bananas] was increased again [Arg0 by Big Fruit Co. ]
(18.16) [Arg1 The price of bananas] increased [Arg2 5%].
PropBank also has a number of non-numbered arguments called ArgMs, (ArgMTMP, ArgM-LOC, etc) which represent modification or adjunct meanings. These are relatively stable across predicates, so aren’t listed with each frame file. Data labeled with these modifiers can be helpful in training systems to detect temporal, location, or directional modification across predicates. Some of the ArgM’s include:
TMP when? yesterday evening, now
LOC where? at the museum, in San Francisco
DIR where to/from? down, to Bangkok
MNR how? clearly, with much enthusiasm
PRP/CAU why? because … , in response to the ruling
REC themselves, each other
ADV miscellaneous
PRD secondary predication …ate the meat raw
While PropBank focuses on verbs, a related project, NomBank (Meyers et al., 2004) adds annotations to noun predicates. For example the noun agreement in Apple’s agreement with IBM would be labeled with Apple as the Arg0 and IBM as the Arg2. This allows semantic role labelers to assign labels to arguments of both verbal and nominal predicates.
While making inferences about the semantic commonalities across different sentences with increase is useful, it would be even more useful if we could make such inferences in many more situations, across different verbs, and also between verbs and nouns. For example, we’d like to extract the similarity among these three sentences:
(18.17) [Arg1 The price of bananas] increased [Arg2 5%].
(18.18) [Arg1 The price of bananas] rose [Arg2 5%].
(18.19) There has been a [Arg2 5%] rise [Arg1 in the price of bananas].
Note that the second example uses the different verb rise, and the third example uses the noun rather than the verb rise. We’d like a system to recognize that the price of bananas is what went up, and that 5% is the amount it went up, no matter whether the 5% appears as the object of the verb increased or as a nominal modifier of the noun rise.
The FrameNet project is another semantic-role-labeling project that attempts to address just these kinds of problems (Baker et al. 1998, Fillmore et al. 2003, Fillmore and Baker 2009, Ruppenhofer et al. 2016). Whereas roles in the PropBank project are specific to an individual verb, roles in the FrameNet project are specific to a frame.
What is a frame? Consider the following set of words:
reservation, flight, travel, buy, price, cost, fare, rates, meal, plane
There are many individual lexical relations of hyponymy, synonymy, and so on between many of the words in this list. The resulting set of relations does not, however, add up to a complete account of how these words are related. They are clearly all defined with respect to a coherent chunk of common-sense background information concerning air travel.
We call the holistic background knowledge that unites these words a frame (Fillmore, 1985). The idea that groups of words are defined with respect to some background information is widespread in artificial intelligence and cognitive science, where besides frame we see related works like a model (Johnson-Laird, 1983), or even script (Schank and Abelson, 1977).
A frame in FrameNet is a background knowledge structure that defines a set of frame-specific semantic roles, called frame elements, and includes a set of predicates that use these roles. Each word evokes a frame and profiles some aspect of the frame and its elements. The FrameNet dataset includes a set of frames and frame elements, the lexical units associated with each frame, and a set of labeled example sentences. For example, the change position on a scale frame is defined as follows:
This frame consists of words that indicate the change of an Item’s position on a scale (the Attribute) from a starting point (Initial value) to an end point (Final value).
Some of the semantic roles (frame elements) in the frame are defined as in core roles Fig. 18.3. Note that these are separated into core roles, which are frame specific, and non-core roles, which are more like the Arg-M arguments in PropBank, expressed more general properties of time, location, and so on.
[外链图片转存失败(img-n2xVMDzr-1562478115371)(18.3.png)]
Here are some example sentences:
(18.20) [ITEM Oil] rose [ATTRIBUTE in price] [DIFFERENCE by 2%].
(18.21) [ITEM It] has increased [FINAL STATE to having them 1 day a month].
(18.22) [ITEM Microsoft shares] fell [FINAL VALUE to 7 5/8].
(18.23) [ITEM Colon cancer incidence] fell [DIFFERENCE by 50%] [GROUP among men].
(18.24) a steady increase [INITIAL VALUE from 9.5] [FINAL VALUE to 14.3] [ITEM in dividends]
(18.25) a [DIFFERENCE 5%] [ITEM dividend] increase…
Note from these example sentences that the frame includes target words like rise, fall, and increase. In fact, the complete frame consists of the following words:
[外链图片转存失败(img-PbFTHYNA-1562478115371)(18.5 example.png)]
FrameNet also codes relationships between frames, allowing frames to inherit from each other, or representing relations between frames like causation (and generalizations among frame elements in different frames can be represented by inheritance as well). Thus, there is a Cause_change_of_position_on_a_scale frame that is linked to the Change_of_position_on_a_scale frame by the cause relation, but that adds an AGENT role and is used for causative examples such as the following:
(18.26) [AGENT They] raised [ITEM the price of their soda] [DIFFERENCE by 2%].
Together, these two frames would allow an understanding system to extract the common event semantics of all the verbal and nominal causative and non-causative usages.
FrameNets have also been developed for many other languages including Spanish, German, Japanese, Portuguese, Italian, and Chinese.
Semantic role labeling (sometimes shortened as SRL) is the task of automatically finding the semantic roles of each argument of each predicate in a sentence. Current approaches to semantic role labeling are based on supervised machine learning, often using the FrameNet and PropBank resources to specify what counts as a predicate, define the set of roles used in the task, and provide training and test sets.
Recall that the difference between these two models of semantic roles is that FrameNet (18.27) employs many frame-specific frame elements as roles, while PropBank (18.28) uses a smaller number of numbered argument labels that can be interpreted as verb-specific labels, along with the more general ARGM labels. Some examples:
(18.27) [You] can’t [blame] [the program] [for being unable to identify it]
COGNIZER TARGET EVALUEE REASON
(18.28) [The San Francisco Examiner] issued [a special edition] [yesterday]
ARG0 TARGET ARG1 ARGM-TMP
A simplified feature-based semantic role labeling algorithm is sketched in Fig. 18.4. Feature-based algorithms—from the very earliest systems like (Simmons, 1973)—begin by parsing, using broad-coverage parsers to assign a parse to the input string. Figure 18.5 shows a parse of (18.28) above. The parse is then traversed to find all words that are predicates.
[外链图片转存失败(img-PnjZDZaK-1562478115372)(18.4.png)]
[外链图片转存失败(img-Ipex6OUU-1562478115372)(18.5.png)]
For each of these predicates, the algorithm examines each node in the parse tree and uses supervised classification to decide the semantic role (if any) it plays for this predicate. Given a labeled training set such as PropBank or FrameNet, a feature vector is extracted for each node, using feature templates described in the next subsection. A 1-of-N classifier is then trained to predict a semantic role for each constituent given these features, where N is the number of potential semantic roles plus an extra NONE role for non-role constituents. Any standard classification algorithms can be used. Finally, for each test sentence to be labeled, the classifier is run on each relevant constituent.
Instead of training a single-stage classifier as in Fig. 18.5, the node-level classification task can be broken down into multiple steps:
The separation of identification and classification may lead to better use of features (different features may be useful for the two tasks) or to computational efficiency.
Global Optimization
The classification algorithm of Fig. 18.5 classifies each argument separately (‘locally’), making the simplifying assumption that each argument of a predicate can be labeled independently. This assumption is false; there are interactions between arguments that require a more ‘global’ assignment of labels to constituents. For example, constituents in FrameNet and PropBank are required to be non-overlapping. More significantly, the semantic roles of constituents are not independent. For example, PropBank does not allow multiple identical arguments; two constituents of the same verb cannot both be labeled ARG0 .
Role labeling systems thus often add a fourth step to deal with global consistency across the labels in a sentence. For example, the local classifiers can return a list of possible labels associated with probabilities for each constituent, and a second-pass Viterbi decoding or re-ranking approach can be used to choose the best consensus label. Integer linear programming (ILP) is another common way to choose a solution that conforms best to multiple constraints.
Features for Semantic Role Labeling
Most systems use some generalization of the core set of features introduced by Gildea and Jurafsky (2000). Common basic features templates (demonstrated on the NP-SBJ constituent The San Francisco Examiner in Fig. 18.5) include:
The following feature vector thus represents the first NP in our example (recall that most observations will have the value NONE rather than, for example, ARG0, since most constituents in the parse tree will not bear a semantic role):
ARG0: [issued, NP, Examiner, NNP, NP ↑ \uparrow ↑S ↓ \downarrow ↓VP ↓ \downarrow ↓VBD, active, before, VP → \to → NP PP, ORG, The, Examiner]
Other features are often used in addition, such as sets of n-grams inside the constituent, or more complex versions of the path features (the upward or downward halves, or whether particular nodes occur in the path).
It’s also possible to use dependency parses instead of constituency parses as the basis of features, for example using dependency parse paths instead of constituency paths.
The standard neural algorithm for semantic role labeling is based on the bi-LSTM IOB tagger introduced in Chapter 9, which we’ve seen applied to part-of-speech tagging and named entity tagging, among other tasks. Recall that with IOB tagging, we have a begin and end tag for each possible role (B-ARG0, I-ARG0; B-ARG1, I-ARG1, and so on), plus an outside tag O.
As with all the taggers, the goal is to compute the highest probability tag sequence y ^ \hat y y^, given the input sequence of words w:
y ^ = arg max y ∈ T P ( y ∣ w ) \hat y = \mathop{\arg\max}_{y\in T}P(y|w) y^=argmaxy∈TP(y∣w)
In algorithms like He et al. (2017), each input word is mapped to pre-trained embeddings, and also associated with an embedding for a flag (0/1) variable indicating whether that input word is the predicate. These concatenated embeddings are passed through multiple layers of bi-directional LSTM. State-of-the-art algorithms tend to be deeper than for POS or NER tagging, using 3 to 4 layers (6 to 8 total LSTMs). Highway layers can be used to connect these layers as well.
Output from the last bi-LSTM can then be turned into an IOB sequence as for POS or NER tagging. Tags can be locally optimized by taking the bi-LSTM output, passing it through a single layer into a softmax for each word that creates a probability distribution over all SRL tags and the most likely tag for word x i x_i xi is chosen as t i t_i ti, computing for each word essentially:
y ^ i = arg max t ∈ t a g s P ( t ∣ w i ) \hat y_i = \mathop{\arg\max}_{t\in tags}P(t|w_i) y^i=argmaxt∈tagsP(t∣wi)
However, just as feature-based SRL tagging, this local approach to decoding doesn’t exploit the global constraints between tags; a tag I-ARG0, for example, must follow another I-ARG0 or B-ARG0.
As we saw for POS and NER tagging, there are many ways to take advantage of these global constraints. A CRF layer can be used instead of a softmax layer on top of the bi-LSTM output, and the Viterbi decoding algorithm can be used to decode from the CRF.
An even simpler Viterbi decoding algorithm that may perform equally well and doesn’t require adding CRF complexity to the training process is to start with the simple softmax. The softmax output (the entire probability distribution over tags) for each word is then treated it as a lattice and we can do Viterbi decoding through the lattice. The hard IOB constraints can act as the transition probabilities in the Viterbi decoding (Thus the transition from state I-ARG0 to I-ARG1 would have probability 0). Alternatively, the training data can be used to learn bigram or trigram
tag transition probabilities as if doing HMM decoding. Fig. 18.6 shows a sketch of the algorithm.
[外链图片转存失败(img-iJPb4DLO-1562478115373)(18.6.png)]
The standard evaluation for semantic role labeling is to require that each argument label must be assigned to the exactly correct word sequence or parse constituent, and then compute precision, recall, and { \{ { -measure. Identification and classification can also be evaluated separately. Two common datasets used for evaluation are CoNLL-2005 (Carreras and Marquez, 2005) and CoNLL-2012 (Pradhan et al., 2013).
We turn in this section to another way to represent facts about the relationship between predicates and arguments. A selectional restriction is a semantic type constraint that a verb imposes on the kind of concepts that are allowed to fill its argument roles. Consider the two meanings associated with the following example:
(18.29) I want to eat someplace nearby.
There are two possible parses and semantic interpretations for this sentence. In the sensible interpretation, eat is intransitive and the phrase someplace nearby is an adjunct that gives the location of the eating event. In the nonsensical speaker-as-Godzilla interpretation, eat is transitive and the phrase someplace nearby is the direct object and the THEME of the eating, like the NP Malaysian food in the following sentences:
(18.30) I want to eat Malaysian food.
How do we know that someplace nearby isn’t the direct object in this sentence? One useful cue is the semantic fact that the THEME of EATING events tends to be something that is edible. This restriction placed by the verb eat on the filler of its THEME argument is a selectional restriction.
Selectional restrictions are associated with senses, not entire lexemes. We can see this in the following examples of the lexeme serve:
(18.31) The restaurant serves green-lipped mussels.
(18.32) Which airlines serve Denver?
Example (18.31) illustrates the offering-food sense of serve, which ordinarily restricts its THEME to be some kind of
food Example (18.32) illustrates the provides a commercial service to sense of serve, which constrains its THEME to be some type of appropriate location.
Selectional restrictions vary widely in their specificity. The verb imagine, for example, imposes strict requirements on its AGENT role (restricting it to humans and other animate entities) but places very few semantic requirements on its THEME role. A verb like diagonalize, on the other hand, places a very specific constraint on the filler of its THEME role: it has to be a matrix, while the arguments of the adjectives odorless are restricted to concepts that could possess an odor:
(18.33) In rehearsal, I often ask the musicians to imagine a tennis game.
(18.34) Radon is an odorless gas that can’t be detected by human senses.
(18.35) To diagonalize a matrix is to find its eigenvalues.
These examples illustrate that the set of concepts we need to represent selectional restrictions (being a matrix, being able to possess an odor, etc) is quite open ended. This distinguishes selectional restrictions from other features for representing lexical knowledge, like parts-of-speech, which are quite limited in number.
One way to capture the semantics of selectional restrictions is to use and extend the event representation of Chapter 14. Recall that the neo-Davidsonian representation of an event consists of a single variable that stands for the event, a predicate denoting the kind of event, and variables and relations for the event roles. Ignoring the issue of the λ \lambda λ-structures and using thematic roles rather than deep event roles, the semantic contribution of a verb like eat might look like the following:
KaTeX parse error: Expected 'EOF', got '\and' at position 26: …,x,y\ Eating(e)\̲a̲n̲d̲ ̲ ̲Agent(e,x)\and …
With this representation, all we know about y, the filler of the THEME role, is that it is associated with an Eating event through the Theme relation. To stipulate the selectional restriction that y must be something edible, we simply add a new term to that effect:
KaTeX parse error: Expected 'EOF', got '\and' at position 25: …e,x,y Eating(e)\̲a̲n̲d̲ ̲Agent(e,x)\and …
When a phrase like ate a hamburger is encountered, a semantic analyzer can form the following kind of representation:
KaTeX parse error: Expected 'EOF', got '\and' at position 25: …e,x,y Eating(e)\̲a̲n̲d̲ ̲ ̲Eater(e,x)\and …
This representation is perfectly reasonable since the membership of y in the category Hamburger is consistent with its membership in the category EdibleThing, assuming a reasonable set of facts in the knowledge base. Correspondingly, the representation for a phrase such as ate a takeoff would be ill-formed because membership in an event-like category such as Takeoff would be inconsistent with membership in the category EdibleThing.
While this approach adequately captures the semantics of selectional restrictions, there are two problems with its direct use. First, using FOL to perform the simple task of enforcing selectional restrictions is overkill. Other, far simpler, formalisms can do the job with far less computational cost. The second problem is that this approach presupposes a large, logical knowledge base of facts about the concepts that make up selectional restrictions. Unfortunately, although such common-sense knowledge bases are being developed, none currently have the kind of coverage necessary to the task.
A more practical approach is to state selectional restrictions in terms of WordNet synsets rather than as logical concepts. Each predicate simply specifies a WordNet synset as the selectional restriction on each of its arguments. A meaning representation is well-formed if the role filler word is a hyponym (subordinate) of this synset.
For our ate a hamburger example, for instance, we could set the selectional restriction on the THEME role of the verb eat to the synset { food, nutrient } \{\textbf{food, nutrient}\} { food, nutrient}, glossed as any substance that can be metabolized by an animal to give energy and build tissue. Luckily, the chain of hypernyms for hamburger shown in Fig. 18.7 reveals that hamburgers are indeed food. Again, the filler of a role need not match the restriction synset exactly; it just needs to have the synset as one of its superordinates.
[外链图片转存失败(img-ihpVjPby-1562478115373)(18.7.png)]
We can apply this approach to the THEME roles of the verbs imagine, lift, and diagonalize, discussed earlier. Let us restrict imagine’s THEME to the synset { \{ { entity } \} }, lift’s THEME to { \{ { physical entity } \} }, and diagonalize to { \{ { matrix } \} }. This arrangement correctly permits imagine a hamburger and lift a hamburger, while also correctly ruling out diagonalize a hamburger.
In the earliest implementations, selectional restrictions were considered strict constraints on the kind of arguments a predicate could take (Katz and Fodor 1963, Hirst 1987). For example, the verb eat might require that its THEME argument be [+FOOD]. Early word sense disambiguation systems used this idea to rule out senses that violated the selectional restrictions of their governing predicates.
Very quickly, however, it became clear that these selectional restrictions were better represented as preferences rather than strict constraints (Wilks 1975c, Wilks 1975b). For example, selectional restriction violations (like inedible arguments of eat) often occur in well-formed sentences, for example because they are negated (18.36), or because selectional restrictions are overstated (18.37):
(18.36) But it fell apart in 1931, perhaps because people realized you can’t eat gold for lunch if you’re hungry.
(18.37) In his two championship trials, Mr. Kulkarni ate glass on an empty stomach, accompanied only by water and tea.
Modern systems for selectional preferences therefore specify the relation between a predicate and its possible arguments with soft constraints of some kind.
Selectional Association
One of the most influential has been the selectional association model of Resnik (1993). Resnik defines the idea of selectional preference strength as the general amount of information that a predicate tells us about the semantic class of its arguments. For example, the verb eat tells us a lot about the semantic class of its direct objects, since they tend to be edible. The verb be, by contrast, tells us less about its direct objects. The selectional preference strength can be defined by the difference in information between two distributions: the distribution of expected semantic classes P ( c ) P(c) P(c) (how likely is it that a direct object will fall into class c c c) and the distribution of expected semantic classes for the particular verb P ( c ∣ v ) P(c|v) P(c∣v) (how likely is it that the direct object of the specific verb v v v will fall into semantic class c c c). The greater the difference between these distributions, the more information the verb is giving us about possible objects. The difference between these two distributions can be quantified by relative entropy, or the Kullback-Leibler divergence (Kullback and Leibler, 1951). The Kullback-Leibler or KL divergence D ( P ∣ ∣ Q ) D(P||Q) D(P∣∣Q) expresses the difference between two probability distributions P P P and Q Q Q.
D ( P ∣ ∣ Q ) = ∑ x P ( x ) log P ( x ) Q ( x ) D(P||Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)} D(P∣∣Q)=x∑P(x)logQ(x)P(x)
The selectional preference S R ( v ) S_R(v) SR(v) uses the KL divergence to express how much information, in bits, the verb v v v expresses about the possible semantic class of its argument.
S R ( v ) = D ( P ( c ∣ v ) ∣ ∣ P ( c ) ) = ∑ c P ( c ∣ v ) log P ( c ∣ v ) P ( c ) S_R(v) = D(P(c|v)||P(c)) = \sum_c P(c|v)\log \frac{P(c|v)}{P(c)} SR(v)=D(P(c∣v)∣∣P(c))=c∑P(c∣v)logP(c)P(c∣v)
Resnik then defines the selectional association of a particular class and verb as the relative contribution of that class to the general selectional preference of the verb:
A R ( v , c ) = 1 S R ( v ) P ( c ∣ v ) log P ( c ∣ v ) P ( c ) A_R(v,c) =\frac{1}{S_R(v)}P(c|v)\log \frac{P(c|v)}{P(c)} AR(v,c)=SR(v)1P(c∣v)logP(c)P(c∣v)
The selectional association is thus a probabilistic measure of the strength of association between a predicate and a class dominating the argument to the predicate. Resnik estimates the probabilities for these associations by parsing a corpus, counting all the times each predicate occurs with each argument word, and assuming that each word is a partial observation of all the WordNet concepts containing the word. The following table from Resnik (1996) shows some sample high and low selectional associations for verbs and some WordNet semantic classes of their direct objects.
Verb | Direct Object Semantic Class |
Assoc | Direct Object Semantic Class |
Assoc |
---|---|---|---|---|
read | WRITING | 6.80 | ACTIVITY | -.20 |
write | WRITING | 7.26 | COMMERCE | 0 |
see | ENTITY | 5.79 | METHOD | -0.01 |
Selectional Preference via Conditional Probability
An alternative to using selectional association between a verb and the WordNet class of its arguments, is to simply use the conditional probability of an argument word given a predicate verb. This simple model of selectional preferences can be used to directly model the strength of association of one verb (predicate) with one noun (argument).
The conditional probability model can be computed by parsing a very large corpus (billions of words), and computing co-occurrence counts: how often a given verb occurs with a given noun in a given relation. The conditional probability of an argument noun given a verb for a particular relation P ( n ∣ v , r ) P(n|v,r) P(n∣v,r) can then be used as a selectional preference metric for that pair of words (Brockmann and Lapata, 2003):
P ( n ∣ v , r ) = { C ( n , v , r ) C ( v , r ) if C ( n , v , r ) > 0 0 otherwise P(n|v,r) =\begin{cases}\frac{C(n,v,r)}{C(v,r)}& \textrm{if }C(n,v,r) > 0\\ 0 &\textrm{otherwise}\end{cases} P(n∣v,r)={ C(v,r)C(n,v,r)0if C(n,v,r)>0otherwise
The inverse probability P ( v ∣ n , r ) P(v|n,r) P(v∣n,r) was found to have better performance in some cases (Brockmann and Lapata, 2003):
P ( v ∣ n , r ) = { C ( n , v , r ) C ( n , r ) if C ( n , v , r ) > 0 0 otherwise P(v|n,r) =\begin{cases}\frac{C(n,v,r)}{C(n,r)}& \textrm{if }C(n,v,r) > 0\\ 0 &\textrm{otherwise}\end{cases} P(v∣n,r)={ C(n,r)C(n,v,r)0if C(n,v,r)>0otherwise
In cases where it’s not possible to get large amounts of parsed data, another option, at least for direct objects, is to get the counts from simple part-of-speech based approximations. For example pairs can be extracted using the pattern ”V Det N”, where V is any form of the verb, Det is the—a— ϵ \epsilon ϵ and N is the singular or plural form of the noun (Keller and Lapata, 2003).
An even simpler approach is to use the simple log co-occurrence frequency of the predicate with the argument log c o u n t ( v , n , r ) count(v,n,r) count(v,n,r) instead of conditional probability, this seems to do better for extracting preferences for syntactic subjects rather than objects (Brockmann and Lapata, 2003).
Evaluating Selectional Preferences
One way to evaluate models of selectional preferences is to use pseudowords (Gale et al. 1992c, Schutze 1992a ¨ ). A pseudoword is an artificial word created by concatenating a test word in some context (say banana) with a confounder word (say door) to create banana-door). The task of the system is to identify which of the two words is the original word. To evaluate a selectional preference model (for example on the relationship between a verb and a direct object) we take a test corpus and select all verb tokens. For each verb token (say drive) we select the direct object (e.g., car), concatenated with a confounder word that is its nearest neighbor, the noun with the frequency closest to the original (say house), to make car/house). We then use the selectional preference model to choose which of car and house are more preferred objects of drive, and compute how often the model chooses the correct original object (e.g., (car) (Chambers and Jurafsky, 2010).
Another evaluation metric is to get human preferences for a test set of verbargument pairs, and have them rate their degree of plausibility. This is usually done by using magnitude estimation, a technique from psychophysics, in which
subjects rate the plausibility of an argument proportional to a modulus item. A selectional preference model can then be evaluated by its correlation with the human preferences (Keller and Lapata, 2003).
One way of thinking about the semantic roles we have discussed through the chapter is that they help us define the roles that arguments play in a decompositional way, based on finite lists of thematic roles (agent, patient, instrument,
proto-agent, proto-patient, etc.) This idea of decomposing meaning into sets of primitive semantics elements or features, called primitive decomposition or componential analysis, has been taken even further, and focused particularly on predicates.
Consider these examples of the verb kill:
(18.41) Jim killed his philodendron.
(18.42) Jim did something to cause his philodendron to become not alive.
There is a truth-conditional (‘propositional semantics’) perspective from which these two sentences have the same meaning. Assuming this equivalence, we could represent the meaning of kill as:
(18.43) KILL(x,y) ⇔ \Leftrightarrow ⇔ CAUSE(x, BECOME(NOT(ALIVE(y))))
thus using semantic primitives like do, cause, become not, and alive.
Indeed, one such set of potential semantic primitives has been used to account for some of the verbal alternations discussed in Section 18.2 (Lakoff 1965, Dowty 1979). Consider the following examples.
(18.44) John opened the door. ⇒ \Rightarrow ⇒ CAUSE(John, BECOME(OPEN(door)))
(18.45) The door opened. ⇒ \Rightarrow ⇒ BECOME(OPEN(door))
(18.46) The door is open. ⇒ \Rightarrow ⇒ OPEN(door)
The decompositional approach asserts that a single state-like predicate associated with open underlies all of these examples. The differences among the meanings of these examples arises from the combination of this single predicate with the primitives CAUSE and BECOME.
While this approach to primitive decomposition can explain the similarity between states and actions or causative and non-causative predicates, it still relies on having a large number of predicates like open. More radical approaches choose to break down these predicates as well. One such approach to verbal predicate decomposition that played a role in early natural language understanding systems is conceptual dependency (CD), a set of ten primitive predicates, shown in Fig. 18.8.
[外链图片转存失败(img-lih64xJR-1562478115374)(18.8.png)]
Below is an example sentence along with its CD representation. The verb brought is translated into the two primitives ATRANS and PTRANS to indicate that the waiter both physically conveyed the check to Mary and passed control of it to her. Note that CD also associates a fixed set of thematic roles with each primitive to represent the various participants in the action.
(18.47) The waiter brought Mary the check.
KaTeX parse error: Expected 'EOF', got '\and' at position 23: …s x,y Atrans(x)\̲a̲n̲d̲ ̲ ̲Actor(x,Waiter)…