(See the article on how to get a working copy of the repository.)
In the previous article, we presented an approach for capturing similarity between words that was concerned with the syntactic similarity of two strings. Today, we are back to discuss another approach that is more concerned with the meaning of words. Semantic similarity is a confidence score that reflects the semantic relation between the meanings of two sentences. It is difficult to gain a high accuracy score because the exact semantic meanings are completely understood only in a particular context.
The goals of this paper are to:
Before we go any further, let us start with some brief introduction of the groundwork.
WordNet is a lexical database which is available online, and provides a large repository of English lexical items. There is a multilingual WordNet for European languages which is structured in the same way as the English language WordNet.
WordNet was designed to establish the connections between four types of Parts of Speech (POS) - noun, verb, adjective, and adverb. The smallest unit in a WordNet is synset, which represents a specific meaning of a word. It includes the word, its explanation, and its synonyms. The specific meaning of one word under one type of POS is called a sense. Each sense of a word is in a different synset. Synsets are equivalent to senses = structures containing sets of terms with synonymous meanings. Each synset has a gloss that defines the concept it represents. For example, the words night, nighttime, and dark constitute a single synset that has the following gloss: the time after sunset and before sunrise while it is dark outside. Synsets are connected to one another through explicit semantic relations. Some of these relations (hypernym, hyponym for nouns, and hypernym and troponym for verbs) constitute is-a-kind-of (holonymy) and is-a-part-of (meronymy for nouns) hierarchies.
For example, tree is a kind of plant, tree is a hyponym of plant, and plant is a hypernym of tree. Analogously, trunk is a part of a tree, and we have trunk as a meronym of tree, and tree is a holonym of trunk. For one word and one type of POS, if there is more than one sense, WordNet organizes them in the order of the most frequently used to the least frequently used (Semcor).
Malcolm Crowe and Troy Simpson have developed an Open-Source .NET Framework library for WordNet, called WordNet.Net.
WordNet.Net was originally created by Malcolm Crowe, and it was known as a C# library for WordNet. It was created for WordNet 1.6, and stayed in its original form until after the release of WordNet 2.0 when Troy gained permission from Malcolm to use the code for freeware dictionary/thesaurus projects. Finally, after WordNet 2.1 was released, Troy released his version of Malcolm's library as an LGPL library known as WordNet.Net (with permission from Princeton and Malcolm Crowe, and in consultation with the Free Software Foundation), which was updated to work with the WordNet 2.1 database.
At the time of this writing, the WordNet.Net library is Open-Sourced for a short period of time, but it is expected to mature as more projects such as this spawn from the library's availability. Bug fixing and extensions to Malcolm's original library had been ongoing for over a year and a half prior to the release of the Open Source project. This is the project address of WordNet.Net.
Given two sentences, the measurement determines how similar the meaning of two sentences is. The higher the score, the more similar the meaning of the two sentences.
Here are the steps for computing semantic similarity between two sentences:
Each sentence is partitioned into a list of words, and we remove the stop words. Stop words are frequently occurring, insignificant words that appear in a database record, article, or a web page, etc.
This task is to identify the correct part of speech (POS - like noun, verb, pronoun, adverb ...) of each word in the sentence. The algorithm takes a sentence as input and a specified tag set (a finite list of POS tags). The output is a single best POS tag for each word. There are two types of taggers: the first one attaches syntactic roles to each word (subject, object, ..), and the second one attaches only functional roles (noun, verb, ...). There is a lot of work that has been done on POS tagging. The tagger can be classified as rule-based or stochastic. Rule-based taggers use hand written rules to disambiguate tag ambiguity. An example of rule-based tagging is Brill's tagger (Eric Brill algorithm). Stochastic taggers resolve tagging ambiguities by using a training corpus to compute the probability of a given word having a given tag in a given context. For example: tagger using the Hidden Markov Model, Maximize likelihood.
There are two samples included for using the Brill Tagger from a C# application. The Brill Tagger tools, libraries, and samples can be found under the 3rd_Party_Tools_Data folder in the source repository.
One of the available ports is a VB.NET port by Steven Abbott of the original Brill Tagger. That port has been in turn ported to C# by Troy Simpson. The other is a port to VC++ by Paul Maddox. The C# test program for Paul Maddox's port uses a wrapper to read stdout directly from the command line application. The wrapper was created using a template by Mike Mayer.
See the respective test applications for working examples on using the Brill Tagger from C#. The port of Steven Abbott's work is fairly new, but after some testing, it is likely that Paul's VC++ port will be deprecated and replaced with Troy's C# port of Steven's VB.NET work.
We use the Porter stemming algorithm. Porter stemming is a process of removing the common morphological and inflexional endings of words. It can be thought of as a lexicon finite state transducer with the following steps: Surface form -> split word into possible morphemes -> getting intermediate form -> map stems to categories and affixes to meaning -> underlying form. I.e.: foxes -> fox + s -> fox.
(+) Currently these works are not used in the semantic similarity project and will soon be integrated. To get their ideas, you can use the porterstermer
class and the Brill Tagger sample in the repository.
As you are already aware, a word can have more than one sense that can lead to ambiguity. For example: the word "interest" has different meanings in the following two contexts:
Disambiguation is the process of finding out the most appropriate sense of a word that is used in a given sentence. The Lesk algorithm [13]uses dictionary definitions (gloss) to disambiguate a polysemous word in a sentence context. The major objective of its idea is to count the number of words that are shared between two glosses. The more overlapping the words, the more related the senses are.
To disambiguate a word, the gloss of each of its senses is compared to the glosses of every other word in a phrase. A word is assigned to the sense whose gloss shares the largest number of words in common with the glosses of the other words.
For example: In performing disambiguation for the "pine cone" phrasal, according to the Oxford Advanced Learner's Dictionary, the word "pine" has two senses:
The word "cone" has three senses:
By comparing each of the two gloss senses of the word "pine" with each of the three senses of the word "cone", it is found that the words "evergreen tree" occurs in one sense in each of the two words. So, these two senses are then declared to be the most appropriate senses when the words "pine" and "cone" are used together.
The original Lesk algorithm begins anew for each word, and does not utilize the senses it previously assigned. This greedy method does not always work effectively. Therefore, if the computational time is not critical, we should think of optimal sense combination by applying local search techniques such as Beam. The major idea behind such methods is to reduce the search space by applying several heuristic techniques. The Beam searcher limits its attention to only the k most promising candidates at each stage of the search process, where k is a predefined number.
The original Lesk used the gloss of a word, and is restricted on the overlap scoring mechanism. In this section, we introduce an adapted version of the algorithm[16] with some improvements to overcome the limitations:
To disambiguate each word in a sentence that has N words, we call each word to be disambiguated as a target word. The algorithm is described in the following steps:
(*) All of them are applied with the same rule.
When computing the relatedness between two synsets s1 and s2, the pair hype-hype means the gloss for the hypernym of s1 is compared to the gloss for the hypernym of s2. The pair hype-hypo means that the gloss for the hypernym of s1 is compared to the gloss for the hyponym of s2.
OverallScore(s1, s2)= Score(hype(s1)-hypo(s2)) + Score(gloss(s1)-hypo(s2)) + Score(hype(s1)-gloss(s2))... ( OverallScore(s1, s2) is also equivalent to OverallScore(s2, s1) ).
In the example of "pine cone", there are three senses of pine and 6 senses of cone, so we can have a total of 18 possible combinations. One of them is the right one.
To score the overlap, we use a new scoring mechanism that differentiates between N-single words and N-consecutive word overlaps and effectively treats each gloss as a bag of words. It is based on ZipF's Law, which says that the length of words is inversely proportional to their usage. The shortest words are those which are used more often, the longest ones are used less often.
Measuring overlaps between two strings is reduced to solve the problem of finding the longest common sub-string with maximal consecutives. Each overlap which contains N consecutive words contributes N2 to the score of the gloss sense combination. For example: an overlap "ABC" has a score of 32=9, and two single overlaps "AB" and "C" has a score of 22 + 11=5.
If you intend to work with this topic, you should refer to the measurements of Hirst-St.Onge which is based on finding the lexical chains between the synsets.
The above method allows us to find the most appropriate sense for each word in a sentence. To compute the similarity between two sentences, we base the semantic similarity between word senses. We capture semantic similarity between two word senses based on the path length similarity.
In WordNet, each part of speech words (nouns/verbs...) are organized into taxonomies where each node is a set of synonyms (synset) represented in one sense. If a word has more than one sense, it will appear in multiple synsets at various locations in the taxonomy. WordNet defines relations between synsets and relations between word senses. A relation between synsets is a semantic relation, and a relation between word senses is a lexical relation. The difference is that lexical relations are relations between members of two different synsets, but semantic relations are relations between two whole synsets. For instance:
Using the example, the antonym of the tenth sense of the noun light (light#n#10) in WordNet is the first sense of the noun dark (dark#n#1). The synset to which it belongs is {light#n#10, lighting#n#1}. Clearly, it makes sense that light#n#10 is an antonym of dark#n#1, but lighting#n#1 is not an antonym of dark#n#1; therefore, the antonym relation needs to be a lexical relation, not a semantic relation. Semantic similarity is a special case of semantic relatedness where we only consider the IS-A relationship.
To measure the semantic similarity between two synsets, we use hyponym/hypernym (or is-a relations). Due to the limitation of is-a hierarchies, we only work with "noun-noun", and "verb-verb" parts of speech.
A simple way to measure the semantic similarity between two synsets is to treat taxonomy as an undirected graph and measure the distance between them in WordNet. Said P. Resnik: "The shorter the path from one node to another, the more similar they are". Note that the path length is measured in nodes/vertices rather than in links/edges. The length of the path between two members of the same synset is 1 (synonym relations).
This figure shows an example of the hyponym taxonomy in WordNet used for path length similarity measurement:
In the above figure, we observe that the length between car and auto is 1, car and truck is 3, car and bicycle is 4, car and fork is 12.
A shared parent of two synsets is known as a sub-sumer. The least common sub-sumer (LCS) of two synsets is the sumer that does not have any children that are also the sub-sumer of two synsets. In other words, the LCS of two synsets is the most specific sub-sumer of the two synsets. Back to the above example, the LCS of {car, auto..} and {truck..} is {automotive, motor vehicle}, since the {automotive, motor vehicle} is more specific than the common sub-sumer {wheeled vehicle}.
The path length gives us a simple way to compute the relatedness distance between two word senses. There are some issues that need to be addressed:
Lexicon
class. When considering a word, we first check if it is a noun and, if so, we will treat it as a noun, and its verb or adjective will be disregarded. If it is not a noun, we will check if it is a verb...There are many proposals for measuring semantic similarity between two synsets: Wu & Palmer, Leacock and Chodorow, P. Resnik. In this work, we experimented with two simple measurements:
Sim(s, t) = 1/distance(s, t).
s
to t
using node counting.This formula was used in the previous article, which not only took into account the length of the path, but also the order of the sense involved in this path:
Sim(s, t) = SenseWeight(s)*SenseWeight(t)/PathLength
s
and t
: denote the source and target words being compared.SenseWeight
: denotes a weight calculated according to the frequency of use of this sense and the total of frequency of use of all senses.PathLength
: denotes the length of the connection path from s
to t
. We will now describe the overall strategy to capture semantic similarity between two sentences. Given two sentences X and Y, we denote m
to be length of X, n
to be length of Y. The major steps can be described as follows:
ScoreSum <- 0; foreach (X[i] in X){ bestCandidate <- -1; bestScore <- -maxInt; foreach (Y[j] in Y){ if (Y[j] is still free && r[i, j] > bestScore){ bestScore <- R[i, j]; bestCandidate <- j; } } if (bestCandidate != -1){ mark the bestCandidate as matched item. scoreSum <- scoreSum + bestScore; } }
match(X, Y)
are the matching word tokens between X and Y. This similarity is computed by dividing the sum of similarity values of all match candidates of both sentences X and Y by the total number of set tokens. An important point is that it is based on each of the individual similarity values, so that the overall similarity always reflects the influence of them. We apply this strategy with the MS1 formula.For example: Given two sentences X and Y, X and Y have lengths of 3 and 2, respectively. The bipartite matcher returns that X[1] has matched Y[1] with a score of 0.8, X[2] has matched Y[2] with a score of 0.7:
The overall score is: 2*(1 + 1)/ (3+2) = 0.8.
To run this code, you should install WordNet 2.1. Currently, the source code is stored at the Google Code repository. Please read the article: Using the WordNet.Net subversion repository before downloading the source code. This code is used to test the semantic similarity function:
void Test() { SemanticSimilarity semsim=new SemanticSimilarity() ; float score=semsim.GetScore("Defense Ministry", "Department of defence"); }
Time restrictions are a problem; whenever possible, we would like to:
In this article, you have seen a simple approach to capture semantic similarity. This work might have many limitations since we are not a NLP research group. There are some things that need to improve. Once the final work is approved, we will move a copy to the CodeProject. This process may take a few working days.
There is a Perl Open Source package for semantic similarity from T. Pedersen and his team. Unfortunately, we do not know Perl; it would be very helpful if someone could migrate it to .NET. We'll stop here for now and hope that others might be inspired to work on WordNet.Net to develop this open source library to make it more useful.
(*) is the co-author of this article.