

by Divya Godayal

通过Divya Godayal

词性标注和隐马尔可夫模型简介 (An introduction to part-of-speech tagging and the Hidden Markov Model)

by Sachin Malhotra and Divya Godayal

由Sachin Malhotra和Divya Godayal撰写

Let’s go back into the times when we had no language to communicate. The only way we had was sign language. That’s how we usually communicate with our dog at home, right? When we tell him, “We love you, Jimmy,” he responds by wagging his tail. This doesn’t mean he knows what we are actually saying. Instead, his response is simply because he understands the language of emotions and gestures more than words.

让我们回到没有语言进行交流的时代。 我们唯一的方式是手语。 那就是我们通常在家里与狗交流的方式,对吗? 当我们告诉他“我们爱你,吉米”时,他用摇尾巴回答。 这并不意味着他知道我们实际上在说什么。 相反,他的React仅仅是因为他比单词更能理解情感和手势语言。

We as humans have developed an understanding of a lot of nuances of the natural language more than any animal on this planet. That is why when we say “I LOVE you, honey” vs when we say “Lets make LOVE, honey” we mean different things. Since we understand the basic difference between the two phrases, our responses are very different. It is these very intricacies in natural language understanding that we want to teach to a machine.

作为人类,我们对自然语言的许多细微差别的理解比对地球上任何动物的理解都多。 这就是为什么当我们说“我爱你,亲爱的”而当我们说“让我爱你,亲爱的”时,我们意味着不同的事情。 由于我们了解这两个词组之间的基本区别,因此我们的回答也大不相同。 我们想教一台机器就是自然语言理解中的这些非常复杂的东西。

What this could mean is when your future robot dog hears “I love you, Jimmy”, he would know LOVE is a Verb. He would also realize that it’s an emotion that we are expressing to which he would respond in a certain way. And maybe when you are telling your partner “Lets make LOVE”, the dog would just stay out of your business ?.

这可能意味着当您未来的机器狗听到“我爱您,吉米”时,他会知道爱是一个动词。 他还将意识到,这是我们正在表达的一种情感,他将以某种方式做出回应。 也许当您告诉您的伴侣“让爱成为现实”时,那只狗就不会经营您的生意了?

This is just an example of how teaching a robot to communicate in a language known to us can make things easier.


The primary use case being highlighted in this example is how important it is to understand the difference in the usage of the word LOVE, in different contexts.


词性标记 (Part-of-Speech Tagging)

From a very small age, we have been made accustomed to identifying part of speech tags. For example, reading a sentence and being able to identify what words act as nouns, pronouns, verbs, adverbs, and so on. All these are referred to as the part of speech tags.

从很小的时候起,我们就习惯了识别语音标签的一部分。 例如,阅读一个句子并能够识别哪些词充当名词,代词,动词,副词等。 所有这些都称为语音标签的一部分。

Let’s look at the Wikipedia definition for them:


In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

在语料库语言学中, 词性标记 ( POS标记PoS标记POST ),也称为语法标记单词类别歧义消除 ,是将文本(语料库)中的单词标记为与特定部分相对应的过程基于其定义和上下文(即,它与短语,句子或段落中相邻和相关单词的关系)的语言表达。 通常将这种简化形式教给学龄儿童,将单词识别为名词,动词,形容词,副词等。

Identifying part of speech tags is much more complicated than simply mapping words to their part of speech tags. This is because POS tagging is not something that is generic. It is quite possible for a single word to have a different part of speech tag in different sentences based on different contexts. That is why it is impossible to have a generic mapping for POS tags.

识别语音标签的一部分比简单地将单词映射到语音标签的部分要复杂得多。 这是因为POS标记不是通用的。 根据不同的上下文,单个单词很有可能在不同的句子中具有不同的语音标签部分。 这就是为什么不可能有POS标签的通用映射的原因。

As you can see, it is not possible to manually find out different part-of-speech tags for a given corpus. New types of contexts and new words keep coming up in dictionaries in various languages, and manual POS tagging is not scalable in itself. That is why we rely on machine-based POS tagging.

如您所见,无法为给定语料库手动找到不同的词性标签。 词典中不断出现各种类型的新上下文和新单词,并且手动POS标记本身无法扩展。 这就是为什么我们依赖基于机器的POS标记。

Before proceeding further and looking at how part-of-speech tagging is done, we should look at why POS tagging is necessary and where it can be used.


为什么使用词性标记? (Why Part-of-Speech tagging?)

Part-of-Speech tagging in itself may not be the solution to any particular NLP problem. It is however something that is done as a pre-requisite to simplify a lot of different problems. Let us consider a few applications of POS tagging in various NLP tasks.

词性标记本身可能不能解决任何特定的NLP问题。 但是,这是简化许多不同问题的先决条件。 让我们考虑一下POS标记在各种NLP任务中的一些应用。

文字转语音 (Text to Speech Conversion)

Let us look at the following sentence:


They refuse to permit us to obtain the refuse permit.

The word refuse is being used twice in this sentence and has two different meanings here. refUSE (/rəˈfyo͞oz/)is a verb meaning “deny,” while REFuse(/ˈrefˌyo͞os/) is a noun meaning “trash” (that is, they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.)

refuse一词在这句话中被使用了两次,在这里有两个不同的含义。 refUSE(/ rəˈfyo͞oz /)是一个动词,表示“拒绝”,而REFuse(/ ˈrefˌyo͞os /)是一个名词,表示“垃圾”(也就是说,它们不是同音字)。 因此,我们需要知道使用了哪个单词才能正确发音。 (由于这个原因,文本语音转换系统通常执行POS标记。)

Have a look at the part-of-speech tags generated for this very sentence by the NLTK package.


>>> text = word_tokenize("They refuse to permit us to obtain the refuse permit")>>> nltk.pos_tag(text)[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

As we can see from the results provided by the NLTK package, POS tags for both refUSE and REFuse are different. Using these two different POS tags for our text to speech converter can come up with a different set of sounds.

从NLTK软件包提供的结果可以看出, refuse和refuse的 POS标签是不同的。 将这两个不同的POS标签用于我们的文本到语音转换器可以提供不同的声音集。

Similarly, let us look at yet another classical application of POS tagging: word sense disambiguation.


词义消歧 (Word Sense Disambiguation)

Let’s talk about this kid called Peter. Since his mother is a neurological scientist, she didn’t send him to school. His life was devoid of science and math.

让我们谈谈这个叫彼得的孩子。 由于他的母亲是神经科科学家,她没有送他去学校。 他的生活缺乏科学和数学。

One day she conducted an experiment, and made him sit for a math class. Even though he didn’t have any prior subject knowledge, Peter thought he aced his first test. His mother then took an example from the test and published it as below. (Kudos to her!)

一天,她进行了一次实验,让他参加数学课。 即使他以前没有任何学科知识,彼得仍然认为他参加了第一次考试。 然后,他的母亲从测试中举了一个例子,并发布如下。 (对她表示敬意!)

Words often occur in different senses as different parts of speech. For example:

单词通常作为不同的词性出现在不同的意义上。 例如:

  • She saw a bear.


  • Your efforts will bear fruit.


The word bear in the above sentences has completely different senses, but more importantly one is a noun and other is a verb. Rudimentary word sense disambiguation is possible if you can tag words with their POS tags.

上述句子中的单词Bear具有完全不同的含义,但更重要的是一个是名词,另一个是动词。 如果您可以使用POS标签标记单词,则可以进行基本的单词歧义消除。

Word-sense disambiguation (WSD) is identifying which sense of a word (that is, which meaning) is used in a sentence, when the word has multiple meanings.


Try to think of the multiple meanings for this sentence:


Time flies like an arrow


Here are the various interpretations of the given sentence. The meaning and hence the part-of-speech might vary for each word.

这是给定句子的各种解释。 每个单词的含义以及词性可能会有所不同。

As we can clearly see, there are multiple interpretations possible for the given sentence. Different interpretations yield different kinds of part of speech tags for the words.This information, if available to us, can help us find out the exact version / interpretation of the sentence and then we can proceed from there.

我们可以清楚地看到,给定句子可能有多种解释。 不同的解释会产生不同的词性语音标签。这些信息(如果可用)可以帮助我们找出句子的确切版本/解释,然后从那里开始。

The above example shows us that a single sentence can have three different POS tag sequences assigned to it that are equally likely. That means that it is very important to know what specific meaning is being conveyed by the given sentence whenever it’s appearing. This is word sense disambiguation, as we are trying to find out THE sequence.

上面的示例向我们展示了一个句子可以分配给它的三个不同的POS标签序列的可能性相同。 这意味着,当给定的句子出现时,知道它所传达的具体含义非常重要。 这是单词意义上的歧义,因为我们试图找出THE序列。

These are just two of the numerous applications where we would require POS tagging. There are other applications as well which require POS tagging, like Question Answering, Speech Recognition, Machine Translation, and so on.

这些只是我们需要POS标记的众多应用中的两个。 还有其他一些需要POS标记的应用程序,例如问题回答,语音识别,机器翻译等。

Now that we have a basic knowledge of different applications of POS tagging, let us look at how we can go about actually assigning POS tags to all the words in our corpus.


POS标记器的类型 (Types of POS taggers)

POS-tagging algorithms fall into two distinctive groups:


  • Rule-Based POS Taggers


  • Stochastic POS Taggers


E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms. Let us first look at a very brief overview of what rule-based tagging is all about.

E. Brill的标记器是最早使用最广泛的英语POS标记器之一,它使用基于规则的算法。 让我们首先看一下有关基于规则的标记的简要概述。

基于规则的标记 (Rule-Based Tagging)

Automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule-based methods.


Typical rule-based approaches use contextual information to assign tags to unknown or ambiguous words. Disambiguation is done by analyzing the linguistic features of the word, its preceding word, its following word, and other aspects.

典型的基于规则的方法使用上下文信息将标签分配给未知或歧义词。 通过分析单词,其前一个单词,其后一个单词以及其他方面的语言特征来实现歧义消除。

For example, if the preceding word is an article, then the word in question must be a noun. This information is coded in the form of rules.

例如,如果前一个单词是冠词,则该单词必须是名词。 此信息以规则的形式编码。

Example of a rule:


If an ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag it as an adjective.

Defining a set of rules manually is an extremely cumbersome process and is not scalable at all. So we need some automatic way of doing this.

手动定义一组规则是一个非常繁琐的过程,并且根本无法扩展。 因此,我们需要一些自动的方式来执行此操作。

The Brill’s tagger is a rule-based tagger that goes through the training data and finds out the set of tagging rules that best define the data and minimize POS tagging errors. The most important point to note here about Brill’s tagger is that the rules are not hand-crafted, but are instead found out using the corpus provided. The only feature engineering required is a set of rule templates that the model can use to come up with new features.

Brill的标记器是一个基于规则的标记器,它遍历训练数据并找出最能定义数据并最大程度降低POS标记错误的标记规则集。 关于Brill的标记器,这里需要注意的最重要一点是规则不是手工制作的,而是使用提供的语料库找到的。 唯一需要的功能工程是一组规则模板 ,模型可以使用这些规则模板来提供新功能。

Let’s move ahead now and look at Stochastic POS tagging.


随机词性标注 (Stochastic Part-of-Speech Tagging)

The term ‘stochastic tagger’ can refer to any number of different approaches to the problem of POS tagging. Any model which somehow incorporates frequency or probability may be properly labelled stochastic.

术语“随机标记器”可以指代解决POS标记问题的许多不同方法。 任何以某种方式结合了频率或概率的模型都可以适当地随机标记。

The simplest stochastic taggers disambiguate words based solely on the probability that a word occurs with a particular tag. In other words, the tag encountered most frequently in the training set with the word is the one assigned to an ambiguous instance of that word. The problem with this approach is that while it may yield a valid tag for a given word, it can also yield inadmissible sequences of tags.

最简单的随机标记器仅根据单词与特定标签一起出现的可能性来消除单词歧义。 换句话说,在训练集中最经常遇到的带有该单词的标签是分配给该单词歧义实例的标签。 这种方法的问题在于,尽管它可能为给定的单词生成一个有效的标签,但它也会生成不可接受的标签序列。

An alternative to the word frequency approach is to calculate the probability of a given sequence of tags occurring. This is sometimes referred to as the n-gram approach, referring to the fact that the best tag for a given word is determined by the probability that it occurs with the n previous tags. This approach makes much more sense than the one defined before, because it considers the tags for individual words based on context.

单词频率方法的替代方法是计算给定标签序列出现的概率。 这有时被称为n-gram方法,是指给定单词的最佳标记取决于它与n个先前标记一起出现的概率。 这种方法比以前定义的方法更有意义,因为它根据上下文考虑单个单词的标签。

The next level of complexity that can be introduced into a stochastic tagger combines the previous two approaches, using both tag sequence probabilities and word frequency measurements. This is known as the Hidden Markov Model (HMM).

可以引入到随机标记器中的下一个复杂度级别结合了前两种方法,同时使用了标记序列概率和字频测量。 这被称为隐马尔可夫模型(HMM)

Before proceeding with what is a Hidden Markov Model, let us first look at what is a Markov Model. That will better help understand the meaning of the term Hidden in HMMs.

在进行隐藏操作之前 马尔可夫模型,让我们首先看看什么是马尔可夫模型。 这将有助于更好地理解“ 隐藏 ”一词的含义 在HMM中。

马尔可夫模型 (Markov Model)

Say that there are only three kinds of weather conditions, namely


  • Rainy

  • Sunny

  • Cloudy


Now, since our young friend we introduced above, Peter, is a small kid, he loves to play outside. He loves it when the weather is sunny, because all his friends come out to play in the sunny conditions.

现在,由于我们上面介绍的年轻朋友彼得是个小孩,他喜欢在户外玩。 天气晴朗时,他喜欢它,因为他的所有朋友都出来在阳光明媚的条件下比赛。

He hates the rainy weather for obvious reasons.


Every day, his mother observe the weather in the morning (that is when he usually goes out to play) and like always, Peter comes up to her right after getting up and asks her to tell him what the weather is going to be like. Since she is a responsible parent, she want to answer that question as accurately as possible. But the only thing she has is a set of observations taken over multiple days as to how weather has been.

每天,母亲都会观察早晨的天气(也就是他通常出去玩的时间),彼得总是像往常一样站起来,要求她告诉他天气会怎样。 由于她是一个负责任的父母,她想尽可能准确地回答这个问题。 但是她唯一的一件事就是对天气进行了多天的一系列观察。

How does she make a prediction of the weather for today based on what the weather has been for the past N days?


Say you have a sequence. Something like this:

假设您有一个序列。 像这样:

Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy

Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy

So, the weather for any give day can be in any of the three states.


Let’s say we decide to use a Markov Chain Model to solve this problem. Now using the data that we have, we can construct the following state diagram with the labelled probabilities.

假设我们决定使用马尔可夫链模型来解决此问题。 现在,使用已有的数据,我们可以构建带有标记概率的以下状态图。

In order to compute the probability of today’s weather given N previous observations, we will use the Markovian Property.


Markov Chain is essentially the simplest known Markov model, that is it obeys the Markov property.


The Markov property suggests that the distribution for a random variable in the future depends solely only on its distribution in the current state, and none of the previous states have any impact on the future states.


For a much more detailed explanation of the working of Markov chains, refer to this link.


Also, have a look at the following example just to see how probability of the current state can be computed using the formula above, taking into account the Markovian Property.


Apply the Markov property in the following example.


We can clearly see that as per the Markov property, the probability of tomorrow's weather being Sunny depends solely on today's weather and not on yesterday's .

我们可以清楚地看到,根据Markov属性, tomorrow's天气晴朗的可能性完全取决于today's天气,而不取决于yesterday's天气。

Let us now proceed and see what is hidden in the Hidden Markov Models.


隐马尔可夫模型 (Hidden Markov Model)

It’s the small kid Peter again, and this time he’s gonna pester his new caretaker — which is you. (Ooopsy!!)

再次是小彼得,这次他要缠着他的新看管人-就是你。 (糟糕!)

As a caretaker, one of the most important tasks for you is to tuck Peter into bed and make sure he is sound asleep. Once you’ve tucked him in, you want to make sure he’s actually asleep and not up to some mischief.

作为看守,对您来说最重要的任务之一就是让Peter卧床并确保他睡着了。 将他塞进去之后,您要确保他确实在睡觉,并且不会有任何恶作剧。

You cannot, however, enter the room again, as that would surely wake Peter up. So all you have to decide are the noises that might come from the room. Either the room is quiet or there is noise coming from the room. These are your states.

但是,您不能再次进入房间,因为这肯定会使Peter醒来。 因此,您只需要确定房间可能发出的噪音即可。 房间很安静,或者房间里有噪音 。 这些是你的状态。

Peter’s mother, before leaving you to this nightmare, said:


May the sound be with you :)

His mother has given you the following state diagram. The diagram has some states, observations, and probabilities.

他的母亲给了您以下状态图。 该图具有一些状态,观察值和概率。

Note that there is no direct correlation between sound from the room and Peter being asleep.


There are two kinds of probabilities that we can see from the state diagram.


  • One is the emission probabilities, which represent the probabilities of making certain observations given a particular state. For example, we have P(noise | awake) = 0.5 . This is an emission probability.

    一是发射 概率,代表在特定状态下进行某些观察的概率。 例如,我们有P(noise | awake) = 0.5 。 这是发射概率。

  • The other ones is transition probabilities, which represent the probability of transitioning to another state given a particular state. For example, we have P(asleep | awake) = 0.4 . This is a transition probability.

    另一个是过渡 概率,表示在特定状态下转换为另一状态的概率。 例如,我们有P(asleep | awake) = 0.4 。 这是转移概率。

The Markovian property applies in this model as well. So do not complicate things too much. Markov, your savior said:

马尔可夫属性也适用于此模型。 因此,不要使事情复杂化。 马尔可夫,您的救世主说:

Don’t go too much into the history…

The Markov property, as would be applicable to the example we have considered here, would be that the probability of Peter being in a state depends ONLY on the previous state.


But there is a clear flaw in the Markov property. If Peter has been awake for an hour, then the probability of him falling asleep is higher than if has been awake for just 5 minutes. So, history matters. Therefore, the Markov state machine-based model is not completely correct. It’s merely a simplification.

但是,马尔可夫性质存在明显缺陷。 如果彼得醒了一个小时,那么他入睡的几率比醒来仅5分钟要高。 因此,历史很重要。 因此,基于马尔可夫状态机的模型并不完全正确。 这只是一个简化。

The Markov property, although wrong, makes this problem very tractable.


We usually observe longer stretches of the child being awake and being asleep. If Peter is awake now, the probability of him staying awake is higher than of him going to sleep. Hence, the 0.6 and 0.4 in the above diagram.P(awake | awake) = 0.6 and P(asleep | awake) = 0.4

我们通常会观察到较长时间的孩子处于清醒和入睡状态。 如果彼得现在醒着,那么他保持清醒的可能性比他入睡的可能性高。 因此,上图中的0.6和0.4。 P(awake | awake) = 0.6 and P(asleep | awake) = 0.4

Before actually trying to solve the problem at hand using HMMs, let’s relate this model to the task of Part of Speech Tagging.


用于语音标记的HMM (HMMs for Part of Speech Tagging)

We know that to model any problem using a Hidden Markov Model we need a set of observations and a set of possible states. The states in an HMM are hidden.

我们知道,要使用隐马尔可夫模型对任何问题进行建模,我们需要一组观察值和一组可能的状态。 HMM中的状态被隐藏。

In the part of speech tagging problem, the observations are the words themselves in the given sequence.

在语音标记问题中, 观察结果是给定序列中的单词本身。

As for the states, which are hidden, these would be the POS tags for the words.

至于隐藏的状态 ,这些将是单词的POS标签。

The transition probabilities would be somewhat like P(VP | NP) that is, what is the probability of the current word having a tag of Verb Phrase given that the previous tag was a Noun Phrase.

过渡概率有点像P(VP | NP) ,也就是说,鉴于先前的单词是名词短语,当前单词具有动词短语标签的概率是多少。

Emission probabilities would be P(john | NP) or P(will | VP) that is, what is the probability that the word is, say, John given that the tag is a Noun Phrase.

发射概率将是P(john | NP) or P(will | VP) ,也就是说,假设标签是名词短语,那么单词是John的概率是多少。

Note that this is just an informal modeling of the problem to provide a very basic understanding of how the Part of Speech tagging problem can be modeled using an HMM.


我们该如何解决呢? (How do we solve this?)

Coming back to our problem of taking care of Peter.


Irritated are we ? ?.

我们烦了吗? ?

Our problem here was that we have an initial state: Peter was awake when you tucked him into bed. After that, you recorded a sequence of observations, namely noise or quiet, at different time-steps. Using these set of observations and the initial state, you want to find out whether Peter would be awake or asleep after say N time steps.

我们这里的问题是,我们有一个初始状态:当您将彼得塞到床上时,彼得醒了。 之后,您在不同的时间步长上记录了一系列观察结果,即噪音安静 。 使用这些观察结果和初始状态,您想确定在经过N个时间步长之后,Peter会醒还是睡着。

We draw all possible transitions starting from the initial state. There’s an exponential number of branches that come out as we keep moving forward. So the model grows exponentially after a few time steps. Even without considering any observations. Have a look at the model expanding exponentially below.

我们从初始状态开始绘制所有可能的过渡。 随着我们不断前进,出现了指数级的分支。 因此,模型经过几个时间步呈指数增长 。 即使不考虑任何观察。 看看下面的指数级扩展模型。

If we had a set of states, we could calculate the probability of the sequence. But we don’t have the states. All we have are a sequence of observations. This is why this model is referred to as the Hidden Markov Model — because the actual states over time are hidden.

如果我们有一组状态,我们可以计算出序列的概率。 但是我们没有州。 我们所拥有的只是一系列观察结果。 这就是为什么将此模型称为“ 马尔可夫模型”的原因-因为随着时间的推移,实际状态是隐藏的。

So, caretaker, if you’ve come this far it means that you have at least a fairly good understanding of how the problem is to be structured. All that is left now is to use some algorithm / technique to actually solve the problem. For now, Congratulations on Leveling up!

因此,看守者,如果您走了这么远,就意味着您至少对如何解决问题有一个很好的了解。 现在剩下的就是使用某种算法/技术来实际解决问题。 目前, 恭喜您升级!

In the next article of this two-part series, we will see how we can use a well defined algorithm known as the Viterbi Algorithm to decode the given sequence of observations given the model. See you there!

在这个由两部分组成的系列的下一篇文章中,我们将了解如何使用定义良好的算法(称为维特比算法)对给定模型的给定观测序列进行解码。 到时候那里见!

翻译自: https://www.freecodecamp.org/news/an-introduction-to-part-of-speech-tagging-and-the-hidden-markov-model-953d45338f24/

