

Lexical Simplification (LS) is replacing complex words with simpler alternatives, which can help various groups of people, like children, non-native speakers, etc, to better understandig a given text.

大号 exical简化(LS)与更简单的方法,它可以帮助各类人群,如儿童,非母语等取代复杂的单词,以understandig给定文本更好。

介绍 (Introduction)

In this article I’ll show how to make a lexical simplifier using NLTK, BERT and python, the main ideas are taken from this paper


This project will be address in 3 steps: 1. Identify complex words in a given sentence: Create a model that can detect or identify possible complex word, it is called complex word identification (CWI). 2. Generate candidates: Used BERT´s masked language model to get possible words candidates.3. Select the best candidates based on Zipf values: Compute the zipf values of each candidates to select the simplest one.

该项目将分3个步骤解决: 1.识别给定句子中的复杂单词:创建一个可以检测或识别可能的复杂单词的模型,称为复杂单词识别(CWI)。 2.生成候选词:使用BERT的屏蔽语言模型来获取可能的单词候选词。 3.根据Zipf值选择最佳候选者:计算每个候选者的zipf值以选择最简单的候选者。

The diagram of the work is as follows:


1.识别复杂的单词 (1. Identify complex words)

The first goal is to be able to select the words in a given sentence which should be simplified, for that we need to train a model able to detect complex words in sentences.To train this model we are going to use the labeled dataset you can find on this page, and we are going to use a sequential architecture based in BiLSTM, which provides contextual information from both the left and right context of a target word.

第一个目标是能够选择给定句子中的单词该单词 应该简化,因为我们需要训练一个能够检测句子中复杂单词的模型。要训练该模型,我们将使用标记的数据集可以在此页面上找到,我们将使用基于BiLSTM的顺序体系结构,该体系结构从目标单词的左右上下文中提供上下文信息。

After loaded the dataset you’ll end with the follow length:



And the structure:



You can see that the dataset has sentences with binary labeled words (1 if word is considered complex), we will use this to train the CWI model.


As usual in NLP, we need to apply some preprocessing to text in order to feed the data into a deep learning model:


  1. clean the text: delete not alphanumeric characters, lower case all words, etc.

  2. create the vocabulary.

  3. Get the embeding vectors (for this article we are using glove).


After preprocesing the dataset, we got the following:



i.e. a vocabulary with 52K words, and an embedding matrix of shape 52K x 300

例如,词汇表包含52K个单词,并且嵌入矩阵的形状为52K x 300

创建CWI模型 (Create the model for CWI)

Next, we are creating the following model, using keras:



Is a simple BiLSTM model, we select this model for this post just for give you an idea how to approach for this problem. This model is going to be used only for indentify possible complex word in a given sentence.

这是一个简单的BiLSTM模型,在本文中选择此模型只是为了给您一个解决该问题的方法。 该模型用于识别给定句子中可能的复杂单词

And then train it:


After trained the CWI model, we will use BERT to generate candidates over the words which the CWI model identified as complex, the candidates are based not only in the synonym of the word but in context of them.


2.使用BERT生成候选人 (2. Generate candidates using BERT)

BERT is a powerful pretrained model from google, however this article is not about how BER works, if you need some basic knowledge about it please go to this blog.

BERT是来自Google的功能强大的预训练模型,但是本文与BER的工作方式无关,如果您需要BER的一些基础知识,请转到他的博客 。

We are using one of the BERT’s task masked language modeling (MLM), which is predicts missing tokens in a sequence given its left and right context.

我们正在使用BERT的一种任务掩盖语言建模 (MLM),该模型可以根据给定的左右上下文来预测序列中缺少的标记。

For example given the following sentence:


*  “the cat sat on the table”

The sentence is modified (masked)before put into BERT:


*  “[CLS] the cat [MASK] on the table [SEP]”

As we said, one of the task BERT was trained for, is to be able to predict the [MASK] word, so if we input into a BERT model, it will output the most probably word given the context, for example:



We are masking the complex words of each sentence and get the probability distribution of the vocabulary corresponding to the masked word.


As the paper suggest, We’ll concatenate the original sequence and the sequence where we replace the complex word with mask token as a sentence pair, and feed the sentence pair into the Bert to obtain the probability distribution of the vocabulary corresponding to the mask word.

正如论文所建议的 ,我们将连接原始序列和将复杂单词替换为掩码标记作为句子对的序列,并将句子对输入到Bert中,以获得与掩码词相对应的词汇的概率分布。

By using the sentence pair approach, we are not only consider the complex word itself, but also fit the context of the complex word:


Consider the sentence:


*  “an attempt to thwart them

The complex word in this sentence is “thwart”, to get the simplest replaces candidates we will feed it into BERT in this way:

这句话中的复杂词是“ thwar t”,要获得最简单的替换候选者,我们将以这种方式将其输入BERT:


As we see, we feed the sentence pair into the Bert to obtain the mask word, in this case BERT give us: “Stop” which is a appropiate simplest replace.

如我们所见,我们将句子对输入到Bert中以获取掩码单词,在这种情况下,BERT给我们:“ Stop”(停止),这是一个最合适的最简单替换。

To do that, we’ll using pytorch and the function BertForMaskedLM from the very useful library hugginface, with that, using BERT’s masked language modeling is as simple as:

为了做到这一点,我们将使用pytorch从非常有用的库函数BertForMaskedLM hugginface ,与使用BERT的蒙面语言建模 很简单:


3.根据Zipf值选择最佳候选者。 (3. Select the best candidates based on Zipf values.)

We adopt the Zipf frequency of a word which is the base-10 logarithm of the number of times it appears per billion words to rank the replace word candidates. The greater the value the most common or familiar is the word for a person.

我们采用一个单词的Zipf频率 ,该频率是每十亿个单词出现的次数的10个对数,以对替换单词候选者进行排名。 一个人最常识或最熟悉的词的价值越大。

To get you and idea, let’s get the zipf values for simple word “stop”:

为了让您有一个想法,让我们获取简单单词“ stop ”的zipf值:


Now a more complex/uncommon word, “thwart

现在,一个更复杂 /不常见的词“ 挫败


Yo can see that the word “Stop” is most common than “thwart” so the chances that people are more familiar with the word “stop” are higher.


Now you get the idea, for each candidate generated from BERT we compute the zipf score using the python’s package wordfreq.


In summary, for our lexical simplifier, we:

总而言之 ,对于我们的词汇简化器,我们:

  1. Get 10 context-aware candidates word from BERT.

  2. Compute their Zipf values.

  3. Sort each candidates based in the greater zipf value.

  4. Replace each complex word with their simpler counterpart.


The snippet code for this steps:


for input_text in list_texts:
  new_text = input_text
  input_padded, index_list, len_list = process_input(input_text)
  pred_cwi = model_cwi.predict(input_padded)
  pred_cwi_binary = np.argmax(pred_cwi, axis = 2)
  complete_cwi_predictions = complete_missing_word(pred_cwi_binary, index_list, len_list)
  bert_candidates =   get_bert_candidates(input_text, complete_cwi_predictions)
  for word_to_replace, l_candidates in bert_candidates:
    tuples_word_zipf = []
    for w in l_candidates:
      if w.isalpha():
        tuples_word_zipf.append((w, zipf_frequency(w, 'en')))
    tuples_word_zipf = sorted(tuples_word_zipf, key = lambda x: x[1], reverse=True)
    new_text = re.sub(word_to_replace, tuples_word_zipf[0][0], new_text) 
  print("Original text: ", input_text )
  print("Simplified text:", new_text, "\n")

And finally some results:



For example the first original sentence:


“ The Risk That Students Could Arrive at School With the Coronavirus As schools grapple with how to reopen, new estimates show that large parts of the country would probably see infected students if classrooms opened now.


And the simplified one, where we mark in bold the words replaces:


“The Risk That Students Could Arrive at School With the disease As schools deal with how to open, new numbers show that large parts of the country would maybe see infected students if they opened now.”


再来看看 (Let’s see one more)

The original:


“The research does not prove that infected children are contagious, but it should influence the debate about reopening schools, some experts said.”


The simplified:


“The work does not show that infected children are sick, but it should change the question about open schools, some experts said.”


最后的话 (Final Words)

The techniques for lexical simplification of sentences/documents, in this article, leverages on the masking language model of Bert. It focuses on the context of the complex word.

在本文中,用于简化句子/文档的词汇的技术利用了Bert的掩蔽语言模型。 它着重于复杂单词的上下文。

Experiment results have shown that this approach achieves pretty good canditates for replace into sentence and make it more simple.


If you want to improve, you can try another method or model for CWI, and use another score beyond zipf values for candidates raking.


The complete code can be found on this Jupyter notebook, and you can browse for more projects on my Github.


