自然语言处理 预处理步骤_NLP预处理:-一个有用且重要的步骤

自然语言处理 预处理步骤

介绍 (Introduction)

GPT-3 model has, for now, became a hot topic in the natural language processing field due to its performance. It has nearly 175 billion parameters in comparison to GPT-2 which had around 1.5 billion parameters. It's a major breakthrough in the field of NLP. But the preprocessing steps that are required before training any model is of utmost importance. Therefore in this article, I will be explaining all the major steps that are used and are required in preprocessing the data before training any NLP model.

到目前为止,由于其性能,GPT-3模型已成为自然语言处理领域的热门话题。 与GPT-2约有15亿个参数相比,它拥有近1750亿个参数。 这是NLP领域的重大突破。 但是,在训练任何模型之前所需的预处理步骤至关重要。 因此,在本文中,我将解释在训练任何NLP模型之前对数据进行预处理所需要使用的所有主要步骤。

First I will list out the preprocessing steps and then will explain them in detail:-

首先,我将列出预处理步骤,然后将详细解释它们:-

  1. Removing HTML tags

    删除HTML标签
  2. Removing stopwords

    删除停用词
  3. Removing extra spaces

    删除多余的空间
  4. Converting numbers to their textual representations

    将数字转换为其文本表示形式
  5. Lowercasing the text

    小写文本
  6. Tokenization

    代币化
  7. Stemming

    抽干
  8. Lemmatization

    合法化
  9. Spell-checking

    拼写检查

Now let’s start with their explanation one by one.

现在让我们从他们的解释开始。

删除HTML标签 (Removing HTML tags)

Sometimes the text data could contain the HTML tags along with the normal text if the data has been web-scraped from the internet. This could be removed by using python’s BeautfulSoup library because these tags will not be of any use or these tags can be removed using regex as well. The code is explained below:-

如果从互联网上对数据进行了爬网,则有时文本数据可能包含HTML标记以及普通文本。 可以使用python的BeautfulSoup库将其删除,因为这些标签将无用 ,或者也可以使用正则表达式删除这些标签。 该代码解释如下:

删除停用词 (Removing stop-words)

Many times the data contains a large number of stop-words. These might not be useful because they won’t be making any significant impact on the data. These can be removed by using nltk or spacy library. The code is shown below:-

很多时候,数据包含大量的停用词。 这些可能没有用,因为它们不会对数据产生任何重大影响。 可以使用nltkspacy库将其删除。 代码如下所示:

删除多余的空间 (Removing extra-spaces)

There might be certain situations where the data might contain extra spaces within the sentences. These can be easily removed by python’s split() and join() functions.

在某些情况下,数据在句子中可能包含多余的空格。 这些可以通过python的split()join()函数轻松删除。

将数字转换为其文本表示形式 (Converting numbers to their textual representations)

Converting numbers to their textual form is also much useful in NLP preprocessing steps. For this purpose, the num2words library can be used.

在NLP预处理步骤中,将数字转换为文本形式也非常有用。 为此,可以使用num2words库。

小写文本 (Lowercasing the text)

Converting all the words in data into lowercase is a good practice to remove redundancies. There might be a possibility that words may appear more than one time in the text. One in lowercase form, other in uppercase form.

将数据中的所有单词都转换为小写是删除冗余的一种好习惯。 单词在文本中可能会出现多次。 一个以小写形式,另一个以大写形式。

代币化 (Tokenization)

Tokenization involves converting the sentences into tokens. By tokens, I mean that splitting the sentences into words. It is also useful to separate punctuation from the words because, in the embedding layer of the model, it is much possible that the model does not have embedding present for that word. For example — ‘thanks.’ is a word with full-stop. Tokenization will split into [‘thanks’, ‘.’]. The code for doing this is expressed below using NLTK’s word_toknize:-

标记化涉及将句子转换为标记。 用记号表示,我的意思是将句子分成单词。 将标点符号与单词分开也是很有用的,因为在模型的嵌入层中,该模型很可能没有该单词的嵌入。 例如-“谢谢”。 是一个句号。 标记化将分为['thanks','。']。 以下代码使用NLTK的word_toknize表示:

抽干 (Stemming)

Stemming is the process of converting any word in the data to its root form. For example:- ‘sitting’ will be converted to ‘sit’, ‘thinking’ will be converted to ‘think’ etc. NLTK’s PorterStemmer can be used for this purpose.

词干是将数据中的任何单词转换为其根形式的过程。 例如:-“坐着”将转换为“坐下”,“思考”将转换为“思考”等。NLTKPorterStemmer可用于此目的。

合法化 (Lemmatization)

Many people consider lemmatization similar to stemming. But actually they are different. Lemmatization does a morphological analysis of words that stemming does not do. NLTK has an implementation of lemmatization (WordNetLemmatizer) which can be used.

许多人认为成词类似于词干。 但是实际上它们是不同的。 词法化对词干不做的词进行形态分析 。 NLTK具有可以使用的词形化( WordNetLemmatizer )的实现。

拼写检查 (Spell-checking)

There is much possibility that the data that is being used contain spelling mistakes. There spell-checking becomes an important step in NLP preprocessing. I will be using the TextBlob library for this purpose.

所使用的数据很可能包含拼写错误。 拼写检查成为NLP预处理的重要步骤。 我将为此使用TextBlob库。

Although the above spell-checker may not be perfect still it will be of good use.

尽管上述拼写检查器可能并不完美,但仍会很有用。

All the above methods that are depicted above are one of the possible techniques available to do those steps. There are other methods available as well.

上面描述的所有上述方法都是可用于执行那些步骤的可能技术之一。 还有其他可用方法。

I have created a Github repo as well accumulating all the above methods in one file. You can check it by going through the below link:-

我已经创建了一个Github存储库,并将上述所有方法都存储在一个文件中。 您可以通过以下链接进行检查:-

That’s all from my side this time. Keep reading. If you want to read more, you can reach to below stories of mine:-

这次是我的全部。 继续阅读。 如果您想了解更多,可以阅读以下我的故事:-

翻译自: https://medium.com/analytics-vidhya/nlp-preprocessing-a-useful-and-important-step-e79895c65a89

自然语言处理 预处理步骤

你可能感兴趣的:(nlp,机器学习,https)