原文链接:https://towardsdatascience.com/an-easy-introduction-to-natural-language-processing-b1e2801291c1
用计算机理解人类语言。
计算机非常适合处理标准化和结构化数据,例如数据库表和财务记录。它们可以比我们人类更快的做些处理。但是我们不能用结构话语言去交流,我们也不会说二进制语言!我们使用单词去沟通,这是非结构化数据的形式。
不幸的是,计算机很难处理非结构化的数据因为没有标准化的技术来处理它。当我们使用C++、Java或者Python编写程序的时候,我们本质上是给了计算机一些规则让它遵循。对于非结构化数据,这些规则是十分抽象的,想要具体定义很有挑战性。
人类VS计算机对语言的理解
人类已经写了几千年的事情。这段时间,我们的大脑在理解自然语言方面收获了大量的经验。当我们在读纸上写的或者网上的博客时,我们理解这件事在真实世界的含义。我们阅读那件事会感受到情感在里面,并且我们经常想象这件事在生活中的样子。
自然语言处理(NLP)是人工智能的一个子领域 ,专注于使计算机能够理解并处理人类语言,使得计算机语言和人类理解的语言更加接近。电脑并不是和人类一样有相同的直觉去理解自然语言。它们不能真正理解,某个语言真正想要表达的事情。简而言之,计算机无法读取行间。
话虽这么说,但是机器学习(ML)的最新进展已经可以使计算机能够用自然语言做很多的事情!深度学习使我们能够编写程序来执行语言翻译、语义理解和文本摘要。所有这些都增加了现实世界的价值,使你无需手动操作就可以轻松理解和执行大块文本的计算。
接下来让我们有一个快速的对NLP概念的理解。之后我们将深入研究一些Python代码,以便你自己开始使用NLP。
NLP发展困的的真正原因
阅读与理解语言的能力远比初看起来复杂的多。有许多东西可以真正理解一段文本在现实中的含义。例如,你认为接下来的文本是什么意思?
"Steph Curry was on fire last nice. He totally destroyed the other team"
人类可能觉着这句话的意思非常明显。我们知道Steph Curry是一名篮球运动员;即使你不这他是在球队打球,可能是在一支运动项队伍。当我们看到"on fire" 和"destroyed"我们知道它意味着Steph Curry昨天晚上真多发挥很好,战胜了其他队伍。
计算机趋向于采用字面意思。我们来作为计算机理解一下,看到"Steph Curry"和首字母大写可以假定这是一个人、一个地方或者其他重要的伟大的东西!但是之后我们看"was on fire"…一个计算机按字面意思可能会告诉你某个人昨天着火了!后面那句,计算机可能会说Curry先生用身体摧毁了另一支队伍…根据这台计算机的理解……他们将不在存在……
但并非所有都是糟糕的!感谢机器学习,我们实际上可以做一些非常聪明的事情来快速提取和理解来自人类语言中的信息。让我们来看看使用Python写下的几行代码。
做自然语言处理—使用Python代码
为了了解NLP管道的工作原理,我们将使用一下维基百科的文本:
Amazon.com, Inc., doing business as Amazon, is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after Alibaba Group in terms of total sales. The amazon.com website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics — Kindle e-readers, Fire tablets, Fire TV, and Echo — and is the world’s largest provider of cloud infrastructure services (IaaS and PaaS). Amazon also sells certain low-end products under its in-house brand AmazonBasics.
一些依赖
首先我们需要下载一些游泳的Pthon NLP的依赖包,以便我们分析这段话。
### Installing spaCy, general Python NLP lib
pip3 install spacy
### Downloading the English dictionary model for spaCy
python3 -m spacy download en_core_web_lg
### Installing textacy, basically a useful add-on to spaCy
pip3 install textacy
实体分析
现在所有东西都下载好了,我们可以对这段话的进行快速的实体分析。实体分析会查阅你的文本并识别重要的单词或"实体"。当我们说"重要"时,我们真正的意思是具有某种现实世界语义或意义的词。
检出以下代码然后我们做所有的实体分析:
# coding: utf-8
import spacy
### Load spaCy's English NLP model
nlp = spacy.load('en_core_web_lg')
### The text we want to examine
text = "Amazon.com, Inc., doing business as Amazon, is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after Alibaba Group in terms of total sales. The amazon.com website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics - Kindle e-readers, Fire tablets, Fire TV, and Echo - and is the world's largest provider of cloud infrastructure services (IaaS and PaaS). Amazon also sells certain low-end products under its in-house brand AmazonBasics."
### Parse the text with spaCy
### Our 'document' variable now contains a parsed version of text.
document = nlp(text)
### print out all the named entities that were detected
for entity in document.ents:
print(entity.text, entity.label_)
我们首先加载spaCy的学习ML模型并初始化想要处理的文本。我们运行ML模型在文本中提取实体。当你运行这段代码的时候会得到以下输出:
Amazon.com, Inc. ORG
Amazon ORG
American NORP
Seattle GPE
Washington GPE
Jeff Bezos PERSON
July 5, 1994 DATE
second ORDINAL
Alibaba Group ORG
amazon.com ORG
Fire TV ORG
Echo - LOC
PaaS ORG
Amazon ORG
AmazonBasics ORG
文本旁边的3个字母代码是标签,表示我们正在查看的实体类型。看起来我们的模型做了一个非常好的工作!Jeff Bezos是一个人,日期被正确的标注了,亚马逊是一个组织,西雅图和华盛顿是地理试题。唯一棘手的问题是像Fire tv 和Echo这样的东西实际上是产品,而不是组织。它也落下了亚马逊出售的东西" video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry",可能因为它是一个长的、未资本化的列表因此看起来相当不重要。
总体来说,我们的model已经完成了我们想要做的。想象一下我有巨大的文档(上百页的文本)。NLP模式使你可以快速浏览文档的内容以及文档中的关键实体。
对实体进行操作
让我尝试做一些更实用的事。假设你有和上面相同的文本块,但是出于隐私考虑,你想要自动移除人名或者组织名称。spaCy依赖包有一个非常有用的srub方法,我们可以使用这个方法过滤掉任何我们不想看到的菜单(列表项)。如下所示:
# coding: utf-8
import spacy
### Load spaCy's English NLP model
nlp = spacy.load('en_core_web_lg')
### The text we want to examine
text = "Amazon.com, Inc., doing business as Amazon, is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after Alibaba Group in terms of total sales. The amazon.com website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics - Kindle e-readers, Fire tablets, Fire TV, and Echo - and is the world's largest provider of cloud infrastructure services (IaaS and PaaS). Amazon also sells certain low-end products under its in-house brand AmazonBasics."
### Replace a specific entity with the word "PRIVATE"
def replace_entity_with_placeholder(token):
if token.ent_iob != 0 and (token.ent_type_ == "PERSON" or token.ent_type_ == "ORG"):
return "[PRIVATE] "
else:
return token.string
### Loop through all the entities in a piece of text and apply entity replacement
def scrub(text):
doc = nlp(text)
for ent in doc.ents:
ent.merge()
tokens = map(replace_entity_with_placeholder, doc)
return "".join(tokens)
print(scrub(text))
得到的结果是:
[PRIVATE] , doing business as [PRIVATE] , is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by [PRIVATE] on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after [PRIVATE] in terms of total sales. The [PRIVATE] website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics - Kindle e-readers, Fire tablets, [PRIVATE] , and Echo - and is the world's largest provider of cloud infrastructure services (IaaS and [PRIVATE] ). [PRIVATE] also sells certain low-end products under its in-house brand [PRIVATE] .
看起来很棒!这真是一个更强有力的技术。我们可以使用ctrl + f
寻找或者替换他们的文档中的内容。但是使用NLP后,我们不仅可以找到或替换文档内容,同时还可以思考一下他们的语义。
从文本中提取信息
我们之前下载的textacy
包在spaCy上实现了几个公共的NLP信息提取算法。它会帮助我们做一些更近一步的信息提取。
其中之一的算法是Semi-structured Statement Extraction.
。这个算法基本上解析了spaCy的NLP模型能够提取的一些信息,我们可以获得有关某些实体的更具体的信息。简而言之 ,我们可以提取我们项选择的正确的"事件"。
让我们看看代码。举个例子:我们想要提取华盛顿D.C维基百科中的完整摘要。
# coding: utf-8
import spacy
import textacy.extract
### Load spaCy's English NLP model
nlp = spacy.load('en_core_web_lg')
### The text we want to examine
text = """Washington, D.C., formally the District of Columbia and commonly referred to as Washington or D.C., is the capital of the United States of America.[4] Founded after the American Revolution as the seat of government of the newly independent country, Washington was named after George Washington, first President of the United States and Founding Father.[5] Washington is the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6] As the seat of the United States federal government and several international organizations, the city is an important world political capital.[7] Washington is one of the most visited cities in the world, with more than 20 million annual tourists.[8][9]
The signing of the Residence Act on July 16, 1790, approved the creation of a capital district located along the Potomac River on the country's East Coast. The U.S. Constitution provided for a federal district under the exclusive jurisdiction of the Congress and the District is therefore not a part of any state. The states of Maryland and Virginia each donated land to form the federal district, which included the pre-existing settlements of Georgetown and Alexandria. Named in honor of President George Washington, the City of Washington was founded in 1791 to serve as the new national capital. In 1846, Congress returned the land originally ceded by Virginia; in 1871, it created a single municipal government for the remaining portion of the District.
Washington had an estimated population of 693,972 as of July 2017, making it the 20th largest American city by population. Commuters from the surrounding Maryland and Virginia suburbs raise the city's daytime population to more than one million during the workweek. The Washington metropolitan area, of which the District is the principal city, has a population of over 6 million, the sixth-largest metropolitan statistical area in the country.
All three branches of the U.S. federal government are centered in the District: U.S. Congress (legislative), President (executive), and the U.S. Supreme Court (judicial). Washington is home to many national monuments and museums, which are primarily situated on or around the National Mall. The city hosts 177 foreign embassies as well as the headquarters of many international organizations, trade unions, non-profit, lobbying groups, and professional associations, including the Organization of American States, AARP, the National Geographic Society, the Human Rights Campaign, the International Finance Corporation, and the American Red Cross.
A locally elected mayor and a 13‑member council have governed the District since 1973. However, Congress maintains supreme authority over the city and may overturn local laws. D.C. residents elect a non-voting, at-large congressional delegate to the House of Representatives, but the District has no representation in the Senate. The District receives three electoral votes in presidential elections as permitted by the Twenty-third Amendment to the United States Constitution, ratified in 1961."""
### Parse the text with spaCy
### Our 'document' variable now contains a parsed version of text.
document = nlp(text)
### Extracting semi-structured statements
statements = textacy.extract.semistructured_statements(document, "Washington")
print("**** Information from Washington's Wikipedia page ****")
count = 1
for statement in statements:
subject, verb, fact = statement
print(str(count) + " - Statement: ", statement)
print(str(count) + " - Fact: ", fact)
count += 1
输出的结果是:
**** Information from Washington's Wikipedia page ****
1 - Statement: (Washington, is, the capital of the United States of America.[4)
1 - Fact: the capital of the United States of America.[4
2 - Statement: (Washington, is, the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6)
2 - Fact: the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6
3 - Statement: (Washington, is, home to many national monuments and museums, which are primarily situated on or around the National Mall)
3 - Fact: home to many national monuments and museums, which are primarily situated on or around the National Mall
我们的NLP模型找到了3个关于华盛顿D.C有用的事件:
美国的首都是华盛顿
华盛顿很著名是一个大都市
有许多国家古迹和博物馆
关于这一点最好的部分是,这些都是该文本块中最重要的信息片段。
深入了解NLP
本文简单介绍了NLP!我们学到了很多东西,但这仅仅是NLP的一小部分…
有很多更棒的基于NLP做的应用,比如语言翻译、聊天机器人和更加专业、复杂的文档分析。今天大都使用深度学习的部分,特别是Recurrent Neural Networks (RNNs) and Long-Short Term Memory (LSTMs) networks。
如果你自己想玩更多NLP技术,你可以看spaCy docs 和 textacy docs,这些是很好的入门文档。你可以学到大量的解析文本并提取有用信息的例子。sapCy的每一个东西都是快速简单的,你可以从中获取非常有价值的东西。一旦你掌握了这点,就是时候通过深度学习做更大更好的事了。