使用NLTK将自然语言查询转化为SQL查询

文章目录


假设我们有一个sql表,如下所示:

City Country Population
athens greece 1368
bangkok thailand 1178
barcelona spain 1280
berlin east_germany 3481
birmingham united_kingdom 1112

当我们要查找某个athens在哪个国家时,可以使用如下的sql语句:

SELECT Country FROM city_table WHERE City = 'athens'

如果用户使用了自然语言进行查询

What cities are located in China

我们可以通过句法分析,将该自然语言转化为对应的sql语句。
首先,我们需要编写符合乔姆斯基范式的上下文无关文法:

% start S
S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp]
VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp]
VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap]
NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n]
PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
NP[SEM='Country="greece"'] -> 'Greece'
NP[SEM='Country="china"'] -> 'China'
Det[SEM='SELECT'] -> 'Which' | 'What'
N[SEM='City FROM city_table'] -> 'cities'
IV[SEM=''] -> 'are'
A[SEM=''] -> 'located'
P[SEM=''] -> 'in'

我们将该文法保存在文件sql0.fcfg文件中,然后使用NLTK来根据这个文法规则来解析用户输入的自然语言。

from nltk import load_parser
cfg_path='grammars/book_grammars/sql0.fcfg'
cp=load_parser(cfg_path,trace=3)
query = 'What cities are located in China'
trees = list(cp.parse(query.split()))
answer = trees[0].label()['SEM']
answer = [s for s in answer if s]
q = ' '.join(answer)

输出如下:

SELECT City FROM city_table WHERE Country="china"

因为在load_parser指定了参数trace=3,语法树的构建过程也会被打印出来

Leaf Init Rule:
|[-] . . . . .| [0:1] 'What'
|. [-] . . . .| [1:2] 'cities'
|. . [-] . . .| [2:3] 'are'
|. . . [-] . .| [3:4] 'located'
|. . . . [-] .| [4:5] 'in'
|. . . . . [-]| [5:6] 'China'
Feature Bottom Up Predict Combine Rule:
|[-] . . . . .| [0:1] Det[SEM='SELECT'] -> 'What' *
Feature Bottom Up Predict Combine Rule:
|[-> . . . . .| [0:1] NP[SEM=(?det+?n)] -> Det[SEM=?det] * N[SEM=?n] {?det: 'SELECT'}
Feature Bottom Up Predict Combine Rule:
|. [-] . . . .| [1:2] N[SEM='City FROM city_table'] -> 'cities' *
Feature Single Edge Fundamental Rule:
|[---] . . . .| [0:2] NP[SEM=(SELECT, City FROM city_table)] -> Det[SEM='SELECT'] N[SEM='City FROM city_table'] *
Feature Bottom Up Predict Combine Rule:
|[---> . . . .| [0:2] S[SEM=(?np+WHERE+?vp)] -> NP[SEM=?np] * VP[SEM=?vp] {?np: (SELECT, City FROM city_table)}
Feature Bottom Up Predict Combine Rule:
|. . [-] . . .| [2:3] IV[SEM=''] -> 'are' *
Feature Bottom Up Predict Combine Rule:
|. . [-> . . .| [2:3] VP[SEM=(?v+?pp)] -> IV[SEM=?v] * PP[SEM=?pp] {?v: ''}
|. . [-> . . .| [2:3] VP[SEM=(?v+?ap)] -> IV[SEM=?v] * AP[SEM=?ap] {?v: ''}
Feature Bottom Up Predict Combine Rule:
|. . . [-] . .| [3:4] A[SEM=''] -> 'located' *
Feature Bottom Up Predict Combine Rule:
|. . . [-> . .| [3:4] AP[SEM=?pp] -> A[SEM=?a] * PP[SEM=?pp] {?a: ''}
Feature Bottom Up Predict Combine Rule:
|. . . . [-] .| [4:5] P[SEM=''] -> 'in' *
Feature Bottom Up Predict Combine Rule:
|. . . . [-> .| [4:5] PP[SEM=(?p+?np)] -> P[SEM=?p] * NP[SEM=?np] {?p: ''}
Feature Bottom Up Predict Combine Rule:
|. . . . . [-]| [5:6] NP[SEM='Country="china"'] -> 'China' *
Feature Bottom Up Predict Combine Rule:
|. . . . . [->| [5:6] S[SEM=(?np+WHERE+?vp)] -> NP[SEM=?np] * VP[SEM=?vp] {?np: 'Country="china"'}
Feature Single Edge Fundamental Rule:
|. . . . [---]| [4:6] PP[SEM=(, Country="china")] -> P[SEM=''] NP[SEM='Country="china"'] *
Feature Single Edge Fundamental Rule:
|. . . [-----]| [3:6] AP[SEM=(, Country="china")] -> A[SEM=''] PP[SEM=(, Country="china")] *
Feature Single Edge Fundamental Rule:
|. . [-------]| [2:6] VP[SEM=(, , Country="china")] -> IV[SEM=''] AP[SEM=(, Country="china")] *
Feature Single Edge Fundamental Rule:
|[===========]| [0:6] S[SEM=(SELECT, City FROM city_table, WHERE, , , Country="china")] -> NP[SEM=(SELECT, City FROM city_table)] VP[SEM=(, , Country="china")] *

值得注意的是,这里使用的线图分析器自底向上地构建整棵语法树,生成(i,j)位置的树结构,是通过合并(i,k)和(k,j)的结果来实现的,直到规约到句子 S S S时结束。

为了更清楚的看到这棵语法树,我们可以将它打印出来

print(trees)

输出如下(为了看的更清楚,我对输出进行了缩进处理,下载地址):

[
	Tree(
		S[
			SEM=(SELECT, City FROM city_table, WHERE, , , Country="china")
		], 
		[
			Tree(
				NP[SEM=(SELECT, City FROM city_table)], 
				[
					Tree(
						Det[SEM='SELECT'], 
						['What']
					), 
					Tree(
						N[SEM='City FROM city_table'], 
						['cities']
					)
				]
			), 
			Tree(
				VP[SEM=(, , Country="china")],
				[
					Tree(
						IV[SEM=''], 
						['are']
					), 
					Tree(
						AP[SEM=(, Country="china")], 
						[
							Tree(
								A[SEM=''], 
								['located']
							),
							Tree(
								PP[SEM=(, Country="china")], 
								[
									Tree(
										P[SEM=''], 
										['in']
									), 
									Tree(
										NP[SEM='Country="china"'], 
										['China']
									)
								]
							)
						]
					 )
				]
			)
		]
	)
 ]

完整代码可以从我的github上下载
参考文献:

[1] Natural Language Processing with Python(http://www.nltk.org/book/ch10.html)

你可能感兴趣的:(NLP,AI,IR)