City | Country | Population |
---|---|---|
athens | greece | 1368 |
bangkok | thailand | 1178 |
barcelona | spain | 1280 |
berlin | east_germany | 3481 |
birmingham | united_kingdom | 1112 |
当我们要查找某个athens在哪个国家时,可以使用如下的sql语句:
SELECT Country FROM city_table WHERE City = 'athens'
如果用户使用了自然语言进行查询
What cities are located in China
我们可以通过句法分析,将该自然语言转化为对应的sql语句。
首先,我们需要编写符合乔姆斯基范式的上下文无关文法:
% start S
S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp]
VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp]
VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap]
NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n]
PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
NP[SEM='Country="greece"'] -> 'Greece'
NP[SEM='Country="china"'] -> 'China'
Det[SEM='SELECT'] -> 'Which' | 'What'
N[SEM='City FROM city_table'] -> 'cities'
IV[SEM=''] -> 'are'
A[SEM=''] -> 'located'
P[SEM=''] -> 'in'
我们将该文法保存在文件sql0.fcfg文件中,然后使用NLTK来根据这个文法规则来解析用户输入的自然语言。
from nltk import load_parser
cfg_path='grammars/book_grammars/sql0.fcfg'
cp=load_parser(cfg_path,trace=3)
query = 'What cities are located in China'
trees = list(cp.parse(query.split()))
answer = trees[0].label()['SEM']
answer = [s for s in answer if s]
q = ' '.join(answer)
输出如下:
SELECT City FROM city_table WHERE Country="china"
因为在load_parser
指定了参数trace=3
,语法树的构建过程也会被打印出来
Leaf Init Rule:
|[-] . . . . .| [0:1] 'What'
|. [-] . . . .| [1:2] 'cities'
|. . [-] . . .| [2:3] 'are'
|. . . [-] . .| [3:4] 'located'
|. . . . [-] .| [4:5] 'in'
|. . . . . [-]| [5:6] 'China'
Feature Bottom Up Predict Combine Rule:
|[-] . . . . .| [0:1] Det[SEM='SELECT'] -> 'What' *
Feature Bottom Up Predict Combine Rule:
|[-> . . . . .| [0:1] NP[SEM=(?det+?n)] -> Det[SEM=?det] * N[SEM=?n] {?det: 'SELECT'}
Feature Bottom Up Predict Combine Rule:
|. [-] . . . .| [1:2] N[SEM='City FROM city_table'] -> 'cities' *
Feature Single Edge Fundamental Rule:
|[---] . . . .| [0:2] NP[SEM=(SELECT, City FROM city_table)] -> Det[SEM='SELECT'] N[SEM='City FROM city_table'] *
Feature Bottom Up Predict Combine Rule:
|[---> . . . .| [0:2] S[SEM=(?np+WHERE+?vp)] -> NP[SEM=?np] * VP[SEM=?vp] {?np: (SELECT, City FROM city_table)}
Feature Bottom Up Predict Combine Rule:
|. . [-] . . .| [2:3] IV[SEM=''] -> 'are' *
Feature Bottom Up Predict Combine Rule:
|. . [-> . . .| [2:3] VP[SEM=(?v+?pp)] -> IV[SEM=?v] * PP[SEM=?pp] {?v: ''}
|. . [-> . . .| [2:3] VP[SEM=(?v+?ap)] -> IV[SEM=?v] * AP[SEM=?ap] {?v: ''}
Feature Bottom Up Predict Combine Rule:
|. . . [-] . .| [3:4] A[SEM=''] -> 'located' *
Feature Bottom Up Predict Combine Rule:
|. . . [-> . .| [3:4] AP[SEM=?pp] -> A[SEM=?a] * PP[SEM=?pp] {?a: ''}
Feature Bottom Up Predict Combine Rule:
|. . . . [-] .| [4:5] P[SEM=''] -> 'in' *
Feature Bottom Up Predict Combine Rule:
|. . . . [-> .| [4:5] PP[SEM=(?p+?np)] -> P[SEM=?p] * NP[SEM=?np] {?p: ''}
Feature Bottom Up Predict Combine Rule:
|. . . . . [-]| [5:6] NP[SEM='Country="china"'] -> 'China' *
Feature Bottom Up Predict Combine Rule:
|. . . . . [->| [5:6] S[SEM=(?np+WHERE+?vp)] -> NP[SEM=?np] * VP[SEM=?vp] {?np: 'Country="china"'}
Feature Single Edge Fundamental Rule:
|. . . . [---]| [4:6] PP[SEM=(, Country="china")] -> P[SEM=''] NP[SEM='Country="china"'] *
Feature Single Edge Fundamental Rule:
|. . . [-----]| [3:6] AP[SEM=(, Country="china")] -> A[SEM=''] PP[SEM=(, Country="china")] *
Feature Single Edge Fundamental Rule:
|. . [-------]| [2:6] VP[SEM=(, , Country="china")] -> IV[SEM=''] AP[SEM=(, Country="china")] *
Feature Single Edge Fundamental Rule:
|[===========]| [0:6] S[SEM=(SELECT, City FROM city_table, WHERE, , , Country="china")] -> NP[SEM=(SELECT, City FROM city_table)] VP[SEM=(, , Country="china")] *
值得注意的是,这里使用的线图分析器自底向上地构建整棵语法树,生成(i,j)位置的树结构,是通过合并(i,k)和(k,j)的结果来实现的,直到规约到句子 S S S时结束。
为了更清楚的看到这棵语法树,我们可以将它打印出来
print(trees)
输出如下(为了看的更清楚,我对输出进行了缩进处理,下载地址):
[
Tree(
S[
SEM=(SELECT, City FROM city_table, WHERE, , , Country="china")
],
[
Tree(
NP[SEM=(SELECT, City FROM city_table)],
[
Tree(
Det[SEM='SELECT'],
['What']
),
Tree(
N[SEM='City FROM city_table'],
['cities']
)
]
),
Tree(
VP[SEM=(, , Country="china")],
[
Tree(
IV[SEM=''],
['are']
),
Tree(
AP[SEM=(, Country="china")],
[
Tree(
A[SEM=''],
['located']
),
Tree(
PP[SEM=(, Country="china")],
[
Tree(
P[SEM=''],
['in']
),
Tree(
NP[SEM='Country="china"'],
['China']
)
]
)
]
)
]
)
]
)
]
完整代码可以从我的github上下载
参考文献:
[1] Natural Language Processing with Python(http://www.nltk.org/book/ch10.html)