本文是https://github.com/salesforce/WikiSQL这个网址的翻译,只是为了方便了解这个数据集的格式..
.jsonl文件的格式:
{
"phase":1,
"question":"who is the manufacturer for the order year 1998?",
"sql":{
"conds":[
[
0,
0,
"1998"
]
],
"sel":1,
"agg":0
},
"table_id":"1-10007452-3"
}
phase表示在哪个数据集收集的数据,有两个phase。
question:自然语言问句
sql:
- sel: 表中的哪一列被选中了,是这一列的索引值。
- agg: 指的是aggregation operator的索引号,在lib/query.py中我们可以看到是这几个operator:
agg_ops = ['', 'MAX', 'MIN', 'COUNT', 'SUM', 'AVG']
-
conds:三元组列表(column_index,operator_index,condition)where column = "value"
- column_index:列的索引
- operator_index:满足这个条件数字索引。也就是下面几种运算符:
cond_ops = ['=', '>', '<', 'OP']
- condition:条件的比较值,以字符串或浮点类型表示。也就是value
.tables.jsonl的格式:
{
"id":"1-1000181-1",
"header":[
"State/territory",
"Text/background colour",
"Format",
"Current slogan",
"Current series",
"Notes"
],
"types":[
"text",
"text",
"text",
"text",
"text",
"text"
],
"rows":[
[
"Australian Capital Territory",
"blue/white",
"Yaa\u00b7nna",
"ACT \u00b7 CELEBRATION OF A CENTURY 2013",
"YIL\u00b700A",
"Slogan screenprinted on plate"
],
[
"New South Wales",
"black/yellow",
"aa\u00b7nn\u00b7aa",
"NEW SOUTH WALES",
"BX\u00b799\u00b7HI",
"No slogan on current series"
],
[
"New South Wales",
"black/white",
"aaa\u00b7nna",
"NSW",
"CPX\u00b712A",
"Optional white slimline series"
],
[
"Northern Territory",
"ochre/white",
"Ca\u00b7nn\u00b7aa",
"NT \u00b7 OUTBACK AUSTRALIA",
"CB\u00b706\u00b7ZZ",
"New series began in June 2011"
],
[
"Queensland",
"maroon/white",
"nnn\u00b7aaa",
"QUEENSLAND \u00b7 SUNSHINE STATE",
"999\u00b7TLG",
"Slogan embossed on plate"
],
[
"South Australia",
"black/white",
"Snnn\u00b7aaa",
"SOUTH AUSTRALIA",
"S000\u00b7AZD",
"No slogan on current series"
],
[
"Victoria",
"blue/white",
"aaa\u00b7nnn",
"VICTORIA - THE PLACE TO BE",
"ZZZ\u00b7562",
"Current series will be exhausted this year"
]
]
}
id:表的id
header:表中的列名
rows:表中每一行的值.
我们具体来看一下下载下来的data文件夹的数据:
这是train.jsonl第一行的数据:
这是train.table.json第一行的数据:
我们可以看到标注的sql的sel,conds,agg,然后去train.tables.json比对一下:
sel:索引5表示notes,conds:3表示current slogan,0表示"=","SOUTH AUSTRALIA"表示值.也就意味着这句SQL是:select notes where current slogan = "SOUTH AUSTRALIA".
over~