wikiSQL-数据形式

本文是https://github.com/salesforce/WikiSQL这个网址的翻译,只是为了方便了解这个数据集的格式..

.jsonl文件的格式:

{
   "phase":1,
   "question":"who is the manufacturer for the order year 1998?",
   "sql":{
      "conds":[
         [
            0,
            0,
            "1998"
         ]
      ],
      "sel":1,
      "agg":0
   },
   "table_id":"1-10007452-3"
}

phase表示在哪个数据集收集的数据,有两个phase。
question:自然语言问句
sql:

  • sel: 表中的哪一列被选中了,是这一列的索引值。
  • agg: 指的是aggregation operator的索引号,在lib/query.py中我们可以看到是这几个operator:

agg_ops = ['', 'MAX', 'MIN', 'COUNT', 'SUM', 'AVG']

  • conds:三元组列表(column_index,operator_index,condition)where column = "value"

    • column_index:列的索引
    • operator_index:满足这个条件数字索引。也就是下面几种运算符:

    cond_ops = ['=', '>', '<', 'OP']

    • condition:条件的比较值,以字符串或浮点类型表示。也就是value

.tables.jsonl的格式:

{
   "id":"1-1000181-1",
   "header":[
      "State/territory",
      "Text/background colour",
      "Format",
      "Current slogan",
      "Current series",
      "Notes"
   ],
   "types":[
      "text",
      "text",
      "text",
      "text",
      "text",
      "text"
   ],
   "rows":[
      [
         "Australian Capital Territory",
         "blue/white",
         "Yaa\u00b7nna",
         "ACT \u00b7 CELEBRATION OF A CENTURY 2013",
         "YIL\u00b700A",
         "Slogan screenprinted on plate"
      ],
      [
         "New South Wales",
         "black/yellow",
         "aa\u00b7nn\u00b7aa",
         "NEW SOUTH WALES",
         "BX\u00b799\u00b7HI",
         "No slogan on current series"
      ],
      [
         "New South Wales",
         "black/white",
         "aaa\u00b7nna",
         "NSW",
         "CPX\u00b712A",
         "Optional white slimline series"
      ],
      [
         "Northern Territory",
         "ochre/white",
         "Ca\u00b7nn\u00b7aa",
         "NT \u00b7 OUTBACK AUSTRALIA",
         "CB\u00b706\u00b7ZZ",
         "New series began in June 2011"
      ],
      [
         "Queensland",
         "maroon/white",
         "nnn\u00b7aaa",
         "QUEENSLAND \u00b7 SUNSHINE STATE",
         "999\u00b7TLG",
         "Slogan embossed on plate"
      ],
      [
         "South Australia",
         "black/white",
         "Snnn\u00b7aaa",
         "SOUTH AUSTRALIA",
         "S000\u00b7AZD",
         "No slogan on current series"
      ],
      [
         "Victoria",
         "blue/white",
         "aaa\u00b7nnn",
         "VICTORIA - THE PLACE TO BE",
         "ZZZ\u00b7562",
         "Current series will be exhausted this year"
      ]
   ]
}

id:表的id
header:表中的列名
rows:表中每一行的值.

我们具体来看一下下载下来的data文件夹的数据:
这是train.jsonl第一行的数据:

train.jsonl

这是train.table.json第一行的数据:
train.tables.json

我们可以看到标注的sql的sel,conds,agg,然后去train.tables.json比对一下:
sel:索引5表示notes,conds:3表示current slogan,0表示"=","SOUTH AUSTRALIA"表示值.也就意味着这句SQL是:select notes where current slogan = "SOUTH AUSTRALIA".

over~

你可能感兴趣的:(wikiSQL-数据形式)