[github]word_ordering

link

The task of word ordering, or linearization, is to recover the original order of a shuffled sentence.
Replicating our results can be broken down into two main steps:

  1. Preprocess Penn Treebank with the splits and tokenization used in our experiments. Instructions are available in data/preprocessing/README_DATASET_CREATION.txt.
  2. Train, run, and evaluate the NGram and LSTM models of interest. Instructions are available in Usage.txt

1. 预处理

treebank代表标注了语义和语法的语料库,PTB便是其中的一个。
预处理工作是将这个语料库进行划分和标记。instructions如下:

  1. 下载语料库treebank_3放在datasets/ directory中。
  1. Run cd data/preprocessing
  2. Run bash create_dependency_files.sh. 功能包括:
    复制WSJ constituency trees
    patch the WSJ part of the Penn Treebank using the NP bracketing script
    convert the NP-bracketed constituency trees to dependency trees.
  3. Run bash create_dataset.sh. 生成核心文件:
    ordered PTB files with and without base NP annotations
    exact shuffling of the word multisets we used to generate the point estimates in our paper
    versions formatted for use with ZGen and the Yara parer.

After following these steps, the folder zgen_data_gold will contain the gold, ordered sentences

[github]word_ordering_第1张图片

_ref_npsyms.txt结尾的文件包括BNP注释。

The folder zgen_data_npsyms_freq3_unkUNK中的文件把不常用的token被替换成特殊符号,也包括shuffle后的文件。包含BNP注释的没有被shuffle(eg. Black Monday 被看作一个整体)

需要的环境包括:
Python: Most recently tested with 2.7.9
NLTK (Python package): Most recently tested with 3.0.4
Java: Most recently tested with 1.8.0_31
安装NLTK包括:

  1. sudo pip install -U nltk安装nltk
  2. 下载nlkt-data放在指定目录中的一个即可。
    中间有报错:
IOError: Zipfile nltk_data/corpora/ptb.zip' does not contain ptb/datasets/treebank_3_ original/wsj/02/wsj 0200.mrg

解决方案
进入/usr/share/nltk-data/,把ptb.zip解压,然后把wsj文件夹放入。(貌似并不需要把名称转换为大写)

2. Train, run, and evaluate the NGram and LSTM models

N-gram

准备环境
安装KenLM
Python module:pip install https://github.com/kpu/kenlm/archive/master.zip

命令:(虚拟机内存要分配够)

分为两种情况,N-Gram (Words+BNPs) 和 N-Gram (Words, without future costs)
  1. 建立5-gram model(参数为5)
bin/lmplz -o 5 -S 3145728K -T /tmp  /home/yingtao/output/LM5_noeos_withnpsyms_freq3_unkUNK.arpa
  1. 建立unigram model(参数为1)
bin/lmplz -o 1 -S 3145728K -T /tmp --discount_fallback /home/yingtao/output/LM1_noeos_withnpsyms_freq3_unkUNK.arpa
  1. Run the N-gram decoder. 把之前准备的shuffled_no_eos.txt作为输入,参考刚才得到的两个gram模型。输出排序后的结果
python ngram_decoder.py /home/yingtao/output/LM5_noeos_withnpsyms_freq3_unkUNK.arpa /mnt/shared/order/datasets/zgen_data_npsyms_freq3_unkUNK/npsyms/valid_words_with_np_symbols_shuffled_no_eos.txt 64 --future /home/yingtao/output/LM1_noeos_withnpsyms_freq3_unkUNK.arpa > /home/yingtao/output/output_valid_with_npsyms_futurecosts_beam64_lm5.txt
  1. 把unk/UNK随机换成低频词。除去BNP标记。.
python randomly_replace_unkUNK.py \
--generated_reordering_with_unk /home/yingtao/output/output_valid_with_npsyms_futurecosts_beam64_lm5.txt \
--gold_unprocessed /mnt/shared/order/datasets//zgen_data_gold/valid_words_ref_npsyms.txt \
--gold_processed /mnt/shared/order/datasets/zgen_data_npsyms_freq3_unkUNK/npsyms/valid_words_with_np_symbols_no_eos.txt \
--out_file /home/yingtao/output/output_valid_with_npsyms_futurecosts_beam64_lm5_removed_unk.txt \
--remove_npsyms
右边是正确的结果
  1. 计算得分
./ScoreBLEU.sh -t /home/yingtao/output/output_valid_with_npsyms_futurecosts_beam64_lm5_removed_unk.txt -r /mnt/shared/order/datasets/zgen_data_gold/valid_words_ref.txt -odir /home/yingtao/output/test
sh脚本替换use UNIVERSAL 'isa';为use Scalar::Util 'reftype';

没有BNP注释的时候得分会下降:


你可能感兴趣的:([github]word_ordering)