复现数据增强实验(1)--Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

最开始知道题目中的论文,是看到BERT做数据增强的论文,Conditional BERT Contextual Augmentation,看到有同学分析的文章(见https://zhuanlan.zhihu.com/p/53141568),文章中提到了Kobayashi的这篇论文,刚好有开源的代码,作为英专毕业数年的不合格程序媛,自己写代码是不现实的,决定用开源的代码复现一下实验结果,记录以下作为参考。

原代码在Github上的地址:

按照readme中的顺序,从上到下依次执行命令:

1. Prepare a label-conditional bi-directional language model

(1) # download wikitext

sh prepare_rawwikitext.sh

(2) # install chainer and spacy

命令:pip install cupy--->>  pip install cupy-cuda90

提示:

Collecting cupy-cuda90

  Downloading https://files.pythonhosted.org/packages/30/a5/89d64c99a8b17c1ed64fcc0c9207ff6bc70efe90a9c567d616eb910aee34/cupy_cuda90-6.2.0-cp36-cp36m-manylinux1_x86_64.whl (270.4MB)

     |████████████████████████████████| 270.4MB 17kB/s

Collecting fastrlock>=0.3 (from cupy-cuda90)

  Downloading https://files.pythonhosted.org/packages/b5/93/a7efbd39eac46c137500b37570c31dedc2d31a8ff4949fcb90bda5bc5f16/fastrlock-0.4-cp36-cp36m-manylinux1_x86_64.whl

Requirement already satisfied: numpy>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from cupy-cuda90) (1.16.4)

Requirement already satisfied: six>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from cupy-cuda90) (1.12.0)

Installing collected packages: fastrlock, cupy-cuda90

Successfully installed cupy-cuda90-6.2.0 fastrlock-0.4

命令:pip install chainer

提示:

Collecting chainer

  Downloading https://files.pythonhosted.org/packages/2c/5a/86c50a0119a560a39d782c4cdd9b72927c090cc2e3f70336e01b19a5f97a/chainer-6.2.0.tar.gz (873kB)

     |████████████████████████████████| 880kB 174kB/s

Requirement already satisfied: setuptools in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from chainer) (28.8.0)

Collecting typing<=3.6.6 (from chainer)

  Downloading https://files.pythonhosted.org/packages/4a/bd/eee1157fc2d8514970b345d69cb9975dcd1e42cd7e61146ed841f6e68309/typing-3.6.6-py3-none-any.whl

Collecting typing_extensions<=3.6.6 (from chainer)

  Downloading https://files.pythonhosted.org/packages/62/4f/392a1fa2873e646f5990eb6f956e662d8a235ab474450c72487745f67276/typing_extensions-3.6.6-py3-none-any.whl

Collecting filelock (from chainer)

  Downloading https://files.pythonhosted.org/packages/93/83/71a2ee6158bb9f39a90c0dea1637f81d5eef866e188e1971a1b1ab01a35a/filelock-3.0.12-py3-none-any.whl

Requirement already satisfied: numpy>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from chainer) (1.16.4)

Collecting protobuf<3.8.0rc1,>=3.0.0 (from chainer)

  Downloading https://files.pythonhosted.org/packages/5a/aa/a858df367b464f5e9452e1c538aa47754d467023850c00b000287750fa77/protobuf-3.7.1-cp36-cp36m-manylinux1_x86_64.whl (1.2MB)

     |████████████████████████████████| 1.2MB 153kB/s

Requirement already satisfied: six>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from chainer) (1.12.0)

Building wheels for collected packages: chainer

  Building wheel for chainer (setup.py) ... done

  Stored in directory: /sunj/wanglina/.cache/pip/wheels/2e/be/c5/6ee506abcaa4a53106f7d7671bbee8b4e5243bc562a9d32ad1

Successfully built chainer

Installing collected packages: typing, typing-extensions, filelock, protobuf, chainer

  Found existing installation: protobuf 3.9.0

    Uninstalling protobuf-3.9.0:

      Successfully uninstalled protobuf-3.9.0

Successfully installed chainer-6.2.0 filelock-3.0.12 protobuf-3.7.1 typing-3.6.6 typing-extensions-3.6.6

命令:pip install spacy

提示:

Collecting spacy

  Downloading https://files.pythonhosted.org/packages/4e/f4/3d79c0eeec5d45046d0b1f00b3b78de00f8ce389f56d8b53fbbdd198d90e/spacy-2.1.6-cp36-cp36m-manylinux1_x86_64.whl (30.8MB)

     |████████████████████████████████| 30.8MB 76kB/s

Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)

  Downloading https://files.pythonhosted.org/packages/a6/e6/63f160a4fdf0e875d16b28f972083606d8d54f56cd30cb8929f9a1ee700e/murmurhash-1.0.2-cp36-cp36m-manylinux1_x86_64.whl

Collecting srsly<1.1.0,>=0.0.6 (from spacy)

  Downloading https://files.pythonhosted.org/packages/aa/6c/2ef2d6f4c63a197981f4ac01bb17560c857c6721213c7c99998e48cdda2a/srsly-0.0.7-cp36-cp36m-manylinux1_x86_64.whl (180kB)

     |████████████████████████████████| 184kB 1.1MB/s

Collecting thinc<7.1.0,>=7.0.8 (from spacy)

  Downloading https://files.pythonhosted.org/packages/18/a5/9ace20422e7bb1bdcad31832ea85c52a09900cd4a7ce711246bfb92206ba/thinc-7.0.8-cp36-cp36m-manylinux1_x86_64.whl (2.1MB)

     |████████████████████████████████| 2.1MB 1.2MB/s

Collecting cymem<2.1.0,>=2.0.2 (from spacy)

  Downloading https://files.pythonhosted.org/packages/3d/61/9b0520c28eb199a4b1ca667d96dd625bba003c14c75230195f9691975f85/cymem-2.0.2-cp36-cp36m-manylinux1_x86_64.whl

Requirement already satisfied: numpy>=1.15.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from spacy) (1.16.4)

Collecting requests<3.0.0,>=2.13.0 (from spacy)

  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)

     |████████████████████████████████| 61kB 318kB/s

Collecting wasabi<1.1.0,>=0.2.0 (from spacy)

  Downloading https://files.pythonhosted.org/packages/f4/c1/d76ccdd12c716be79162d934fe7de4ac8a318b9302864716dde940641a79/wasabi-0.2.2-py3-none-any.whl

Collecting preshed<2.1.0,>=2.0.1 (from spacy)

  Downloading https://files.pythonhosted.org/packages/20/93/f222fb957764a283203525ef20e62008675fd0a14ffff8cc1b1490147c63/preshed-2.0.1-cp36-cp36m-manylinux1_x86_64.whl (83kB)

     |████████████████████████████████| 92kB 453kB/s

Collecting blis<0.3.0,>=0.2.2 (from spacy)

  Downloading https://files.pythonhosted.org/packages/34/46/b1d0bb71d308e820ed30316c5f0a017cb5ef5f4324bcbc7da3cf9d3b075c/blis-0.2.4-cp36-cp36m-manylinux1_x86_64.whl (3.2MB)

     |████████████████████████████████| 3.2MB 1.0MB/s

Collecting plac<1.0.0,>=0.9.6 (from spacy)

  Downloading https://files.pythonhosted.org/packages/9e/9b/62c60d2f5bc135d2aa1d8c8a86aaf84edb719a59c7f11a4316259e61a298/plac-0.9.6-py2.py3-none-any.whl

Collecting tqdm<5.0.0,>=4.10.0 (from thinc<7.1.0,>=7.0.8->spacy)

  Downloading https://files.pythonhosted.org/packages/9f/3d/7a6b68b631d2ab54975f3a4863f3c4e9b26445353264ef01f465dc9b0208/tqdm-4.32.2-py2.py3-none-any.whl (50kB)

     |████████████████████████████████| 51kB 271kB/s

Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests<3.0.0,>=2.13.0->spacy)

  Downloading https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl (150kB)

     |████████████████████████████████| 153kB 1.2MB/s

Collecting certifi>=2017.4.17 (from requests<3.0.0,>=2.13.0->spacy)

  Downloading https://files.pythonhosted.org/packages/69/1b/b853c7a9d4f6a6d00749e94eb6f3a041e342a885b87340b79c1ef73e3a78/certifi-2019.6.16-py2.py3-none-any.whl (157kB)

     |████████████████████████████████| 163kB 1.2MB/s

Collecting chardet<3.1.0,>=3.0.2 (from requests<3.0.0,>=2.13.0->spacy)

  Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)

     |████████████████████████████████| 143kB 1.2MB/s

Collecting idna<2.9,>=2.5 (from requests<3.0.0,>=2.13.0->spacy)

  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)

     |████████████████████████████████| 61kB 343kB/s

Installing collected packages: murmurhash, srsly, tqdm, cymem, plac, wasabi, preshed, blis, thinc, urllib3, certifi, chardet, idna, requests, spacy

Successfully installed blis-0.2.4 certifi-2019.6.16 chardet-3.0.4 cymem-2.0.2 idna-2.8 murmurhash-1.0.2 plac-0.9.6 preshed-2.0.1 requests-2.22.0 spacy-2.1.6 srsly-0.0.7 thinc-7.0.8 tqdm-4.32.2 urllib3-1.25.3 wasabi-0.2.2

命令:python -m spacy download en_core_web_sm

提示:

Collecting en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0

  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)

     |████████████████████████████████| 11.1MB 403kB/s

Building wheels for collected packages: en-core-web-sm

  Building wheel for en-core-web-sm (setup.py) ... done

  Stored in directory: /tmp/pip-ephem-wheel-cache-6kka_3pd/wheels/39/ea/3b/507f7df78be8631a7a3d7090962194cf55bc1158572c0be77f

Successfully built en-core-web-sm

Installing collected packages: en-core-web-sm

Successfully installed en-core-web-sm-2.1.0

WARNING: You are using pip version 19.1.1, however version 19.2.1 is available.

You should consider upgrading via the 'pip install --upgrade pip' command.

✔ Download and installation successful

You can now load the model via spacy.load('en_core_web_sm')

3 # segment text by sentence boundaries (very slowly)15:15开始到18:30的时候完成一半,整体估计需要六个小时。

命令:PYTHONIOENCODING=utf-8 python preprocess_spacy.py -d datasets/wikitext-103-raw/wiki.train.raw > datasets/wikitext-103-raw/spacy_wikitext-103-raw.train

提示:

0 lines end

100000 lines end

200000 lines end

300000 lines end

400000 lines end

500000 lines end

600000 lines end

700000 lines end

800000 lines end

900000 lines end

1000000 lines end

1100000 lines end

1200000 lines end

1300000 lines end

1400000 lines end

1500000 lines end

1600000 lines end

1700000 lines end

1800000 lines end

命令:PYTHONIOENCODING=utf-8 python preprocess_spacy.py -d datasets/wikitext-103-raw/wiki.valid.raw > datasets/wikitext-103-raw/spacy_wikitext-103-raw.valid

提示:

0 lines end

(4) # construct vocabulary on wikitext
命令:python construct_vocab.py --data datasets/wikitext-103-raw/spacy_wikitext-103-raw.train -t 50 --save datasets/wikitext-103-raw/spacy_wikitext-103-raw.train.vocab.t50
提示:

# of words: 49873

篇幅有限,下一篇接着记录。

你可能感兴趣的:(数据增广)