文本数据增强(data augmentation)textattack 和 nlpaug使用

nlpaug

针对文本数据增强,支持同义词替换、tfidf、拼写错误、随机删除插入、回译等。

环境
python==3.7
nlpaug==1.1.7

文档

https://nlpaug.readthedocs.io/en/latest/overview/overview.html
https://github.com/makcedward/nlpaug

安装

pip install numpy requests nlpaug

数据增强常用方式

api 文档

https://nlpaug.readthedocs.io/en/latest/augmenter/augmenter.html
https://pypi.org/project/nlpaug/

示例

import nlpaug.augmenter.word as naw
from nlpaug.flow import Sometimes

# 增强时,会保持下面列表中的内容不变。
stopwords = ["love", "i"]
synonym_aug = naw.SynonymAug(stopwords=stop_words)
spelling_aug = naw.SpellingAug(stopwords=stop_words, aug_p=0.1)
# 将多种数据增强方式融合
aug = Sometimes([synonym_aug, spelling_aug])
text = "i love apple. i was born in 2000. how are you?"
r = aug.augment(text, 2)

textattack

环境
python==3.7
textattack==0.3.3

文档

https://textattack.readthedocs.io/en/latest/apidoc/textattack.transformations.word_swaps.html#word-swap
https://pypi.org/project/textattack/

示例

import nltk
from textattack.transformations import WordSwapQWERTY
from textattack.transformations import CompositeTransformation
from textattack.transformations import WordInsertionRandomSynonym, WordSwapChangeNumber, WordSwapEmbedding, WordSwapRandomCharacterDeletion, WordSwapQWERTY, WordSwapChangeLocation

from textattack.constraints.pre_transformation import RepeatModification
# from textattack.constraints.pre_transformation import StopwordModification
from textattack.constraints import PreTransformationConstraint

from textattack.augmentation import Augmenter
from textattack.shared.validators import transformation_consists_of_word_swaps


class StopwordModification(PreTransformationConstraint):
    """A constraint disallowing the modification of stopwords.
    	默认使用nltk的stopwords
		自定义停用词列表,数据增强时会保持停用词不变,对句子中的其他词做替换、删除等操作,如果希望某些词不变,可以加进来
	"""

    def __init__(self, stopwords=None, language="english"):
        if stopwords is not None:
            self.stopwords = set(stopwords)
        else:
            self.stopwords = set(nltk.corpus.stopwords.words(language))

    def _get_modifiable_indices(self, current_text):
        """Returns the word indices in ``current_text`` which are able to be
        modified."""
        non_stopword_indices = set()
        for i, word in enumerate(current_text.words):
            if word not in self.stopwords:
                non_stopword_indices.add(i)
        return non_stopword_indices

    def check_compatibility(self, transformation):
        """The stopword constraint only is concerned with word swaps since
        paraphrasing phrases containing stopwords is OK.

        Args:
            transformation: The ``Transformation`` to check compatibility with.
        """
        return transformation_consists_of_word_swaps(transformation)


# Set up transformation using CompositeTransformation()
transformation = CompositeTransformation([WordSwapEmbedding(), WordSwapWordNet(language='eng'), WordSwapChangeNumber(), WordSwapChangeLocation()])
# Set up constraints
constraints = [RepeatModification(), StopwordModification()]
# Create augmenter with specified parameters
augmenter = Augmenter(transformation=transformation, constraints=constraints, pct_words_to_swap=0.1, transformations_per_example=10)
s = 'what is this? it is a apple. my name is tom. i am china. i am 19 years old. i cannot speak english'
# Augment!
augmenter.augment(s)

运行结果:

['what is this? it is a abel. my christening is tenor. i am chinese. i am 14 ageing longtime. i noteworthy speech briton',
 'what is this? it is a abel. my denominations is randy. i am hoa. i am 4 anni antique. i discernible tell briton',
 'what is this? it is a apples. my appoint is tone. i am hua. i am 25 elderly longtime. i conspicuous speaks britons',
 'what is this? it is a apples. my behalf is tenor. i am wah. i am 10 olds longtime. i noteworthy speech england',
 'what is this? it is a apples. my behalf is thom. i am wah. i am 1 ageing seniors. i substantial discussing anglais',
 'what is this? it is a cake. my appointment is tony. i am chinaman. i am 38 age vecchio. i important talking frenchman',
 'what is this? it is a cheesecake. my appoints is tum. i am chinese. i am 1 ages antiquated. i momentous conversations brits',
 'what is this? it is a cheesecake. my naming is empty. i am wah. i am 11 year archaic. i sizable discussing francais',
 'what is this? it is a cobbler. my surname is tonda. i am wah. i am 7 decades viejo. i notable discussing spanish',
 'what is this? it is a quiche. my appointments is tono. i am hoa. i am 24 aged vecchio. i substantial talk briton']

你可能感兴趣的:(NLP,nlp)