关于 #今日arXiv精选
这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者。
Neural Machine Translation Quality and Post-Editing Performance
Comment: 9 pages, 1 page appendix. To be presented at EMNLP2021
Link: http://arxiv.org/abs/2109.05016
Abstract
We test the natural expectation that using MT in professional translationsaves human processing time. The last such study was carried out bySanchez-Torron and Koehn (2016) with phrase-based MT, artificially reducing thetranslation quality. In contrast, we focus on neural MT (NMT) of high quality,which has become the state-of-the-art approach since then and also got adoptedby most translation companies. Through an experimental study involving over 30 professional translators forEnglish ->Czech translation, we examine the relationship between NMTperformance and post-editing time and quality. Across all models, we found thatbetter MT systems indeed lead to fewer changes in the sentences in thisindustry setting. The relation between system quality and post-editing time ishowever not straightforward and, contrary to the results on phrase-based MT,BLEU is definitely not a stable predictor of the time or final output quality.
BiSECT: Learning to Split and Rephrase Sentences with Bitexts
Comment: 9 pages, 9 figures. Long paper to appear in Empirical Methods in Natural Language Processing 2021 (EMNLP 2021)
Link: http://arxiv.org/abs/2109.05006
Abstract
An important task in NLP applications such as sentence simplification is theability to take a long, complex sentence and split it into shorter sentences,rephrasing as necessary. We introduce a novel dataset and a new model for this`split and rephrase' task. Our BiSECT training data consists of 1 million longEnglish sentences paired with shorter, meaning-equivalent English sentences. Weobtain these by extracting 1-2 sentence alignments in bilingual parallelcorpora and then using machine translation to convert both sides of the corpusinto the same language. BiSECT contains higher quality training examples thanprevious Split and Rephrase corpora, with sentence splits that require moresignificant modifications. We categorize examples in our corpus, and use thesecategories in a novel model that allows us to target specific regions of theinput sentence to be split and edited. Moreover, we show that models trained onBiSECT can perform a wider variety of split operations and improve uponprevious state-of-the-art approaches in automatic and human evaluations.
Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training
Comment: EMNLP 2021. (Code: https://github.com/yumeng5/RoSTER)
Link: http://arxiv.org/abs/2109.05003
Abstract
We study the problem of training named entity recognition (NER) models usingonly distantly-labeled data, which can be automatically obtained by matchingentity mentions in the raw text with entity types in a knowledge base. Thebiggest challenge of distantly-supervised NER is that the distant supervisionmay induce incomplete and noisy labels, rendering the straightforwardapplication of supervised learning ineffective. In this paper, we propose (1) anoise-robust learning scheme comprised of a new loss function and a noisy labelremoval step, for training NER models on distantly-labeled data, and (2) aself-training method that uses contextualized augmentations created bypre-trained language models to improve the generalization ability of the NERmodel. On three benchmark datasets, our method achieves superior performance,outperforming existing distantly-supervised NER models by significant margins.
Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.04994
Abstract
Unlike well-structured text, such as news reports and encyclopedia articles,dialogue content often comes from two or more interlocutors, exchanginginformation with each other. In such a scenario, the topic of a conversationcan vary upon progression and the key information for a certain topic is oftenscattered across multiple utterances of different speakers, which poseschallenges to abstractly summarize dialogues. To capture the various topicinformation of a conversation and outline salient facts for the capturedtopics, this work proposes two topic-aware contrastive learning objectives,namely coherence detection and sub-summary generation objectives, which areexpected to implicitly model the topic change and handle information scatteringchallenges for the dialogue summarization task. The proposed contrastiveobjectives are framed as auxiliary tasks for the primary dialogue summarizationtask, united via an alternative parameter updating strategy. Extensiveexperiments on benchmark datasets demonstrate that the proposed simple methodsignificantly outperforms strong baselines and achieves new state-of-the-artperformance. The code and trained models are publicly available via\href{https://github.com/Junpliu/ConDigSum}{https://github.com/Junpliu/ConDigSum}.
Does Pretraining for Summarization Require Knowledge Transfer?
Comment: Camera-ready for Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.04953
Abstract
Pretraining techniques leveraging enormous datasets have driven recentadvances in text summarization. While folk explanations suggest that knowledgetransfer accounts for pretraining's benefits, little is known about why itworks or what makes a pretraining task or dataset suitable. In this paper, wechallenge the knowledge transfer story, showing that pretraining on documentsconsisting of character n-grams selected at random, we can nearly match theperformance of models pretrained on real corpora. This work holds the promiseof eliminating upstream corpora, which may alleviate some concerns overoffensive language, bias, and copyright issues. To see whether the smallresidual benefit of using real data could be accounted for by the structure ofthe pretraining task, we design several tasks motivated by a qualitative studyof summarization corpora. However, these tasks confer no appreciable benefit,leaving open the possibility of a small role for knowledge transfer.
Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding
Comment: Accepted to Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.04947
Abstract
Large-scale, pre-trained language models (LMs) have achieved human-levelperformance on a breadth of language understanding tasks. However, evaluationsonly based on end task performance shed little light on machines' true abilityin language understanding and reasoning. In this paper, we highlight theimportance of evaluating the underlying reasoning process in addition to endperformance. Toward this goal, we introduce Tiered Reasoning for IntuitivePhysics (TRIP), a novel commonsense reasoning dataset with dense annotationsthat enable multi-tiered evaluation of machines' reasoning process. Ourempirical results show that while large LMs can achieve high end performance,they struggle to support their predictions with valid supporting evidence. TheTRIP dataset and our baseline results will motivate verifiable evaluation ofcommonsense reasoning and facilitate future research toward developing betterlanguage understanding and reasoning models.
Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars
Comment: Accepted by EMNLP 2021
Link: http://arxiv.org/abs/2109.04939
Abstract
In computational linguistics, it has been shown that hierarchical structuresmake language models (LMs) more human-like. However, the previous literaturehas been agnostic about a parsing strategy of the hierarchical models. In thispaper, we investigated whether hierarchical structures make LMs morehuman-like, and if so, which parsing strategy is most cognitively plausible. Inorder to address this question, we evaluated three LMs against human readingtimes in Japanese with head-final left-branching structures: Long Short-TermMemory (LSTM) as a sequential model and Recurrent Neural Network Grammars(RNNGs) with top-down and left-corner parsing strategies as hierarchicalmodels. Our computational modeling demonstrated that left-corner RNNGsoutperformed top-down RNNGs and LSTM, suggesting that hierarchical andleft-corner architectures are more cognitively plausible than top-down orsequential architectures. In addition, the relationships between the cognitiveplausibility and (i) perplexity, (ii) parsing, and (iii) beam size will also bediscussed.
Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers
Comment: Accepted to Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.04922
Abstract
As large-scale, pre-trained language models achieve human-level andsuperhuman accuracy on existing language understanding tasks, statistical biasin benchmark data and probing studies have recently called into question theirtrue capabilities. For a more informative evaluation than accuracy on textclassification tasks can offer, we propose evaluating systems through a novelmeasure of prediction coherence. We apply our framework to two existinglanguage understanding benchmarks with different properties to demonstrate itsversatility. Our experimental results show that this evaluation framework,although simple in ideas and implementation, is a quick, effective, andversatile measure to provide insight into the coherence of machines'predictions.
Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes
Comment: EMNLP 2021 Main Conference
Link: http://arxiv.org/abs/2109.04921
Abstract
State-of-the-art contextual embeddings are obtained from large languagemodels available only for a few languages. For others, we need to learnrepresentations using a multilingual model. There is an ongoing debate onwhether multilingual embeddings can be aligned in a space shared across manylanguages. The novel Orthogonal Structural Probe (Limisiewicz and Mare\v{c}ek,2021) allows us to answer this question for specific linguistic features andlearn a projection based only on mono-lingual annotated datasets. We evaluatesyntactic (UD) and lexical (WordNet) structural information encoded inmBERT'scontextual representations for nine diverse languages. We observe that forlanguages closely related to English, no transformation is needed. Theevaluated information is encoded in a shared cross-lingual embedding space. Forother languages, it is beneficial to apply orthogonal transformation learnedseparately for each language. We successfully apply our findings to zero-shotand few-shot cross-lingual parsing.
ReasonBERT: Pre-trained to Reason with Distant Supervision
Comment: Accepted to EMNLP'2021. Our code and pre-trained models are available at https://github.com/sunlab-osu/ReasonBERT
Link: http://arxiv.org/abs/2109.04912
Abstract
We present ReasonBert, a pre-training method that augments language modelswith the ability to reason over long-range relations and multiple, possiblyhybrid contexts. Unlike existing pre-training methods that only harvestlearning signals from local contexts of naturally occurring texts, we propose ageneralized notion of distant supervision to automatically connect multiplepieces of text and tables to create pre-training examples that requirelong-range reasoning. Different types of reasoning are simulated, includingintersecting multiple pieces of evidence, bridging from one piece of evidenceto another, and detecting unanswerable cases. We conduct a comprehensiveevaluation on a variety of extractive question answering datasets ranging fromsingle-hop to multi-hop and from text-only to table-only to hybrid that requirevarious reasoning capabilities and show that ReasonBert achieves remarkableimprovement over an array of strong baselines. Few-shot experiments furtherdemonstrate that our pre-training method substantially improves sampleefficiency.
Document-level Entity-based Extraction as Template Generation
Comment: 13 pages. EMNLP 2021
Link: http://arxiv.org/abs/2109.04901
Abstract
Document-level entity-based extraction (EE), aiming at extractingentity-centric information such as entity roles and entity relations, is key toautomatic knowledge acquisition from text corpora for various domains. Mostdocument-level EE systems build extractive models, which struggle to modellong-term dependencies among entities at the document level. To address thisissue, we propose a generative framework for two document-level EE tasks:role-filler entity extraction (REE) and relation extraction (RE). We firstformulate them as a template generation problem, allowing models to efficientlycapture cross-entity dependencies, exploit label semantics, and avoid theexponential computation complexity of identifying N-ary relations. A novelcross-attention guided copy mechanism, TopK Copy, is incorporated into apre-trained sequence-to-sequence model to enhance the capabilities ofidentifying key information in the input document. Experiments done on theMUC-4 and SciREX dataset show new state-of-the-art results on REE (+3.26%),binary RE (+4.8%), and 4-ary RE (+2.7%) in F1 score.
Efficient Test Time Adapter Ensembling for Low-resource Language Varieties
Comment: EMNLP 2021 Findings
Link: http://arxiv.org/abs/2109.04877
Abstract
Adapters are light-weight modules that allow parameter-efficient fine-tuningof pretrained models. Specialized language and task adapters have recently beenproposed to facilitate cross-lingual transfer of multilingual pretrained models(Pfeiffer et al., 2020b). However, this approach requires training a separatelanguage adapter for every language one wishes to support, which can beimpractical for languages with limited data. An intuitive solution is to use arelated language adapter for the new language variety, but we observe that thissolution can lead to sub-optimal performance. In this paper, we aim to improvethe robustness of language adapters to uncovered languages without training newadapters. We find that ensembling multiple existing language adapters makes thefine-tuned model significantly more robust to other language varieties notincluded in these adapters. Building upon this observation, we propose EntropyMinimized Ensemble of Adapters (EMEA), a method that optimizes the ensembleweights of the pretrained language adapters for each test sentence byminimizing the entropy of its predictions. Experiments on three diverse groupsof language varieties show that our method leads to significant improvements onboth named entity recognition and part-of-speech tagging across all languages.
Studying word order through iterative shuffling
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.04867
Abstract
As neural language models approach human performance on NLP benchmark tasks,their advances are widely seen as evidence of an increasingly complexunderstanding of syntax. This view rests upon a hypothesis that has not yetbeen empirically tested: that word order encodes meaning essential toperforming these tasks. We refute this hypothesis in many cases: in the GLUEsuite and in various genres of English text, the words in a sentence or phrasecan rarely be permuted to form a phrase carrying substantially differentinformation. Our surprising result relies on inference by iterative shuffling(IBIS), a novel, efficient procedure that finds the ordering of a bag of wordshaving the highest likelihood under a fixed language model. IBIS can use anyblack-box model without additional training and is superior to existing wordordering algorithms. Coalescing our findings, we discuss how shufflinginference procedures such as IBIS can benefit language modeling and constrainedgeneration.
CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale Multi-Label Text Classification
Comment: 5 pages, 2 figures, EMNLP 2021
Link: http://arxiv.org/abs/2109.04853
Abstract
Large-Scale Multi-Label Text Classification (LMTC) includes tasks withhierarchical label spaces, such as automatic assignment of ICD-9 codes todischarge summaries. Performance of models in prior art is evaluated withstandard precision, recall, and F1 measures without regard for the richhierarchical structure. In this work we argue for hierarchical evaluation ofthe predictions of neural LMTC models. With the example of the ICD-9 ontologywe describe a structural issue in the representation of the structured labelspace in prior art, and propose an alternative representation based on thedepth of the ontology. We propose a set of metrics for hierarchical evaluationusing the depth-based representation. We compare the evaluation scores from theproposed metrics with previously used metrics on prior art LMTC models forICD-9 coding in MIMIC-III. We also propose further avenues of researchinvolving the proposed ontological representation.
Block Pruning For Faster Transformers
Comment: EMNLP 2021. Code, hyper-parameters, evaluation results and checkpoints available at https://github.com/huggingface/nn_pruning
Link: http://arxiv.org/abs/2109.04838
Abstract
Pre-training has improved model accuracy for both classification andgeneration tasks at the cost of introducing much larger and slower models.Pruning methods have proven to be an effective way of reducing model size,whereas distillation methods are proven for speeding up inference. We introducea block pruning approach targeting both small and fast models. Our approachextends structured methods by considering blocks of any size and integratesthis structure into the movement pruning paradigm for fine-tuning. We find thatthis approach learns to prune out full components of the underlying model, suchas attention heads. Experiments consider classification and generation tasks,yielding among other results a pruned model that is a 2.4x faster, 74% smallerBERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled modelsin speed and pruned models in size.
An Evaluation Dataset and Strategy for Building Robust Multi-turn Response Selection Model
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.04834
Abstract
Multi-turn response selection models have recently shown comparableperformance to humans in several benchmark datasets. However, in the realenvironment, these models often have weaknesses, such as making incorrectpredictions based heavily on superficial patterns without a comprehensiveunderstanding of the context. For example, these models often give a high scoreto the wrong response candidate containing several keywords related to thecontext but using the inconsistent tense. In this study, we analyze theweaknesses of the open-domain Korean Multi-turn response selection models andpublish an adversarial dataset to evaluate these weaknesses. We also suggest astrategy to build a robust model in this adversarial environment.
Asking It All: Generating Contextualized Questions for any Semantic Role
Comment: Accepted as a long paper to EMNLP 2021, Main Conference
Link: http://arxiv.org/abs/2109.04832
Abstract
Asking questions about a situation is an inherent step towards understandingit. To this end, we introduce the task of role question generation, which,given a predicate mention and a passage, requires producing a set of questionsasking about all possible semantic roles of the predicate. We develop atwo-stage model for this task, which first produces a context-independentquestion prototype for each role and then revises it to be contextuallyappropriate for the passage. Unlike most existing approaches to questiongeneration, our approach does not require conditioning on existing answers inthe text. Instead, we condition on the type of information to inquire about,regardless of whether the answer appears explicitly in the text, could beinferred from it, or should be sought elsewhere. Our evaluation demonstratesthat we generate diverse and well-formed questions for a large, broad-coverageontology of predicates and roles.
Artificial Text Detection via Examining the Topology of Attention Maps
Comment: Accepted to EMNLP 2021
Link: http://arxiv.org/abs/2109.04825
Abstract
The impressive capabilities of recent generative models to create texts thatare challenging to distinguish from the human-written ones can be misused forgenerating fake news, product reviews, and even abusive content. Despite theprominent performance of existing methods for artificial text detection, theystill lack interpretability and robustness towards unseen models. To this end,we propose three novel types of interpretable topological features for thistask based on Topological Data Analysis (TDA) which is currently understudiedin the field of NLP. We empirically show that the features derived from theBERT model outperform count- and neural-based baselines up to 10\% on threecommon datasets, and tend to be the most robust towards unseen GPT-stylegeneration models as opposed to existing methods. The probing analysis of thefeatures reveals their sensitivity to the surface and syntactic properties. Theresults demonstrate that TDA is a promising line with respect to NLP tasks,specifically the ones that incorporate surface and structural information.
Does It Capture STEL? A Modular, Similarity-based Linguistic Style Evaluation Framework
Comment: Accepted at EMNLP2021
Link: http://arxiv.org/abs/2109.04817
Abstract
Style is an integral part of natural language. However, evaluation methodsfor style measures are rare, often task-specific and usually do not control forcontent. We propose the modular, fine-grained and content-controlledsimilarity-based STyle EvaLuation framework (STEL) to test the performance ofany model that can compare two sentences on style. We illustrate STEL with twogeneral dimensions of style (formal/informal and simple/complex) as well as twospecific characteristics of style (contrac'tion and numb3r substitution). Wefind that BERT-based methods outperform simple versions of commonly used stylemeasures like 3-grams, punctuation frequency and LIWC-based approaches. Weinvite the addition of further tasks and task instances to STEL and hope tofacilitate the improvement of style-sensitive measures.
Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT
Comment: EMNLP 2021 camera-ready version
Link: http://arxiv.org/abs/2109.04810
Abstract
Infusing factual knowledge into pre-trained models is fundamental for manyknowledge-intensive tasks. In this paper, we proposed Mixture-of-Partitions(MoP), an infusion approach that can handle a very large knowledge graph (KG)by partitioning it into smaller sub-graphs and infusing their specificknowledge into various BERT models using lightweight adapters. To leverage theoverall factual knowledge for a target task, these sub-graph adapters arefurther fine-tuned along with the underlying BERT through a mixture layer. Weevaluate our MoP with three biomedical BERTs (SciBERT, BioBERT, PubmedBERT) onsix downstream tasks (inc. NLI, QA, Classification), and the results show thatour MoP consistently enhances the underlying BERTs in task performance, andachieves new SOTA performances on five evaluated datasets.
Exophoric Pronoun Resolution in Dialogues with Topic Regularization
Comment: EMNLP 2021 main conference
Link: http://arxiv.org/abs/2109.04787
Abstract
Resolving pronouns to their referents has long been studied as a fundamentalnatural language understanding problem. Previous works on pronoun coreferenceresolution (PCR) mostly focus on resolving pronouns to mentions in text whileignoring the exophoric scenario. Exophoric pronouns are common in dailycommunications, where speakers may directly use pronouns to refer to someobjects present in the environment without introducing the objects first.Although such objects are not mentioned in the dialogue text, they can often bedisambiguated by the general topics of the dialogue. Motivated by this, wepropose to jointly leverage the local context and global topics of dialogues tosolve the out-of-text PCR problem. Extensive experiments demonstrate theeffectiveness of adding topic regularization for resolving exophoric pronouns.
RoR: Read-over-Read for Long Document Machine Reading Comprehension
Comment: Accepted as findings of EMNLP2021
Link: http://arxiv.org/abs/2109.04780
Abstract
Transformer-based pre-trained models, such as BERT, have achieved remarkableresults on machine reading comprehension. However, due to the constraint ofencoding length (e.g., 512 WordPiece tokens), a long document is usually splitinto multiple chunks that are independently read. It results in the readingfield being limited to individual chunks without information collaboration forlong document machine reading comprehension. To address this problem, wepropose RoR, a read-over-read method, which expands the reading field fromchunk to document. Specifically, RoR includes a chunk reader and a documentreader. The former first predicts a set of regional answers for each chunk,which are then compacted into a highly-condensed version of the originaldocument, guaranteeing to be encoded once. The latter further predicts theglobal answers from this condensed document. Eventually, a voting strategy isutilized to aggregate and rerank the regional and global answers for finalprediction. Extensive experiments on two benchmarks QuAC and TriviaQAdemonstrate the effectiveness of RoR for long document reading. Notably, RoRranks 1st place on the QuAC leaderboard (https://quac.ai/) at the time ofsubmission (May 17th, 2021).
Improving Multilingual Translation by Representation and Gradient Regularization
Comment: EMNLP 2021 (Long)
Link: http://arxiv.org/abs/2109.04778
Abstract
Multilingual Neural Machine Translation (NMT) enables one model to serve alltranslation directions, including ones that are unseen during training, i.e.zero-shot translation. Despite being theoretically attractive, current modelsoften produce low quality translations -- commonly failing to even produceoutputs in the right target language. In this work, we observe that off-targettranslation is dominant even in strong multilingual systems, trained on massivemultilingual corpora. To address this issue, we propose a joint approach toregularize NMT models at both representation-level and gradient-level. At therepresentation level, we leverage an auxiliary target language prediction taskto regularize decoder outputs to retain information about the target language.At the gradient level, we leverage a small amount of direct data (in thousandsof sentence pairs) to regularize model gradients. Our results demonstrate thatour approach is highly effective in both reducing off-target translationoccurrences and improving zero-shot translation performance by +5.59 and +10.38BLEU on WMT and OPUS datasets respectively. Moreover, experiments show that ourmethod also works well when the small amount of direct data is not available.
A Strong Baseline for Query Efficient Attacks in a Black Box Setting
Comment: EMNLP 2021 - Main Conference
Link: http://arxiv.org/abs/2109.04775
Abstract
Existing black box search methods have achieved high success rate ingenerating adversarial attacks against NLP models. However, such search methodsare inefficient as they do not consider the amount of queries required togenerate adversarial attacks. Also, prior attacks do not maintain a consistentsearch space while comparing different search methods. In this paper, wepropose a query efficient attack strategy to generate plausible adversarialexamples on text classification and entailment tasks. Our attack jointlyleverages attention mechanism and locality sensitive hashing (LSH) to reducethe query count. We demonstrate the efficacy of our approach by comparing ourattack with four baselines across three different search spaces. Further, webenchmark our results across the same search space used in prior attacks. Incomparison to attacks proposed, on an average, we are able to reduce the querycount by 75% across all datasets and target models. We also demonstrate thatour attack achieves a higher success rate when compared to prior attacks in alimited query setting.
How Does Fine-tuning Affect the Geometry of Embedding Space: A Case Study on Isotropy
Comment: To appear in Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.04740
Abstract
It is widely accepted that fine-tuning pre-trained language models usuallybrings about performance improvements in downstream tasks. However, there arelimited studies on the reasons behind this effectiveness, particularly from theviewpoint of structural changes in the embedding space. Trying to fill thisgap, in this paper, we analyze the extent to which the isotropy of theembedding space changes after fine-tuning. We demonstrate that, even thoughisotropy is a desirable geometrical property, fine-tuning does not necessarilyresult in isotropy enhancements. Moreover, local structures in pre-trainedcontextual word representations (CWRs), such as those encoding token types orfrequency, undergo a massive change during fine-tuning. Our experiments showdramatic growth in the number of elongated directions in the embedding space,which, in contrast to pre-trained CWRs, carry the essential linguisticknowledge in the fine-tuned embedding space, making existing isotropyenhancement methods ineffective.
Genre as Weak Supervision for Cross-lingual Dependency Parsing
Comment: Accepted to EMNLP 2021 (Main Conference)
Link: http://arxiv.org/abs/2109.04733
Abstract
Recent work has shown that monolingual masked language models learn torepresent data-driven notions of language variation which can be used fordomain-targeted training data selection. Dataset genre labels are alreadyfrequently available, yet remain largely unexplored in cross-lingual setups. Weharness this genre metadata as a weak supervision signal for targeted dataselection in zero-shot dependency parsing. Specifically, we projecttreebank-level genre information to the finer-grained sentence level, with thegoal to amplify information implicitly stored in unsupervised contextualizedrepresentations. We demonstrate that genre is recoverable from multilingualcontextual embeddings and that it provides an effective signal for trainingdata selection in cross-lingual, zero-shot scenarios. For 12 low-resourcelanguage treebanks, six of which are test-only, our genre-specific methodssignificantly outperform competitive baselines as well as recentembedding-based methods for data selection. Moreover, genre-based dataselection provides new state-of-the-art results for three of these targetlanguages.
Assessing the Reliability of Word Embedding Gender Bias Measures
Comment: 23 pages, 24 figures, 3 tables. Accepted to EMNLP 2021
Link: http://arxiv.org/abs/2109.04732
Abstract
Various measures have been proposed to quantify human-like social biases inword embeddings. However, bias scores based on these measures can suffer frommeasurement error. One indication of measurement quality is reliability,concerning the extent to which a measure produces consistent results. In thispaper, we assess three types of reliability of word embedding gender biasmeasures, namely test-retest reliability, inter-rater consistency and internalconsistency. Specifically, we investigate the consistency of bias scores acrossdifferent choices of random seeds, scoring rules and words. Furthermore, weanalyse the effects of various factors on these measures' reliability scores.Our findings inform better design of word embedding gender bias measures.Moreover, we urge researchers to be more critical about the application of suchmeasures.
AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.04715
Abstract
Reproducible benchmarks are crucial in driving progress of machinetranslation research. However, existing machine translation benchmarks havebeen mostly limited to high-resource or well-represented languages. Despite anincreasing interest in low-resource machine translation, there are nostandardized reproducible benchmarks for many African languages, many of whichare used by millions of speakers but have less digitized textual data. Totackle these challenges, we propose AfroMT, a standardized, clean, andreproducible machine translation benchmark for eight widely spoken Africanlanguages. We also develop a suite of analysis tools for system diagnosistaking into account the unique properties of these languages. Furthermore, weexplore the newly considered case of low-resource focused pretraining anddevelop two novel data augmentation-based strategies, leveraging word-levelalignment information and pseudo-monolingual data for pretraining multilingualsequence-to-sequence models. We demonstrate significant improvements whenpretraining on 11 languages, with gains of up to 2 BLEU points over strongbaselines. We also show gains of up to 12 BLEU points over cross-lingualtransfer baselines in data-constrained scenarios. All code and pretrainedmodels will be released as further steps towards larger reproducible benchmarksfor African languages.
Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.04712
Abstract
Multi-label text classification is a challenging task because it requirescapturing label dependencies. It becomes even more challenging when classdistribution is long-tailed. Resampling and re-weighting are common approachesused for addressing the class imbalance problem, however, they are noteffective when there is label dependency besides class imbalance because theyresult in oversampling of common labels. Here, we introduce the application ofbalancing loss functions for multi-label text classification. We performexperiments on a general domain dataset with 90 labels (Reuters-21578) and adomain-specific dataset from PubMed with 18211 labels. We find that adistribution-balanced loss function, which inherently addresses both the classimbalance and label linkage problems, outperforms commonly used loss functions.Distribution balancing methods have been successfully used in the imagerecognition field. Here, we show their effectiveness in natural languageprocessing. Source code is available athttps://github.com/blessu/BalancedLossNLP.
Pre-train or Annotate? Domain Adaptation with a Constrained Budget
Comment: Accepted to EMNLP 2021
Link: http://arxiv.org/abs/2109.04711
Abstract
Recent work has demonstrated that pre-training in-domain language models canboost performance when adapting to a new domain. However, the costs associatedwith pre-training raise an important question: given a fixed budget, what stepsshould an NLP practitioner take to maximize performance? In this paper, westudy domain adaptation under budget constraints, and approach it as a customerchoice problem between data annotation and pre-training. Specifically, wemeasure the annotation cost of three procedural text datasets and thepre-training cost of three in-domain language models. Then we evaluate theutility of different combinations of pre-training and data annotation undervarying budget constraints to assess which combination strategy works best. Wefind that, for small budgets, spending all funds on annotation leads to thebest performance; once the budget becomes large enough, a combination of dataannotation and in-domain pre-training works more optimally. We thereforesuggest that task-specific data annotation should be part of an economicalstrategy when adapting an NLP model to a new domain.
Knowledge-Aware Meta-learning for Low-Resource Text Classification
Comment: Accepted by EMNLP 2021
Link: http://arxiv.org/abs/2109.04707
Abstract
Meta-learning has achieved great success in leveraging the historical learnedknowledge to facilitate the learning process of the new task. However, merelylearning the knowledge from the historical tasks, adopted by currentmeta-learning algorithms, may not generalize well to testing tasks when theyare not well-supported by training tasks. This paper studies a low-resourcetext classification problem and bridges the gap between meta-training andmeta-testing tasks by leveraging the external knowledge bases. Specifically, wepropose KGML to introduce additional representation for each sentence learnedfrom the extracted sentence-specific knowledge graph. The extensive experimentson three datasets demonstrate the effectiveness of KGML under both supervisedadaptation and unsupervised adaptation settings.
Rethinking Zero-shot Neural Machine Translation: From a Perspective of Latent Variables
Comment: EMNLP Findings 2021
Link: http://arxiv.org/abs/2109.04705
Abstract
Zero-shot translation, directly translating between language pairs unseen intraining, is a promising capability of multilingual neural machine translation(NMT). However, it usually suffers from capturing spurious correlations betweenthe output language and language invariant semantics due to the maximumlikelihood training objective, leading to poor transfer performance onzero-shot translation. In this paper, we introduce a denoising autoencoderobjective based on pivot language into traditional training objective toimprove the translation accuracy on zero-shot directions. The theoreticalanalysis from the perspective of latent variables shows that our approachactually implicitly maximizes the probability distributions for zero-shotdirections. On two benchmark machine translation datasets, we demonstrate thatthe proposed method is able to effectively eliminate the spurious correlationsand significantly outperforms state-of-the-art methods with a remarkableperformance. Our code is available at https://github.com/Victorwz/zs-nmt-dae.
Heterogeneous Graph Neural Networks for Keyphrase Generation
Comment: Accepted by EMNLP 2021
Link: http://arxiv.org/abs/2109.04703
Abstract
The encoder-decoder framework achieves state-of-the-art results in keyphrasegeneration (KG) tasks by predicting both present keyphrases that appear in thesource document and absent keyphrases that do not. However, relying solely onthe source document can result in generating uncontrollable and inaccurateabsent keyphrases. To address these problems, we propose a novel graph-basedmethod that can capture explicit knowledge from related references. Our modelfirst retrieves some document-keyphrases pairs similar to the source documentfrom a pre-defined index as references. Then a heterogeneous graph isconstructed to capture relationships of different granularities between thesource document and its references. To guide the decoding process, ahierarchical attention and copy mechanism is introduced, which directly copiesappropriate words from both the source document and its references based ontheir relevance and significance. The experimental results on multiple KGbenchmarks show that the proposed model achieves significant improvementsagainst other baseline models, especially with regard to the absent keyphraseprediction.
Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning
Comment: To appear in Proceedings of EMNLP 2021
Link: http://arxiv.org/abs/2109.04689
Abstract
Motivated by suggested question generation in conversational newsrecommendation systems, we propose a model for generating question-answer pairs(QA pairs) with self-contained, summary-centric questions andlength-constrained, article-summarizing answers. We begin by collecting a newdataset of news articles with questions as titles and pairing them withsummaries of varying length. This dataset is used to learn a QA pair generationmodel producing summaries as answers that balance brevity with sufficiencyjointly with their corresponding questions. We then reinforce the QA pairgeneration process with a differentiable reward function to mitigate exposurebias, a common problem in natural language generation. Both automatic metricsand human evaluation demonstrate these QA pairs successfully capture thecentral gists of the articles and achieve high answer accuracy.
DIALKI: Knowledge Identification in Conversational Systems through Dialogue-Document Contextualization
Comment: EMNLP 2021 camera-ready
Link: http://arxiv.org/abs/2109.04673
Abstract
Identifying relevant knowledge to be used in conversational systems that aregrounded in long documents is critical to effective response generation. Weintroduce a knowledge identification model that leverages the documentstructure to provide dialogue-contextualized passage encodings and betterlocate knowledge relevant to the conversation. An auxiliary loss captures thehistory of dialogue-document connections. We demonstrate the effectiveness ofour model on two document-grounded conversational datasets and provide analysesshowing generalization to unseen documents and long dialogue contexts.
Investigating Numeracy Learning Ability of a Text-to-Text Transfer Model
Comment: 7 pages, 10 figures, 5 tables, Accepted in the Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.04672
Abstract
The transformer-based pre-trained language models have been tremendouslysuccessful in most of the conventional NLP tasks. But they often struggle inthose tasks where numerical understanding is required. Some possible reasonscan be the tokenizers and pre-training objectives which are not specificallydesigned to learn and preserve numeracy. Here we investigate the ability oftext-to-text transfer learning model (T5), which has outperformed itspredecessors in the conventional NLP tasks, to learn numeracy. We consider fournumeracy tasks: numeration, magnitude order prediction, finding minimum andmaximum in a series, and sorting. We find that, although T5 models performreasonably well in the interpolation setting, they struggle considerably in theextrapolation setting across all four tasks.
Zero-Shot Dialogue State Tracking via Cross-Task Transfer
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.04655
Abstract
Zero-shot transfer learning for dialogue state tracking (DST) enables us tohandle a variety of task-oriented dialogue domains without the expense ofcollecting in-domain data. In this work, we propose to transfer the\textit{cross-task} knowledge from general question answering (QA) corpora forthe zero-shot DST task. Specifically, we propose TransferQA, a transferablegenerative QA model that seamlessly combines extractive QA and multi-choice QAvia a text-to-text transformer framework, and tracks both categorical slots andnon-categorical slots in DST. In addition, we introduce two effective ways toconstruct unanswerable questions, namely, negative question sampling andcontext truncation, which enable our model to handle "none" value slots in thezero-shot DST setting. The extensive experiments show that our approachessubstantially improve the existing zero-shot and few-shot results on MultiWoz.Moreover, compared to the fully trained baseline on the Schema-Guided Dialoguedataset, our approach shows better generalization ability in unseen domains.
Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation
Comment: Accepted in EMNLP-Findings (2021)
Link: http://arxiv.org/abs/2109.04653
Abstract
Pre-trained language-vision models have shown remarkable performance on thevisual question answering (VQA) task. However, most pre-trained models aretrained by only considering monolingual learning, especially the resource-richlanguage like English. Training such models for multilingual setups demand highcomputing resources and multilingual language-vision dataset which hinderstheir application in practice. To alleviate these challenges, we propose aknowledge distillation approach to extend an English language-vision model(teacher) into an equally effective multilingual and code-mixed model(student). Unlike the existing knowledge distillation methods, which only usethe output from the last layer of the teacher network for distillation, ourstudent model learns and imitates the teacher from multiple intermediate layers(language and vision encoders) with appropriately designed distillationobjectives for incremental knowledge extraction. We also create the large-scalemultilingual and code-mixed VQA dataset in eleven different language setupsconsidering the multiple Indian and European languages. Experimental resultsand in-depth analysis show the effectiveness of the proposed VQA model over thepre-trained language-vision models on eleven diverse language setups.
What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers
Comment: Accepted to EMNLP2021 as a long paper
Link: http://arxiv.org/abs/2109.04650
Abstract
GPT-3 shows remarkable in-context learning ability of large-scale languagemodels (LMs) trained on hundreds of billion scale data. Here we address someremaining issues less reported by the GPT-3 paper, such as a non-English LM,the performances of different sized models, and the effect of recentlyintroduced prompt optimization on in-context learning. To achieve this, weintroduce HyperCLOVA, a Korean variant of 82B GPT-3 trained on a Korean-centriccorpus of 560B tokens. Enhanced by our Korean-specific tokenization, HyperCLOVAwith our training configuration shows state-of-the-art in-context zero-shot andfew-shot learning performances on various downstream tasks in Korean. Also, weshow the performance benefits of prompt-based learning and demonstrate how itcan be integrated into the prompt engineering pipeline. Then we discuss thepossibility of materializing the No Code AI paradigm by providing AIprototyping capabilities to non-experts of ML by introducing HyperCLOVA studio,an interactive prompt engineering interface. Lastly, we demonstrate thepotential of our methods with three successful in-house applications.
Rule-based Morphological Inflection Improves Neural Terminology Translation
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.04620
Abstract
Current approaches to incorporating terminology constraints in machinetranslation (MT) typically assume that the constraint terms are provided intheir correct morphological forms. This limits their application to real-worldscenarios where constraint terms are provided as lemmas. In this paper, weintroduce a modular framework for incorporating lemma constraints in neural MT(NMT) in which linguistic knowledge and diverse types of NMT models can beflexibly applied. It is based on a novel cross-lingual inflection module thatinflects the target lemma constraints based on the source context. We explorelinguistically motivated rule-based and data-driven neural-based inflectionmodules and design English-German health and English-Lithuanian news testsuites to evaluate them in domain adaptation and low-resource MT settings.Results show that our rule-based inflection module helps NMT models incorporatelemma constraints more accurately than a neural module and outperforms theexisting end-to-end approach with lower training costs.
An Exploratory Study on Long Dialogue Summarization: What Works and What's Next
Comment: Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.04609
Abstract
Dialogue summarization helps readers capture salient information from longconversations in meetings, interviews, and TV series. However, real-worlddialogues pose a great challenge to current summarization models, as thedialogue length typically exceeds the input limits imposed by recenttransformer-based pre-trained models, and the interactive nature of dialoguesmakes relevant information more context-dependent and sparsely distributed thannews articles. In this work, we perform a comprehensive study on long dialoguesummarization by investigating three strategies to deal with the lengthy inputproblem and locate relevant information: (1) extended transformer models suchas Longformer, (2) retrieve-then-summarize pipeline models with severaldialogue utterance retrieval methods, and (3) hierarchical dialogue encodingmodels such as HMNet. Our experimental results on three long dialogue datasets(QMSum, MediaSum, SummScreen) show that the retrieve-then-summarize pipelinemodels yield the best performance. We also demonstrate that the summary qualitycan be further improved with a stronger retrieval model and pretraining onproper external summarization datasets.
IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization
Comment: Accepted at EMNLP 2021
Link: http://arxiv.org/abs/2109.04607
Abstract
We present IndoBERTweet, the first large-scale pretrained model forIndonesian Twitter that is trained by extending a monolingually-trainedIndonesian BERT model with additive domain-specific vocabulary. We focus inparticular on efficient model adaptation under vocabulary mismatch, andbenchmark different ways of initializing the BERT embedding layer for new wordtypes. We find that initializing with the average BERT subword embedding makespretraining five times faster, and is more effective than proposed methods forvocabulary adaptation in terms of extrinsic evaluation over seven Twitter-baseddatasets.
Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations
Comment: Accepted paper EMNLP2021
Link: http://arxiv.org/abs/2109.04602
Abstract
Current language models are usually trained using a self-supervised scheme,where the main focus is learning representations at the word or sentence level.However, there has been limited progress in generating useful discourse-levelrepresentations. In this work, we propose to use ideas from predictive codingtheory to augment BERT-style language models with a mechanism that allows themto learn suitable discourse-level representations. As a result, our proposedapproach is able to predict future sentences using explicit top-downconnections that operate at the intermediate layers of the network. Byexperimenting with benchmarks designed to evaluate discourse-related knowledgeusing pre-trained sentence representations, we demonstrate that our approachimproves performance in 6 out of 11 tasks by excelling in discourserelationship detection.
Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph
Comment: Published in Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.04400
Abstract
In cross-lingual text classification, it is required that task-specifictraining data in high-resource source languages are available, where the taskis identical to that of a low-resource target language. However, collectingsuch training data can be infeasible because of the labeling cost, taskcharacteristics, and privacy concerns. This paper proposes an alternativesolution that uses only task-independent word embeddings of high-resourcelanguages and bilingual dictionaries. First, we construct a dictionary-basedheterogeneous graph (DHG) from bilingual dictionaries. This opens thepossibility to use graph neural networks for cross-lingual transfer. Theremaining challenge is the heterogeneity of DHG because multiple languages areconsidered. To address this challenge, we propose dictionary-basedheterogeneous graph neural network (DHGNet) that effectively handles theheterogeneity of DHG by two-step aggregations, which are word-level andlanguage-level aggregations. Experimental results demonstrate that our methodoutperforms pretrained models even though it does not access to large corpora.Furthermore, it can perform well even though dictionaries contain manyincorrect translations. Its robustness allows the usage of a wider range ofdictionaries such as an automatically constructed dictionary and crowdsourceddictionary, which are convenient for real-world applications.
Counterfactual Adversarial Learning with Representation Interpolation
Comment: Accepted to Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.04746
Abstract
Deep learning models exhibit a preference for statistical fitting overlogical reasoning. Spurious correlations might be memorized when there existsstatistical bias in training data, which severely limits the model performanceespecially in small data scenarios. In this work, we introduce CounterfactualAdversarial Training framework (CAT) to tackle the problem from a causalityperspective. Particularly, for a specific sample, CAT first generates acounterfactual representation through latent space interpolation in anadversarial manner, and then performs Counterfactual Risk Minimization (CRM) oneach original-counterfactual pair to adjust sample-wise loss weightdynamically, which encourages the model to explore the true causal effect.Extensive experiments demonstrate that CAT achieves substantial performanceimprovement over SOTA across different downstream tasks, including sentenceclassification, natural language inference and question answering.
Style Pooling: Automatic Text Style Obfuscation for Improved Classification Fairness
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.04624
Abstract
Text style can reveal sensitive attributes of the author (e.g. race or age)to the reader, which can, in turn, lead to privacy violations and bias in bothhuman and algorithmic decisions based on text. For example, the style ofwriting in job applications might reveal protected attributes of the candidatewhich could lead to bias in hiring decisions, regardless of whether hiringdecisions are made algorithmically or by humans. We propose a VAE-basedframework that obfuscates stylistic features of human-generated text throughstyle transfer by automatically re-writing the text itself. Our frameworkoperationalizes the notion of obfuscated style in a flexible way that enablestwo distinct notions of obfuscated style: (1) a minimal notion that effectivelyintersects the various styles seen in training, and (2) a maximal notion thatseeks to obfuscate by adding stylistic features of all sensitive attributes totext, in effect, computing a union of styles. Our style-obfuscation frameworkcan be used for multiple purposes, however, we demonstrate its effectiveness inimproving the fairness of downstream classifiers. We also conduct acomprehensive study on style pooling's effect on fluency, semantic consistency,and attribute removal from text, in two and three domain style obfuscation.
·