关于 #今日arXiv精选 
这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者。
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
Comment: EMNLP 2021. Code and data are available at  https://github.com/WadeYin9712/GD-VCR
Link: http://arxiv.org/abs/2109.06860
Abstract
Commonsense is defined as the knowledge that is shared by everyone. However,certain types of commonsense knowledge are correlated with culture andgeographic locations and they are only shared locally. For example, thescenarios of wedding ceremonies vary across regions due to different customsinfluenced by historical and religious factors. Such regional characteristics,however, are generally omitted in prior work. In this paper, we construct aGeo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to testvision-and-language models' ability to understand cultural andgeo-location-specific commonsense. In particular, we study two state-of-the-artVision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standardmultimodal commonsense benchmark with images primarily from Western regions. Wethen evaluate how well the trained models can generalize to answering thequestions in GD-VCR. We find that the performance of both models fornon-Western regions including East Asia, South Asia, and Africa issignificantly lower than that for Western region. We analyze the reasons behindthe performance disparity and find that the performance gap is larger on QApairs that: 1) are concerned with culture-related scenarios, e.g., weddings,religious activities, and festivals; 2) require high-level geo-diversecommonsense reasoning rather than low-order perception and recognition. Datasetand code are released at https://github.com/WadeYin9712/GD-VCR.
Summarize-then-Answer: Generating Concise Explanations for Multi-hop Reading Comprehension
Comment: Accepted to EMNLP2021 Long Paper (Main Track)
Link: http://arxiv.org/abs/2109.06853
Abstract
How can we generate concise explanations for multi-hop Reading Comprehension(RC)? The current strategies of identifying supporting sentences can be seen asan extractive question-focused summarization of the input text. However, theseextractive explanations are not necessarily concise i.e. not minimallysufficient for answering a question. Instead, we advocate for an abstractiveapproach, where we propose to generate a question-focused, abstractive summaryof input paragraphs and then feed it to an RC system. Given a limited amount ofhuman-annotated abstractive explanations, we train the abstractive explainer ina semi-supervised manner, where we start from the supervised model and thentrain it further through trial and error maximizing a conciseness-promotedreward function. Our experiments demonstrate that the proposed abstractiveexplainer can generate more compact explanations than an extractive explainerwith limited supervision (only 2k instances) while maintaining sufficiency.
The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation
Comment: EMNLP 2021 (20 pages)
Link: http://arxiv.org/abs/2109.06835
Abstract
Recent text generation research has increasingly focused on open-endeddomains such as story and poetry generation. Because models built for suchtasks are difficult to evaluate automatically, most researchers in the spacejustify their modeling choices by collecting crowdsourced human judgments oftext quality (e.g., Likert scores of coherence or grammaticality) from AmazonMechanical Turk (AMT). In this paper, we first conduct a survey of 45open-ended text generation papers and find that the vast majority of them failto report crucial details about their AMT tasks, hindering reproducibility. Wethen run a series of story evaluation experiments with both AMT workers andEnglish teachers and discover that even with strict qualification filters, AMTworkers (unlike teachers) fail to distinguish between model-generated text andhuman-generated references. We show that AMT worker judgments improve when theyare shown model-generated output alongside human-generated references, whichenables the workers to better calibrate their ratings. Finally, interviews withthe English teachers provide deeper insights into the challenges of theevaluation process, particularly when rating model-generated text.
Types of Out-of-Distribution Texts and How to Detect Them
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.06827
Abstract
Despite agreement on the importance of detecting out-of-distribution (OOD)examples, there is little consensus on the formal definition of OOD examplesand how to best detect them. We categorize these examples by whether theyexhibit a background shift or a semantic shift, and find that the two majorapproaches to OOD detection, model calibration and density estimation (languagemodeling for text), have distinct behavior on these types of OOD data. Across14 pairs of in-distribution and OOD English natural language understandingdatasets, we find that density estimation methods consistently beat calibrationmethods in background shift settings, while performing worse in semantic shiftsettings. In addition, we find that both methods generally fail to detectexamples from challenge data, highlighting a weak spot for current methods.Since no single method works well across all settings, our results call for anexplicit definition of OOD examples when evaluating different detectionmethods.
LM-Critic: Language Models for Unsupervised Grammatical Error Correction
Comment: EMNLP 2021. Code & data available at  https://github.com/michiyasunaga/LM-Critic
Link: http://arxiv.org/abs/2109.06822
Abstract
Training a model for grammatical error correction (GEC) requires a set oflabeled ungrammatical / grammatical sentence pairs, but manually annotatingsuch pairs can be expensive. Recently, the Break-It-Fix-It (BIFI) framework hasdemonstrated strong results on learning to repair a broken program without anylabeled examples, but this relies on a perfect critic (e.g., a compiler) thatreturns whether an example is valid or not, which does not exist for the GECtask. In this work, we show how to leverage a pretrained language model (LM) indefining an LM-Critic, which judges a sentence to be grammatical if the LMassigns it a higher probability than its local perturbations. We apply thisLM-Critic and BIFI along with a large set of unlabeled sentences to bootstraprealistic ungrammatical / grammatical pairs for training a corrector. Weevaluate our approach on GEC datasets across multiple domains (CoNLL-2014,BEA-2019, GMEG-wiki and GMEG-yahoo) and show that it outperforms existingmethods in both the unsupervised setting (+7.7 F0.5) and the supervised setting(+0.5 F0.5).
Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.06798
Abstract
Zero-shot cross-lingual information extraction (IE) describes theconstruction of an IE model for some target language, given existingannotations exclusively in some other language, typically English. While theadvance of pretrained multilingual encoders suggests an easy optimism of "trainon English, run on any language", we find through a thorough exploration andextension of techniques that a combination of approaches, both new and old,leads to better performance than any one cross-lingual strategy in particular.We explore techniques including data projection and self-training, and howdifferent pretrained encoders impact them. We use English-to-Arabic IE as ourinitial example, demonstrating strong performance in this setting for eventextraction, named entity recognition, part-of-speech tagging, and dependencyparsing. We then apply data projection and self-training to three tasks acrosseight target languages. Because no single set of techniques performs the bestacross all tasks, we encourage practitioners to explore various configurationsof the techniques described in this work when seeking to improve on zero-shottraining.
Adaptive Information Seeking for Open-Domain Question Answering
Comment: Accepted at EMNLP 2021
Link: http://arxiv.org/abs/2109.06747
Abstract
Information seeking is an essential step for open-domain question answeringto efficiently gather evidence from a large corpus. Recently, iterativeapproaches have been proven to be effective for complex questions, byrecursively retrieving new evidence at each step. However, almost all existingiterative approaches use predefined strategies, either applying the sameretrieval function multiple times or fixing the order of different retrievalfunctions, which cannot fulfill the diverse requirements of various questions.In this paper, we propose a novel adaptive information-seeking strategy foropen-domain question answering, namely AISO. Specifically, the whole retrievaland answer process is modeled as a partially observed Markov decision process,where three types of retrieval operations (e.g., BM25, DPR, and hyperlink) andone answer operation are defined as actions. According to the learned policy,AISO could adaptively select a proper retrieval action to seek the missingevidence at each step, based on the collected evidence and the reformulatedquery, or directly output the answer when the evidence set is sufficient forthe question. Experiments on SQuAD Open and HotpotQA fullwiki, which serve assingle-hop and multi-hop open-domain QA benchmarks, show that AISO outperformsall baseline methods with predefined strategies in terms of both retrieval andanswer evaluations.
A Novel Global Feature-Oriented Relational Triple Extraction Model based on Table Filling
Comment: EMNLP2021
Link: http://arxiv.org/abs/2109.06705
Abstract
Table filling based relational triple extraction methods are attractinggrowing research interests due to their promising performance and theirabilities on extracting triples from complex sentences. However, this kind ofmethods are far from their full potential because most of them only focus onusing local features but ignore the global associations of relations and oftoken pairs, which increases the possibility of overlooking some importantinformation during triple extraction. To overcome this deficiency, we propose aglobal feature-oriented triple extraction model that makes full use of thementioned two kinds of global associations. Specifically, we first generate atable feature for each relation. Then two kinds of global associations aremined from the generated table features. Next, the mined global associationsare integrated into the table feature of each relation. This"generate-mine-integrate" process is performed multiple times so that the tablefeature of each relation is refined step by step. Finally, each relation'stable is filled based on its refined table feature, and all triples linked tothis relation are extracted based on its filled table. We evaluate the proposedmodel on three benchmark datasets. Experimental results show our model iseffective and it achieves state-of-the-art results on all of these datasets.The source code of our work is available at: https://github.com/neukg/GRTE.
KFCNet: Knowledge Filtering and Contrastive Learning Network for Generative Commonsense Reasoning
Comment: Accepted to EMNLP 2021 Findings
Link: http://arxiv.org/abs/2109.06704
Abstract
Pre-trained language models have led to substantial gains over a broad rangeof natural language processing (NLP) tasks, but have been shown to havelimitations for natural language generation tasks with high-qualityrequirements on the output, such as commonsense generation and ad keywordgeneration. In this work, we present a novel Knowledge Filtering andContrastive learning Network (KFCNet) which references external knowledge andachieves better generation performance. Specifically, we propose a BERT-basedfilter model to remove low-quality candidates, and apply contrastive learningseparately to each of the encoder and decoder, within a generalencoder--decoder architecture. The encoder contrastive module helps to captureglobal target semantics during encoding, and the decoder contrastive moduleenhances the utility of retrieved prototypes while learning general features.Extensive experiments on the CommonGen benchmark show that our modeloutperforms the previous state of the art by a large margin: +6.6 points (42.5vs. 35.9) for BLEU-4, +3.7 points (33.3 vs. 29.6) for SPICE, and +1.3 points(18.3 vs. 17.0) for CIDEr. We further verify the effectiveness of the proposedcontrastive module on ad keyword generation, and show that our model haspotential commercial value.
Efficient Inference for Multilingual Neural Machine Translation
Comment: Accepted as a long paper to EMNLP 2021
Link: http://arxiv.org/abs/2109.06679
Abstract
Multilingual NMT has become an attractive solution for MT deployment inproduction. But to match bilingual quality, it comes at the cost of larger andslower models. In this work, we consider several ways to make multilingual NMTfaster at inference without degrading its quality. We experiment with several"light decoder" architectures in two 20-language multi-parallel settings:small-scale on TED Talks and large-scale on ParaCrawl. Our experimentsdemonstrate that combining a shallow decoder with vocabulary filtering leads tomore than twice faster inference with no loss in translation quality. Wevalidate our findings with BLEU and chrF (on 380 language pairs), robustnessevaluation and human evaluation.
MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model
Comment: Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.06605
Abstract
Domain adaptive pretraining, i.e. the continued unsupervised pretraining of alanguage model on domain-specific text, improves the modelling of text fordownstream tasks within the domain. Numerous real-world applications are basedon domain-specific text, e.g. working with financial or biomedical documents,and these applications often need to support multiple languages. However,large-scale domain-specific multilingual pretraining data for such scenarioscan be difficult to obtain, due to regulations, legislation, or simply a lackof language- and domain-specific text. One solution is to train a singlemultilingual model, taking advantage of the data available in as many languagesas possible. In this work, we explore the benefits of domain adaptivepretraining with a focus on adapting to multiple languages within a specificdomain. We propose different techniques to compose pretraining corpora thatenable a language model to both become domain-specific and multilingual.Evaluation on nine domain-specific datasets-for biomedical named entityrecognition and financial sentence classification-covering seven differentlanguages show that a single multilingual domain-specific model can outperformthe general multilingual model, and performs close to its monolingualcounterpart. This finding holds across two different pretraining methods,adapter-based pretraining and full model pretraining.
Non-Parametric Unsupervised Domain Adaptation for Neural Machine Translation
Comment: Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.06604
Abstract
Recently, $k$NN-MT has shown the promising capability of directlyincorporating the pre-trained neural machine translation (NMT) model withdomain-specific token-level $k$-nearest-neighbor ($k$NN) retrieval to achievedomain adaptation without retraining. Despite being conceptually attractive, itheavily relies on high-quality in-domain parallel corpora, limiting itscapability on unsupervised domain adaptation, where in-domain parallel corporaare scarce or nonexistent. In this paper, we propose a novel framework thatdirectly uses in-domain monolingual sentences in the target language toconstruct an effective datastore for $k$-nearest-neighbor retrieval. To thisend, we first introduce an autoencoder task based on the target language, andthen insert lightweight adapters into the original NMT model to map thetoken-level representation of this task to the ideal representation oftranslation task. Experiments on multi-domain datasets demonstrate that ourproposed approach significantly improves the translation accuracy withtarget-side monolingual data, while achieving comparable performance withback-translation.
Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP
Comment: Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.06598
Abstract
A key part of the NLP ethics movement is responsible use of data, but exactlywhat that means or how it can be best achieved remain unclear. This positionpaper discusses the core legal and ethical principles for collection andsharing of textual data, and the tensions between them. We propose a potentialchecklist for responsible data (re-)use that could both standardise the peerreview of conference submissions, as well as enable a more in-depth view ofpublished research across the community. Our proposal aims to contribute to thedevelopment of a consistent standard for data (re-)use, embraced across NLPconferences.
Learning Bill Similarity with Annotated and Augmented Corpora of Bills
Comment: Accepted at EMNLP 2021(Long paper)
Link: http://arxiv.org/abs/2109.06527
Abstract
Bill writing is a critical element of representative democracy. However, itis often overlooked that most legislative bills are derived, or even directlycopied, from other bills. Despite the significance of bill-to-bill linkages forunderstanding the legislative process, existing approaches fail to addresssemantic similarities across bills, let alone reordering or paraphrasing whichare prevalent in legal document writing. In this paper, we overcome theselimitations by proposing a 5-class classification task that closely reflectsthe nature of the bill generation process. In doing so, we construct ahuman-labeled dataset of 4,721 bill-to-bill relationships at thesubsection-level and release this annotated dataset to the research community.To augment the dataset, we generate synthetic data with varying degrees ofsimilarity, mimicking the complex bill writing process. We use BERT variantsand apply multi-stage training, sequentially fine-tuning our models withsynthetic and human-labeled datasets. We find that the predictive performancesignificantly improves when training with both human-labeled and syntheticdata. Finally, we apply our trained model to infer section- and bill-levelsimilarities. Our analysis shows that the proposed methodology successfullycaptures the similarities across legal documents at various levels ofaggregation.
Different Strokes for Different Folks: Investigating Appropriate Further Pre-training Approaches for Diverse Dialogue Tasks
Comment: Accepted as a long paper at EMNLP 2021 (Main Conference)
Link: http://arxiv.org/abs/2109.06524
Abstract
Loading models pre-trained on the large-scale corpus in the general domainand fine-tuning them on specific downstream tasks is gradually becoming aparadigm in Natural Language Processing. Previous investigations prove thatintroducing a further pre-training phase between pre-training and fine-tuningphases to adapt the model on the domain-specific unlabeled data can bringpositive effects. However, most of these further pre-training works just keeprunning the conventional pre-training task, e.g., masked language model, whichcan be regarded as the domain adaptation to bridge the data distribution gap.After observing diverse downstream tasks, we suggest that different tasks mayalso need a further pre-training phase with appropriate training tasks tobridge the task formulation gap. To investigate this, we carry out a study forimproving multiple task-oriented dialogue downstream tasks through designingvarious tasks at the further pre-training phase. The experiment shows thatdifferent downstream tasks prefer different further pre-training tasks, whichhave intrinsic correlation and most further pre-training tasks significantlyimprove certain target tasks rather than all. Our investigation indicates thatit is of great importance and effectiveness to design appropriate furtherpre-training tasks modeling specific information that benefit downstream tasks.Besides, we present multiple constructive empirical conclusions for enhancingtask-oriented dialogues.
Netmarble AI Center's WMT21 Automatic Post-Editing Shared Task Submission
Comment: WMT21 Automatic Post-Editing Shared Task System Paper (at EMNLP2021  Workshop)
Link: http://arxiv.org/abs/2109.06515
Abstract
This paper describes Netmarble's submission to WMT21 Automatic Post-Editing(APE) Shared Task for the English-German language pair. First, we propose aCurriculum Training Strategy in training stages. Facebook Fair's WMT19 newstranslation model was chosen to engage the large and powerful pre-trainedneural networks. Then, we post-train the translation model with differentlevels of data at each training stages. As the training stages go on, we makethe system learn to solve multiple tasks by adding extra information atdifferent training stages gradually. We also show a way to utilize theadditional data in large volume for APE tasks. For further improvement, weapply Multi-Task Learning Strategy with the Dynamic Weight Average during thefine-tuning stage. To fine-tune the APE corpus with limited data, we add somerelated subtasks to learn a unified representation. Finally, for betterperformance, we leverage external translations as augmented machine translation(MT) during the post-training and fine-tuning. As experimental results show,our APE system significantly improves the translations of provided MT resultsby -2.848 and +3.74 on the development dataset in terms of TER and BLEU,respectively. It also demonstrates its effectiveness on the test dataset withhigher quality than the development dataset.
Tribrid: Stance Classification with Neural Inconsistency Detection
Comment: Accepted at EMNLP 2021
Link: http://arxiv.org/abs/2109.06508
Abstract
We study the problem of performing automatic stance classification on socialmedia with neural architectures such as BERT. Although these architecturesdeliver impressive results, their level is not yet comparable to the one ofhumans and they might produce errors that have a significant impact on thedownstream task (e.g., fact-checking). To improve the performance, we present anew neural architecture where the input also includes automatically generatednegated perspectives over a given claim. The model is jointly learned to makesimultaneously multiple predictions, which can be used either to improve theclassification of the original perspective or to filter out doubtfulpredictions. In the first case, we propose a weakly supervised method forcombining the predictions into a final one. In the second case, we show thatusing the confidence scores to remove doubtful predictions allows our method toachieve human-like performance over the retained information, which is still asizable part of the original input.
AligNART: Non-autoregressive Neural Machine Translation by Jointly Learning to Estimate Alignment and Translate
Comment: Accepted by EMNLP 2021
Link: http://arxiv.org/abs/2109.06481
Abstract
Non-autoregressive neural machine translation (NART) models suffer from themulti-modality problem which causes translation inconsistency such as tokenrepetition. Most recent approaches have attempted to solve this problem byimplicitly modeling dependencies between outputs. In this paper, we introduceAligNART, which leverages full alignment information to explicitly reduce themodality of the target distribution. AligNART divides the machine translationtask into $(i)$ alignment estimation and $(ii)$ translation with aligneddecoder inputs, guiding the decoder to focus on simplified one-to-onetranslation. To alleviate the alignment estimation problem, we further proposea novel alignment decomposition method. Our experiments show that AligNARToutperforms previous non-iterative NART models that focus on explicit modalityreduction on WMT14 En$\leftrightarrow$De and WMT16 Ro$\rightarrow$En.Furthermore, AligNART achieves BLEU scores comparable to those of thestate-of-the-art connectionist temporal classification based models on WMT14En$\leftrightarrow$De. We also observe that AligNART effectively addresses thetoken repetition problem even without sequence-level knowledge distillation.
Logic-level Evidence Retrieval and Graph-based Verification Network for Table-based Fact Verification
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.06480
Abstract
Table-based fact verification task aims to verify whether the given statementis supported by the given semi-structured table. Symbolic reasoning withlogical operations plays a crucial role in this task. Existing methods leverageprograms that contain rich logical information to enhance the verificationprocess. However, due to the lack of fully supervised signals in the programgeneration process, spurious programs can be derived and employed, which leadsto the inability of the model to catch helpful logical operations. To addressthe aforementioned problems, in this work, we formulate the table-based factverification task as an evidence retrieval and reasoning framework, proposingthe Logic-level Evidence Retrieval and Graph-based Verification network(LERGV). Specifically, we first retrieve logic-level program-like evidence fromthe given table and statement as supplementary evidence for the table. Afterthat, we construct a logic-level graph to capture the logical relations betweenentities and functions in the retrieved evidence, and design a graph-basedverification network to perform logic-level graph-based reasoning based on theconstructed graph to classify the final entailment relation. Experimentalresults on the large-scale benchmark TABFACT show the effectiveness of theproposed approach.
Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding
Comment: Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.06466
Abstract
Task-adaptive pre-training (TAPT) and Self-training (ST) have emerged as themajor semi-supervised approaches to improve natural language understanding(NLU) tasks with massive amount of unlabeled data. However, it's unclearwhether they learn similar representations or they can be effectively combined.In this paper, we show that TAPT and ST can be complementary with simple TFSprotocol by following TAPT ->Finetuning ->Self-training (TFS) process.Experimental results show that TFS protocol can effectively utilize unlabeleddata to achieve strong combined gains consistently across six datasets coveringsentiment classification, paraphrase identification, natural languageinference, named entity recognition and dialogue slot classification. Weinvestigate various semi-supervised settings and consistently show that gainsfrom TAPT and ST can be strongly additive by following TFS procedure. We hopethat TFS could serve as an important semi-supervised baseline for future NLPstudies.
Uncovering Implicit Gender Bias in Narratives through Commonsense Inference
Comment: Accepted at Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.06437
Abstract
Pre-trained language models learn socially harmful biases from their trainingcorpora, and may repeat these biases when used for generation. We study genderbiases associated with the protagonist in model-generated stories. Such biasesmay be expressed either explicitly ("women can't park") or implicitly (e.g. anunsolicited male character guides her into a parking space). We focus onimplicit biases, and use a commonsense reasoning engine to uncover them.Specifically, we infer and analyze the protagonist's motivations, attributes,mental states, and implications on others. Our findings regarding implicitbiases are in line with prior work that studied explicit biases, for exampleshowing that female characters' portrayal is centered around appearance, whilemale figures' focus on intellect.
Gradient Imitation Reinforcement Learning for Low Resource Relation Extraction
Comment: In EMNLP 2021 as a long paper. Code and data available at  https://github.com/THU-BPM/GradLRE
Link: http://arxiv.org/abs/2109.06415
Abstract
Low-resource Relation Extraction (LRE) aims to extract relation facts fromlimited labeled corpora when human annotation is scarce. Existing works eitherutilize self-training scheme to generate pseudo labels that will cause thegradual drift problem, or leverage meta-learning scheme which does not solicitfeedback explicitly. To alleviate selection bias due to the lack of feedbackloops in existing LRE learning paradigms, we developed a Gradient ImitationReinforcement Learning method to encourage pseudo label data to imitate thegradient descent direction on labeled data and bootstrap its optimizationcapability through trial and error. We also propose a framework called GradLRE,which handles two major scenarios in low-resource relation extraction. Besidesthe scenario where unlabeled data is sufficient, GradLRE handles the situationwhere no unlabeled data is available, by exploiting a contextualizedaugmentation method to generate data. Experimental results on two publicdatasets demonstrate the effectiveness of GradLRE on low resource relationextraction when comparing with baselines.
Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding
Comment: Accepted as a long paper in the main conference of EMNLP 2021
Link: http://arxiv.org/abs/2109.06400
Abstract
A key solution to temporal sentence grounding (TSG) exists in how to learneffective alignment between vision and language features extracted from anuntrimmed video and a sentence description. Existing methods mainly leveragevanilla soft attention to perform the alignment in a single-step process.However, such single-step attention is insufficient in practice, sincecomplicated relations between inter- and intra-modality are usually obtainedthrough multi-step reasoning. In this paper, we propose an Iterative AlignmentNetwork (IA-Net) for TSG task, which iteratively interacts inter- andintra-modal features within multiple steps for more accurate grounding.Specifically, during the iterative reasoning process, we pad multi-modalfeatures with learnable parameters to alleviate the nowhere-to-attend problemof non-matched frame-word pairs, and enhance the basic co-attention mechanismin a parallel manner. To further calibrate the misaligned attention caused byeach reasoning step, we also devise a calibration module following eachattention module to refine the alignment knowledge. With such iterativealignment scheme, our IA-Net can robustly capture the fine-grained relationsbetween vision and language domains step-by-step for progressively reasoningthe temporal boundaries. Extensive experiments conducted on three challengingbenchmarks demonstrate that our proposed model performs better than thestate-of-the-arts.
Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos
Comment: Accepted as a long paper in the main conference of EMNLP 2021
Link: http://arxiv.org/abs/2109.06398
Abstract
We address the problem of temporal sentence localization in videos (TSLV).Traditional methods follow a top-down framework which localizes the targetsegment with pre-defined segment proposals. Although they have achieved decentperformance, the proposals are handcrafted and redundant. Recently, bottom-upframework attracts increasing attention due to its superior efficiency. Itdirectly predicts the probabilities for each frame as a boundary. However, theperformance of bottom-up model is inferior to the top-down counterpart as itfails to exploit the segment-level interaction. In this paper, we propose anAdaptive Proposal Generation Network (APGN) to maintain the segment-levelinteraction while speeding up the efficiency. Specifically, we first perform aforeground-background classification upon the video and regress on theforeground frames to adaptively generate proposals. In this way, thehandcrafted proposal design is discarded and the redundant proposals aredecreased. Then, a proposal consolidation module is further developed toenhance the semantic of the generated proposals. Finally, we locate the targetmoments with these generated proposals following the top-down framework.Extensive experiments on three challenging benchmarks show that our proposedAPGN significantly outperforms previous state-of-the-art methods.
Rationales for Sequential Predictions
Comment: To appear in the 2021 Conference on Empirical Methods in Natural  Language Processing (EMNLP 2021)
Link: http://arxiv.org/abs/2109.06387
Abstract
Sequence models are a critical component of modern NLP systems, but theirpredictions are difficult to explain. We consider model explanations thoughrationales, subsets of context that can explain individual model predictions.We find sequential rationales by solving a combinatorial optimization: the bestrationale is the smallest subset of input tokens that would predict the sameoutput as the full sequence. Enumerating all subsets is intractable, so wepropose an efficient greedy algorithm to approximate this objective. Thealgorithm, which is called greedy rationalization, applies to any model. Forthis approach to be effective, the model should form compatible conditionaldistributions when making predictions on incomplete subsets of the context.This condition can be enforced with a short fine-tuning step. We study greedyrationalization on language modeling and machine translation. Compared toexisting baselines, greedy rationalization is best at optimizing thecombinatorial objective and provides the most faithful rationales. On a newdataset of annotated sequential rationales, greedy rationales are most similarto human rationales.
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation
Comment: EMNLP 2021, Code available at  https://github.com/tanyuqian/ctc-gen-eval
Link: http://arxiv.org/abs/2109.06379
Abstract
Natural language generation (NLG) spans a broad range of tasks, each of whichserves for specific objectives and desires different properties of generatedtext. The complexity makes automatic evaluation of NLG particularlychallenging. Previous work has typically focused on a single task and developedindividual evaluation metrics based on specific intuitions. In this paper, wepropose a unifying perspective based on the nature of information change in NLGtasks, including compression (e.g., summarization), transduction (e.g., textrewriting), and creation (e.g., dialog). Information alignment between input,context, and output text plays a common central role in characterizing thegeneration. With automatic alignment prediction models, we develop a family ofinterpretable metrics that are suitable for evaluating key aspects of differentNLG tasks, often without need of gold reference data. Experiments show theuniformly designed metrics achieve stronger or comparable correlations withhuman judgement compared to state-of-the-art metrics in each of diverse tasks,including text summarization, style transfer, and knowledge-grounded dialog.
Question Answering over Electronic Devices: A New Benchmark Dataset and a Multi-Task Learning based QA Framework
Comment: EMNLP Findings 2021, Long
Link: http://arxiv.org/abs/2109.05897
Abstract
Answering questions asked from instructional corpora such as E-manuals,recipe books, etc., has been far less studied than open-domain factoidcontext-based question answering. This can be primarily attributed to theabsence of standard benchmark datasets. In this paper we meticulously create alarge amount of data connected with E-manuals and develop suitable algorithm toexploit it. We collect E-Manual Corpus, a huge corpus of 307,957 E-manuals andpretrain RoBERTa on this large corpus. We create various benchmark QA datasetswhich include question answer pairs curated by experts based upon twoE-manuals, real user questions from Community Question Answering Forumpertaining to E-manuals etc. We introduce EMQAP (E-Manual Question AnsweringPipeline) that answers questions pertaining to electronics devices. Built uponthe pretrained RoBERTa, it harbors a supervised multi-task learning frameworkwhich efficiently performs the dual tasks of identifying the section in theE-manual where the answer can be found and the exact answer span within thatsection. For E-Manual annotated question-answer pairs, we show an improvementof about 40% in ROUGE-L F1 scores over the most competitive baseline. Weperform a detailed ablation study and establish the versatility of EMQAP acrossdifferent circumstances. The code and datasets are shared athttps://github.com/abhi1nandy2/EMNLP-2021-Findings, and the correspondingproject website is https://sites.google.com/view/emanualqa/home.
Mitigating Language-Dependent Ethnic Bias in BERT
Comment: 17 pages including references and appendix. To appear in EMNLP 2021  (camera-ready ver.)
Link: http://arxiv.org/abs/2109.05704
Abstract
BERT and other large-scale language models (LMs) contain gender and racialbias. They also exhibit other dimensions of social bias, most of which have notbeen studied in depth, and some of which vary depending on the language. Inthis paper, we study ethnic bias and how it varies across languages byanalyzing and mitigating ethnic bias in monolingual BERT for English, German,Spanish, Korean, Turkish, and Chinese. To observe and quantify ethnic bias, wedevelop a novel metric called Categorical Bias score. Then we propose twomethods for mitigation; first using a multilingual model, and second usingcontextual word alignment of two monolingual models. We compare our proposedmethods with monolingual BERT and show that these methods effectively alleviatethe ethnic bias. Which of the two methods works better depends on the amount ofNLP resources available for that language. We additionally experiment withArabic and Greek to verify that our proposed methods work for a wider varietyof languages.
·
继续阅读
阅读原文