关于 #今日arXiv精选 
这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者。
Does Vision-and-Language Pretraining Improve Lexical Grounding?
Comment: Camera ready for Findings of EMNLP 2021
Link: http://arxiv.org/abs/2109.10246
Abstract
Linguistic representations derived from text alone have been criticized fortheir lack of grounding, i.e., connecting words to their meanings in thephysical world. Vision-and-Language (VL) models, trained jointly on text andimage or video data, have been offered as a response to such criticisms.However, while VL pretraining has shown success on multimodal tasks such asvisual question answering, it is not yet known how the internal linguisticrepresentations themselves compare to their text-only counterparts. This papercompares the semantic representations learned via VL vs. text-only pretrainingfor two recent VL models using a suite of analyses (clustering, probing, andperformance on a commonsense question answering task) in a language-onlysetting. We find that the multimodal models fail to significantly outperformthe text-only variants, suggesting that future work is required if multimodalpretraining is to be pursued as a means of improving NLP in general.
Blindness to Modality Helps Entailment Graph Mining
Comment: To appear at the Workshop on Insights from Negative Results in NLP at  EMNLP 2021
Link: http://arxiv.org/abs/2109.10227
Abstract
Understanding linguistic modality is widely seen as important for downstreamtasks such as Question Answering and Knowledge Graph Population. EntailmentGraph learning might also be expected to benefit from attention to modality. Webuild Entailment Graphs using a news corpus filtered with a modality parser,and show that stripping modal modifiers from predicates in fact increasesperformance. This suggests that for some tasks, the pragmatics of modalmodification of predicates allows them to contribute as evidence of entailment.
One Source, Two Targets: Challenges and Rewards of Dual Decoding
Comment: Accepted at EMNLP 2021
Link: http://arxiv.org/abs/2109.10197
Abstract
Machine translation is generally understood as generating one target textfrom an input source document. In this paper, we consider a strongerrequirement: to jointly generate two texts so that each output side effectivelydepends on the other. As we discuss, such a device serves several practicalpurposes, from multi-target machine translation to the generation of controlledvariations of the target text. We present an analysis of possibleimplementations of dual decoding, and experiment with four applications.Viewing the problem from multiple angles allows us to better highlight thechallenges of dual decoding and to also thoroughly analyze the benefits ofgenerating matched, rather than independent, translations.
TranslateLocally: Blazing-fast translation running on the local CPU
Comment: Accepted at EMNLP 2021 demo track; https://translatelocally.com
Link: http://arxiv.org/abs/2109.10194
Abstract
Every day, millions of people sacrifice their privacy and browsing habits inexchange for online machine translation. Companies and governments withconfidentiality requirements often ban online translation or pay a premium todisable logging. To bring control back to the end user and demonstrate speed,we developed translateLocally. Running locally on a desktop or laptop CPU,translateLocally delivers cloud-like translation speed and quality even on 10year old hardware. The open-source software is based on Marian and runs onLinux, Windows, and macOS.
Are Transformers a Modern Version of ELIZA? Observations on French Object Verb Agreement
Comment: Camera-ready for EMNLP'21
Link: http://arxiv.org/abs/2109.10133
Abstract
Many recent works have demonstrated that unsupervised sentencerepresentations of neural networks encode syntactic information by observingthat neural language models are able to predict the agreement between a verband its subject. We take a critical look at this line of research by showingthat it is possible to achieve high accuracy on this agreement task with simplesurface heuristics, indicating a possible flaw in our assessment of neuralnetworks' syntactic ability. Our fine-grained analyses of results on thelong-range French object-verb agreement show that contrary to LSTMs,Transformers are able to capture a non-trivial amount of grammatical structure.
ConvFiT: Conversational Fine-Tuning of Pretrained Language Models
Comment: EMNLP 2021 (long paper)
Link: http://arxiv.org/abs/2109.10126
Abstract
Transformer-based language models (LMs) pretrained on large text collectionsare proven to store a wealth of semantic knowledge. However, 1) they are noteffective as sentence encoders when used off-the-shelf, and 2) thus typicallylag behind conversationally pretrained (e.g., via response selection) encoderson conversational tasks such as intent detection (ID). In this work, we proposeConvFiT, a simple and efficient two-stage procedure which turns any pretrainedLM into a universal conversational encoder (after Stage 1 ConvFiT-ing) andtask-specialised sentence encoder (after Stage 2). We demonstrate that 1)full-blown conversational pretraining is not required, and that LMs can bequickly transformed into effective conversational encoders with much smalleramounts of unannotated data; 2) pretrained LMs can be fine-tuned intotask-specialised sentence encoders, optimised for the fine-grained semantics ofa particular task. Consequently, such specialised sentence encoders allow fortreating ID as a simple semantic similarity task based on interpretable nearestneighbours retrieval. We validate the robustness and versatility of the ConvFiTframework with such similarity-based inference on the standard ID evaluationsets: ConvFiT-ed LMs achieve state-of-the-art ID performance across the board,with particular gains in the most challenging, few-shot setups.
On the Difficulty of Segmenting Words with Attention
Comment: Accepted at the "Workshop on Insights from Negative Results in NLP"  (EMNLP 2021)
Link: http://arxiv.org/abs/2109.10107
Abstract
Word segmentation, the problem of finding word boundaries in speech, is ofinterest for a range of tasks. Previous papers have suggested that forsequence-to-sequence models trained on tasks such as speech translation orspeech recognition, attention can be used to locate and segment the words. Weshow, however, that even on monolingual data this approach is brittle. In ourexperiments with different input types, data sizes, and segmentationalgorithms, only models trained to predict phones from words succeed in thetask. Models trained to predict words from either phones or speech (i.e., theopposite direction needed to generalize to new data), yield much worse results,suggesting that attention-based segmentation is only useful in limitedscenarios.
Learning Kernel-Smoothed Machine Translation with Retrieved Examples
Comment: EMNLP 2021
Link: http://arxiv.org/abs/2109.09991
Abstract
How to effectively adapt neural machine translation (NMT) models according toemerging cases without retraining? Despite the great success of neural machinetranslation, updating the deployed models online remains a challenge. Existingnon-parametric approaches that retrieve similar examples from a database toguide the translation process are promising but are prone to overfit theretrieved examples. However, non-parametric methods are prone to overfit theretrieved examples. In this work, we propose to learn Kernel-SmoothedTranslation with Example Retrieval (KSTER), an effective approach to adaptneural machine translation models online. Experiments on domain adaptation andmulti-domain machine translation datasets show that even without expensiveretraining, KSTER is able to achieve improvement of 1.1 to 1.5 BLEU scores overthe best existing online adaptation methods. The code and trained models arereleased at https://github.com/jiangqn/KSTER.
Generalization in Text-based Games via Hierarchical Reinforcement Learning
Comment: 41 pages, 11 figures, EMNLP2021 Findings
Link: http://arxiv.org/abs/2109.09968
Abstract
Deep reinforcement learning provides a promising approach for text-basedgames in studying natural language communication between humans and artificialagents. However, the generalization still remains a big challenge as the agentsdepend critically on the complexity and variety of training tasks. In thispaper, we address this problem by introducing a hierarchical framework builtupon the knowledge graph-based RL agent. In the high level, a meta-policy isexecuted to decompose the whole game into a set of subtasks specified bytextual goals, and select one of them based on the KG. Then a sub-policy in thelow level is executed to conduct goal-conditioned reinforcement learning. Wecarry out experiments on games with various difficulty levels and show that theproposed method enjoys favorable generalizability.
What BERT Based Language Models Learn in Spoken Transcripts: An Empirical Study
Comment: BlackboxNLP @ EMNLP 2021 (15 pages, includes Appendix)
Link: http://arxiv.org/abs/2109.09105
Abstract
Language Models (LMs) have been ubiquitously leveraged in various tasksincluding spoken language understanding (SLU). Spoken language requires carefulunderstanding of speaker interactions, dialog states and speech inducedmultimodal behaviors to generate a meaningful representation of theconversation. In this work, we propose to dissect SLU into three representativeproperties:conversational (disfluency, pause, overtalk), channel (speaker-type,turn-tasks) and ASR (insertion, deletion,substitution). We probe BERT basedlanguage models (BERT, RoBERTa) trained on spoken transcripts to investigateits ability to understand multifarious properties in absence of any speechcues. Empirical results indicate that LM is surprisingly good at capturingconversational properties such as pause prediction and overtalk detection fromlexical tokens. On the downsides, the LM scores low on turn-tasks and ASRerrors predictions. Additionally, pre-training the LM on spoken transcriptsrestrain its linguistic understanding. Finally, we establish the efficacy andtransferability of the mentioned properties on two benchmark datasets:Switchboard Dialog Act and Disfluency datasets.
Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes
Comment: Accepted at EMNLP 2021 main conference. For the code and dataset, see  https://github.com/skywalker023/focused-empathy
Link: http://arxiv.org/abs/2109.08828
Abstract
Empathy is a complex cognitive ability based on the reasoning of others'affective states. In order to better understand others and express strongerempathy in dialogues, we argue that two issues must be tackled at the sametime: (i) identifying which word is the cause for the other's emotion from hisor her utterance and (ii) reflecting those specific words in the responsegeneration. However, previous approaches for recognizing emotion cause words intext require sub-utterance level annotations, which can be demanding. Takinginspiration from social cognition, we leverage a generative estimator to inferemotion cause words from utterances with no word-level label. Also, weintroduce a novel method based on pragmatics to make dialogue models focus ontargeted words in the input during generation. Our method is applicable to anydialogue models with no additional training on the fly. We show our approachimproves multiple best-performing dialogue agents on generating more focusedempathetic responses in terms of both automatic and human evaluation.
·
·
继续阅读
阅读原文