本次分享会将聚焦“语言模型”和“文本摘要”两大前沿主题,邀请了相关主题的8位 NAACL 论文一作进行专场分享和圆桌讨论。

Yusheng Su
苏裕胜,目前是清华大学计算机博士三年级学生,主要研究方向为自然语言处理(预训练语言模型),在WWW, NAACL, ACL, IEEE/TASLP等会议上发表过多篇论文。同时担任过COLING, EMNLP, ACL, NAACL, ICML等会议审稿人。
On Transferability of Prompt Tuning for Natural Language Processing
Prompt tuning (PT) 只需要调整少量参数即可实现与全参数微调相当的性能,是一种使用超大规模预训练语言模型的参数高效方法。然而,与微调相比,PT 需要更多的训练时间。因此,我们探索是否能通过prompt迁移来增强PT,我们在这项工作中实验研究了prompt在不同下游任务和不同类型、规模的预训练语言模型之间的迁移性。
 (2)此外,这些训练过的prompt也可以直接作为相似任务prompt的初始化,来提高 PT 的训练速度。
Xuandong Zhao
Provably Confidential Language Modelling
Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter all privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.
Weiyan SHi
Selective Differential Privacy for Language Modeling
With the increasing applications of language models, it has become crucial to protect these models from leaking private information. Previous work has attempted to tackle this challenge by training RNN-based language models with differential privacy guarantees. However, applying classical differential privacy to language models leads to poor model performance as the underlying privacy notion is over-pessimistic and provides undifferentiated protection for all tokens in the data. Given that the private information in natural language is sparse (for example, the bulk of an email might not carry personally identifiable information), we propose a new privacy notion, selective differential privacy, to provide rigorous privacy guarantees on the sensitive portion of the data to improve model utility. To realize such a new notion, we develop a corresponding privacy mechanism, Selective-DPSGD, for RNN-based language models. Besides language modeling, we also apply the method to a more concrete application--dialog systems. Experiments on both language modeling and dialog system building show that the proposed privacy-preserving mechanism achieves better utilities while remaining safe under various privacy attacks compared to the baselines.
Jingfeng Yang
现为亚马逊研究科学家(暂时放弃华盛顿大学计算机系自然语言处理的博士offer)。硕士在佐治亚理工学院毕业,导师为杨笛一教授,本科在北大获得生物与计算机双学位。主要研究方向为语义解析、文本生成、多语自然语言处理等。在ACL、 EMNLP、 NAACL 等发表多篇一作文章,担任ACL、 EMNLP、 NAACL、 NeurlPS、 AAAI 等会议审稿人,曾在谷歌、亚马逊、微软、爱丁堡大学等研究实习。
Compositional Generalization in Large Langauge Model Era
Jiacheng Xu
徐嘉诚是Salesforce研究院的研究科学家,专注于自然语言处理,尤其是自然语言生成和文本摘要方向的前沿研究。此前,他于2022年博士毕业于美国德州大学奥斯汀分校,导师为Greg Durrett。他于2017年从复旦大学本科毕业,师从邱锡鹏和黄萱菁教授。他此前曾在谷歌(2020)和微软(2019)实习。
Massive-scale Decoding for Text Generation using Lattices
Conditional neural text generation models generate high-quality outputs, but often concentrate around a mode when what we really want is a diverse set of options. We present a search algorithm to construct lattices encoding a massive number of generation options. First, we restructure decoding as a best-first search, which explores the space differently than beam search and improves efficiency by avoiding pruning paths. Second, we revisit the idea of hypothesis recombination: we can identify pairs of similar generation candidates during search and merge them as an approximation. On both summarization and machine translation, we show that our algorithm encodes thousands of diverse options that remain grammatical and high-quality into one lattice. This algorithm provides a foundation for building downstream generation applications on top of massive-scale diverse outputs.
Xiangru Tang
唐相儒目前是耶鲁大学计算机系博士一年级,导师为Mark Gerstein。此前,他于耶鲁大学获得计算机硕士学位,合作导师为Dragomir Radev。他的主要研究方向为预训练语言模型、文本生成和计算生物学。
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning
Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.
Yue Fang
From spoken dialogue to formal summary: An utterance rewriting for dialogue summarization
Due to the dialogue characteristics of unstructured contexts and multi-parties with first-person perspective, many successful text summarization works have failed when dealing with dialogue summarization. In dialogue summarization task, the input dialogue is usually spoken style with ellipsis and co-references but the output summaries are more formal and complete. Therefore, the dialogue summarization model should be able to complete the ellipsis content and co-reference information and then produce a suitable summary accordingly. How- ever, the current state-of-the-art models pay more attention on the topic or structure of summary, rather than the consistency of dialogue summary with its input dialogue context, which may suffer from the personal and logical inconsistency problem. In this paper, we propose a new model, named ReWriteSum, to tackle this problem. Firstly, an utterance rewriter is conducted to complete the ellipsis content of dialogue content and then obtain the rewriting utterances. Then, the co-reference data aug- mentation mechanism is utilized to replace the referential person's name with its specific name to enhance the personal information.
Xiangci Li
UT Dallas计算机博士生
李向磁是UT Dallas第二年博士生,师从Prof. Jessica Ouyang,主要研究方向为科研文献处理(信息抽取和相关工作摘要生成)。于南加州大学获得硕士学位,师从彭楠贇。曾在Chan Zuckerburg Initiative,百度和腾讯北美人工智能实验室实习。
CORWA: A Citation-Oriented Related Work Annotation Dataset
Academic research is an exploratory activity to discover new solutions to problems. By this nature, academic research works perform literature reviews to distinguish their novelties from prior work. In natural language processing, this literature review is usually conducted under the “Related Work” section. The task of related work generation aims to automatically generate the related work section given the rest of the research paper and a list of papers to cite. Prior work on this task has focused on the sentence as the basic unit of generation, neglecting the fact that related work sections consist of variable length text fragments derived from different information sources. As a first step toward a linguistically-motivated related work generation framework, we present a Citation Oriented Related Work Annotation (CORWA) dataset that labels different types of citation text fragments from different information sources. We train a strong baseline model that automatically tags the CORWA labels on massive unlabeled related work section texts. We further suggest a novel framework for human-in-the-loop, iterative, abstractive related work generation.