语言模型和文本摘要前沿专题，齐聚8位顶会NAACL一作

NAACL是自然语言处理领域的顶级学术会议，为了进一步促进国际间学术交流，青源会将于8月4日上午09:00-12:20举办「青源Seminar丨NAACL专场线上分享会」，召集人为青源研究组成员、耶鲁大学博士生唐相儒。

本次分享会将聚焦“语言模型”和“文本摘要”两大前沿主题，邀请了相关主题的8位 NAACL 论文一作进行专场分享和圆桌讨论。

点击

活动官网+直播预约

或阅读原文预约线上直播，微信扫描下方二维码加入讲者微信群。

扫码加入讲者交流群

(点击查看高清图片)

Yusheng Su

苏裕胜

清华大学计算机博士生

苏裕胜，目前是清华大学计算机博士三年级学生，主要研究方向为自然语言处理（预训练语言模型），在WWW, NAACL, ACL, IEEE/TASLP等会议上发表过多篇论文。同时担任过COLING, EMNLP, ACL, NAACL, ICML等会议审稿人。

On Transferability of Prompt Tuning for Natural Language Processing

Prompt tuning (PT) 只需要调整少量参数即可实现与全参数微调相当的性能，是一种使用超大规模预训练语言模型的参数高效方法。然而，与微调相比，PT 需要更多的训练时间。因此，我们探索是否能通过prompt迁移来增强PT，我们在这项工作中实验研究了prompt在不同下游任务和不同类型、规模的预训练语言模型之间的迁移性。

我们发现：

（1）在零样本设定下，训练过的prompt可以有效地迁移到同一预训练语言模型的类似任务上，也可以迁移到其他不同的预训练语言模型上并完成类似任务。

（2）此外，这些训练过的prompt也可以直接作为相似任务prompt的初始化，来提高 PT 的训练速度。

（3）为了探索影响迁移性的因素，我们研究了各种迁移性指标，发现prompt所激活神经元的重叠率与迁移性存在较强相关性。我们的研究结果表明，prompt迁移是一种有前景的增强PT的方式，我们鼓励进一步的研究更多关注prompt如何激活预训练语言模型以完成各种任务。

Xuandong Zhao

赵宣栋

UCSB计算机博士生

赵宣栋，目前是UCSB计算机博士三年级，导师为李磊和王宇翔。曾在阿里巴巴，微软等公司实习，研究兴趣为机器学习和自然语言处理（模型保护和隐私保护）。

Provably Confidential Language Modelling

Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter all privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.

Weiyan SHi

史唯艳

哥伦比亚大学博士生

我主要的研究方向是对话系统，尤其是策略性和有影响力的对话系统（比如，说服对话系统）。其他的研究方向包括对话生成，和隐私保护的NLP模型。

Selective Differential Privacy for Language Modeling

With the increasing applications of language models, it has become crucial to protect these models from leaking private information. Previous work has attempted to tackle this challenge by training RNN-based language models with differential privacy guarantees. However, applying classical differential privacy to language models leads to poor model performance as the underlying privacy notion is over-pessimistic and provides undifferentiated protection for all tokens in the data. Given that the private information in natural language is sparse (for example, the bulk of an email might not carry personally identifiable information), we propose a new privacy notion, selective differential privacy, to provide rigorous privacy guarantees on the sensitive portion of the data to improve model utility. To realize such a new notion, we develop a corresponding privacy mechanism, Selective-DPSGD, for RNN-based language models. Besides language modeling, we also apply the method to a more concrete application--dialog systems. Experiments on both language modeling and dialog system building show that the proposed privacy-preserving mechanism achieves better utilities while remaining safe under various privacy attacks compared to the baselines.

Jingfeng Yang

杨靖锋

亚马逊研究科学家

现为亚马逊研究科学家（暂时放弃华盛顿大学计算机系自然语言处理的博士offer）。硕士在佐治亚理工学院毕业，导师为杨笛一教授，本科在北大获得生物与计算机双学位。主要研究方向为语义解析、文本生成、多语自然语言处理等。在ACL、 EMNLP、 NAACL 等发表多篇一作文章，担任ACL、 EMNLP、 NAACL、 NeurlPS、 AAAI 等会议审稿人，曾在谷歌、亚马逊、微软、爱丁堡大学等研究实习。

Compositional Generalization in Large Langauge Model Era

组合泛化仍是是大模型的最重要的难点之一，是实现推理、分布外泛化，以及通往通用人工智能这一最终目标的关键。我们两篇NAACL的文章分别从两种视角提出两种方式来增强模型的组合泛化能力。从模型角度，我们可以通过序列Prompt填充、以及集成预训练模型和精调模型，来保证分布内泛化能力的同时，提升分布外泛化能力，其中，我们发现预训练模型的限制解码、以及在限制词表上概率重新归一化是这一技术获得成功的关键。从数据角度，我们提出了通过语义树子树替换的方法进行数据扩增，然后再将扩增数据作为Seq2seq生成模型的训练数据。这两种方法在一系列组合性语义解析的测试中取得了明显提升。

Jiacheng Xu

徐嘉诚

Salesforce研究院

研究科学家

徐嘉诚是Salesforce研究院的研究科学家，专注于自然语言处理，尤其是自然语言生成和文本摘要方向的前沿研究。此前，他于2022年博士毕业于美国德州大学奥斯汀分校，导师为Greg Durrett。他于2017年从复旦大学本科毕业，师从邱锡鹏和黄萱菁教授。他此前曾在谷歌（2020）和微软（2019）实习。

Massive-scale Decoding for Text Generation using Lattices

Conditional neural text generation models generate high-quality outputs, but often concentrate around a mode when what we really want is a diverse set of options. We present a search algorithm to construct lattices encoding a massive number of generation options. First, we restructure decoding as a best-first search, which explores the space differently than beam search and improves efficiency by avoiding pruning paths. Second, we revisit the idea of hypothesis recombination: we can identify pairs of similar generation candidates during search and merge them as an approximation. On both summarization and machine translation, we show that our algorithm encodes thousands of diverse options that remain grammatical and high-quality into one lattice. This algorithm provides a foundation for building downstream generation applications on top of massive-scale diverse outputs.

Xiangru Tang

唐相儒

耶鲁大学博士生

唐相儒目前是耶鲁大学计算机系博士一年级，导师为Mark Gerstein。此前，他于耶鲁大学获得计算机硕士学位，合作导师为Dragomir Radev。他的主要研究方向为预训练语言模型、文本生成和计算生物学。

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.

Yue Fang

房越

北京邮电大学研究生

北京邮电大学人工智能学院研二在读学生，研究方向为对话摘要。

From spoken dialogue to formal summary: An utterance rewriting for dialogue summarization

Due to the dialogue characteristics of unstructured contexts and multi-parties with first-person perspective, many successful text summarization works have failed when dealing with dialogue summarization. In dialogue summarization task, the input dialogue is usually spoken style with ellipsis and co-references but the output summaries are more formal and complete. Therefore, the dialogue summarization model should be able to complete the ellipsis content and co-reference information and then produce a suitable summary accordingly. How- ever, the current state-of-the-art models pay more attention on the topic or structure of summary, rather than the consistency of dialogue summary with its input dialogue context, which may suffer from the personal and logical inconsistency problem. In this paper, we propose a new model, named ReWriteSum, to tackle this problem. Firstly, an utterance rewriter is conducted to complete the ellipsis content of dialogue content and then obtain the rewriting utterances. Then, the co-reference data aug- mentation mechanism is utilized to replace the referential person's name with its specific name to enhance the personal information.

Xiangci Li

李向磁

UT Dallas计算机博士生

李向磁是UT Dallas第二年博士生，师从Prof. Jessica Ouyang，主要研究方向为科研文献处理（信息抽取和相关工作摘要生成）。于南加州大学获得硕士学位，师从彭楠贇。曾在Chan Zuckerburg Initiative，百度和腾讯北美人工智能实验室实习。

CORWA: A Citation-Oriented Related Work Annotation Dataset

Academic research is an exploratory activity to discover new solutions to problems. By this nature, academic research works perform literature reviews to distinguish their novelties from prior work. In natural language processing, this literature review is usually conducted under the “Related Work” section. The task of related work generation aims to automatically generate the related work section given the rest of the research paper and a list of papers to cite. Prior work on this task has focused on the sentence as the basic unit of generation, neglecting the fact that related work sections consist of variable length text fragments derived from different information sources. As a first step toward a linguistically-motivated related work generation framework, we present a Citation Oriented Related Work Annotation (CORWA) dataset that labels different types of citation text fragments from different information sources. We train a strong baseline model that automatically tags the CORWA labels on massive unlabeled related work section texts. We further suggest a novel framework for human-in-the-loop, iterative, abstractive related work generation.

点击左下角“阅读原文”，了解更多！

继续阅读

阅读原文

关键词

模型

语言模型

预训练

自然语言处理

数据