从各大顶会看对比学习在句子表征研究进展

每天给你送来NLP技术干货！

作者 | 上杉翔二
悠闲会 · 信息检索
整理 | NewBeeNLP

本篇博文继续整理几篇代表性的对比学习在句子表征上的文章们。

SimCSE

SimCSE: Simple Contrastive Learning of Sentence Embeddings
paper: https://arxiv.org/abs/2104.08821
code: https://github.com/princeton-nlp/SimCSE

EMNLP2021，简单方法大能量，即仅将标准dropout用作噪声在对比目标中进行预测。

如上图，有两种形式：

unsupervised SimCSE。将相同的输入语句两次传递给经过预训练的编码器，并通过应用独立采样的dropout掩码获得两个嵌入，作为“正例对”。通过仔细的分析，作者们发现dropout本质上是作为数据扩充来使用，而删除它会导致表示崩溃。
supervised SimCSE。利用了基于自然语言推理（NLI）数据集进行句子嵌入学习，并将受监督的句子对纳入对比学习中。

可以来看看对比学习部分的代码实现：

defcl_forward(cls,...):#对比学习的部分代码
    return_dict = return_dict 
if return_dict 
isnotNoneelse cls.config.use_return_dict

    ori_input_ids = input_ids

    batch_size = input_ids.size(
0)

# Number of sentences in one instance
# 2: pair instance; 3: pair instance with a hard negative
    num_sent = input_ids.size(
1)


    mlm_outputs = 
None
# Flatten input for encoding
    input_ids = input_ids.view((
-1, input_ids.size(
-1))) 
# (bs * num_sent, len)
    attention_mask = attention_mask.view((
-1, attention_mask.size(
-1))) 
# (bs * num_sent len)
if token_type_ids 
isnotNone:

        token_type_ids = token_type_ids.view((
-1, token_type_ids.size(
-1))) 
# (bs * num_sent, len)

# Get raw embeddings，得到原句子特征
    outputs = encoder(

        input_ids,

        attention_mask=attention_mask,

        token_type_ids=token_type_ids,

        position_ids=position_ids,

        head_mask=head_mask,

        inputs_embeds=inputs_embeds,

        output_attentions=output_attentions,

        output_hidden_states=
Trueif cls.model_args.pooler_type 
in [
'avg_top2', 
'avg_first_last'] 
elseFalse,

        return_dict=
True,

    )


# MLM auxiliary objective，执行MLM任务
if mlm_input_ids 
isnotNone:

        mlm_input_ids = mlm_input_ids.view((
-1, mlm_input_ids.size(
-1)))

        mlm_outputs = encoder( 
#得到特征
            mlm_input_ids,

            attention_mask=attention_mask,

            token_type_ids=token_type_ids,

            position_ids=position_ids,

            head_mask=head_mask,

            inputs_embeds=inputs_embeds,

            output_attentions=output_attentions,

            output_hidden_states=
Trueif cls.model_args.pooler_type 
in [
'avg_top2', 
'avg_first_last'] 
elseFalse,

            return_dict=
True,

        )


# Pooling，池化
    pooler_output = cls.pooler(attention_mask, outputs)

    pooler_output = pooler_output.view((batch_size, num_sent, pooler_output.size(
-1))) 
# (bs, num_sent, hidden)

# If using "cls", we add an extra MLP layer
# (same as BERT's original implementation) over the representation.
if cls.pooler_type == 
"cls":

        pooler_output = cls.mlp(pooler_output)


# Separate representation，分别得到两个表示z1，z2
    z1, z2 = pooler_output[:,
0], pooler_output[:,
1]


    cos_sim = cls.sim(z1.unsqueeze(
1), z2.unsqueeze(
0)) 
#计算对比loss

    labels = torch.arange(cos_sim.size(
0)).long().to(cls.device)

    loss_fct = nn.CrossEntropyLoss()

    loss = loss_fct(cos_sim, labels)

CLEAR

CLEAR: Contrastive Learning for Sentence Representation
paper: https://arxiv.org/abs/2012.15466
code: no code

句子级别特征的抽取任务。CLEAR的模型结构和SimCLR类似。因此这篇文章主要是提出了四种数据增强构建负例句子的方法，词汇删除（Word deletion）、词段删除（Span deletion）、词序重排（Reordering）、同义词替换（Synonym Substitution）。如下图所示。

词汇删除即随机删除一些词汇作为负例句子，当连续的词被删除时，用一个[del]符号来表示，即句子最终变成[Tok[del], Tok3, Tok[del], Tok5, . . . , TokN]
词段删除是词汇删除的一个特例，其删除连续的某些词，即[Tok[del], Tok5, . . . , TokN]
词序重排和BART中的句子排序类似，替换句子中某些词对的顺序，变成[Tok4, Tok3, Tok1, Tok2, Tok5, . . . , TokN]
同义词替换则随机选择某些词汇并使用同义词进行替换作为负例句子，如[Tok1, Tok’2, Tok’3, Tok4, Tok5, . . . , Tok’N]。

其实构建的负例的方法会更多咯，可以参考一些自监督文章。

DeCLUTR

DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations
paper: https://aclanthology.org/2021.acl-long.72/
code: https://github.com/JohnGiorgi/DeCLUTR

ACL2021，无监督句子级别的特征提取。其实也是探讨如何构建负例，这篇文章的架构如上图，

使用对比学习的方法拉近相同文章中句子embedding之间的距离，拉远不同文章之间句子embedding之间的距离。

具体的做法是通过从同一文档中的其他部分采样文本段，并通过对抗loss来最大化上下文段落span的相似性，以学习句子的上下表示。

如下图，将三种类型的正例：部分与锚点重叠，与锚点相邻，以及包含于锚点中。两种类型的负例：来自于其他文档的易负例，来自于同一文档的难负例。

DiffCSE

DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings
paper：https://arxiv.org/pdf/2204.10298.pdf
code：https://github.com/voidism/DiffCSE

来自NAACL2022，主要基于dropout masks作为数据增强策略，作为不敏感转换学习对比学习损失和基于MLM语言模型进行词语替换的方法作为敏感转换学习，即原始句子与编辑句子之间的差异，共同优化句向量表征。

模型架构图如上，左侧为一个标准的SimCSE模型，右侧为一个带条件的句子差异预测模型。左侧不再赘述，右侧包含生成器和判别器。

生成器。给定一个长度为T的句子，MLM预训练语言模型作为生成器G，通过掩码序列来生成句子中被掩掉的token，获取生成序列。
判别器。判别器进行替换token检测，也就是预测哪些token是被替换的。

最近文章

EMNLP 2022 和 COLING 2022，投哪个会议比较好？

一种全新易用的基于Word-Word关系的NER统一模型，刷新了14种数据集并达到新SoTA

阿里+北大 | 在梯度上做简单mask竟有如此的神奇效果

ACL'22 | 快手+中科院提出一种数据增强方法：Text Smoothing，非常简单且有效尤其在数据不足的情况下

继续阅读

阅读原文