放大招！单细胞筛分子标志物，让你的课题逼格up！ up！ up！你一定要试试！（附详细代码）

识别biomarkers和Cluster注释

大家好，我是风。欢迎来到风风的从零开始单细胞系列一。在前面的推文中，我们已经学习了数据下载、构建对象、数据质控、数据过滤、降维分析和聚类分析，按照思路往下走，获得了不同的Cluster之后，我们一般想看看不同Cluster之间的特征差异。那该怎么去识别Cluster的标志物呢？又如何给不同的Cluster进行注释呢？请看下回分解……ㄟ( ▔, ▔ )ㄏ………不了不了，我们还是马上分解吧！

识别不同Cluster的标志物

识别不同Cluster的标志物，简单来说，就是要找到在不同Cluster中差异表达的基因。这算是大部分单细胞分析的必备流程，可以参考之前推文分享过的这篇文章——Dissecting the Single-Cell Transcriptome Network Underlying Gastric Premalignant Lesions and Early Gastric Cancer（超链接“8+单细胞文章从解读到套路总结全部送给你”），以及上周菠小萝老师刚分享的这篇文章——Single-cell RNA sequencing reveals cellular and molecular immune profile in a Pembrolizumab-responsive PD-L1-negative lung cancer patient（超链接“只用1例患者的临床个案也能5+？这个生信研究思路值得你学习！”）。

Seurat通过差异表达找到特定集群的标志物，一般默认与其他所有细胞进行比较，可以识别单个集群的上调和下调标志物。使用FindAllMarkers函数进行分析，也可以寻找特定Cluster之间的差异，或者寻找在所有细胞中都有差异的标志物：

rm(list = ls())options(stringsAsFactors = FALSE)set.seed(123456) # 设置随机种子，方便重复结果# 安装R包# install.packages("Seurat")# install.packages("dplyr")library(Seurat)library(dplyr)library(cowplot)library(ggsci)# 使用readRDS读取rds格式数据pbmc <- readRDS("pbmc_tutorial.rds")pbmc## An object of class Seurat ## 13714 features across 2695 samples within 1 assay ## Active assay: RNA (13714 features, 2000 variable features)## 3 dimensional reductions calculated: pca, tsne, umap# 识别不同Cluster的特征基因# 获取Cluster类别pbmc.cluster <- [email protected][["seurat_clusters"]]table(pbmc.cluster)## pbmc.cluster## 0 1 2 3 4 5 6 7 ## 1215 481 349 292 169 143 32 14# 识别Cluster 1的特征基因cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 1, # 设置Cluster的组别 slot = "data", logfc.threshold = 0.25, test.use = "wilcox", verbose = TRUE, only.pos = FALSE, min.cells.feature = 3, min.cells.group = 3, pseudocount.use = 1, thresh.use = 0.25,

 min.pct = 0.30 # 设置在所有细胞中的任何一组中检测到一个基因的最低百分比，用来过滤不相关基因加快运行速度

)print(x = head(x = cluster1.markers, n = 6))## p_val avg_logFC pct.1 pct.2 p_val_adj## S100A9 0.000000e+00 3.829911 0.994 0.217 0.000000e+00## S100A8 0.000000e+00 3.804527 0.977 0.122 0.000000e+00## LGALS2 0.000000e+00 2.603264 0.894 0.065 0.000000e+00## FCN1 0.000000e+00 2.344164 0.950 0.151 0.000000e+00## CD14 1.261959e-299 1.960834 0.672 0.029 1.730651e-295## TYROBP 1.901113e-285 2.096429 0.994 0.266 2.607187e-281# 同样的方法可以识别Cluster 2的特征基因cluster2.markers <- FindMarkers(object = pbmc, ident.1 = 2, # 设置Cluster的组别 slot = "data", logfc.threshold = 0.25, test.use = "wilcox", verbose = TRUE, only.pos = FALSE, min.cells.feature = 3, min.cells.group = 3, pseudocount.use = 1, thresh.use = 0.25,

 min.pct = 0.30 # 设置在所有细胞中的任何一组中检测到一个基因的最低百分比，用来过滤不相关基因加快运行速度

)print(x = head(x = cluster2.markers, n = 6))## p_val avg_logFC pct.1 pct.2 p_val_adj## CD79A 0.000000e+00 2.974308 0.931 0.042 0.000000e+00## MS4A1 0.000000e+00 2.346764 0.851 0.053 0.000000e+00## TCL1A 1.396390e-273 2.492401 0.619 0.022 1.915009e-269## CD79B 5.094389e-271 2.389632 0.905 0.143 6.986445e-267## HLA-DQA1 2.927694e-270 2.127028 0.885 0.116 4.015039e-266## LINC00926 2.461153e-266 1.949258 0.553 0.011 3.375226e-262# 也可以进行Cluster 1和Cluster 2的比较cluster1vs2.markers <- FindMarkers(object = pbmc, ident.1 = 1, # 设置Cluster的组别 ident.2 = 2, slot = "data", logfc.threshold = 0.25, test.use = "wilcox", verbose = TRUE, only.pos = FALSE, min.cells.feature = 3, min.cells.group = 3, pseudocount.use = 1, thresh.use = 0.25,

 min.pct = 0.30 # 设置在所有细胞中的任何一组中检测到一个基因的最低百分比，用来过滤不相关基因加快运行速度

)print(x = head(x = cluster1vs2.markers, n = 6))## p_val avg_logFC pct.1 pct.2 p_val_adj## CD79A 4.476261e-143 -2.974024 0.040 0.931 6.138745e-139## TYROBP 1.804735e-139 3.459270 0.994 0.106 2.475013e-135## S100A9 9.709182e-138 4.297518 0.994 0.138 1.331517e-133## CST3 2.729147e-136 3.218190 0.992 0.186 3.742752e-132## S100A4 2.862468e-136 2.889772 1.000 0.355 3.925589e-132## S100A8 2.432489e-135 4.077856 0.977 0.074 3.335915e-131

# 通过FindMarkers我们可以得到特定Clsuter的标志物，也可以设定特定两组之间的标志物，可操作性更强。也可以一次性获取所有Cluster的标志物

pbmc.markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.30, thresh.use = 0.25)## Calculating cluster 0## Calculating cluster 1## Calculating cluster 2## Calculating cluster 3## Calculating cluster 4## Calculating cluster 5## Calculating cluster 6## Calculating cluster 7# 分别可视化每个Cluster中上下调差异最大的各两个结果pbmc.markers %>% group_by(cluster) %>% slice_max(avg_logFC, n = 2)## Registered S3 method overwritten by 'cli':## method from ## print.boxx spatstat## # A tibble: 16 x 7## # Groups: cluster [8]## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene ## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> ## 1 1.37e-238 1.18 0.911 0.472 1.87e-234 0 LDHB ## 2 2.41e-200 1.09 0.865 0.237 3.31e-196 0 CD3D ## 3 0. 3.83 0.994 0.217 0. 1 S100A9 ## 4 0. 3.80 0.977 0.122 0. 1 S100A8 ## 5 0. 2.97 0.931 0.042 0. 2 CD79A ## 6 1.40e-273 2.49 0.619 0.022 1.92e-269 2 TCL1A ## 7 1.42e-186 2.06 0.959 0.241 1.95e-182 3 CCL5 ## 8 5.16e-183 2.05 0.616 0.057 7.08e-179 3 GZMK ## 9 2.62e-192 2.27 0.976 0.131 3.59e-188 4 FCGR3A ## 10 9.87e-130 2.08 1 0.312 1.35e-125 4 LST1 ## 11 3.28e-175 3.41 0.958 0.135 4.50e-171 5 GNLY ## 12 5.65e-269 3.37 0.993 0.071 7.75e-265 5 GZMB ## 13 2.44e-262 2.76 0.875 0.01 3.35e-258 6 FCER1A ## 14 5.06e- 35 2.12 1 0.207 6.94e- 31 6 HLA-DQA1## 15 1.11e-109 5.93 1 0.024 1.53e-105 7 PPBP ## 16 8.49e-195 5.01 1 0.011 1.16e-190 7 PF4pbmc.markers %>% group_by(cluster) %>% slice_min(avg_logFC, n = 2)## # A tibble: 16 x 7## # Groups: cluster [8]## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene ## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> ## 1 1.66e- 5 0.255 0.309 0.244 2.28e- 1 0 RPL39 ## 2 2.82e-13 0.255 0.607 0.481 3.87e- 9 0 RPL41 ## 3 3.52e-11 0.253 0.439 0.281 4.83e- 7 1 BLOC1S1## 4 1.85e- 8 0.254 0.751 0.621 2.54e- 4 1 KLF6 ## 5 8.13e-24 0.253 0.997 0.993 1.12e-19 2 RPS27 ## 6 2.96e- 3 0.253 0.513 0.491 1.00e+ 0 2 RNASET2## 7 2.89e- 3 0.252 0.356 0.294 1.00e+ 0 3 SPCS2 ## 8 4.72e- 4 0.253 0.301 0.226 1.00e+ 0 3 P4HB ## 9 9.38e- 5 0.250 0.722 0.474 1.00e+ 0 4 LDHA ## 10 1.32e- 8 0.251 0.562 0.289 1.81e- 4 4 ATP5C1 ## 11 7.50e- 4 0.257 0.462 0.373 1.00e+ 0 5 HNRNPF ## 12 4.05e- 3 0.259 0.476 0.408 1.00e+ 0 5 H2AFZ ## 13 2.88e- 5 0.250 0.594 0.211 3.95e- 1 6 RNF181 ## 14 6.61e- 5 0.250 0.5 0.169 9.06e- 1 6 CHMP4B ## 15 3.56e- 3 0.268 1 0.988 1.00e+ 0 7 FTH1 ## 16 3.56e- 3 0.680 0.929 0.841 1.00e+ 0 7 GAPDH# 对结果进行可视化VlnPlot(object = pbmc, features =c("LDHB", "S100A9", "FCGR3A", "HLA-DQA1", "RPS27", "P4HB"))

FeaturePlot(object = pbmc, features = c("LDHB", "S100A9", "FCGR3A", "HLA-DQA1", "RPS27", "P4HB", "GNLY", "PPBP"), cols = c("grey", "purple"), reduction = "umap")

# DoHeatmap函数可以绘制特定细胞和基因的表达热图，我们这里展示各组差异最显著的前5个基因。

top10<- pbmc.markers %>% group_by(cluster)%>% top_n(5,avg_logFC)Heatmap.plot<- DoHeatmap(object = pbmc,features = top10$gene, label = TRUE,group.by = "ident",group.bar = TRUE,disp.min = -2.5,slot = "scale.data",size = 4.5,hjust = 0,angle = 45,raster = FALSE,draw.lines = TRUE,group.bar.height = 0.02,combine = TRUE)## Warning in DoHeatmap(object = pbmc, features = top10$gene, label = TRUE, : The## following features were omitted as they were not found in the scale.data slot## for the RNA assay: NOSIP, CD3E, CD3D, LDHBHeatmap.plot2<- Heatmap.plot +scale_fill_gsea(alpha = 0.5)## Scale for 'fill' is already present. Adding another scale for 'fill', which## will replace the existing scale.Heatmap.plot2

需要注意，单细胞的数据较多，建议大家绘图的时候只展示结果最显著的特征基因，减少内存，也使图片看起来更加美观。

细胞注释

接下来我们要根据特征基因给不同Cluster进行命名，这一个步骤考验我们的生物学背景知识，当然也有相关R包可以对细胞类型进行自动注释，但是相对来说，根据生物学背景知识来注释更具有讨论价值，在写作的时候也更加容易讨论。在这里，PBMC是一个经典的数据集，可以直接将无监督聚类的结果匹配到已知的细胞类型中，我们来看一下：

# 转换亚群名称current.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)new.cluster.ids <- c("CD4 T cells", "CD14+ Monocytes", "B cells","CD8 T cells", "FCGR3A+ Monocytes","NK cells","Dendritic cells", "Megakaryocytes")pbmc@active.ident <- plyr::mapvalues(x = pbmc@active.ident,from = current.cluster.ids, to = new.cluster.ids)DimPlot(object = pbmc, reduction = "tsne", pt.size = 0.5)

DimPlot(object = pbmc, reduction = "umap", pt.size = 0.5)

DimPlot(object = pbmc, reduction = "pca", pt.size = 0.5)

# 有时候你会发现并不是所有的亚群都按照我们注释的结果，相同细胞聚在一起，不同细胞明显分开，这时候我们可以对细胞类型进行进一步细分，比如下面这样的结果

pbmc[["ClusterNames_0.6"]] <- Idents(object = pbmc) # 保存前面的分析结果，方便进行对比pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, resolution = 0.8, print.output = TRUE)## Warning: The following arguments are not used: reduction.type, dims.use,## print.output## Suggested parameter: reduction instead of reduction.type; dims instead of dims.use; verbose instead of print.output## Warning: The following arguments are not used: reduction.type, dims.use,## print.output## Suggested parameter: reduction instead of reduction.type; dims instead of dims.use; verbose instead of print.output## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck## ## Number of nodes: 2695## Number of edges: 122081## ## Running Louvain algorithm...## Maximum modularity in 10 random starts: 0.8071## Number of communities: 9## Elapsed time: 0 secondspbmc## An object of class Seurat ## 13714 features across 2695 samples within 1 assay ## Active assay: RNA (13714 features, 2000 variable features)## 3 dimensional reductions calculated: pca, tsne, umapplot1 <- DimPlot(object = pbmc, reduction = "umap", pt.size = 0.5)plot2 <- DimPlot(object = pbmc, reduction = "umap", pt.size = 0.5, group.by = "ClusterNames_0.6")plot_grid(plot1, plot2)

# 对比发现多了一个Cluster，我们对细胞类型进行再细分# 识别两种Cluster的特征基因tcell.markers <- FindMarkers(object = pbmc, ident.1 = 0, ident.2 = 1)tcell.markers %>% top_n(2, avg_logFC)## p_val avg_logFC pct.1 pct.2 p_val_adj## CCR7 7.902771e-18 0.5937806 0.447 0.244 1.083786e-13## CD8B 2.995009e-11 0.7187127 0.235 0.096 4.107356e-07tcell.markers %>% top_n(-2, avg_logFC)## p_val avg_logFC pct.1 pct.2 p_val_adj## S100A4 8.412059e-80 -0.9929538 0.656 0.965 1.153630e-75## S100A11 6.258484e-41 -0.8279185 0.244 0.633 8.582885e-37

# 我们发现S100A4和S100A11主要在Cluster 1中高表达，而CCR7和CD8B主要在Cluster 0中高表达，以此来对这两种T细胞亚群进行命名，结合文献调研，我们将其区分为记忆T细胞和幼稚T细胞

FeaturePlot(object = pbmc, features = c("S100A4", "CCR7"), cols = c("lightblue", "orange"))

# 最后，我们将前面比较稳定的分析结果恢复回来pbmc <- SetIdent(object = pbmc, value = "ClusterNames_0.6")pbmc## An object of class Seurat ## 13714 features across 2695 samples within 1 assay ## Active assay: RNA (13714 features, 2000 variable features)## 3 dimensional reductions calculated: pca, tsne, umap

好了，这样我们识别亚群的特征标志物以及对细胞进行注释的内容也就到此完成，而Seurat的内容也基本完结。到了这里，单细胞分析一些常见的步骤已经函数我们也已经学习完成啦，是时候开始自己课题的分析了。最后，我们再留个问题：如何将scRNA-seq分析和bulk-seq分析结合起来呢？

后台回复“feng46”获取本期数据和代码，我们下期见啦！

单细胞分析专栏传送门

大年初一，单细胞分析从nature communication开始

8+单细胞文章从解读到套路总结全部送给你

30分多的单细胞分析文章是怎么做的？

不用R，点点点就能get的单细胞公共数据下载，你“学废”了吗

单细胞都有两个“对象”，你一个都没有

8成生信审稿人必问！单细胞第一张可视化图，你漏了吗？

我从Science中，“偷”到了这个实用的聚类分析技能！真舍不得分享！（附代码）

不骗你！学了这个R包，搞定九成高大上的单细胞分析！包治百病！一包封神！小白赶紧收起来！

Cell的生信分析能有多高级？单细胞研究必备分析！不会做么？简单，我来教你！（附代码）