疯了!想白嫖10+单细胞SCI?只要解决这一步,你真的可以!
多数据集整合
大家好,我是风风。欢迎来到风风的从零开始单细胞系列。在之前的单细胞分析流程中,我们都是基于单个数据集来进行分析。和bulk seq数据和芯片数据一样,随着单细胞测序数据越来越多,单细胞公共数据挖掘已经就在我们身边。但是一个最实际的问题就是:源数据集已经被作者发过了文章,并且可能比你自己做的还要好(毕竟大部分单细胞测序的公司还提供了相应的分析服务),那我们是不是没有机会了呢?当然不是!
我们前面只是说了针对单个单细胞数据集的分析,如果检索到了多个数据集,那么我们可以仿照bulk seq和芯片数据一样,对多个数据集进行合并整合,得到一个新的样本较大的数据集,接着再进行相应的分析。并且,由于目前单细胞文章相对较少,多数据集整合的文章也相对较少,发文章的机会还比较多。由于单细胞测序数据的客观限制,包括经费、价格等等,往往每个单细胞测序项目的样本量都较少,几个到十几个不等,因此多数据集整合显得尤为重要。如何整合?如何去除批次效应》?如何使结果更加可靠?今天,我们就来一起看看,到底多数据集合并的单细胞数据应该怎么处理?
这里我们将使用来自10x Genomics公司的两个PBMC数据集进行演示,数据放在后台,大家自行回复关键词获取:
rm(list = ls())
安装并加载R包
if(!require("batchelor")) install.packages("batchelor", update = F, ask = F)
if(!require("scran")) install.packages("scran", update = F, ask = F)
if(!require("scater")) install.packages("scater", update = F, ask = F)
if(!require("pheatmap")) install.packages("scater", update = F, ask = F)
if(!require("bluster")) install.packages("scater", update = F, ask = F)
加载分析数据,这里使用2份不同的10x GEnomics的外周血数据进行演示
load(file = "pbmc.Rda")
load(file = "dec.Rda")
load(file = "pbmc2.Rda")
load(file = "dec2.Rda")
查看数据基本情况
pbmc3k
# class: SingleCellExperiment
# dim: 32738 2609
# metadata(0):
# assays(2): counts logcounts
# rownames(32738): ENSG00000243485 ENSG00000237613 ... ENSG00000215616
# ENSG00000215611
# rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
# colnames: NULL
# colData names(13): Sample Barcode ... sizeFactor label
# reducedDimNames(3): PCA TSNE UMAP
# mainExpName: NULL
# altExpNames(0):
head(dec3k)
# DataFrame with 6 rows and 6 columns
# mean total tech bio p.value FDR
# <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# ENSG00000243485 0.00000000 0.00000000 0.00000000 0.0000000 NaN NaN
# ENSG00000237613 0.00000000 0.00000000 0.00000000 0.0000000 NaN NaN
# ENSG00000186092 0.00000000 0.00000000 0.00000000 0.0000000 NaN NaN
# ENSG00000238009 0.00000000 0.00000000 0.00000000 0.0000000 NaN NaN
# ENSG00000239945 0.00000000 0.00000000 0.00000000 0.0000000 NaN NaN
# ENSG00000237683 0.00331643 0.00326512 0.00343602 -0.0001709 0.618501 0.777697
pbmc4k
# class: SingleCellExperiment
# dim: 33694 4182
# metadata(0):
# assays(2): counts logcounts
# rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
# ENSG00000268674
# rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
# colnames: NULL
# colData names(13): Sample Barcode ... sizeFactor label
# reducedDimNames(3): PCA TSNE UMAP
# mainExpName: NULL
# altExpNames(0):
head(dec4k)
# DataFrame with 6 rows and 6 columns
# mean total tech bio p.value
# <numeric> <numeric> <numeric> <numeric> <numeric>
# ENSG00000243485 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN
# ENSG00000237613 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN
# ENSG00000186092 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN
# ENSG00000238009 0.002087740 0.002122761 0.002131803 -9.04241e-06 0.512246
# ENSG00000239945 0.000517597 0.000563898 0.000528521 3.53768e-05 0.314021
# ENSG00000239906 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN
# FDR
# <numeric>
# ENSG00000243485 NaN
# ENSG00000237613 NaN
# ENSG00000186092 NaN
# ENSG00000238009 0.721073
# ENSG00000239945 0.721073
# ENSG00000239906 NaN
和芯片数据批次处理一样,首先我们要对数据进行一定的预处理
获取两个数据集共同的基因
universe <- intersect(rownames(pbmc3k),
rownames(pbmc4k))
length(universe)
# [1] 31232
pbmc3k <- pbmc3k[universe,]
pbmc4k <- pbmc4k[universe,]
dec3k <- dec3k[universe,]
dec4k <- dec4k[universe,]
使用multiBatchNorm()函数对SingleCellExperiment对象表达量进行调整
rescaled <- multiBatchNorm(pbmc3k,
pbmc4k)
pbmc3k <- rescaled[[1]]
pbmc4k <- rescaled[[2]]
使用 combineVar() 函数计算所有基因关于批次的方差来进行特征选择
combined.dec <- combineVar(dec3k,
dec4k)
chosen.hvgs <- combined.dec$bio > 0
sum(chosen.hvgs)
# [1] 13431
在进行批次效应矫正之前,我们一样首先要判断数据之间是否存在批次效应,和芯片数据一样,我们可以使用PCA等分析:
rowData(pbmc3k) <- rowData(pbmc4k)
"3k" batch <-
"4k" batch <-
uncorrected <- cbind(pbmc3k,
pbmc4k)
set.seed(0000)
uncorrected <- runPCA(uncorrected,
subset_row = chosen.hvgs,
BSPARAM = BiocSingular::RandomParam())
进行PCA分析
snn.gr <- buildSNNGraph(uncorrected,
use.dimred = "PCA")
clusters <- igraph::cluster_walktrap(snn.gr)$membership
tab <- table(Cluster = clusters,
Batch = uncorrected$batch)
tab
# Batch
# Cluster 3k 4k
# 1 1 829
# 2 482 0
# 3 15 41
# 4 0 605
# 5 1288 0
# 6 0 181
# 7 0 521
# 8 0 497
# 9 34 0
# 10 0 79
# 11 150 0
# 12 0 211
# 13 144 0
# 14 0 91
# 15 0 1041
# 16 148 0
# 17 0 46
# 18 334 1
# 19 11 3
# 20 2 36
我们还可以用t-SNE图直观地看到校正后的结果
set.seed(1111)
uncorrected <- runTSNE(uncorrected,
dimred = "PCA")
plotTSNE(uncorrected,
colour_by = "batch")
我们常用的矫正批次效应的方法是来自limma软件包中的removeBatchEffect()函数以及sva软件包中的comBat()函数,这两种方法的基础都是使用线性回归来消除批次效应,这里我们一样使用线性回归去除批次效应,这里使用rescaleBatches()函数和
regressBatches()函数
rescaled <- rescaleBatches(pbmc3k,
pbmc4k)
rescaled
# class: SingleCellExperiment
# dim: 31232 6791
# metadata(0):
# assays(1): corrected
# rownames(31232): ENSG00000243485 ENSG00000237613 ... ENSG00000198695
# ENSG00000198727
# rowData names(0):
# colnames: NULL
# colData names(1): batch
# reducedDimNames(0):
# mainExpName: NULL
# altExpNames(0):
在聚类后,我们观察到大多数聚类是由两个批次的细胞的混合组成,这与消除批次效应一致。然而,至少有一个批次仍然存在特定的集群,这表明校正并不完全,我们接着往下做:
rescaled <- runPCA(rescaled,
subset_row = chosen.hvgs,
exprs_values = "corrected",
BSPARAM = BiocSingular::RandomParam())
snn.gr <- buildSNNGraph(rescaled,
use.dimred = "PCA")
clusters.resc <- igraph::cluster_walktrap(snn.gr)$membership
tab.resc <- table(Cluster = clusters.resc,
Batch = rescaled$batch)
tab.resc
# Batch
# Cluster 1 2
# 1 152 93
# 2 144 179
# 3 252 507
# 4 157 171
# 5 336 606
# 6 99 609
# 7 19 34
# 8 497 483
# 9 8 52
# 10 4 9
# 11 13 64
# 12 19 57
# 13 567 1062
# 14 217 0
# 15 111 217
# 16 3 36
# 17 11 3
rescaled <- runTSNE(rescaled,
dimred = "PCA")
$batch) batch <- factor(rescaled
plotTSNE(rescaled,
colour_by = "batch")
我们也可以使用regressBatches()来进行线性回归矫正
residuals <- regressBatches(pbmc3k,
pbmc4k,
d = 50,
subset.row = chosen.hvgs,
correct.all = TRUE,
BSPARAM = BiocSingular::RandomParam())
snn.gr <- buildSNNGraph(residuals,
use.dimred = "corrected")
clusters.resid <- igraph::cluster_walktrap(snn.gr)$membership
tab.resid <- table(Cluster = clusters.resid,
Batch = residuals$batch)
tab.resid
# Batch
# Cluster 1 2
# 1 152 184
# 2 476 2
# 3 257 514
# 4 343 606
# 5 7 10
# 6 11 4
# 7 539 501
# 8 146 92
# 9 14 32
# 10 0 252
# 11 22 71
# 12 8 51
# 13 12 62
# 14 128 233
# 15 1 515
# 16 491 1017
# 17 2 36
residuals <- runTSNE(residuals,
dimred = "corrected")
$batch) batch <- factor(residuals
plotTSNE(residuals,
colour_by = "batch")
MNN法矫正
batchelor软件包通过fastMNN()函数提供了MNN方法的计算fastMNN()函数会首先进行行PCA分析以减少数据维度,并加快下游的最近邻居检测过程,这里我们将其应用于两个PBMC批次,以消除跨越selected.hvgs中高度可变基因的批次效应,为了减少计算量和技术噪音,所有批次的所有细胞都被投射到由前d个主成分定义的低维空间,然后在这个低维空间中进行MNNs的识别和校正向量的计算。
set.seed(2222)
mnn.out <- fastMNN(pbmc3k,
pbmc4k,
d = 50,
k = 20,
subset.row = chosen.hvgs,
BSPARAM = BiocSingular::RandomParam(deferred = TRUE))
mnn.out
# class: SingleCellExperiment
# dim: 13431 6791
# metadata(2): merge.info pca.info
# assays(1): reconstructed
# rownames(13431): ENSG00000239945 ENSG00000228463 ... ENSG00000198695
# ENSG00000198727
# rowData names(1): rotation
# colnames: NULL
# colData names(1): batch
# reducedDimNames(1): corrected
# mainExpName: NULL
# altExpNames(0):
fastMNN函数可以返回一个SingleCellExperiment对象,其中包含用于聚类或可视化等下游分析的校正值,mnn.out的每一列对应于一个批次中的一个细胞,而每一行对应于selected.hvgs中的一个输入基因
head(mnn.out$batch)
# [1] 1 1 1 1 1 1
dim(reducedDim(mnn.out,
"corrected"))
# [1] 6791 50
assay(mnn.out,
"reconstructed")
# <13431 x 6791> matrix of class LowRankMatrix and type "double":
# [,1] [,2] [,3] ... [,6790]
# ENSG00000239945 -3.395054e-06 -2.226848e-05 -5.033462e-06 . -7.262104e-06
# ENSG00000228463 -9.121786e-04 -3.790492e-04 -2.315338e-04 . -5.851371e-04
# ENSG00000237094 -4.000722e-05 -8.257147e-05 -1.124062e-06 . -2.469250e-05
# ENSG00000229905 3.373074e-05 2.610168e-05 -3.545071e-05 . 6.469330e-06
# ENSG00000237491 -3.383264e-04 -1.308717e-04 -1.611619e-04 . -3.086475e-04
# ... . . . . .
# ENSG00000198840 -0.0343820801 -0.0372592740 -0.0540644294 . -0.038165082
# ENSG00000212907 -0.0072325025 -0.0044983325 -0.0144663080 . -0.007723571
# ENSG00000198886 0.0249269459 0.0148135863 -0.0372499351 . -0.017291410
# ENSG00000198695 0.0014135224 0.0001795418 0.0003422022 . -0.001723150
# ENSG00000198727 0.0105042342 0.0286676618 -0.0235005178 . -0.011459773
# [,6791]
# ENSG00000239945 -2.065620e-05
# ENSG00000228463 -4.153473e-04
# ENSG00000237094 -5.201251e-05
# ENSG00000229905 -8.245027e-06
# ENSG00000237491 -1.066188e-04
# ... .
# ENSG00000198840 -0.019770576
# ENSG00000212907 -0.001511311
# ENSG00000198886 -0.008230620
# ENSG00000198695 -0.001232065
# ENSG00000198727 0.005486359
snn.gr <- buildSNNGraph(mnn.out,
use.dimred = "corrected")
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership
tab.mnn <- table(Cluster = clusters.mnn,
Batch = mnn.out$batch)
tab.mnn
# Batch
# Cluster 1 2
# 1 273 521
# 2 330 588
# 3 301 670
# 4 515 433
# 5 18 58
# 6 18 21
# 7 175 115
# 8 568 1161
# 9 138 152
# 10 12 54
# 11 16 57
# 12 6 16
# 13 143 92
# 14 64 169
# 15 14 28
# 16 4 8
# 17 3 36
# 18 11 3
在数据矫正完成后,我们要确定是否矫正完成,这时候可以再进行PCA分析,对比前后的结果,判断批次效应矫正是否成功:
chi.prop <- colSums(tab.mnn)/sum(tab.mnn)
chi.results <- apply(tab.mnn, 1,
FUN = chisq.test,
p = chi.prop)
# Warning in FUN(newX[, i], ...): Chi-squared approximation may be incorrect
p.values <- vapply(chi.results,
"[[", i = "p.value",
0)
p.values
# 1 2 3 4 5 6
# 1.939254e-02 1.237853e-01 2.001755e-06 7.584451e-24 8.270309e-03 3.206345e-01
# 7 8 9 10 11 12
# 1.633494e-14 1.943613e-06 1.328729e-03 7.248759e-04 3.749936e-03 2.824661e-01
# 13 14 15 16 17 18
# 1.549664e-12 5.891527e-04 4.980642e-01 7.172327e-01 7.980407e-05 2.009850e-03
tab.mnn <- unclass(tab.mnn)
norm <- normalizeCounts(tab.mnn, pseudo.count=10)
rv <- rowVars(norm)
DataFrame(Batch = tab.mnn,
var = rv)[order(rv,
decreasing = TRUE),]
# DataFrame with 18 rows and 3 columns
# Batch.1 Batch.2 var
# <integer> <integer> <numeric>
# 17 3 36 1.119610
# 13 143 92 0.733569
# 7 175 115 0.721954
# 10 12 54 0.574234
# 18 11 3 0.467939
# ... ... ... ...
# 6 18 21 0.04661042
# 1 273 521 0.03009832
# 15 14 28 0.02291004
# 2 330 588 0.01115334
# 16 4 8 0.00689661
p.values
# 1 2 3 4 5 6
# 1.939254e-02 1.237853e-01 2.001755e-06 7.584451e-24 8.270309e-03 3.206345e-01
# 7 8 9 10 11 12
# 1.633494e-14 1.943613e-06 1.328729e-03 7.248759e-04 3.749936e-03 2.824661e-01
# 13 14 15 16 17 18
# 1.549664e-12 5.891527e-04 4.980642e-01 7.172327e-01 7.980407e-05 2.009850e-03
mnn.out <- runTSNE(mnn.out,
dimred = "corrected")
mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out,
colour_by = "batch")
metadata(mnn.out)$merge.info$lost.var
## [,1] [,2]
## [1,] 0.006460144 0.003301332
tab <- table(paste("after",
clusters.mnn[rescaled$batch == 1]),
paste("before",
colLabels(pbmc3k)))
heat3k <- pheatmap(log10(tab + 10),
cluster_row = FALSE,
cluster_col = FALSE,
main = "PBMC 3K comparison",
silent=TRUE)
# 再次矫正
tab <- table(paste("after",
clusters.mnn[rescaled$batch == 2]),
paste("before",
colLabels(pbmc4k)))
heat4k <- pheatmap(log10(tab+10),
cluster_row = FALSE,
cluster_col = FALSE,
main = "PBMC 4K comparison",
silent = TRUE)
gridExtra::grid.arrange(heat3k[[4]],
heat4k[[4]])
ri3k<- pairwiseRand(clusters.mnn[rescaled$batch == 1],
colLabels(pbmc3k),
mode = "index")
ri3k
## [1] 0.7147989
ri4k<- pairwiseRand(clusters.mnn[rescaled$batch == 2],
colLabels(pbmc4k),
mode = "index")
ri4k
## [1] 0.7866939
tab<- pairwiseRand(colLabels(pbmc3k),
= 1]) =
heat3k<- pheatmap(tab,
cluster_row = FALSE,
cluster_col = FALSE,
col = rev(viridis::magma(100)),
main = "PBMC 3K probabilities",
TRUE) =
tab<- pairwiseRand(colLabels(pbmc4k),
= 2]) =
heat4k<- pheatmap(tab,
cluster_row = FALSE,
cluster_col = FALSE,
col = rev(viridis::magma(100)),
main = "PBMC 4K probabilities",
TRUE) =
gridExtra::grid.arrange(heat3k[[4]],
heat4k[[4]])
# 结合标记基因
stats3<- pairwiseWilcox(pbmc3k,
direction = "up")
markers3<- getTopMarkers(stats3[[1]],
stats3[[2]],
10) =
stats4<- pairwiseWilcox(pbmc4k,
direction = "up")
markers4<- getTopMarkers(stats4[[1]],
stats4[[2]],
10) =
<- unique(unlist(c(unlist(markers3),
unlist(markers4))))
length(marker.set)
## [1] 314
<- fastMNN(pbmc3k,
pbmc4k,
marker.set, =
BiocSingular::RandomParam(deferred = TRUE)) =
<- runTSNE(mnn.out2,
dimred = "corrected")
gridExtra::grid.arrange(
= 1], =
colour_by = I(colLabels(pbmc3k))),
= 2], =
colour_by = I(colLabels(pbmc4k))),
ncol = 2 )
使用矫正后结果作图
m.out <- findMarkers(uncorrected,
clusters.mnn,
block = uncorrected$batch,
direction = "up",
lfc = 1,
row.data = rowData(uncorrected)[,3,
drop = FALSE])
demo <- m.out[["10"]]
as.data.frame(demo[1:20,c("Symbol",
"Top",
"p.value",
"FDR")])
# Symbol Top p.value FDR
# ENSG00000142676 RPL11 1 1.225834e-30 3.828524e-27
# ENSG00000163220 S100A9 1 1.657387e-47 2.588176e-43
# ENSG00000143546 S100A8 1 1.857986e-39 1.160572e-35
# ENSG00000227507 LTB 1 1.201694e-01 1.000000e+00
# ENSG00000090382 LYZ 1 7.224401e-72 2.256325e-67
# ENSG00000197956 S100A6 2 5.204501e-41 4.063675e-37
# ENSG00000008517 IL32 2 1.000000e+00 1.000000e+00
# ENSG00000011600 TYROBP 2 3.080759e-39 1.603638e-35
# ENSG00000163131 CTSS 3 5.789844e-36 2.260355e-32
# ENSG00000168028 RPSA 3 2.258525e-12 2.612527e-09
# ENSG00000198242 RPL23A 3 2.170984e-23 5.650348e-20
# ENSG00000087086 FTL 3 1.155375e-44 1.202823e-40
# ENSG00000163221 S100A12 4 2.338307e-04 7.607292e-02
# ENSG00000257764 <NA> 4 3.603828e-08 2.617552e-05
# ENSG00000101439 CST3 4 9.311694e-24 2.643844e-20
# ENSG00000105374 NKG7 4 1.000000e+00 1.000000e+00
# ENSG00000083845 RPS5 4 1.392542e-20 3.106563e-17
# ENSG00000163682 RPL9 5 1.469368e-19 2.868207e-16
# ENSG00000251562 MALAT1 5 6.948155e-07 3.144997e-04
# ENSG00000167286 CD3D 5 1.000000e+00 1.000000e+00
plotExpression(uncorrected,
x = I(factor(clusters.mnn)),
features = "ENSG00000177954",
colour_by = "batch") +
facet_wrap(~ colour_by)
好了,本期的内容到此结束,多数据集批次矫正大家可能在处理芯片数据的时候比较熟悉,对单细胞的数据而言,目前并没有非常好的解决批次效应的方法,不过话说回来,即使是芯片数据,我们往往也说最好是同平台的芯片数据再进行合并,那单细胞的数据是否有这种顾虑呢?这个还是要看大家的数据集啦,至少如果是不同方法测到的单细胞数据,比如smart seq2和10x Genomics,这样的两个数据集是不建议合并的!
后台回复“feng 54”获取代码和数据,我们下期见!
单细胞分析专栏传送门
碎碎念专栏传送门(完结)
风之美图系列传送门(完结)
—
END—
撰文丨风 风
排版丨四金兄
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~
关键词
细胞
结果
数据集
方法
文章
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。