全网首发!万字长文详解Smart-seq2超详细单细胞教程!没别的,就是教你发CNS级别SCI!
Smart_Seq2数据处理
大家好,我是风风。欢迎来到风风的从零开始单细胞系列一。前面我们基本已经把单细胞分析的方方面面学习完了,但是这些流程大部分只属于10x Genomics的数据,关于其他类型的数据,我们又该怎么入手呢?
目前来说,单细胞转录组测序主要分为10x Genomics、Smart_seq2、CEL_seq2、CEL_seq、SMARTer、STRT等几种类型,在对GEO的数据进行检索分析,发现目前GEO数据库收录的内容大部分是10x Genomics、Smart_seq2、CEL_seq2、STRT这几种。
今天,我们就来看看,拿到一份Smart_seq2数据之后,该怎么对其进行初步分析。注意:今天的内容基本都是由前面的模块内容组合而成,有不理解的地方,例如:如何去除线粒体基因?请返回之前的系列推文看相应的内容。
我们将使用scRNA包及其数据进行分析,在正式分析之前,需要先准备数据并且把数据的行名进行转换为Ensemble ID,使用的R包是AnnotationHub包和scater包
"scRNAseq") BiocManager::install(
"scater") BiocManager::install(
library(scRNAseq) # 加载R包
# 载入需要的程辑包:SingleCellExperiment
# 载入需要的程辑包:SummarizedExperiment
# 载入需要的程辑包:MatrixGenerics
# 载入需要的程辑包:matrixStats
#
# 载入程辑包:'MatrixGenerics'
# The following objects are masked from 'package:matrixStats':
#
# colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
# colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
# colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
# colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
# colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
# colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
# colWeightedMeans, colWeightedMedians, colWeightedSds,
# colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
# rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
# rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
# rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
# rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
# rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
# rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
# rowWeightedSds, rowWeightedVars
# 载入需要的程辑包:GenomicRanges
# 载入需要的程辑包:stats4
# 载入需要的程辑包:BiocGenerics
# 载入需要的程辑包:parallel
#
# 载入程辑包:'BiocGenerics'
# The following objects are masked from 'package:parallel':
#
# clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
# clusterExport, clusterMap, parApply, parCapply, parLapply,
# parLapplyLB, parRapply, parSapply, parSapplyLB
# The following objects are masked from 'package:stats':
#
# IQR, mad, sd, var, xtabs
# The following objects are masked from 'package:base':
#
# anyDuplicated, append, as.data.frame, basename, cbind, colnames,
# dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
# grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
# order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
# rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
# union, unique, unsplit, which.max, which.min
# 载入需要的程辑包:S4Vectors
#
# 载入程辑包:'S4Vectors'
# The following objects are masked from 'package:base':
#
# expand.grid, I, unname
# 载入需要的程辑包:IRanges
#
# 载入程辑包:'IRanges'
# The following object is masked from 'package:grDevices':
#
# windows
# 载入需要的程辑包:GenomeInfoDb
# 载入需要的程辑包:Biobase
# Welcome to Bioconductor
#
# Vignettes contain introductory material; view with
# 'browseVignettes()'. To cite Bioconductor, see
# 'citation("Biobase")', and for packages 'citation("pkgname")'.
#
# 载入程辑包:'Biobase'
# The following object is masked from 'package:MatrixGenerics':
#
# rowMedians
# The following objects are masked from 'package:matrixStats':
#
# anyMissing, rowMedians
library(AnnotationHub)
# 载入需要的程辑包:BiocFileCache
# 载入需要的程辑包:dbplyr
#
# 载入程辑包:'AnnotationHub'
# The following object is masked from 'package:Biobase':
#
# cache
library(scater)
# 载入需要的程辑包:scuttle
# 载入需要的程辑包:ggplot2
library(AnnotationDbi)
#号) 运行下面三行代码下载数据(去掉
which="416b") sce.416b <- LunSpikeInData(
$block <- factor(sce.416b$block) sce.416b
"sce.416b.Rda") # 保存数据 save(sce.416b, file =
下载数据后,把数据加载进来
load("sce.416b.Rda")
sce.416b[1:3,1:3] # 由singlecellexperexperiment构建的分析对象
# class: SingleCellExperiment
# dim: 3 3
# metadata(0):
# assays(1): counts
# rownames(3): ENSMUSG00000102693 ENSMUSG00000064842 ENSMUSG00000051951
# rowData names(1): Length
# colnames(3): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
# SLX-9555.N701_S503.C89V9ANXX.s_1.r_1
# SLX-9555.N701_S504.C89V9ANXX.s_1.r_1
# colData names(9): Source Name cell line ... spike-in addition block
# reducedDimNames(0):
# mainExpName: endogenous
# altExpNames(2): ERCC SIRV
我们用符号重命名singlecellexperexperiment的行,为丢失或重复的符号还原为Ensemble ID
下载注释信息,同样速度比较慢,下载完毕后可以先保存为相应Rda格式
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
# snapshotDate(): 2021-05-18
# loading from cache
# require("ensembldb")
save(ens.mm.v97, file = "ens.mm.v97.Rda")
"ens.mm.v97.Rda") load(
rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)
rowData(sce.416b)$SYMBOL <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
keytype="GENEID", column="SYMBOL")
# Warning: Unable to map 563 of 46604 requested IDs.
rowData(sce.416b)$SEQNAME <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
keytype="GENEID", column="SEQNAME")
# Warning: Unable to map 563 of 46604 requested IDs.
rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL,
rowData(sce.416b)$SYMBOL)
unfiltered <- sce.416b
save(unfiltered, file = "unfiltered.sce.416b2.Rda") # 再保存一份数据备用
在完成了数据的预处理,我们接着往下,那就要对数据进行质控(Quality Control):我们前面已经说过,我们一般不需要线粒体基因,所以我们应该去除线粒体基因:
# 识别并去除线粒体基因(MT基因)
mito <- which(rowData(sce.416b)$SEQNAME == "MT")
stats <- perCellQCMetrics(sce.416b, subsets = list(Mt=mito))
qc <- quickPerCellQC(stats, percent_subsets = c("subsets_Mt_percent",
"altexps_ERCC_percent"),
batch=sce.416b$block)
sce.416b <- sce.416b[,!qc$discard]
colData(unfiltered) <- cbind(colData(unfiltered),
stats)
unfiltered$block <- factor(unfiltered$block)
unfiltered$discard <- qc$discard
# gridExtra进行绘图可视化,下图为每个QC度量在数据集的细胞之间的分布,按样板分类,每个点代表一个细胞,并根据该细胞是否被去除而添加颜色。
gridExtra::grid.arrange(
plotColData(unfiltered,
x = "block",
y = "sum",
colour_by = "discard") +
scale_y_log10() +
ggtitle("Total count"),
plotColData(unfiltered,
x = "block",
y = "detected",
colour_by = "discard") +
scale_y_log10() +
ggtitle("Detected features"),
plotColData(unfiltered,
x = "block",
y = "subsets_Mt_percent",
colour_by = "discard") +
ggtitle("Mito percent"),
plotColData(unfiltered,
x = "block",
y = "altexps_ERCC_percent",
colour_by = "discard") +
ggtitle("ERCC percent"),
nrow=2, ncol=2) # 按2行2列排列
# 下图为在数据集中每个细胞中线粒体读取的百分比,与总计数(左)或峰值读取的百分比(右)相比,每个点代表一个细胞,并根据该细胞是否被去除而添加颜色。
gridExtra::grid.arrange(
x = "sum",
y = "subsets_Mt_percent",
colour_by = "discard") +
scale_x_log10(),
plotColData(unfiltered,
x = "altexps_ERCC_percent",
y = "subsets_Mt_percent",
colour_by = "discard"),
nrow = 2)
还可以检查每种原因所去除的细胞数量
colSums(as.matrix(qc))
# low_lib_size low_n_features high_subsets_Mt_percent
# 5 0 2
# high_altexps_ERCC_percent discard
# 2 7
由于数据集较小,而且所有的细胞都来自同一细胞系,这里我们就不再执行聚类:
# BiocManager::install("scran")
library(scran)
sce.416b <- computeSumFactors(sce.416b)
sce.416b <- logNormCounts(sce.416b)
summary(sizeFactors(sce.416b))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3468 0.7111 0.9207 1.0000 1.1520 3.6037
# 进行可视化绘图
plot(librarySizeFactors(sce.416b),
sizeFactors(sce.416b),
pch = 16,
xlab = "Library size factors",
ylab = "Deconvolution factors",
col = c("black", "red")[grepl("induced",
sce.416b$phenotype)+1],
log="xy")
# 从图片上看到,诱导细胞的大小因子从非诱导细胞中系统性地偏移,这与构成偏倚的存在相一致
这里我们取前10%的基因:
dec.416b <- modelGeneVarWithSpikes(sce.416b,
"ERCC",
block = sce.416b$block)
chosen.hvgs <- getTopHVGs(dec.416b,
prop = 0.1)
par(mfrow=c(1,2))
blocked.stats <- dec.416b$per.block
for (i in colnames(blocked.stats)) {
current <- blocked.stats[[i]]
plot(current$mean,
current$total,
main=i,
pch=16,
cex=0.5,
xlab="Mean of log-expression",
ylab="Variance of log-expression")
curfit <- metadata(current)
points(curfit$mean, curfit$var, col="red", pch=16)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
}
每个点代表一个基因(黑色),其均值-方差趋势(蓝色)与尖峰转录本(红色)相吻合
去除批间差
小样本建议使用removeBatchEffect(),对于较大的数据集,考虑使用batchelor包中的regressBatches()
library(limma)
#
# 载入程辑包:'limma'
# The following object is masked from 'package:scater':
#
# plotMDS
# The following object is masked from 'package:BiocGenerics':
#
# plotMA
assay(sce.416b, "corrected") <- removeBatchEffect(logcounts(sce.416b),
design = model.matrix(~sce.416b$phenotype),
batch=sce.416b$block)
数据降维
sce.416b <- runPCA(sce.416b,
ncomponents = 10,
subset_row = chosen.hvgs,
exprs_values = "corrected",
BSPARAM = BiocSingular::ExactParam())
set.seed(1010)
sce.416b <- runTSNE(sce.416b,
dimred = "PCA",
perplexity = 10)
聚类
my.dist <- dist(reducedDim(sce.416b,
"PCA"))
my.tree <- hclust(my.dist,
method="ward.D2")
library(dynamicTreeCut)
my.clusters <- unname(cutreeDynamic(my.tree,
distM=as.matrix(my.dist),
minClusterSize=10,
verbose=0))
colLabels(sce.416b) <- factor(my.clusters)
table(Cluster=colLabels(sce.416b),
Plate=sce.416b$block)
# Plate
# Cluster 20160113 20160325
# 1 40 38
# 2 37 32
# 3 10 14
# 4 6 8
将cluster与致癌基因诱导状态进行比较,结果发现,在每个cluster的组成上都可以观察到差异
table(Cluster = colLabels(sce.416b),
Oncogene = sce.416b$phenotype)
# Oncogene
# Cluster induced CBFB-MYH11 oncogene expression wild type phenotype
# 1 78 0
# 2 0 69
# 3 1 23
# 4 14 0
plotTSNE(sce.416b,
colour_by = "label")
library(cluster)
clust.col <- scater:::.get_palette("tableau10medium")
sil <- silhouette(my.clusters,
dist = my.dist)
sil.cols <- clust.col[ifelse(sil[,3] > 0,
sil[,1],
sil[,2])]
sil.cols <- sil.cols[order(-sil[,1],
sil[,3])]
plot(sil, main = paste(length(unique(my.clusters)),
"clusters"),
border = sil.cols,
col=sil.cols,
do.col.sort=FALSE)
310 -7.13310 -2.20632
## Cdca8 1 1.01449e-4# 鉴定markers
markers <- findMarkers(sce.416b,
my.clusters,
block = sce.416b$block)
marker.set <- markers[["1"]]
head(marker.set, 10)
# DataFrame with 10 rows and 7 columns
# Top p.value FDR summary.logFC logFC.2 logFC.3
# <integer> <numeric> <numeric> <numeric> <numeric> <numeric>
## Ccna2 1 9.85422e-67 4.59246e-62 -7.131 1.52514e-38 -7.26175 -6.00378 -2.03841
# Pirb 1 4.16555e-33 1.95516e-30 5.87820 5.28149 5.87820
# Cks1b 2 2.98233e-40 3.23229e-37 -6.43381 -6.43381 -4.15385
# Aurkb 2 2.41436e-64 5.62593e-60 -6.94063 -6.94063 -1.65534
# Myh11 2 1.28865e-46 3.75353e-43 4.38182 4.38182 4.29290
# Mcm6 3 1.15877e-28 3.69887e-26 -5.44558 -5.44558 -5.82130
# Cdca3 3 5.02047e-45 1.23144e-41 -6.22179 -6.22179 -2.10502
# Top2a 3 7.25965e-61 1.12776e-56 -7.07811 -7.07811 -2.39123
# Mcm2 4 1.50854e-33 7.98908e-31 -5.54197 -5.54197 -6.09178
# logFC.4
# <numeric>
# Ccna2 -7.3451052
# Cdca8 -7.2617478
# Pirb 0.0352849
# Cks1b -6.4385323
# Aurkb -6.4162126
# Myh11 0.9410499
# Mcm6 -3.5804973
# Cdca3 -7.0539510
# Top2a -6.8297343
# Mcm2 -3.8238103
联合聚类对热图进行可视化
top.markers <- rownames(marker.set)[marker.set$Top <= 10]
plotHeatmap(sce.416b,
features = top.markers,
order_columns_by = "label",
colour_columns_by = c("label",
"block",
"phenotype"),
center=TRUE,
symmetric=TRUE,
zlim=c(-5, 5))
# Warning: 'symmetric=' is deprecated and is automatically set to TRUE when
# 'center=TRUE'
这样我们Smart-seq2地单细胞数据处理流程也就基本完成,再往下就可以做一些个性化的分析啦!其实上面这些步骤在我们前面的系列推文中都能够找到相应的内容,大家可以比对学习下,也可以用前面讲到的其他一些方法再尝试分析。
好啦,今天内容就到这里,后台回复“feng52”获取代码和数据,我们下期再见!
单细胞分析专栏传送门
碎碎念专栏传送门(完结)
风之美图系列传送门(完结)
—
END—
撰文丨风 风
排版丨四金兄
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~
关键词
数据
细胞
单细胞
基因
代码
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。