Smart_Seq2数据处理
大家好,我是风风。欢迎来到风风的从零开始单细胞系列一。前面我们基本已经把单细胞分析的方方面面学习完了,但是这些流程大部分只属于10x Genomics的数据,关于其他类型的数据,我们又该怎么入手呢?
目前来说,单细胞转录组测序主要分为10x Genomics、Smart_seq2、CEL_seq2、CEL_seq、SMARTer、STRT等几种类型,在对GEO的数据进行检索分析,发现目前GEO数据库收录的内容大部分是10x Genomics、Smart_seq2、CEL_seq2、STRT这几种。
今天,我们就来看看,拿到一份Smart_seq2数据之后,该怎么对其进行初步分析。注意:今天的内容基本都是由前面的模块内容组合而成,有不理解的地方,例如:如何去除线粒体基因?请返回之前的系列推文看相应的内容。
数据预处理
我们将使用scRNA包及其数据进行分析,在正式分析之前,需要先准备数据并且把数据的行名进行转换为Ensemble ID,使用的R包是AnnotationHub包和scater包
# BiocManager::install("scRNAseq")# BiocManager::install("scater")library(scRNAseq) # 加载R包## 载入需要的程辑包:SingleCellExperiment## 载入需要的程辑包:SummarizedExperiment## 载入需要的程辑包:MatrixGenerics## 载入需要的程辑包:matrixStats## ## 载入程辑包:'MatrixGenerics'## The following objects are masked from 'package:matrixStats':## ## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,## colWeightedMeans, colWeightedMedians, colWeightedSds,## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,## rowWeightedSds, rowWeightedVars## 载入需要的程辑包:GenomicRanges## 载入需要的程辑包:stats4## 载入需要的程辑包:BiocGenerics## 载入需要的程辑包:parallel## ## 载入程辑包:'BiocGenerics'## The following objects are masked from 'package:parallel':## ## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,## clusterExport, clusterMap, parApply, parCapply, parLapply,## parLapplyLB, parRapply, parSapply, parSapplyLB## The following objects are masked from 'package:stats':## ## IQR, mad, sd, var, xtabs## The following objects are masked from 'package:base':## ## anyDuplicated, append, as.data.frame, basename, cbind, colnames,## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,## union, unique, unsplit, which.max, which.min## 载入需要的程辑包:S4Vectors## ## 载入程辑包:'S4Vectors'## The following objects are masked from 'package:base':## ## expand.grid, I, unname## 载入需要的程辑包:IRanges## ## 载入程辑包:'IRanges'## The following object is masked from 'package:grDevices':## ## windows## 载入需要的程辑包:GenomeInfoDb## 载入需要的程辑包:Biobase## Welcome to Bioconductor## ## Vignettes contain introductory material; view with## 'browseVignettes()'. To cite Bioconductor, see## 'citation("Biobase")', and for packages 'citation("pkgname")'.## ## 载入程辑包:'Biobase'## The following object is masked from 'package:MatrixGenerics':## ## rowMedians## The following objects are masked from 'package:matrixStats':## ## anyMissing, rowMedianslibrary(AnnotationHub)## 载入需要的程辑包:BiocFileCache## 载入需要的程辑包:dbplyr## ## 载入程辑包:'AnnotationHub'## The following object is masked from 'package:Biobase':## ## cachelibrary(scater)## 载入需要的程辑包:scuttle## 载入需要的程辑包:ggplot2library(AnnotationDbi)# 运行下面三行代码下载数据(去掉#号)# sce.416b <- LunSpikeInData(which="416b") # sce.416b$block <- factor(sce.416b$block)# save(sce.416b, file = "sce.416b.Rda"# 保存数据# 下载数据后,把数据加载进来load("sce.416b.Rda")sce.416b[1:3,1:3] # 由singlecellexperexperiment构建的分析对象## class: SingleCellExperiment ## dim: 3 3 ## metadata(0):## assays(1): counts## rownames(3): ENSMUSG00000102693 ENSMUSG00000064842 ENSMUSG00000051951## rowData names(1): Length## colnames(3): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1## colData names(9): Source Name cell line ... spike-in addition block## reducedDimNames(0):## mainExpName: endogenous## altExpNames(2): ERCC SIRV# 我们用符号重命名singlecellexperexperiment的行,为丢失或重复的符号还原为Ensemble ID# 下载注释信息,同样速度比较慢,下载完毕后可以先保存为相应Rda格式ens.mm.v97 <- AnnotationHub()[["AH73905"]]## snapshotDate(): 2021-05-18## loading from cache## require("ensembldb")save(ens.mm.v97, file = "ens.mm.v97.Rda")# load("ens.mm.v97.Rda")rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)rowData(sce.416b)$SYMBOL <- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype="GENEID", column="SYMBOL")## Warning: Unable to map 563 of 46604 requested IDs.rowData(sce.416b)$SEQNAME <- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype="GENEID", column="SEQNAME")## Warning: Unable to map 563 of 46604 requested IDs.rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL)unfiltered <- sce.416bsave(unfiltered, file = "unfiltered.sce.416b2.Rda") # 再保存一份数据备用
数据质控QC
在完成了数据的预处理,我们接着往下,那就要对数据进行质控(Quality Control):我们前面已经说过,我们一般不需要线粒体基因,所以我们应该去除线粒体基因:
# 识别并去除线粒体基因(MT基因)mito <- which(rowData(sce.416b)$SEQNAME == "MT")stats <- perCellQCMetrics(sce.416b, subsets = list(Mt=mito))qc <- quickPerCellQC(stats, percent_subsets = c("subsets_Mt_percent","altexps_ERCC_percent"), batch=sce.416b$block)sce.416b <- sce.416b[,!qc$discard]colData(unfiltered) <- cbind(colData(unfiltered), stats)unfiltered$block <- factor(unfiltered$block)unfiltered$discard <- qc$discard# gridExtra进行绘图可视化,下图为每个QC度量在数据集的细胞之间的分布,按样板分类,每个点代表一个细胞,并根据该细胞是否被去除而添加颜色。gridExtra::grid.arrange( plotColData(unfiltered, x = "block", y = "sum", colour_by = "discard") + scale_y_log10() + ggtitle("Total count"), plotColData(unfiltered, x = "block", y = "detected", colour_by = "discard") + scale_y_log10() + ggtitle("Detected features"), plotColData(unfiltered, x = "block", y = "subsets_Mt_percent", colour_by = "discard") + ggtitle("Mito percent"), plotColData(unfiltered, x = "block", y = "altexps_ERCC_percent", colour_by = "discard") + ggtitle("ERCC percent"), nrow=2, ncol=2) # 按2行2列排列
# 下图为在数据集中每个细胞中线粒体读取的百分比,与总计数(左)或峰值读取的百分比(右)相比,每个点代表一个细胞,并根据该细胞是否被去除而添加颜色。gridExtra::grid.arrange(plotColData(unfiltered,x = "sum",y = "subsets_Mt_percent", colour_by = "discard") +scale_x_log10(),plotColData(unfiltered,x = "altexps_ERCC_percent",y = "subsets_Mt_percent",colour_by = "discard"),nrow = 2)
# 还可以检查每种原因所去除的细胞数量colSums(as.matrix(qc))## low_lib_size low_n_features high_subsets_Mt_percent ## 5 0 2 ## high_altexps_ERCC_percent discard ## 2 7
标准化
由于数据集较小,而且所有的细胞都来自同一细胞系,这里我们就不再执行聚类:
# BiocManager::install("scran")library(scran)sce.416b <- computeSumFactors(sce.416b)sce.416b <- logNormCounts(sce.416b)summary(sizeFactors(sce.416b))## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3468 0.7111 0.9207 1.0000 1.1520 3.6037# 进行可视化绘图plot(librarySizeFactors(sce.416b), sizeFactors(sce.416b), pch = 16, xlab = "Library size factors", ylab = "Deconvolution factors", col = c("black", "red")[grepl("induced", sce.416b$phenotype)+1],log="xy")
# 从图片上看到,诱导细胞的大小因子从非诱导细胞中系统性地偏移,这与构成偏倚的存在相一致
构建方差模型
这里我们取前10%的基因:
dec.416b <- modelGeneVarWithSpikes(sce.416b,"ERCC", block = sce.416b$block)chosen.hvgs <- getTopHVGs(dec.416b, prop = 0.1)par(mfrow=c(1,2))blocked.stats <- dec.416b$per.blockfor (i in colnames(blocked.stats)) { current <- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab="Mean of log-expression", ylab="Variance of log-expression") curfit <- metadata(current) points(curfit$mean, curfit$var, col="red", pch=16) curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)}
# 每个点代表一个基因(黑色),其均值-方差趋势(蓝色)与尖峰转录本(红色)相吻合# 去除批间差# 小样本建议使用removeBatchEffect(),对于较大的数据集,考虑使用batchelor包中的regressBatches()library(limma)## ## 载入程辑包:'limma'## The following object is masked from 'package:scater':## ## plotMDS## The following object is masked from 'package:BiocGenerics':## ## plotMAassay(sce.416b, "corrected") <- removeBatchEffect(logcounts(sce.416b), design = model.matrix(~sce.416b$phenotype), batch=sce.416b$block)# 数据降维sce.416b <- runPCA(sce.416b, ncomponents = 10, subset_row = chosen.hvgs, exprs_values = "corrected", BSPARAM = BiocSingular::ExactParam())set.seed(1010)sce.416b <- runTSNE(sce.416b, dimred = "PCA", perplexity = 10)
聚类和鉴定markers
# 聚类my.dist <- dist(reducedDim(sce.416b, "PCA"))my.tree <- hclust(my.dist,                  method="ward.D2")library(dynamicTreeCut)my.clusters <- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), minClusterSize=10, verbose=0))colLabels(sce.416b) <- factor(my.clusters)table(Cluster=colLabels(sce.416b), Plate=sce.416b$block)## Plate## Cluster 20160113 20160325## 1 40 38## 2 37 32## 3 10 14## 4 6 8# 将cluster与致癌基因诱导状态进行比较,结果发现,在每个cluster的组成上都可以观察到差异table(Cluster = colLabels(sce.416b), Oncogene = sce.416b$phenotype)## Oncogene## Cluster induced CBFB-MYH11 oncogene expression wild type phenotype## 1 78 0## 2 0 69## 3 1 23## 4 14 0plotTSNE(sce.416b, colour_by = "label")
library(cluster)clust.col <- scater:::.get_palette("tableau10medium") sil <- silhouette(my.clusters, dist = my.dist)sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]sil.cols <- sil.cols[order(-sil[,1], sil[,3])]plot(sil, main = paste(length(unique(my.clusters)), "clusters"), border = sil.cols, col=sil.cols, do.col.sort=FALSE)
310 -7.13310 -2.20632## Cdca8 1 1.01449e-4# 鉴定markersmarkers <- findMarkers(sce.416b, my.clusters, block = sce.416b$block)marker.set <- markers[["1"]]head(marker.set, 10)## DataFrame with 10 rows and 7 columns## Top p.value FDR summary.logFC logFC.2 logFC.3## <integer> <numeric> <numeric> <numeric> <numeric> <numeric>## Ccna2 1 9.85422e-67 4.59246e-62 -7.131 1.52514e-38 -7.26175 -6.00378 -2.03841## Pirb 1 4.16555e-33 1.95516e-30 5.87820 5.28149 5.87820## Cks1b 2 2.98233e-40 3.23229e-37 -6.43381 -6.43381 -4.15385## Aurkb 2 2.41436e-64 5.62593e-60 -6.94063 -6.94063 -1.65534## Myh11 2 1.28865e-46 3.75353e-43 4.38182 4.38182 4.29290## Mcm6 3 1.15877e-28 3.69887e-26 -5.44558 -5.44558 -5.82130## Cdca3 3 5.02047e-45 1.23144e-41 -6.22179 -6.22179 -2.10502## Top2a 3 7.25965e-61 1.12776e-56 -7.07811 -7.07811 -2.39123## Mcm2 4 1.50854e-33 7.98908e-31 -5.54197 -5.54197 -6.09178## logFC.4## <numeric>## Ccna2 -7.3451052## Cdca8 -7.2617478## Pirb 0.0352849## Cks1b -6.4385323## Aurkb -6.4162126## Myh11 0.9410499## Mcm6 -3.5804973## Cdca3 -7.0539510## Top2a -6.8297343## Mcm2 -3.8238103# 联合聚类对热图进行可视化top.markers <- rownames(marker.set)[marker.set$Top <= 10]plotHeatmap(sce.416b, features = top.markers, order_columns_by = "label", colour_columns_by = c("label", "block", "phenotype"), center=TRUE, symmetric=TRUE, zlim=c(-5, 5))## Warning: 'symmetric=' is deprecated and is automatically set to TRUE when## 'center=TRUE'

这样我们Smart-seq2地单细胞数据处理流程也就基本完成,再往下就可以做一些个性化的分析啦!其实上面这些步骤在我们前面的系列推文中都能够找到相应的内容,大家可以比对学习下,也可以用前面讲到的其他一些方法再尝试分析。
好啦,今天内容就到这里,后台回复“feng52”获取代码和数据,我们下期再见!
单细胞分析专栏传送门
碎碎念专栏传送门(完结)
风之美图系列传送门(完结)
END

撰文丨风   风
排版丨四金兄
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~

继续阅读
阅读原文