又见Nature，这次的分析还能举一反三

大家好，我是风。欢迎来到风风的从零开始单细胞系列。终于我们前面说完了单细胞分析的简单分析流程。其实如果你要分析的数据是10x Genomics的数据，并且格式非常标准，那你只需要掌握Seurat包的分析流程即可。然而，事实上我们大多数人并没有选择数据平台的权利，往往我们要分析的特定疾病的公共数据库数据集就只有那么两三个，再刚好都是10x Genomics并且原作者也刚好上传了非常标准的格式，可能性很小了对吧？所以前面兜兜转转相同的分析目的都提供了不同的分析过程。对了对了，有学员留言想看细胞自动注释的R包，也有学员私聊能不能按照测序方法（比如10x Genomics、smart-seq2等）分别聊一下分析流程？尽量满足大家，但是推文毕竟篇幅有限，所以不敢贸然答应，见谅见谅/(ㄒoㄒ)/~~

今天我们要聊的是单细胞分析的文章中，常见却高级的一种分析方法——拟时序分析，也有人称之为“伪时间分析”。正如前面的推文所说，不管是“拟”还是“伪”，其实都是假的时序分析。简单来说，基于单细胞测序数据，通过特定算法来模拟研究细胞不断变化的过程，就称之为“拟时序分析”。

你可能要问了，既然有假的时序分析，那肯定有真的时序分析吧？为什么不在实验设计和测序时候直接监测单个细胞的分化等变化过程，做到“真”时序分析呢？好吧，你说的对，但是不是不想，而是做不到！原因在于在提取RNA时，细胞就会被裂解破坏，这样就没办法观察到单个细胞变化的过程（包括发育过程、分化过程等）并测得每个过程的scRNA-seq数据。

但是，scRNA-seq测序时会在多个时间点进行采样，并获得基因表达谱的 snapshots。而由于一些细胞在分化过程中会比其他细胞发展得更快，每一张snapshots可能包含了发育过程中不同时刻的细胞，这样我们就可以使用统计方法和算法来对这些细胞进行排列，使细胞沿着一个或多个潜在发育轨迹的顺序进行排列，这种顺序被称为拟时序（模拟时间顺序）。

Cannoodt等人在2016年曾经总结过拟时序分析的方法，有：SCUBA、Wanderlust、Wishbone、SLICER、SCOUP、Waterfall、Mpath、Monocle、TSCAN。这些方法有的使用python实现，有的使用matlab实现，有的使用R实现：

就我个人尝试的结果来说，matlab实现SCUBA的时候总会遇到bug，Wanderlust传闻需要注册，Monocle、TSCAN相对稳定便捷，ouija对电脑有要求，耗时长。可能这也是大多数人用Monocle的原因吧，稳定、方便、快捷。

接下来，我们将用两期的内容，跟大家一起从范文到代码，聊一聊R中实现拟时序分析的方法，以及相关论文应用，包括：Monocle, TSCAN, destiny, ouija和SLICER。尽管目前也有一些其他的拟时序分析方法，但是用的较多的还是上面5种。

数据初探

在进行拟时序分析之前，我们要对手上的单细胞数据先进行初步探索，使用聚类分析看一下数据的类别：

rm(list = ls())options(stringsAsFactors = FALSE)set.seed(123456) # 设置随机种子，方便重复结果# 安装R包# install.packages("devtools")# install.packages("SLICER")# BiocManager::install("SingleCellExperiment")# BiocManager::install("M3Drop")# BiocManager::install("TSCAN")# BiocManager::install("monocle")# BiocManager::install("destiny")# devtools::install_github("kieranrcampbell/ouija")# install.packages("viridis")library(SingleCellExperiment)## Loading required package: SummarizedExperiment## Loading required package: GenomicRanges## Loading required package: stats4## Loading required package: BiocGenerics## Loading required package: parallel## ## Attaching package: 'BiocGenerics'## The following objects are masked from 'package:parallel':## ## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,## clusterExport, clusterMap, parApply, parCapply, parLapply,## parLapplyLB, parRapply, parSapply, parSapplyLB## The following objects are masked from 'package:stats':## ## IQR, mad, sd, var, xtabs## The following objects are masked from 'package:base':## ## anyDuplicated, append, as.data.frame, basename, cbind, colnames,## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,## union, unique, unsplit, which, which.max, which.min## Loading required package: S4Vectors## ## Attaching package: 'S4Vectors'## The following object is masked from 'package:base':## ## expand.grid## Loading required package: IRanges## ## Attaching package: 'IRanges'## The following object is masked from 'package:grDevices':## ## windows## Loading required package: GenomeInfoDb## Loading required package: Biobase## Welcome to Bioconductor## ## Vignettes contain introductory material; view with## 'browseVignettes()'. To cite Bioconductor, see## 'citation("Biobase")', and for packages 'citation("pkgname")'.## Loading required package: DelayedArray## Loading required package: matrixStats## ## Attaching package: 'matrixStats'## The following objects are masked from 'package:Biobase':## ## anyMissing, rowMedians## Loading required package: BiocParallel## ## Attaching package: 'DelayedArray'## The following objects are masked from 'package:matrixStats':## ## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges## The following objects are masked from 'package:base':## ## aperm, apply, rowsumlibrary(TSCAN)library(M3Drop)## Loading required package: numDerivlibrary(monocle)## Loading required package: Matrix## ## Attaching package: 'Matrix'## The following object is masked from 'package:S4Vectors':## ## expand## Loading required package: ggplot2## Loading required package: VGAM## Loading required package: splines## Loading required package: DDRTree## Loading required package: irlba## Warning: package 'irlba' was built under R version 4.0.2library(destiny)## ## Attaching package: 'destiny'## The following object is masked from 'package:SummarizedExperiment':## ## distance## The following object is masked from 'package:GenomicRanges':## ## distance## The following object is masked from 'package:IRanges':## ## distancelibrary(SLICER)library(ouija)## Loading required package: Rcpp## Registered S3 method overwritten by 'cli':## method from ## print.boxx spatstat.geomlibrary(scater)library(ggplot2)library(ggthemes)library(ggbeeswarm)library(corrplot)## corrplot 0.84 loaded## ## Attaching package: 'corrplot'## The following object is masked from 'package:destiny':## ## colorlegend# 读取数据PT_SCE <- readRDS("Features.rds")PT_SCE$cell_type2 <- factor( PT_SCE$cell_type2, levels = c("zy", "early2cell", "mid2cell", "late2cell", "4cell", "8cell", "16cell", "earlyblast", "midblast", "lateblast"))cellLabels <- PT_SCE$cell_type2PT <- counts(PT_SCE)colnames(PT) <- cellLabelsPT_SCE <- runPCA(PT_SCE)plotPCA(PT_SCE, colour_by = "cell_type2")

# 简单的拟时序分析可以直接使用PCA的主成分来进行分析# 例如从上图PCA结果显示PC1为34%，PC2为16%，因此我们可以考虑使用PC1来简单进行拟时序分析PT_SCE$PC1 <- reducedDim(PT_SCE, "PCA")[,1] # 提取主成分1ggplot(as.data.frame(colData(PT_SCE)), aes(x = PC1, y = cell_type2, colour = cell_type2)) + geom_quasirandom(groupOnX = FALSE) + scale_color_tableau() + theme_classic() + xlab("Principal Component 1") + ylab("Timepoint") + ggtitle("Cells ordered by first principal component")

# 如上图所示，PC1在发育过程的早期和晚期都很难正确排序细胞，但总的来说，它在按发育时间排序细胞方面结果还可以

monocle

接下来我们来尝试下monocle方法。monocle不会对数据进行聚类，而是直接对细胞进行降维并建立最小生成树以连接所有细胞。然后，monocle识别了该树中的最长路径，以确定拟时序过程。如果数据包含不同的轨迹（即一种细胞类型分化为两种不同的细胞类型），，monocle可以识别这些轨迹，并将每个生成的分支路径都定义为单独的细胞类型。

关于monocle方法的运用，可以参考Nature的这篇文章：

其中用到了几个图：

都是使用了monocle进行拟时序分析，可以参考文章的写法和结果分析。

接下来我们看看R中monocle法实现拟时序分析的代码实现过程：

# monocle之前需要先进行特征筛选，这里使用M3Drop进行特征筛选：m3dGenes <- as.character( M3DropFeatureSelection(PT)$Gene)## Warning in bg__calc_variables(expr_mat): Warning: Removing 1134 undetected## genes.

length(m3dGenes) # 得到973个基因，接着提取矩阵## [1] 973intergene <- which(rownames(PT) %in% m3dGenes)d <- PT[intergene, ]d <- d[!duplicated(rownames(d)), ] # 去除重复项# 开始monocle分析colnames(d) <- 1:ncol(d)geneNames <- rownames(d)rownames(d) <- 1:nrow(d)pd <- data.frame(timepoint = cellLabels)pd## timepoint## 1 16cell## 2 16cell## 3 16cell## 4 16cell## 5 16cell## 6 16cell## 7 16cell## 8 16cell## 9 16cell## 10 16cell## 11 16cell## 12 16cell## 13 16cell## 14 16cell## 15 16cell## 16 16cell## 17 16cell## 18 16cell## 19 16cell## 20 16cell## 21 16cell## 22 16cell## 23 16cell## 24 16cell## 25 16cell## 26 16cell## 27 16cell## 28 16cell## 29 16cell## 30 16cell## 31 16cell## 32 16cell## 33 16cell## 34 16cell## 35 16cell## 36 16cell## 37 16cell## 38 16cell## 39 16cell## 40 16cell## 41 16cell## 42 16cell## 43 16cell## 44 16cell## 45 16cell## 46 16cell## 47 16cell## 48 16cell## 49 16cell## 50 16cell## 51 4cell## 52 4cell## 53 4cell## 54 4cell## 55 4cell## 56 4cell## 57 4cell## 58 4cell## 59 4cell## 60 4cell## 61 4cell## 62 4cell## 63 4cell## 64 4cell## 65 8cell## 66 8cell## 67 8cell## 68 8cell## 69 8cell## 70 8cell## 71 8cell## 72 8cell## 73 8cell## 74 8cell## 75 8cell## 76 8cell## 77 8cell## 78 8cell## 79 8cell## 80 8cell## 81 8cell## 82 8cell## 83 8cell## 84 8cell## 85 8cell## 86 8cell## 87 8cell## 88 8cell## 89 8cell## 90 8cell## 91 8cell## 92 8cell## 93 8cell## 94 8cell## 95 8cell## 96 8cell## 97 8cell## 98 8cell## 99 8cell## 100 8cell## 101 8cell## 102 early2cell## 103 early2cell## 104 early2cell## 105 early2cell## 106 early2cell## 107 early2cell## 108 early2cell## 109 early2cell## 110 earlyblast## 111 earlyblast## 112 earlyblast## 113 earlyblast## 114 earlyblast## 115 earlyblast## 116 earlyblast## 117 earlyblast## 118 earlyblast## 119 earlyblast## 120 earlyblast## 121 earlyblast## 122 earlyblast## 123 earlyblast## 124 earlyblast## 125 earlyblast## 126 earlyblast## 127 earlyblast## 128 earlyblast## 129 earlyblast## 130 earlyblast## 131 earlyblast## 132 earlyblast## 133 earlyblast## 134 earlyblast## 135 earlyblast## 136 earlyblast## 137 earlyblast## 138 earlyblast## 139 earlyblast## 140 earlyblast## 141 earlyblast## 142 earlyblast## 143 earlyblast## 144 earlyblast## 145 earlyblast## 146 earlyblast## 147 earlyblast## 148 earlyblast## 149 earlyblast## 150 earlyblast## 151 earlyblast## 152 earlyblast## 153 late2cell## 154 late2cell## 155 late2cell## 156 late2cell## 157 late2cell## 158 late2cell## 159 late2cell## 160 late2cell## 161 late2cell## 162 late2cell## 163 lateblast## 164 lateblast## 165 lateblast## 166 lateblast## 167 lateblast## 168 lateblast## 169 lateblast## 170 lateblast## 171 lateblast## 172 lateblast## 173 lateblast## 174 lateblast## 175 lateblast## 176 lateblast## 177 lateblast## 178 lateblast## 179 lateblast## 180 lateblast## 181 lateblast## 182 lateblast## 183 lateblast## 184 lateblast## 185 lateblast## 186 lateblast## 187 lateblast## 188 lateblast## 189 lateblast## 190 lateblast## 191 lateblast## 192 lateblast## 193 mid2cell## 194 mid2cell## 195 mid2cell## 196 mid2cell## 197 mid2cell## 198 mid2cell## 199 mid2cell## 200 mid2cell## 201 mid2cell## 202 mid2cell## 203 mid2cell## 204 mid2cell## 205 midblast## 206 midblast## 207 midblast## 208 midblast## 209 midblast## 210 midblast## 211 midblast## 212 midblast## 213 midblast## 214 midblast## 215 midblast## 216 midblast## 217 midblast## 218 midblast## 219 midblast## 220 midblast## 221 midblast## 222 midblast## 223 midblast## 224 midblast## 225 midblast## 226 midblast## 227 midblast## 228 midblast## 229 midblast## 230 midblast## 231 midblast## 232 midblast## 233 midblast## 234 midblast## 235 midblast## 236 midblast## 237 midblast## 238 midblast## 239 midblast## 240 midblast## 241 midblast## 242 midblast## 243 midblast## 244 midblast## 245 midblast## 246 midblast## 247 midblast## 248 midblast## 249 midblast## 250 midblast## 251 midblast## 252 midblast## 253 midblast## 254 midblast## 255 midblast## 256 midblast## 257 midblast## 258 midblast## 259 midblast## 260 midblast## 261 midblast## 262 midblast## 263 midblast## 264 midblast## 265 zy## 266 zy## 267 zy## 268 zypd <- new("AnnotatedDataFrame", data=pd)pd## An object of class 'AnnotatedDataFrame'## rowNames: 1 2 ... 268 (268 total)## varLabels: timepoint## varMetadata: labelDescriptionfd <- data.frame(gene_short_name = geneNames)head(fd)## gene_short_name## 1 Arrdc1## 2 Stac2## 3 Rab3d## 4 Gm13363## 5 Btnl3## 6 4930562C15Rikfd <- new("AnnotatedDataFrame", data=fd)fd## An object of class 'AnnotatedDataFrame'## rowNames: 1 2 ... 973 (973 total)## varLabels: gene_short_name## varMetadata: labelDescriptiondCellData <- newCellDataSet(d, phenoData = pd, featureData = fd, expressionFamily = tobit())dCellData## CellDataSet (storageMode: environment)## assayData: 973 features, 268 samples ## element names: exprs ## protocolData: none## phenoData## sampleNames: 1 2 ... 268 (268 total)## varLabels: timepoint Size_Factor## varMetadata: labelDescription## featureData## featureNames: 1 2 ... 973 (973 total)## fvarLabels: gene_short_name## fvarMetadata: labelDescription## experimentData: use 'experimentData(object)'## Annotation:dCellData <- setOrderingFilter(dCellData, which(geneNames %in% m3dGenes))dCellData <- estimateSizeFactors(dCellData)dCellDataSet <- reduceDimension(dCellData, pseudo_expr = 1)dCellDataSet## CellDataSet (storageMode: environment)## assayData: 973 features, 268 samples ## element names: exprs ## protocolData: none## phenoData## sampleNames: 1 2 ... 268 (268 total)## varLabels: timepoint Size_Factor## varMetadata: labelDescription## featureData## featureNames: 1 2 ... 973 (973 total)## fvarLabels: gene_short_name use_for_ordering## fvarMetadata: labelDescription## experimentData: use 'experimentData(object)'## Annotation:dCellDataSet <- orderCells(dCellDataSet, reverse = FALSE)plot_cell_trajectory(dCellDataSet)## Warning: `select_()` was deprecated in dplyr 0.7.0.## Please use `select()` instead.

# 从上图我们已经使用monocle把拟时序分析完成，但是我们还需要把结果保存下来：pseudotime_monocle <- data.frame( Timepoint = phenoData(dCellDataSet)$timepoint, pseudotime = phenoData(dCellDataSet)$Pseudotime, State = phenoData(dCellDataSet)$State )rownames(pseudotime_monocle) <- 1:ncol(d)head(pseudotime_monocle, 10)## Timepoint pseudotime State## 1 16cell 14.50536 2## 2 16cell 13.16401 3## 3 16cell 13.39352 2## 4 16cell 11.56981 1## 5 16cell 16.68029 2## 6 16cell 17.19478 2## 7 16cell 13.75265 2## 8 16cell 18.07869 2## 9 16cell 13.38970 2## 10 16cell 15.43145 2pseudotime_order_monocle <- rownames(pseudotime_monocle[order(pseudotime_monocle$pseudotime), ])# 接下来将推断的伪时间与已知的采样时间点进行比较PT_SCE$pseudotime_monocle <- pseudotime_monocle$pseudotimetable(PT_SCE$cell_type2)## ## zy early2cell mid2cell late2cell 4cell 8cell 16cell ## 4 8 12 10 14 37 50 ## earlyblast midblast lateblast ## 43 60 30ggplot(as.data.frame(colData(PT_SCE)), aes(x = pseudotime_monocle, y = cell_type2, colour = cell_type2)) + geom_quasirandom(groupOnX = FALSE) + scale_color_tableau() + theme_classic() + xlab("monocle pseudotime") + ylab("Timepoint") + ggtitle("Cells ordered by monocle pseudotime")

# 从图片上可以发现，monocle结果中，late2cell组与zy、early2cell和mid2cell细胞完全分离，但是顺序正确，而4cell、8cell、16cell、earlyblast、midblast和lateblast则都没有分离

# 我们还可以跟踪识别在分化过程中起重要作用的基因,我们用Hif1a基因来说明这个过程。我们可以将拟时序分析结果添加到SCE对象中，这样就可以使用scater包的全部绘图函数来研究基因表达、细胞群和拟时序之间的关系

# 特定主成分进行拟时序分析的基因随时间表达情况plotExpression(PT_SCE,"Hif1a", x = "PC1", colour_by = "cell_type2", show_violin = FALSE, show_smooth = TRUE)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# monocle进行拟时序分析的基因随时间表达情况plotExpression(PT_SCE, "Hif1a", x = "pseudotime_monocle", colour_by = "cell_type2", show_violin = FALSE, show_smooth = TRUE)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

TSCAN

TSCAN和monocle不同，它将聚类与拟时序分析结合在一起。首先，它使用基于正态分布混合的mclust对细胞进行聚类；然后，构建最小的生成树以连接群集。连接最多簇数的该树的分支作为确定拟时序分析的主分支。关于TSCAN的方法和使用可以参考Nucleic Acids Research的这篇文献：

当然TSCAN也有交互式版本：

这篇文章也同样提到了几种方法的对比：

接下来，我们看看TSCAN分析的代码过程：

# 使用所有的基因对细胞进行排序。procPT<- TSCAN::preprocess(PT)colnames(procPT)<- 1:ncol(PT)PTclust<- TSCAN::exprmclust(procPT, clusternum = 10)TSCAN::plotmclust(PTclust)

PTorderTSCAN <- TSCAN::TSCANorder(PTclust, orderonly = FALSE)head(PTorderTSCAN,10)## sample_name State Pseudotime## 88 88 5 1## 115 115 5 2## 173 173 5 3## 175 175 5 4## 229 229 5 5## 125 125 5 6## 237 237 5 7## 151 151 5 8## 249 249 5 9## 241 241 5 10dim(PTorderTSCAN)## [1] 221 3pseudotime_order_tscan <- as.character(PTorderTSCAN$sample_name)PT_SCE$pseudotime_order_tscan <- NAPT_SCE$pseudotime_order_tscan[as.numeric(PTorderTSCAN$sample_name)] <- PTorderTSCAN$Pseudotime# 结果发现，TSCAN仅提供268个细胞中的221个的拟时序值cellLabels[PTclust$clusterid == 10]## [1] late2cell late2cell late2cell late2cell late2cell late2cell late2cell## [8] late2cell late2cell late2cell## 10 Levels: zy early2cell mid2cell late2cell 4cell 8cell 16cell ... lateblastggplot(as.data.frame(colData(PT_SCE)), aes(x = pseudotime_order_tscan, y = cell_type2, colour = cell_type2)) + geom_quasirandom(groupOnX = FALSE) + scale_color_tableau() + theme_classic() + xlab("TSCAN pseudotime") + ylab("Timepoint") + ggtitle("Cells ordered by TSCAN pseudotime")## Warning: Removed 47 rows containing missing values (position_quasirandom).

# 同样展示一下Hif1a基因随时间变化的表达量plotExpression(PT_SCE, "Hif1a", x = "pseudotime_order_tscan", colour_by = "cell_type2", show_violin = FALSE, show_smooth = TRUE)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'## Warning: Removed 47 rows containing non-finite values (stat_smooth).## Warning: Removed 47 rows containing missing values (geom_point).