人人都应该学会的单细胞特征选择
大家好,我是风。欢迎来到风风的从零开始单细胞系列。本期我们继续单细胞分析——特征选择。
一般来说,单细胞scRNA-Seq能够测量每个细胞中成千上万个基因的表达量。然而,在大多数情况下,只有一小部分基因会对我们感兴趣的表型或者生物学特征有帮助,比如我们找能够代表细胞类型差异的markers,或者对某种治疗方式有反应的markers。但是,由于技术噪音或者批间差等原因,可能会对我们的结果造成影响。因此,我们需要通过特征选择去除下游分析中仅表现出技术噪音的基因,同时减少要处理的数据总量,并降低了分析计算的复杂度。
对于scRNA-Seq数据,我们侧重于使用无监督聚类的方法,这些方法不需要预先设置特定分类或者人为设置类别数目(例如kmeans等),这样也使得这种方法有了更多的适用性和可操控性。2013年发表在Nature Method上的“Accounting for technical noise in single-cell RNA-seq experiments”一文中,单细胞数据中出现的技术噪音和相应处理方法做了详细的解释。
大致思路是使用二次方曲线等方法拟合均值和方差,然后使用卡方检验找到异常基因。听起来好像很复杂,操作起来一样不难,有现成的R包帮我们完成上述操作!
然而,上面这篇文章只能让大家了解这种方法的原理的目的,到底应该如何用到我们自己的单细胞分析文章中呢?每次都来CNS级别的文章大家可能觉得望尘莫及,所以这次我找了一篇5分多的文章——2020年发表在Stem Cell Research & Therapy上的一篇单细胞分析文章,题目为:Single-cell RNA-seq highlights heterogeneity in human primary Wharton’s jelly mesenchymal stem/stromal cells cultured in vitro
这篇文章思路清晰,内容简单易上手,可以说,只要你看完了前面的推文,加上既往“风风打赏营”的内容,就可以复现这篇文章。文章的主要内容如下所示:
简单来说,作者就是识别了单细胞数据中的highly variable genes (HVGs),对这些基因进行GO和KEGG分析之后发现这些基因参与了某种生物学功能,然后再进行降维、聚类、确定细胞亚群等分析,思路非常清晰,图片呢按照前面推文的代码你也可以画得更美观。关键在于到底怎么识别highly variable genes呢?别着急,这次不仅来聊聊如何识别highly variable genes,把high dropout genes也一起打包给你!
接下来,我们来看看具体的操作:
识别高突变基因
Highly Variable Genes
一般我们在数据的质量控制之后进行特征选择,所以我们选择我们前面推文中已经QC过的数据进行分析,也可以直接在后台回复获取相应数据。我们主要使用M3Drop的“M3DropFeatureSelection”函数和“NBumiFeatureSelectionCombinedDrop”函数进行演示:
rm(list = ls())options(stringsAsFactors = FALSE)set.seed(123456) # 设置随机种子,方便重复结果# 安装R包# install.packages("devtools")# devtools::install_github("hemberg-lab/scRNA.seq.funcs")# install.packages("matrixStats")# BiocManager::install("SingleCellExperiment")# BiocManager::install("M3Drop")library(scRNA.seq.funcs)library(matrixStats)library(M3Drop)## Loading required package: numDerivlibrary(RColorBrewer)library(SingleCellExperiment)## Loading required package: SummarizedExperiment## Loading required package: GenomicRanges## Loading required package: stats4## Loading required package: BiocGenerics## Loading required package: parallel## ## Attaching package: 'BiocGenerics'## The following objects are masked from 'package:parallel':## ## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,## clusterExport, clusterMap, parApply, parCapply, parLapply,## parLapplyLB, parRapply, parSapply, parSapplyLB## The following objects are masked from 'package:stats':## ## IQR, mad, sd, var, xtabs## The following objects are masked from 'package:base':## ## anyDuplicated, append, as.data.frame, basename, cbind, colnames,## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,## union, unique, unsplit, which, which.max, which.min## Loading required package: S4Vectors## ## Attaching package: 'S4Vectors'## The following object is masked from 'package:base':## ## expand.grid## Loading required package: IRanges## ## Attaching package: 'IRanges'## The following object is masked from 'package:grDevices':## ## windows## Loading required package: GenomeInfoDb## Loading required package: Biobase## Welcome to Bioconductor## ## Vignettes contain introductory material; view with## 'browseVignettes()'. To cite Bioconductor, see## 'citation("Biobase")', and for packages 'citation("pkgname")'.## ## Attaching package: 'Biobase'## The following objects are masked from 'package:matrixStats':## ## anyMissing, rowMedians## Loading required package: DelayedArray## Loading required package: BiocParallel## ## Attaching package: 'DelayedArray'## The following objects are masked from 'package:matrixStats':## ## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges## The following objects are masked from 'package:base':## ## aperm, apply, rowsumlibrary(cowplot)library(ggsci)library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --## √ ggplot2 3.3.3 √ purrr 0.3.4## √ tibble 3.1.0 √ dplyr 1.0.5## √ tidyr 1.1.3 √ stringr 1.4.0## √ readr 1.4.0 √ forcats 0.5.1## -- Conflicts ------------------------------------------ tidyverse_conflicts() --## x dplyr::collapse() masks IRanges::collapse()## x dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()## x dplyr::count() masks matrixStats::count()## x dplyr::desc() masks IRanges::desc()## x tidyr::expand() masks S4Vectors::expand()## x dplyr::filter() masks stats::filter()## x dplyr::first() masks S4Vectors::first()## x dplyr::lag() masks stats::lag()## x ggplot2::Position() masks BiocGenerics::Position(), base::Position()## x purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()## x dplyr::rename() masks S4Vectors::rename()## x purrr::simplify() masks DelayedArray::simplify()## x dplyr::slice() masks IRanges::slice()# 读取数据scRNA_data <- readRDS("Features.rds")scRNA_data # SingleCellExperiment读取后的数据格式## class: SingleCellExperiment ## dim: 22431 268 ## metadata(0):## assays(2): counts logcounts## rownames(22431): Hvcn1 Gbp7 ... Sox5 Alg11## rowData names(10): feature_symbol is_feature_control ... total_counts## log10_total_counts## colnames(268): 16cell 16cell.1 ... zy.2 zy.3## colData names(30): cell_type2 cell_type1 ... pct_counts_ERCC## is_cell_control## reducedDimNames(0):## spikeNames(1): ERCC## altExpNames(0):print(head(scRNA_data@assays[["data"]]@listData[["logcounts"]])[,1:6]) # 查看数据## 16cell 16cell.1 16cell.2 16cell.3 16cell.4 16cell.5## Hvcn1 1.761892 7.659105 0.00000 7.431404 0.000000 0.00000## Gbp7 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000## Arrdc1 11.391550 9.996722 10.84872 10.243523 10.811142 0.00000## Ercc5 1.134689 1.243790 0.00000 0.000000 7.148973 0.00000## Mrpl15 11.399873 11.109356 12.34129 12.573202 12.518002 12.47322## Dclk1 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000# 提取细胞类型celltype_labs <- colData(scRNA_data)$cell_type2table(celltype_labs)## celltype_labs## 16cell 4cell 8cell early2cell earlyblast late2cell lateblast ## 50 14 37 8 43 10 30 ## mid2cell midblast zy ## 12 60 4# 自定义颜色,方便对图形进行美化display.brewer.all()
cell_colors <- brewer.pal(max(3, length(unique(celltype_labs))),"Paired") # 选择Paired面板的颜色参数cell_colors## [1] "#A6CEE3""#1F78B4""#B2DF8A""#33A02C""#FB9A99""#E31A1C""#FDBF6F"## [8] "#FF7F00""#CAB2D6""#6A3D9A"# 识别Highly Variable Genes# 使用M3DropConvertData函数从SingleCellExperiment对象中提取出特征筛选的矩阵expr_matrix <- M3Drop::M3DropConvertData(scRNA_data)## [1] "Removing 1134 undetected genes."print(head(expr_matrix[,1:6]))## 16cell 16cell.1 16cell.2 16cell.3 16cell.4 16cell.5## Hvcn1 2.391425 201.125172 0.000 171.6137 0.0000 0.000## Gbp7 0.000000 0.000000 0.000 0.0000 0.0000 0.000## Arrdc1 2685.570474 1020.676044 1843.121 1211.2932 1795.7108 0.000## Ercc5 1.195713 1.368198 0.000 0.0000 140.9238 0.000## Mrpl15 2701.114737 2208.272299 5188.189 6093.1212 5864.3516 5685.091## Dclk1 0.000000 0.000000 0.000 0.0000 0.0000 0.000# 使用BrenneckeGetVariableGenes函数对整个数据集估计技术噪声Brennecke_HVG <- BrenneckeGetVariableGenes( expr_matrix, fdr = 0.01, # 设置P值的阈值 minBiolDisp = 0.5)
# 图片中红色曲线为拟合的技术噪声模型,虚线为95% CI,粉色点是经多次试验校正后具有显著生物学变异的基因head(Brennecke_HVG) # 查看结果## Gene effect.size p.value q.value## Rac2 Rac2 236.5316 0 0## D630029K05Rik D630029K05Rik 230.9111 0 0## Car2 Car2 225.0898 0 0## Eltd1 Eltd1 210.5842 0 0## Cd34 Cd34 209.6378 0 0## Samsn1 Samsn1 208.1645 0 0# 可以发现这是一个类似于我们差异分析后的表格,我们主要看P值# 提取结果基因HVG_genes <- Brennecke_HVG$Genelength(HVG_genes)## [1] 1303# 总共有1303个基因
High Dropout Genes
High Dropout Genes指的是大部分样本中表达量都为0的基因,0的出现频率被称为dropout rate,与scRNA-Seq数据的表达水平密切相关。前面推文中我们构建稀疏矩阵时候就发现:在单细胞表达矩阵中,大部分表达量都为0,所以才需要构建稀疏矩阵。因此,0是单细胞RNASeq数据的主要特征,通常占最终表达矩阵的一半以上。0主要是由于在单细胞测序时,mRNA未能被逆转录而导致的,详细解释可参考这篇文章:
Modelling dropouts for feature selection in scRNASeq experiments.doi: 10.1101/065094。逆转录是一种酶反应,因此可以使用Michaelis-Menten方程(非线性函数)来模拟:
# 自构建数据探索M3Drop的分析过程K<- 49 # Michaelis-Menten常数S_sim<- 10^seq(from = -3, to = 4,by = 0.05) MM<- 1 - S_sim / (K + S_sim) # Michaelis-Menten公式plot(S_sim,MM,type = "l", lwd = 3, log = "x",xlab = "Expression", ylab = "Dropout Rate", xlim = c(1,1000))S1<- 10 # 设置均值S1P1<- 1 - S1 / (K + S1) # S1的Michaelis-MentenS2<- 750 # 设置均值S2P2<- 1 - S2 / (K + S2) # S2的Michaelis-Mentenpoints(c(S1,S2),c(P1,P2), pch = 16, col = "grey85", cex = 3)mix<- 0.5 points(S1* mix + S2 * (1 - mix), P1* mix + P2 * (1 - mix), pch = 16, col = "grey35", cex = 3)# 上面就是M3Drop的基本思想,得到的结果图片可以看成是M3Drop结果的简版# 接下来使用M3DropFeatureSelection函数识别异常基因,矫正阈值为FDR = 0.01M3Drop_genes<- M3DropFeatureSelection(expr_matrix,mt_method = "fdr",mt_threshold = 0.01)head(M3Drop_genes)## Gene effect.size p.value q.value## Btg4 Btg4 88.85860 9.277494e-88 2.822611e-84## LOC100502936 LOC100502936 77.55832 2.374005e-41 7.546147e-39## C86187 C86187 69.83055 5.788963e-84 9.483658e-81## Oog3 Oog3 69.31695 6.400013e-78 9.086738e-75## Accsl Accsl 63.76541 9.515726e-56 5.629345e-53## E330034G19Rik E330034G19Rik 51.76511 1.297491e-41 4.386139e-39M3Drop_genes<- M3Drop_genes$Genelength(M3Drop_genes)## [1] 1635# 总共得到1635个基因,与上面识别高变基因的结果对比,存在一定差异# 如果你的数据是UMI数据,则需要调整一下使用的方法,可以使用M3Drop的另外一种方法——NBumiFeatureSelectionCombinedDrop函数,这种方法主要是基于负二项式模型:scRNA_data_int<- NBumiConvertData(scRNA_data) # 只适合UMI数据,这里只是展示方法## [1] "Removing 1134 undetected genes."DANB_fit<- NBumiFitModel(scRNA_data_int) # 执行DANB特征选择DropFS<- NBumiFeatureSelectionCombinedDrop(DANB_fit, method="fdr",qval.thresh=0.01,suppress.plot=FALSE)
head(DropFS)## Gene effect_size p.value q.value## Rfpl4b Rfpl4b 0.8153876 2.123150e-252 4.521673e-248## Gm4850 Gm4850 0.7798507 2.295216e-210 2.444061e-206## 1700013H16Rik 1700013H16Rik 0.7980667 1.469433e-207 1.043150e-203## Duoxa2 Duoxa2 0.7920190 6.824657e-205 3.633618e-201## 7420426K07Rik 7420426K07Rik 0.7852554 1.964136e-204 8.366039e-201## 1600025M17Rik 1600025M17Rik 0.7761420 4.691205e-199 1.665143e-195DANB_genes <- DropFS[1:1500,]$Gene length(DropFS$Gene)## [1] 10694
基于基因之间的相关性进行特征筛选
这种方法主要考虑基因之间相关性的大小,而由于技术噪音、批间差等原因,不考虑基因之间相关性的P值:
# 计算基因间的相关性# cor_feat <- M3Drop::corFS(expr_matrix) # 运行时间较长,可以保存下来,下次直接加载,减少运行时间#print(head(cor_feat))# saveRDS(cor_feat, file = "cor_feat.rds")cor_feat <- readRDS("cor_feat.rds")length(cor_feat)## [1] 21297Cor_genes <- names(cor_feat)[1:1500]# PCA分析# 一般认为,具有高PCA负荷的基因更可能是高度可变基因,可能与潜在的生物学效应相关。但是,PCA负荷容易因为批次效应而出现假阳性。因此,建议绘制PCA结果以确定哪些成分对应了生物变异# 进行对数转换pca <- prcomp(log(expr_matrix + 1) / log(2))pca$x[1:8,1:8]## PC1 PC2 PC3 PC4 PC5 PC6## Hvcn1 18.33134 -28.2126253 -6.1775477 -3.9040348 -5.666065 1.24967145## Gbp7 -34.76186 -32.1369010 -8.2182685 -2.0702401 7.149251 0.57501272## Arrdc1 47.10214 -31.4578329 34.8947461 2.2857984 -6.646687 -23.56904257## Ercc5 12.09302 -8.7114346 -9.5579975 -8.7309701 -9.004727 2.65100637## Mrpl15 121.25563 33.7733148 18.0089107 0.8887331 5.879710 3.79956578## Dclk1 -50.57560 5.4423929 1.8064698 1.0922085 2.915070 -0.01293841## Tssc4 19.50557 -5.6356208 18.0823213 1.7941776 -14.162350 -4.96608262## Gm101 -49.96462 0.8869821 -0.1458661 -2.3477926 2.998759 -0.48755628## PC7 PC8## Hvcn1 0.7164300 3.6902055## Gbp7 0.3369409 -0.3003790## Arrdc1 -3.3569123 -4.8229241## Ercc5 -0.8493427 -0.7753090## Mrpl15 -3.6368224 -0.5439466## Dclk1 0.4125416 -0.1754502## Tssc4 -2.1362989 1.4660340## Gm101 0.2148508 -0.1400523pca$rotation[1:8,1:8]## PC1 PC2 PC3 PC4 PC5 PC6## 16cell 0.06319658 0.007538477 0.08275729 -0.03118525 0.032116521 -0.03390935## 16cell.1 0.06306832 0.001696112 0.07430445 -0.03109429 0.035836284 -0.04532950## 16cell.2 0.06230193 0.006977916 0.08491574 -0.02545638 0.038851443 -0.03097586## 16cell.3 0.06225635 0.001292008 0.08486632 -0.01954077 0.032155960 -0.02026838## 16cell.4 0.06517881 0.005370336 0.07716562 -0.03491017 0.004795291 -0.03846531## 16cell.5 0.06219640 0.008454293 0.07865085 -0.03633414 0.052710959 -0.04116955## 16cell.6 0.06210753 0.005208071 0.08727395 -0.02767458 0.049595669 -0.03246626## 16cell.7 0.06403473 0.001878939 0.07746825 -0.01782858 0.010800429 -0.03050787## PC7 PC8## 16cell -0.012491854 0.03056545## 16cell.1 -0.002931519 0.07198860## 16cell.2 -0.002714431 0.03169307## 16cell.3 -0.010152846 0.02447489## 16cell.4 0.005762919 0.03034525## 16cell.5 0.014833785 0.01962836## 16cell.6 -0.005064821 0.03104576## 16cell.7 -0.005788153 0.03876841# 绘图plot( pca$rotation[,1], pca$rotation[,2], pch = 16, col = cell_colors[as.factor(celltype_labs)])

# 计算PCA负荷score<- rowSums(abs(pca$x[,c(1,2)])) names(score)<- rownames(expr_matrix)score<- score[order(-score)]PCA_genes<- names(score[1:1500])# 检验所选择的特征是否真的代表了细胞类型之间的基因表达差异J<- sum(M3Drop_genes %in% HVG_genes)/length(unique(c(M3Drop_genes, HVG_genes)))J## [1] 0.3005755M3DropExpressionHeatmap(M3Drop_genes,expr_matrix,cell_labels = celltype_labs)
M3DropExpressionHeatmap(HVG_genes,expr_matrix,cell_labels = celltype_labs)
M3DropExpressionHeatmap(PCA_genes,expr_matrix,cell_labels = celltype_labs)
M3DropExpressionHeatmap(DANB_genes,expr_matrix,cell_labels = celltype_labs)
好了,这样我们特征选择的内容也就到此完成,关于单细胞分析一些基础分析步骤以及注意事项我们就全部学习完成啦,下一期开始,我们将进入单细胞分析的高级分析,首先肯定是假时序...不对,是拟时序分析啦!
后台回复“feng47”获取本期数据和代码,我们下期见啦!
单细胞分析专栏传送门
碎碎念专栏传送门(完结)
风之美图系列传送门(完结)
END

撰文丨风   风
排版丨四金兄
值班 | 阿   琛
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~

继续阅读
阅读原文