PCA降维和识别Cell Cluster
大家好,我是风。欢迎来到风风的从零开始单细胞系列一。上一讲,我们已经学习了Seurat包进行数据质控和可视化,Seurat包提供了一系列函数和流程化分析过程,使得单细胞分析变得容易上手和入门。既然如此,Seurat包不可能只有数据质控和标准化是吧?没错,包括我们前面介绍的PCA分析和降维分析,都可以直接用Seurat包完成,今天我们一起来看一下。
我们先来看一下Cell的这篇文章:"Data-driven phenotypic dissection of AML reveals progenitor like cells that correlate with prognosis":
这篇文章中有这么两张图:


这两张图片分别使用了 Manuel Gates和PhenoGraph两种方法,并且对于PhenoGraph,在方法部分,文章是这么介绍的
it finds the k nearest neighbors for each cell (using Euclidean distance), resulting in N sets of k-neighborhoods. Second, it operates on these sets to build a weighted graph such that the weight between nodes scales with the number of neighbors they share
没错,聪明的你可能猜到了,这种区分不同Cluster的方法大致类似Seurat识别不同亚型的方法,简单来说,就是利用KNN法,基于欧式距离找出每一小部分区域的k个邻近区域,从而得到N组k-邻域,然后再对这些集合构建加权图,获得每个节点之间的权重。结合单细胞数据,那就是使用KNN法在在具有相似基因表达模式的细胞之间画边,然后划分为高度相互关联的“细胞群体”,并基于KNN结果优化任意两个细胞之间的权重。Cell原文是使用MATLAB和Python完成,我们知道了原理之后,又有了Seurat包,就可以直接调用R包来完成相关操作。
小Tips:学习一项新技能时,对于一些关键步骤还是有必要简单了解下相关原理,说不定哪天看到高级的方法和技术的时候,只是名称不同,内核一致呢?(*^_^*)

接下来我们开始使用Seurat包进行操作,使用上一期推文的代码处理后的结果,你们可以用自己的处理结果,也可以直接后台回复关键词获取数据和代码。
期刊信息
线性降维
读取上一讲中得到的结果进行PCA分析降维:
rm(list = ls())options(stringsAsFactors = FALSE)set.seed(20210330) # 设置随机种子,方便重复结果# 安装R包# install.packages("Seurat")# install.packages("dplyr")library(Seurat)library(dplyr)## ## Attaching package: 'dplyr'## The following objects are masked from 'package:stats':## ## filter, lag## The following objects are masked from 'package:base':## ## intersect, setdiff, setequal, unionlibrary(cowplot)# 加载数据load(file = "pbmc.rda")pbmc## An object of class Seurat ## 13714 features across 2695 samples within 1 assay ## Active assay: RNA (13714 features, 2000 variable features)# 使用RunPCA进行降维,也可以自己指定特征基因进行降维pbmc <- RunPCA(object = pbmc, npcs = 50, rev.pca = FALSE, weight.by.var = TRUE, verbose = TRUE, # 输出特定PC值的特征基因 ndims.print = 1:5, # 输出前5个PC值的特征基因 nfeatures.print = 30, # 输出每个PC值的30个特征基因 reduction.key = "PC_")## PC_ 1 ## Positive: MALAT1, IL32, LTB, IL7R, CD2, B2M, ACAP1, CTSW, GZMM, STK17A ## CD247, CCL5, GIMAP5, GZMA, AQP3, CST7, TRAF3IP3, GZMK, MAL, HOPX ## ITM2A, ETS1, MYC, LYAR, NKG7, GIMAP7, BEX2, KLRG1, LDLRAP1, ZAP70 ## Negative: CST3, TYROBP, LST1, AIF1, FTL, FCN1, LYZ, FTH1, S100A9, FCER1G ## TYMP, CFD, LGALS1, CTSS, S100A8, SERPINA1, LGALS2, SPI1, IFITM3, PSAP ## IFI30, CFP, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, NCF2 ## PC_ 2 ## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 ## HLA-DPB1, HLA-DQA2, HLA-DMA, CD37, HLA-DRB5, HLA-DPA1, HLA-DMB, FCRLA, HVCN1, LTB ## BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 ## Negative: NKG7, CST7, PRF1, GZMA, GZMB, FGFBP2, CTSW, GNLY, B2M, SPON2 ## GZMH, CCL4, FCGR3A, CCL5, GZMM, CD247, XCL2, CLIC3, AKR1C3, SRGN ## HOPX, S100A4, CTSC, TTC38, ANXA1, IGFBP7, ID2, IL32, XCL1, ACTB ## PC_ 3 ## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, CD74, HLA-DPA1, MS4A1, HLA-DRB1, HLA-DRB5 ## HLA-DRA, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 ## PLAC8, BLNK, SMIM14, PLD4, LAT2, SWAP70, IGLL5, P2RX5, FCGR3A, FGFBP2 ## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU ## AP001189.4, HIST1H2AC, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, NGFRAP1, MMD ## TREML1, F13A1, SEPT5, RUFY1, MPP1, CMTM5, RP11-367G6.3, TSC22D1, MYL9, GP1BA ## PC_ 4 ## Positive: HLA-DQA1, HIST1H2AC, PF4, CD79B, SDPR, CD79A, PPBP, GNG11, SPARC, MS4A1 ## HLA-DQB1, CD74, GP9, HLA-DPB1, NRGN, RGS18, AP001189.4, CD9, PTCRA, CLU ## CA2, TCL1A, HLA-DQA2, HLA-DRB1, TUBB1, HLA-DPA1, TMEM40, HLA-DRA, LINC00926, ITGA2B ## Negative: VIM, IL7R, S100A6, S100A8, S100A4, IL32, S100A9, S100A10, GIMAP7, MAL ## CD14, LGALS2, AQP3, CD2, FYB, RBP7, FCN1, GIMAP4, LYZ, ANXA1 ## MS4A6A, S100A12, S100A11, FOLR3, GIMAP5, AIF1, TRABD2A, IL8, IFI6, MALAT1 ## PC_ 5 ## Positive: GZMB, FGFBP2, S100A8, NKG7, GNLY, CCL4, PRF1, CST7, SPON2, GZMA ## GZMH, LGALS2, CCL3, S100A9, XCL2, CLIC3, CD14, CTSW, RBP7, GSTP1 ## MS4A6A, AKR1C3, IGFBP7, S100A12, TTC38, TYROBP, XCL1, FOLR3, CCL5, GZMM ## Negative: LTB, IL7R, CKB, VIM, AQP3, MS4A7, RP11-290F20.3, CYTIP, SIGLEC10, HMOX1 ## PTGES3, LILRB2, CD2, MAL, HN1, ANXA5, PPA1, TUBA1B, CORO1B, ATP1A1 ## IL32, TRADD, CTD-2006K23.1, WARS, FAM110A, ABRACL, VMO1, LILRA3, GPR183, TRAF3IP3# 降维结果可视化# 二维PCA分布图DimPlot(object = pbmc, reduction = "pca", split.by = 'ident', dims = c(1,2), shuffle = TRUE, label = TRUE, label.size = 4, label.color = "black", label.box = TRUE, cols.highlight = "#DE2D26", sizes.highlight = 1)## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.## Please use `as_label()` or `as_name()` instead.## This warning is displayed once per session.
# 展示特定维度FeaturePlot(object = pbmc, features = "PC_1")
# 展示特征基因FeaturePlot(object = pbmc, features = "MS4A1")
# 展示不同特征之间的关系FeatureScatter(object = pbmc, feature1 = "MS4A1", feature2 = "PC_1")
FeatureScatter(object = pbmc, feature1 = "MS4A1", feature2 = "CD3D")
# 展示PC特征及其特征基因的热图DimHeatmap(object = pbmc, reduction = "pca", cells = 200,dims = c(1:5),nfeatures = 30,disp.min = -2.5,balanced = TRUE,projected = FALSE,fast = TRUE,raster = TRUE,slot = "scale.data",combine = TRUE)
# 确定显著的主成分pbmc<- JackStraw(object = pbmc, reduction = "pca",dims = 20,num.replicate = 100,prop.freq = 0.1, verbose = FALSE)## Warning: UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated## random numbers without specifying argument 'future.seed'. There is a risk that## those random numbers are not statistically sound and the overall results might## be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper,## parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To## disable this check, use 'future.seed=NULL', or set option 'future.rng.onMisuse'## to "ignore".# JackStrawPlot函数可以将每个PC的p值分布与均匀分布进行比较,显著的PCs将表现出较强的p值较低的基因富集(虚线以上的实线)pbmc<- ScoreJackStraw(object = pbmc, dims = 1:20, reduction = "pca")JackStrawPlot(object = pbmc,dims = 1:20, reduction = "pca")## Warning: Removed 32844 rows containing missing values (geom_point).
# PC成分偏差图ElbowPlot(object = pbmc)
我们需要注意,从PC成分选择到识别数据集的真实维度是Seurat分析的一个重要步骤,但是这个过程可能由于主观原因或者其他原因造成结果有偏差。因此,可以考虑以下两种方法:
1、使用更加更严格的方法,探索重要PC成分以确定相关的异质性来源,例如,可以与GSEA联合使用;
2、多次随机抽样然后统计总的结果,但是这种方法运算量大,并且可能不会返回一个明确的PC阈值。接下来我们看如何识别不同的细胞亚群。
识别Cell Cluster
就如前面所说,Seurat是基于KNN法识别不同Cluster,这个过程由FindNeighbors函数和FindClusters函数完成,我们来看看:
# 使用FindNeighbors函数进行KNN分析pbmc <- FindNeighbors(pbmc, reduction = "pca", dims = 1:20)## Computing nearest neighbor graph## Computing SNNpbmc## An object of class Seurat ## 13714 features across 2695 samples within 1 assay ## Active assay: RNA (13714 features, 2000 variable features)## 1 dimensional reduction calculated: pca# 然后使用FindClusters函数识别亚群pbmc <- FindClusters(pbmc, resolution = 0.5, granularity = 0.8, algorithm = 1)## Warning: The following arguments are not used: granularity## Warning: The following arguments are not used: granularity## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck## ## Number of nodes: 2695## Number of edges: 122081## ## Running Louvain algorithm...## Maximum modularity in 10 random starts: 0.8591## Number of communities: 8## Elapsed time: 0 seconds# 这里有个小技巧,大部分人做FindClusters时候都会忽略granularity,当数据的细胞数目在3000或以下时,设置granularity的区间在0.6-1.2之间,结果相对更准确# 进行非线性降维tSNE分析pbmc <- RunTSNE(object = pbmc, dims.use = 1:20, do.fast = TRUE)?DimPlot## starting httpd help server ...## doneDimPlot(object = pbmc, reduction = "tsne", shuffle = TRUE, label = TRUE, label.size = 4, label.color = "navy", label.box = TRUE, repel = TRUE, cols.highlight = "#DE2D26", sizes.highlight = 2, combine = TRUE)
# UMAP分析pbmc <- RunUMAP(pbmc, reduction = "pca", dims = 1:20)## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'## This message will be shown once per session## 19:44:23 UMAP embedding parameters a = 0.9922 b = 1.112## 19:44:23 Read 2695 rows and found 20 numeric columns## 19:44:23 Using Annoy for neighbor search, n_neighbors = 30## 19:44:23 Building Annoy index with metric = cosine, n_trees = 50## 0% 10 20 30 40 50 60 70 80 90 100%## [----|----|----|----|----|----|----|----|----|----|## **************************************************|## 19:44:23 Writing NN index file to temp file C:\Users\Public\Documents\Wondershare\CreatorTemp\RtmpINy5Lq\file217c301c20f8## 19:44:23 Searching Annoy index using 1 thread, search_k = 3000## 19:44:24 Annoy recall = 100%## 19:44:25 Commencing smooth kNN distance calibration using 1 thread## 19:44:25 Initializing from normalized Laplacian + noise## 19:44:25 Commencing optimization for 500 epochs, with 111764 positive edges## 19:44:37 Optimization finishedDimPlot(pbmc, reduction = "umap", split.by = "seurat_clusters")
# 保存结果,方便下次调用进行下一步分析吧saveRDS(pbmc, file = "pbmc_tutorial.rds")
好了,到了这里,我们的Seurat分析流程就和scater分析流程再次到达同一起跑线啦!不同的Cluster已经识别出来,接下来就是寻找Marker基因以及拟时序分析相关的内容了,一篇简单的单细胞分析文章的大致流程即将走完,不知道你又到了哪一步了呢?
当然这只是最简单的入门分析过程,在实际分析过程中,根据数据的不同、测序平台的不同等等,具体步骤还有一些小的差异,比如10x Genomics、STRT seq、Smart seq2等等,不过都是些小差异啦,我们后面再一起学习。
后台回复“feng45”获取数据和代码,我们下期见啦!
单细胞分析专栏传送门
碎碎念专栏传送门(完结)
风之美图系列传送门(完结)
END

撰文丨风   风
排版丨四金兄
值班 | 王美丽
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~
SIMPLE
酸谈直播
搞科研的小伙伴都想发SCI,

发文章流程你真的清楚吗?

前期筹备?文章写作?

选刊投稿?正式发表?

坑无处不在,小心别踩奥!




本周直播酸谈来帮你排排雷!

避雷指南:发文章流程必知的6个技巧

酸菜老谈在线给你实用小tips,

想发文章再也不是难题啦!

直播预告

直播平台:Bilibili微信视频号
直播时间:4
月10日(周六)晚18:00-20:00
直播主题:
避雷指南:发文章流程必知的6个技巧

带着你的小伙伴们

一起相约解螺旋直播间吧!

b站直播间
领悟科研 优人一步

扫码预约直播


继续阅读
阅读原文