Cell的生信分析能有多高级?单细胞研究必备分析!不会做么?简单,我来教你!(附代码)
PCA降维和识别Cell Cluster
大家好,我是风。欢迎来到风风的从零开始单细胞系列一。上一讲,我们已经学习了Seurat包进行数据质控和可视化,Seurat包提供了一系列函数和流程化分析过程,使得单细胞分析变得容易上手和入门。既然如此,Seurat包不可能只有数据质控和标准化是吧?没错,包括我们前面介绍的PCA分析和降维分析,都可以直接用Seurat包完成,今天我们一起来看一下。
我们先来看一下Cell的这篇文章:"Data-driven phenotypic dissection of AML reveals progenitor like cells that correlate with prognosis":
这篇文章中有这么两张图:
这两张图片分别使用了 Manuel Gates和PhenoGraph两种方法,并且对于PhenoGraph,在方法部分,文章是这么介绍的
it finds the k nearest neighbors for each cell (using Euclidean distance), resulting in N sets of k-neighborhoods. Second, it operates on these sets to build a weighted graph such that the weight between nodes scales with the number of neighbors they share
没错,聪明的你可能猜到了,这种区分不同Cluster的方法大致类似Seurat识别不同亚型的方法,简单来说,就是利用KNN法,基于欧式距离找出每一小部分区域的k个邻近区域,从而得到N组k-邻域,然后再对这些集合构建加权图,获得每个节点之间的权重。结合单细胞数据,那就是使用KNN法在在具有相似基因表达模式的细胞之间画边,然后划分为高度相互关联的“细胞群体”,并基于KNN结果优化任意两个细胞之间的权重。Cell原文是使用MATLAB和Python完成,我们知道了原理之后,又有了Seurat包,就可以直接调用R包来完成相关操作。
(小Tips:学习一项新技能时,对于一些关键步骤还是有必要简单了解下相关原理,说不定哪天看到高级的方法和技术的时候,只是名称不同,内核一致呢?(*^_^*)
)
接下来我们开始使用Seurat包进行操作,使用上一期推文的代码处理后的结果,你们可以用自己的处理结果,也可以直接后台回复关键词获取数据和代码。
读取上一讲中得到的结果进行PCA分析降维:
rm(list = ls())
options(stringsAsFactors = FALSE)
set.seed(20210330) # 设置随机种子,方便重复结果
安装R包
"Seurat") install.packages(
"dplyr") install.packages(
library(Seurat)
library(dplyr)
#
# Attaching package: 'dplyr'
# The following objects are masked from 'package:stats':
#
# filter, lag
# The following objects are masked from 'package:base':
#
# intersect, setdiff, setequal, union
library(cowplot)
# 加载数据
load(file = "pbmc.rda")
pbmc
# An object of class Seurat
# 13714 features across 2695 samples within 1 assay
# Active assay: RNA (13714 features, 2000 variable features)
使用RunPCA进行降维,也可以自己指定特征基因进行降维
pbmc <- RunPCA(object = pbmc,
npcs = 50,
rev.pca = FALSE,
weight.by.var = TRUE,
verbose = TRUE, # 输出特定PC值的特征基因
ndims.print = 1:5, # 输出前5个PC值的特征基因
nfeatures.print = 30, # 输出每个PC值的30个特征基因
reduction.key = "PC_")
# PC_ 1
# Positive: MALAT1, IL32, LTB, IL7R, CD2, B2M, ACAP1, CTSW, GZMM, STK17A
# CD247, CCL5, GIMAP5, GZMA, AQP3, CST7, TRAF3IP3, GZMK, MAL, HOPX
# ITM2A, ETS1, MYC, LYAR, NKG7, GIMAP7, BEX2, KLRG1, LDLRAP1, ZAP70
# Negative: CST3, TYROBP, LST1, AIF1, FTL, FCN1, LYZ, FTH1, S100A9, FCER1G
# TYMP, CFD, LGALS1, CTSS, S100A8, SERPINA1, LGALS2, SPI1, IFITM3, PSAP
# IFI30, CFP, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, NCF2
# PC_ 2
# Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
# HLA-DPB1, HLA-DQA2, HLA-DMA, CD37, HLA-DRB5, HLA-DPA1, HLA-DMB, FCRLA, HVCN1, LTB
# BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
# Negative: NKG7, CST7, PRF1, GZMA, GZMB, FGFBP2, CTSW, GNLY, B2M, SPON2
# GZMH, CCL4, FCGR3A, CCL5, GZMM, CD247, XCL2, CLIC3, AKR1C3, SRGN
# HOPX, S100A4, CTSC, TTC38, ANXA1, IGFBP7, ID2, IL32, XCL1, ACTB
# PC_ 3
# Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, CD74, HLA-DPA1, MS4A1, HLA-DRB1, HLA-DRB5
# HLA-DRA, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
# PLAC8, BLNK, SMIM14, PLD4, LAT2, SWAP70, IGLL5, P2RX5, FCGR3A, FGFBP2
# Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
# AP001189.4, HIST1H2AC, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, NGFRAP1, MMD
# TREML1, F13A1, SEPT5, RUFY1, MPP1, CMTM5, RP11-367G6.3, TSC22D1, MYL9, GP1BA
# PC_ 4
# Positive: HLA-DQA1, HIST1H2AC, PF4, CD79B, SDPR, CD79A, PPBP, GNG11, SPARC, MS4A1
# HLA-DQB1, CD74, GP9, HLA-DPB1, NRGN, RGS18, AP001189.4, CD9, PTCRA, CLU
# CA2, TCL1A, HLA-DQA2, HLA-DRB1, TUBB1, HLA-DPA1, TMEM40, HLA-DRA, LINC00926, ITGA2B
# Negative: VIM, IL7R, S100A6, S100A8, S100A4, IL32, S100A9, S100A10, GIMAP7, MAL
# CD14, LGALS2, AQP3, CD2, FYB, RBP7, FCN1, GIMAP4, LYZ, ANXA1
# MS4A6A, S100A12, S100A11, FOLR3, GIMAP5, AIF1, TRABD2A, IL8, IFI6, MALAT1
# PC_ 5
# Positive: GZMB, FGFBP2, S100A8, NKG7, GNLY, CCL4, PRF1, CST7, SPON2, GZMA
# GZMH, LGALS2, CCL3, S100A9, XCL2, CLIC3, CD14, CTSW, RBP7, GSTP1
# MS4A6A, AKR1C3, IGFBP7, S100A12, TTC38, TYROBP, XCL1, FOLR3, CCL5, GZMM
# Negative: LTB, IL7R, CKB, VIM, AQP3, MS4A7, RP11-290F20.3, CYTIP, SIGLEC10, HMOX1
# PTGES3, LILRB2, CD2, MAL, HN1, ANXA5, PPA1, TUBA1B, CORO1B, ATP1A1
# IL32, TRADD, CTD-2006K23.1, WARS, FAM110A, ABRACL, VMO1, LILRA3, GPR183, TRAF3IP3
降维结果可视化
二维PCA分布图
DimPlot(object = pbmc,
reduction = "pca",
split.by = 'ident',
dims = c(1,2),
shuffle = TRUE,
label = TRUE,
label.size = 4,
label.color = "black",
label.box = TRUE,
cols.highlight = "#DE2D26",
sizes.highlight = 1)
# Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
# Please use `as_label()` or `as_name()` instead.
# This warning is displayed once per session.
FeaturePlot(object = pbmc,
features = "PC_1")
FeaturePlot(object = pbmc,
features = "MS4A1")
FeatureScatter(object = pbmc,
feature1 = "MS4A1",
feature2 = "PC_1")
FeatureScatter(object = pbmc,
feature1 = "MS4A1",
feature2 = "CD3D")
# 展示PC特征及其特征基因的热图
pbmc, =
reduction = "pca",
cells = 200,
dims = c(1:5),
nfeatures = 30,
-2.5, =
balanced = TRUE,
projected = FALSE,
fast = TRUE,
raster = TRUE,
slot = "scale.data",
combine = TRUE)
# 确定显著的主成分
pbmc<- JackStraw(object = pbmc,
reduction = "pca",
dims = 20,
100, =
0.1, =
verbose = FALSE)
## Warning: UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated
## random numbers without specifying argument 'future.seed'. There is a risk that
## those random numbers are not statistically sound and the overall results might
## be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper,
## parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To
## disable this check, use 'future.seed=NULL', or set option 'future.rng.onMisuse'
## to "ignore".
# JackStrawPlot函数可以将每个PC的p值分布与均匀分布进行比较,显著的PCs将表现出较强的p值较低的基因富集(虚线以上的实线)
pbmc<- ScoreJackStraw(object = pbmc,
dims = 1:20,
reduction = "pca")
pbmc, =
dims = 1:20,
reduction = "pca")
## Warning: Removed 32844 rows containing missing values (geom_point).
ElbowPlot(object = pbmc)
我们需要注意,从PC成分选择到识别数据集的真实维度是Seurat分析的一个重要步骤,但是这个过程可能由于主观原因或者其他原因造成结果有偏差。因此,可以考虑以下两种方法:
1、使用更加更严格的方法,探索重要PC成分以确定相关的异质性来源,例如,可以与GSEA联合使用;
2、多次随机抽样然后统计总的结果,但是这种方法运算量大,并且可能不会返回一个明确的PC阈值。接下来我们看如何识别不同的细胞亚群。
就如前面所说,Seurat是基于KNN法识别不同Cluster,这个过程由FindNeighbors函数和FindClusters函数完成,我们来看看:
使用FindNeighbors函数进行KNN分析
pbmc <- FindNeighbors(pbmc,
reduction = "pca",
dims = 1:20)
# Computing nearest neighbor graph
# Computing SNN
pbmc
# An object of class Seurat
# 13714 features across 2695 samples within 1 assay
# Active assay: RNA (13714 features, 2000 variable features)
# 1 dimensional reduction calculated: pca
然后使用FindClusters函数识别亚群
pbmc <- FindClusters(pbmc,
resolution = 0.5,
granularity = 0.8,
algorithm = 1)
# Warning: The following arguments are not used: granularity
## Warning: The following arguments are not used: granularity
# Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#
# Number of nodes: 2695
# Number of edges: 122081
#
# Running Louvain algorithm...
# Maximum modularity in 10 random starts: 0.8591
# Number of communities: 8
# Elapsed time: 0 seconds
这里有个小技巧,大部分人做FindClusters时候都会忽略granularity,当数据的细胞数目在3000或以下时,设置granularity的区间在0.6-1.2之间,结果相对更准确
进行非线性降维tSNE分析
pbmc <- RunTSNE(object = pbmc,
dims.use = 1:20,
do.fast = TRUE)
?DimPlot
# starting httpd help server ...
# done
DimPlot(object = pbmc,
reduction = "tsne",
shuffle = TRUE,
label = TRUE,
label.size = 4,
label.color = "navy",
label.box = TRUE,
repel = TRUE,
cols.highlight = "#DE2D26",
sizes.highlight = 2,
combine = TRUE)
UMAP分析
pbmc <- RunUMAP(pbmc,
reduction = "pca",
dims = 1:20)
# Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
# To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
# This message will be shown once per session
# 19:44:23 UMAP embedding parameters a = 0.9922 b = 1.112
# 19:44:23 Read 2695 rows and found 20 numeric columns
# 19:44:23 Using Annoy for neighbor search, n_neighbors = 30
# 19:44:23 Building Annoy index with metric = cosine, n_trees = 50
# 0% 10 20 30 40 50 60 70 80 90 100%
# [----|----|----|----|----|----|----|----|----|----|
# **************************************************|
# 19:44:23 Writing NN index file to temp file C:\Users\Public\Documents\Wondershare\CreatorTemp\RtmpINy5Lq\file217c301c20f8
# 19:44:23 Searching Annoy index using 1 thread, search_k = 3000
# 19:44:24 Annoy recall = 100%
# 19:44:25 Commencing smooth kNN distance calibration using 1 thread
# 19:44:25 Initializing from normalized Laplacian + noise
# 19:44:25 Commencing optimization for 500 epochs, with 111764 positive edges
# 19:44:37 Optimization finished
DimPlot(pbmc,
reduction = "umap",
split.by = "seurat_clusters")
# 保存结果,方便下次调用进行下一步分析吧
saveRDS(pbmc, file = "pbmc_tutorial.rds")
好了,到了这里,我们的Seurat分析流程就和scater分析流程再次到达同一起跑线啦!不同的Cluster已经识别出来,接下来就是寻找Marker基因以及拟时序分析相关的内容了,一篇简单的单细胞分析文章的大致流程即将走完,不知道你又到了哪一步了呢?
当然这只是最简单的入门分析过程,在实际分析过程中,根据数据的不同、测序平台的不同等等,具体步骤还有一些小的差异,比如10x Genomics、STRT seq、Smart seq2等等,不过都是些小差异啦,我们后面再一起学习。
后台回复“feng45”获取数据和代码,我们下期见啦!
单细胞分析专栏传送门
碎碎念专栏传送门(完结)
风之美图系列传送门(完结)
—
END—
撰文丨风 风
排版丨四金兄
值班 | 王美丽
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~
酸谈直播
搞科研的小伙伴都想发SCI,
发文章流程你真的清楚吗?
前期筹备?文章写作?
选刊投稿?正式发表?
坑无处不在,小心别踩奥!
本周直播酸谈来帮你排排雷!
避雷指南:发文章流程必知的6个技巧
酸菜老谈在线给你实用小tips,
想发文章再也不是难题啦!
直播预告
直播平台:Bilibili、微信视频号
直播时间:4月10日(周六)晚18:00-20:00
直播主题:《避雷指南:发文章流程必知的6个技巧》
直播平台:Bilibili、微信视频号
直播时间:4月10日(周六)晚18:00-20:00
直播主题:《避雷指南:发文章流程必知的6个技巧》
带着你的小伙伴们
一起相约解螺旋直播间吧!
领悟科研 优人一步
扫码预约直播
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。