不骗你!学了这个R包,搞定九成高大上的单细胞分析!包治百病!一包封神!小白赶紧收起来!
Seurat数据质控与可视化
大家好,我是风。欢迎来到风风的从零开始单细胞系列。单细胞的分析我们已经学习了一段时间了,不知道你掌握得如何?是否可以自己串成一篇文章了呢?技能永远只是技能,而思路应该是来源于文献和专业背景的结合,这是我们的优势!高深的代码知识只适合专业的选手,我们只需要把技能拿过来运用,达到自己的目的就行了。
今天我们来学习一下Seurat包在单细胞分析中的应用。Seurat包是单细胞分析最常用、功能最强大的R包,有多强大呢?Seurat有专门一个网站来介绍这个R包的各种功能,并且目前大部分单细胞分析的文章都绕不开这个R包。我们来看一下Seurat网站的内容有多丰富:
对于Seurat的新用户,官网建议从10X Genomics公司公开的2700个外周血单核细胞(PBMC)数据集开始学习分析。可以说,任何的教程都没有Seurat网站的介绍详细和清晰。网站也提供了相应的练习数据供大家学习。今天我们就来抛砖引玉,给大家先简单介绍这个R包的用法。我们将先使用10x Genomics公司提供的外周血单核细胞数据(PBMC)数据进行展示,PBMC数据也是目前较适合新手入门掌握分析流程的练习数据。然后,再使用一份之前推文的介绍过的来自GEO数据库的GSE166326 10x Genomics数据进行分析,让大家了解怎样替换成自己的数据进行相应的分析。
首先读取PBMC的数据介绍分析流程:
rm(list = ls())
options(stringsAsFactors = FALSE)
set.seed(20210330) # 设置随机种子,方便重复结果
安装R包
"Seurat") install.packages(
"dplyr") install.packages(
library(Seurat)
library(dplyr)
#
# Attaching package: 'dplyr'
# The following objects are masked from 'package:stats':
#
# filter, lag
# The following objects are masked from 'package:base':
#
# intersect, setdiff, setequal, union
library(cowplot)
构建Seurat对象
pbmc_data <- Read10X(data.dir = "PBMC/")
构建Seurat对象不知道大家是否还记得?在前面的推文中我们介绍过,只需要将barcodes.tsv、genes.tsv、matrix.mtx三份数据放在同一个文件夹下,然后使用Seurat包的Read10X函数就可以构建Seurat对象
c查看数据类型
typeof(pbmc_data) # "S4"
# [1] "S4"
class(pbmc_data) # "dgCMatrix"
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
typeof告诉我们这是一个S4类型的数据,什么是S4类型的数据呢?R语言的编程方式有S3、S4和R6等,其中,S3和S4都是R语言中基于泛型函数的函数编程方式,但是S4比S3更具有层次感。比较晦涩,但没关系,这个有了解就行,就记住是一种编程方式。
再使用class函数,发现R告诉我们Seurat对象是一个稀疏矩阵dgCMatrix。新知识又来了,什么叫稀疏矩阵呢?在矩阵中,如果数值为0的元素数目远远多于非0元素的数目,并且非0元素分布无规律时,则称该矩阵称为稀疏矩阵;与之对应的是稠密矩阵,那自然就是非0元素更多则称为稠密矩阵咯。稀疏矩阵常用于数据量较大的数据,例如单细胞测序数据。我们也不用特别对待这种类型,把它当成是常规矩阵matrix的一种特殊类型就行,matrix和dgCMatrix之间也可以互相转换。那为什么要构建稀疏矩阵呢?一个最简单的原因就是节省空间和内存,我们来看看稀疏矩阵和常规矩阵之间的内存相差多大?
pbmc_matrix <- as.matrix(x = pbmc_data)
class(pbmc_matrix)
dense.size <- object.size(x = pbmc_matrix)
dense.size
class(pbmc_data)
sparse.size <- object.size(x = pbmc_data)
sparse.size
pbmc_data@Dimnames[[1]][c(1:10)]
pbmc_data@Dimnames[[2]][c(1:10)]
构建了Seurat对象后,我们就要对数据进行质控和分析,Seurat包的数据质控比scater包方便很多,只需要一个函数:
pbmc <- CreateSeuratObject(counts = pbmc_data,
min.cells = 3,
min.features = 200,
project = "PBMC",
assay = "RNA")
mito.genes <- grep(pattern = "^MT-",
x = rownames(pbmc@assays[["RNA"]]),
value = TRUE)
mito.genes
percent.mito <- Matrix::colSums(pbmc@assays[["RNA"]][mito.genes, ])/Matrix::colSums(pbmc@assays[["RNA"]])
pbmc <- AddMetaData(object = pbmc,
metadata = percent.mito,
col.name = "percent.mito"
)
pbmc
identical(pbmc$percent.mito,percent.mito)
pbmc$percent.mito <- percent.mito
VlnPlot(object = pbmc,
features = c("nFeature_RNA",
"nCount_RNA",
"percent.mito"),
ncol = 3)
nFeature_RNA代表每个细胞测到的基因数目,nCount_RNA代表每个细胞测到的所有基因表达量总和,percent.mito代表检测到的线粒体基因的比例,也可以直接看两者的关系。
par(mfrow = c(1, 2))
FeatureScatter(object = pbmc,
feature1 = "nCount_RNA",
feature2 = "percent.mito",
pt.size = 1,
smooth = FALSE,
combine = TRUE,
slot = "data")
FeatureScatter(object = pbmc,
feature1 = "nCount_RNA",
feature2 = "nFeature_RNA",
pt.size = 1,
smooth = FALSE,
combine = TRUE,
slot = "data")
去除线粒体基因和极端值基因
pbmc <- subset(x = pbmc,
subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mito > -Inf & percent.mito < 5 )
pbmc
# An object of class Seurat
# 13714 features across 2695 samples within 1 assay
# Active assay: RNA (13714 features, 0 variable features)
接下来再对数据进行标准化:
pbmc <- NormalizeData(object = pbmc,
normalization.method = "LogNormalize",
scale.factor = 10000)
pbmc
# An object of class Seurat
# 13714 features across 2695 samples within 1 assay
# Active assay: RNA (13714 features, 0 variable features)
计算高度可变基因。FindVariableGenes的参数设置需要根据具体的数据类型、标准化的方法等的不同而修改,也可以直接试试下边的参数看看结果,然后再进行微调,这里识别2000个高变异基因
pbmc <- FindVariableFeatures(object = pbmc,
selection.method = "vst",
mean.function = ExpMean,
dispersion.function = LogVMR,
x.low.cutoff = 0.0125,
x.high.cutoff = 3,
y.cutoff = 0.5,
nfeatures = 2000)
# Warning: The following arguments are not used: x.low.cutoff, x.high.cutoff,
# y.cutoff
pbmc
# An object of class Seurat
# 13714 features across 2695 samples within 1 assay
# Active assay: RNA (13714 features, 2000 variable features)
查看计算结果
HVFInfo(object = pbmc)[1:15,1:3]
# mean variance variance.standardized
# AL627309.1 0.003339518 0.003329601 0.9335292
# AP006222.2 0.001113173 0.001112346 0.9931409
# RP11-206L10.2 0.001855288 0.001852533 0.9632153
# RP11-206L10.9 0.001113173 0.001112346 0.9931409
# LINC00115 0.006679035 0.006636888 0.9077796
# NOC2L 0.105751391 0.156963940 0.7836793
# KLHL17 0.003339518 0.003329601 0.9335292
# PLEKHN1 0.002597403 0.002591618 0.9448956
# RP11-54O7.17 0.001113173 0.001112346 0.9931409
# HES4 0.079035250 0.145569967 1.0587329
# RP11-54O7.11 0.001484230 0.001482577 0.9766675
# ISG15 1.187384045 6.971184505 1.7096867
# AGRN 0.002968460 0.002960747 0.9384280
# C1orf159 0.008905380 0.008829351 0.8972435
# TNFRSF18 0.041558442 0.058405885 0.9741060
可以看到2000个基因的计算结果列表
提取top10的基因
top10 <- head(VariableFeatures(pbmc), 10)
进行可视化
plot1 <- VariableFeaturePlot(pbmc)
LabelPoints(plot = plot1,
points = top10,
repel = TRUE)
# When using repel, set xnudge and ynudge to 0 for optimal results
# Warning: Transformation introduced infinite values in continuous x-axis
进行z score转换
pbmc <- ScaleData(object = pbmc,
vars.to.regress = c("nCounts_RNA", "percent.mito"))
# Warning: Requested variables to regress not in object: nCounts_RNA
# Regressing out percent.mito
# Centering and scaling data matrix
pbmc
# An object of class Seurat
# 13714 features across 2695 samples within 1 assay
# Active assay: RNA (13714 features, 2000 variable features)
查看数据
pbmc@assays[["RNA"]]@scale.data[1:10,1:5]
# AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
# ISG15 -0.83409518 -0.83611644 0.39836433
# CPSF3L -0.26051269 -0.24957333 -0.29051889
# MRPL20 1.54929805 -0.50901239 1.22331701
# ATAD3C -0.04470286 -0.04037045 -0.05658646
# C1orf86 -0.45124295 -0.44233736 -0.47567058
# RER1 -0.50312709 1.00993794 1.38617002
# RP3-395M20.9 -0.05735853 -0.04897515 -0.08035375
# LRRC47 -0.19558459 -0.18258824 3.79394938
# GPR153 -0.03527426 -0.03224285 -0.04358929
# TNFRSF25 4.92103466 -0.22742025 -0.27225815
# AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
# ISG15 2.22274110 -0.82942309
# CPSF3L -0.27848634 -0.28579873
# MRPL20 -0.57038796 -0.58591046
# ATAD3C -0.05182111 -0.05471710
# C1orf86 -0.46587505 -0.47182797
# RER1 1.58587184 -0.53560503
# RP3-395M20.9 -0.07113262 -0.07673646
# LRRC47 -0.21693791 -0.22562529
# GPR153 -0.04025495 -0.04228129
# TNFRSF25 -0.25908177 -0.26708929
pbmc@assays[["RNA"]]@data[1:10,1:5]
# 10 x 5 sparse Matrix of class "dgCMatrix"
# AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
# AL627309.1 . . .
# AP006222.2 . . .
# RP11-206L10.2 . . .
# RP11-206L10.9 . . .
# LINC00115 . . .
# NOC2L . . .
# KLHL17 . . .
# PLEKHN1 . . .
# RP11-54O7.17 . . .
# HES4 . . .
# AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
# AL627309.1 . .
# AP006222.2 . .
# RP11-206L10.2 . .
# RP11-206L10.9 . .
# LINC00115 . .
# NOC2L . .
# KLHL17 . .
# PLEKHN1 . .
# RP11-54O7.17 . .
# HES4 . .
这样一份10x Gonomics的数据就处理完成,包括:数据质控、数据过滤、数据标准化等重要操作。
接下来我们使用GEO数据库上的GSE166326来进行上面的流程,给大家展示如何运用Seurat包处理自己的数据:
GEO_data <- Read10X(data.dir = "GEO/")
GEO <- CreateSeuratObject(counts = GEO_data,
min.cells = 3,
min.features = 200,
project = "GEO",
assay = "RNA")
GEO
mito.genes <- grep(pattern = "^MT-",
x = rownames(GEO@assays[["RNA"]]),
value = TRUE)
mito.genes
VlnPlot(object = GEO,
features = c("nFeature_RNA",
"nCount_RNA"),
ncol = 2)
FeatureScatter(object = GEO,
feature1 = "nCount_RNA",
feature2 = "nFeature_RNA")
计算高变基因
GEO <- subset(x = GEO,
subset = nFeature_RNA > 200 & nFeature_RNA < 2500)
GEO
# An object of class Seurat
# 14627 features across 836 samples within 1 assay
# Active assay: RNA (14627 features, 0 variable features)
进行LogNormalize标准化
GEO <- NormalizeData(object = GEO,
normalization.method = "LogNormalize",
scale.factor = 10000)
GEO
# An object of class Seurat
# 14627 features across 836 samples within 1 assay
# Active assay: RNA (14627 features, 0 variable features)
进行vst转换
GEO <- FindVariableFeatures(object = GEO,
selection.method = "vst",
mean.function = ExpMean,
dispersion.function = LogVMR,
x.low.cutoff = 0.0125,
x.high.cutoff = 3,
y.cutoff = 0.5,
nfeatures = 2000)
# Warning: The following arguments are not used: x.low.cutoff, x.high.cutoff,
# y.cutoff
GEO
# An object of class Seurat
# 14627 features across 836 samples within 1 assay
# Active assay: RNA (14627 features, 2000 variable features)
HVFInfo(object = GEO)[1:15,1:3]
# mean variance variance.standardized
# Sox17 0.264354067 1.828236828 5.0337086
# Mrpl15 0.234449761 0.265925565 0.8436649
# Lypla1 0.632775120 0.917678996 0.8120245
# Tcea1 0.379186603 0.420117182 0.7425375
# Atp6v1h 0.276315789 0.300803656 0.7854357
# Rb1cc1 0.289473684 0.325685471 0.8039206
# Pcmtd1 0.434210526 0.490277340 0.7280056
# Rrs1 0.137559809 0.145126493 0.8416030
# Adhfe1 0.443779904 0.649530126 0.9373595
# 2610203C22Rik 0.007177033 0.007134057 0.8513895
# Mybl1 0.004784689 0.004767498 0.8833650
# Vcpip1 0.092105263 0.102883706 0.9125010
# 1700034P13Rik 0.011961722 0.019018422 1.3252637
# Sgk3 0.108851675 0.118676045 0.8844618
# 6030422M02Rik 0.004784689 0.004767498 0.8833650
这样自己的数据也处理完成了
得到处理完成的数据之后,就可以对数据进行PCA降维分析、细胞分类、新Marker鉴定等内容啦!关于Seurat的内容,如果自己理解能力和动手能力比较强的话,我更建议大家在Seurat官网自己探索,如果嫌全英文太麻烦的话,那就看咱们的推文吧,推文也会演示替换自己的数据该该如何分析。好了,今天的内容就到这里,内容承接前面的Seurat对象。我们的系列推文内容前后呼应,如果有哪一个步骤忘了的话,记得回去看看前面的推文哦!
后台回复“feng44”获取数据和代码,我们下期见啦!
单细胞分析专栏传送门
碎碎念专栏传送门(完结)
风之美图系列传送门(完结)
—
END—
撰文丨风 风
排版丨四金兄
值班 | 菠小萝
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。