单细胞混样品测序后数据拆分（Cell Hashing技术）

最近有学徒提到她在复现文献：《utative regulators for the continuum of erythroid differentiation revealed by single-cell transcriptome of human BM and UCB cells.》的单细胞数据分析的时候.

她发现作者明明是提到了6个样品，但是其数据集是：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE150774,可以看到是5个数据：

GSM4558614 CD235a+ cells - umbilical cord blood 2 and umbilical cord blood 3,UCB3

GSM4558615 CD235a+ cells - umbilical cord blood 1

GSM4558616 CD235a+ cells - Bone Marrow 2

GSM4558617 CD235a+ cells - Bone Marrow 3

GSM4558618 CD235a+ cells - Bone Marrow 4

仔细看了看文章里面的描述，确实是 10x Genomics platform 这个技术，是 3 adult bone marrow and 3 umbilical cord blood samples ，合起来是6个样品，而且提前做了细胞分选，仅仅是关注 CD235a+ cells

学徒以为是作者数据整理上传失败，其实是cell hashing技术，大家可以先去了解 CITE-seq技术，它可以同时拿到普通基因的表达量矩阵，以及几十个蛋白质（通过antibody-derived tags (ADT)）的表达量矩阵，该技术的全称为cellular indexing of transcriptomes and epitopes by sequencing。而Cell Hashing是在CITE-seq的基础上改进，是给需要混合的样品提前加上HTO (A distinct Hashtag oligonucleotide) 标签，这样即使混合后也可以提供不同的HTO标签进行区分。

可以看到文件UCB2UCB3确实是混合在同一个单细胞样品里面了：

GSM4558614_UCB2UCB3_filtered_gene_bc_matrices.tar.gz 95.9 Mb

GSM4558615_UCB1_filtered_gene_bc_matrices.tar.gz 4.4 Mb

GSM4558616_BM2_filtered_gene_bc_matrices.tar.gz 7.4 Mb

GSM4558617_BM3_filtered_gene_bc_matrices.tar.gz 6.3 Mb

GSM4558618_BM4_filtered_gene_bc_matrices.tar.gz 8.3 Mb

下面就让我们来看看如何把这个95.9 Mb的矩阵拆分成为两个样品

首先导入数据并查看数据分布

rm(list=ls())

options(stringsAsFactors = 
F) 

library(Seurat)

library(ggplot2)

# 需要自己去 GSE150774 数据集主页下载 GSE150774_RAW 哦 
# 把这个 GSM4558614_UCB2UCB3_filtered_gene_bc_matrices.tar.gz 95.9 Mb 解压即可：
HP <- Read10X(
"GSE150774_RAW/HP0_filtered_gene_bc_matrices/")

HP[[
1]] 
# RNA：Gene Expression
HP[[
2]] 
# HTO：Antibody Capture

它这个文件夹里面，有表达量矩阵，也有 HTO (A distinct Hashtag oligonucleotide) 标签，这样即使混合后也可以提供不同的HTO标签进行区分。

表达量矩阵和HTO标签需要取交集

#先读hto数据 
yhto <- as.data.frame(HP$`Antibody Capture`)

rownames(yhto)

# 可以看到是两个样品
table(colSums(yhto)== 
0)

pbmc.htos <- yhto[colSums(yhto)>
0]

dim(pbmc.htos)


# 然后看表达量矩阵 
pbmc.umis <- HP$`Gene Expression`


#  取交集
joint.bcs <- intersect(colnames(pbmc.umis), colnames(pbmc.htos))

length(joint.bcs)


# 前面的抗体信息和表达量信息，都需要过滤
# Subset RNA and HTO counts by joint cell barcodes
pbmc.umis <- pbmc.umis[, joint.bcs]

pbmc.htos <- as.matrix(pbmc.htos[, joint.bcs])

如果你的表达量矩阵跟HTO标签矩阵确实是同一个实验产出的，两者交集应该是非常好。

创建Seurat并将HTO置入对象中

取交集后，就可以进行seurat标准流程啦

# Setup Seurat object
pbmc.hashtag <- CreateSeuratObject(counts = pbmc.umis)


# Normalize RNA data with log normalization
pbmc.hashtag <- NormalizeData(pbmc.hashtag)

# Find and scale variable features
pbmc.hashtag <- FindVariableFeatures(pbmc.hashtag, selection.method = 
"mean.var.plot")

pbmc.hashtag <- ScaleData(pbmc.hashtag, features = VariableFeatures(pbmc.hashtag))

pbmc.hashtag


# Add HTO data as a new assay independent from RNA
pbmc.hashtag[[
"HTO"]] <- CreateAssayObject(counts = pbmc.htos)

names(pbmc.hashtag)

# Normalize HTO data, here we use centered log-ratio (CLR) transformation
pbmc.hashtag <- NormalizeData(pbmc.hashtag, 

                              assay = 
"HTO", 

                              normalization.method = 
"CLR")



head([email protected])


VlnPlot(pbmc.hashtag, features = c(
"nFeature_RNA", 
"nCount_RNA",

"nCount_HTO",
"nFeature_HTO"), 

        ncol = 
2,pt.size = 
0)

可以看到，这个时候其实是两个矩阵都进入了seurat对象里面哦，

利用`HTODemux函数`拆分数据

前面的seurat对象比较特殊，它有两个assay，如下所示：

> pbmc.hashtag

An object of class Seurat 

32740 features across 18777 samples within 2 assays 

Active assay: RNA (32738 features, 1749 variable features)

 1 other assay present: HTO

这样的，有两个 assay的 seurat对象，就可以被HTODemux函数拆分数据，代码如下所示：

pbmc.hashtag <- HTODemux(pbmc.hashtag,

                         assay = 
"HTO", 

                         positive.quantile = 
0.99)

pbmc.hashtag

table(pbmc.hashtag$HTO_classification.global)

pbmc.hashtag@assays

table(pbmc.hashtag$hash.ID)


# ## Doublet Negative  Singlet 
#     152    14650     3975 

# 可视化
Idents(pbmc.hashtag) <- 
"hash.ID"
VlnPlot(pbmc.hashtag, features = 
"nCount_RNA", pt.size = 
0, log = 
TRUE)

FeatureScatter(pbmc.hashtag, feature1 = 
"rna_HBE1", feature2 = 
"rna_S100A9")

可以看到，这个数据集质量有点问题，绝大部分的细胞都是阴性，有点意思。

数据提取

混合样品，拆开成为不同的seurat对象：


# First, we will remove negative cells from the object
table(Idents(pbmc.hashtag))

pbmc.hashtag.subset <- subset(pbmc.hashtag, 

                              idents = c(
"Negative",
"Doublet"), 

                              invert = 
TRUE)

table(Idents(pbmc.hashtag.subset)) 

pbmc.hashtag.subset <- RunPCA(pbmc.hashtag.subset)

pbmc.hashtag.subset <- RunUMAP(pbmc.hashtag.subset,dims=
1:
10)

DimPlot(pbmc.hashtag.subset) 


#提取B0251：
B0251 <- subset(pbmc.hashtag, 

                idents = c(
"B0251 anti-human Hashtag1"))

#提取B0252：
B0252 <- subset(pbmc.hashtag, 

                idents = c(
"B0252 anti-human Hashtag2"))

多个单细胞对象需要合并后走harmony流程


sce.all <- merge( B0251, y=  B0252 ,add.cell.ids = c(
'B0251',
'B0252')) 


as.data.frame(sce.all@assays$RNA@counts[
1:
10, 
1:
2])

head([email protected], 
10)

table([email protected]$orig.ident) 

library(stringr)

phe=as.data.frame(str_split(colnames(sce.all),
'_',simplify = 
T))

head(phe)

[email protected]$orig.ident = phe$V1

table([email protected]$orig.ident )

head([email protected])

# 下面是标准流程
sce=sce.all

sce <- NormalizeData(sce, 

                         normalization.method = 
"LogNormalize",

                         scale.factor = 
1e4) 

sce <- FindVariableFeatures(sce)

sce <- ScaleData(sce)

sce <- RunPCA(sce, features = VariableFeatures(object = sce))


library(harmony)

seuratObj <- RunHarmony(sce, 
"orig.ident")

names(seuratObj@reductions)

seuratObj <- RunUMAP(seuratObj,  dims = 
1:
15, 

                     reduction = 
"harmony")

DimPlot(seuratObj,reduction = 
"umap",label=
T ) 


sce=seuratObj

sce <- FindNeighbors(sce, reduction = 
"harmony",

                     dims = 
1:
15) 

sce.all=sce

后面的结果就顺理成章了，分群注释，如下所示的结果：

唯一美中不足的就是最开始的两万多个细胞，居然过滤后仅仅是剩下三千左右细胞数量了。

降维聚类分群，参考前面的例子：人人都能学会的单细胞聚类分群注释

写在文末

我在《生信技能树》，《生信菜鸟团》，《单细胞天地》的大量推文教程里面共享的代码都是复制粘贴即可使用的，有任何疑问欢迎留言讨论，也可以发邮件给我，详细描述你遇到的困难的前因后果给我，我的邮箱地址是 [email protected]

如果你确实觉得我的教程对你的科研课题有帮助，让你茅塞顿开，或者说你的课题大量使用我的技能，烦请日后在发表自己的成果的时候，加上一个简短的致谢，如下所示：

We thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, 
for generously sharing their experience and codes.

十年后我环游世界各地的高校以及科研院所（当然包括中国大陆）的时候，如果有这样的情谊，我会优先见你。

继续阅读

阅读原文