可繁可简,可盐可甜,这才是upset的真面目吧!
一文学会简单版及高级版upset图画法
大家好,我是风。这里是风风的CNS美图复现系列专栏。ComplexHeatmap系列到了这里其实已经接近了尾声,接下来的都是扩展内容,目的还是让大家更好地了解可视化的奥妙。很多学员在跟完打赏营之后都会发出一声感叹:“数据清洗比可视化重要多了!”其实我不太同意,科研的本质是去发现事物背后所存在的潜在规律,这种规律看不到摸不着,需要我们去探索,而可视化则是为了让我们发现数据背后隐藏的规律,这难道不是某种程度上的相通之处吗?所以数据清洗和可视化到底谁更重要呢?都重要,二者相辅相成,没有数据清洗,可视化无法得到合适的输入格式,而没有可视化,则可能会忽略很多本质的真相,今天,我们就用ComplexHeatmap绘制一种特别的图形来发现多数据集取交集之间的真相——upset图!(简单到三行代码出图,真正的三行情诗啊!) 传送门
我们先来构建一个数据绘制常见的简单upset图:
library(ComplexHeatmap)
library(GenomicRanges)
library(RColorBrewer)
display.brewer.all()
brewer.pal(n=12,"Set3")
set.seed(10086)
lt <- list(a = sample(letters, 10),
b = sample(letters, 15),
c = sample(letters, 20))
head(lt)
make_comb_mat()生成组合矩阵,并计算集合和组合集合的大小:
m <- make_comb_mat(lt)
m # 查看矩阵结果,出现的参数均可以进行调用,也可以作为参数设置图片
# A combination matrix with 3 sets and 7 combinations.
# ranges of combination set size: c(1, 8).
# mode for the combination size: distinct.
# sets are on rows.
#
# Utility functions that can be applied:
# - set_name(): name of the sets.
# - set_size(): size of the sets.
# - comb_name(): name of the combination sets.
# - comb_size(): size of the combination sets.
# - comb_degree(): degree of the combination sets.
# - extract_comb(): extract elements in the specific combination set.
# - t(): transpose the combination matrix on the UpSet plot.
# - '[': subset the combination matrix.
set_size(m) # 展示各集合大小
# a b c
# 10 15 20
comb_size(m) # 展示组合的集合的大小
# 100 010 001 110 101 011 111
# 1 1 5 2 3 8 4
comb_degree(m) # 展示组合集的来源,100表示只有第一个集合,010表示只含有第二个集合
# 100 010 001 110 101 011 111
# 1 1 1 2 2 2 3
# 绘制图片
UpSet(m) # 这样一个upset图就绘制完成
# 使用set_order和comb_order控制排列顺序
UpSet(m,
set_order = c("a", "b", "c"), #下方点线图的集合展现顺序
comb_order = order(comb_size(m))) # 从小到大进行排序
# 由pt_size、comb_col和lwd控制点的大小、颜色和线段的宽度
comb_degree(m) # 查看组合类型,方便设置颜色类型
## 100 010 001 110 101 011 111
## 1 1 1 2 2 2 3
UpSet(m,
set_order = c("a", "b", "c"),
comb_order = order(comb_degree(m)),
pt_size = unit(5, "mm"),
lwd = 3,
comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(m)])
# 由bg_col 、bg_pt_col设置背景颜色,我们设置一个奶酪色
UpSet(m,
set_order = c("a", "b", "c"),
comb_order = order(comb_degree(m)),
pt_size = unit(5, "mm"),
lwd = 3,
bg_col = c("#FFFFB3", "#FDB462"),
bg_pt_col = "#FFED6F",
comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(m)])
# 也可以转换一下矩阵看看效果
UpSet(t(m),
set_order = c("a", "b", "c"),
pt_size = unit(5, "mm"),
lwd = 3,
comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(t(m))])
# 也可以只绘制矩阵的特定部分突出重点
UpSet(m[comb_size(m) >= 3])
UpSet(m[comb_degree(m) == 2])
# 使用make_comb_mat()中的不同模式:
m1 <- make_comb_mat(lt, mode = "distinct")
m2 <- make_comb_mat(lt, mode = "intersect")
m3 <- make_comb_mat(lt, mode = "union")
UpSet(m1)
UpSet(m2)
UpSet(m3)
# 也可以把三个结果展示在一个图片中
UpSet(m1, column_title = "distinct mode") +
UpSet(m2, column_title = "intersect mode") +
UpSet(m3, column_title = "union mode")
# 后面两个图片其实只有上半部分有差别,其他部分都一样,我们可以利用热图注释把上半部分合并
top_ha = HeatmapAnnotation(
"intersect" = anno_barplot(comb_size(m2),
gp = gpar(fill = "#8DD3C7"), height = unit(2, "cm")),
"union" = anno_barplot(comb_size(m3),
gp = gpar(fill = "#8DD3C7"), height = unit(2, "cm")),
gap = unit(2, "mm"),
annotation_name_side = "right",
annotation_name_rot = 0)
UpSet(m2,
top_annotation = top_ha,
comb_order = order(comb_degree(m2)),
pt_size = unit(5, "mm"),
lwd = 3,
bg_col = c("#FFFFB3", "#FDB462"),
bg_pt_col = "#FFED6F",
comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(m2)])
# 奶酪配色实在是把我看饿了(T_T)
# 当然也是可以反转后进行行注释
right_ha = rowAnnotation(
"intersect" = anno_barplot(comb_size(m2),
gp = gpar(fill = "#BC80BD"), width = unit(2, "cm")),
"union" = anno_barplot(comb_size(m3),
gp = gpar(fill = "#BC80BD"), width = unit(2, "cm")),
gap = unit(2, "mm"), annotation_name_side = "bottom")
# the same for using m2 or m3
UpSet(t(m2),
right_annotation = right_ha,
comb_order = order(comb_degree(m2)),
pt_size = unit(5, "mm"),
lwd = 3,
comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(m2)])
# 如上所所述,善用各种模式和参数会使得你的upset图更加突出重点
前面是我们自己构建的list,并且数据集也比较少,那如果我们现在有了自己的数据怎么办呢?接下来我们就来看看如何导入数据绘制upset图,并且进行修饰:
movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"),
header = TRUE, sep = ";") # 使用UpSetR的矩阵进行绘图
head(movies)
# Name ReleaseDate Action Adventure Children
# 1 Toy Story (1995) 1995 0 0 1
# 2 Jumanji (1995) 1995 0 1 1
# 3 Grumpier Old Men (1995) 1995 0 0 0
# 4 Waiting to Exhale (1995) 1995 0 0 0
# 5 Father of the Bride Part II (1995) 1995 0 0 0
# 6 Heat (1995) 1995 1 0 0
# Comedy Crime Documentary Drama Fantasy Noir Horror Musical Mystery Romance
# 1 1 0 0 0 0 0 0 0 0 0
# 2 0 0 0 0 1 0 0 0 0 0
# 3 1 0 0 0 0 0 0 0 0 1
# 4 1 0 0 1 0 0 0 0 0 0
# 5 1 0 0 0 0 0 0 0 0 0
# 6 0 1 0 0 0 0 0 0 0 0
# SciFi Thriller War Western AvgRating Watches
# 1 0 0 0 0 4.15 2077
# 2 0 0 0 0 3.20 701
# 3 0 0 0 0 3.02 478
# 4 0 0 0 0 2.73 170
# 5 0 0 0 0 3.01 296
# 6 0 1 0 0 3.88 940
先仿照上面的内容来个简单版的图片
m <- make_comb_mat(movies, top_n_sets = 6)
m
# A combination matrix with 6 sets and 39 combinations.
# ranges of combination set size: c(1, 1028).
# mode for the combination size: distinct.
# sets are on rows.
# universal set is set.
#
# Utility functions that can be applied:
# - set_name(): name of the sets.
# - set_size(): size of the sets.
# - comb_name(): name of the combination sets.
# - comb_size(): size of the combination sets.
# - comb_degree(): degree of the combination sets.
# - extract_comb(): extract elements in the specific combination set.
# - t(): transpose the combination matrix on the UpSet plot.
# - '[': subset the combination matrix.
m1 <- m[comb_degree(m) > 0] # 提取comb_degree大于0的交集数据
UpSet(m1,
pt_size = unit(5, "mm"),
lwd = 3,
comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD","#FFED6F")[comb_degree(m1)]) #绘图
# 我们来加上箱式图展示更多信息
genre <- c("Action", "Romance", "Horror", "Children", "SciFi", "Documentary")
m <- make_comb_mat(movies[, genre])
m <- m[comb_degree(m) > 0]
comb_elements <- lapply(comb_name(m), function(nm)extract_comb(m, nm))
years <- lapply(comb_elements, function(ind) movies$ReleaseDate[ind])
rating <- lapply(comb_elements, function(ind) movies$AvgRating[ind])
watches <- lapply(comb_elements, function(ind) movies$Watches[ind])
ht <- UpSet(t(m)) + rowAnnotation(years = anno_boxplot(years),
rating = anno_boxplot(rating),
watches = anno_boxplot(watches),
gap = unit(2, "mm"))
ht
# 在上面基础上再加个热图
ht + Heatmap(1:19, name = "mean") +
rowAnnotation(bar = anno_points(1:19))
# ComplexHeatmap所有基础函数都可以在上面继续无限叠加,用法就如上面所示,加个+号连接即可
接下来我们来到实战环节,对基因组学数据进行upset图分析,我们利用6个H3K4me3 ChIP-seq数据进行分析,数据已经帮大家下载好了放在后台,回复关键词即可获取
首先是读取数据:
file_list <- c(
"ESC" = "E016-H3K4me3.narrowPeak.gz",
"ES-deriv1" = "E004-H3K4me3.narrowPeak.gz",
"ES-deriv2" = "E006-H3K4me3.narrowPeak.gz",
"Brain" = "E071-H3K4me3.narrowPeak.gz",
"Muscle" = "E100-H3K4me3.narrowPeak.gz",
"Heart" = "E104-H3K4me3.narrowPeak.gz"
)
依次读取压缩文件
peak_list <- lapply(file_list, function(f) {
df <- read.table(f)
GRanges(seqnames = df[, 1], ranges = IRanges(df[, 2], df [, 3]))
})
peak_list
# $ESC
# GRanges object with 59526 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr4 152020035-152023748 *
# [2] chr14 102604285-102609086 *
# [3] chr2 153571839-153577202 *
# [4] chr17 42975620-42977899 *
# [5] chr9 33262735-33266051 *
# ... ... ... ...
# [59522] chr1 180208717-180208915 *
# [59523] chr11 85430326-85430575 *
# [59524] chr12 38123979-38124189 *
# [59525] chr4 64145-64386 *
# [59526] chr4 190200451-190200831 *
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths
#
# $`ES-deriv1`
# GRanges object with 41041 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr7 102987916-102988841 *
# [2] chr2 38976591-38978502 *
# [3] chr5 137513371-137514654 *
# [4] chr10 27441927-27444117 *
# [5] chr17 4699460-4700239 *
# ... ... ... ...
# [41037] chr6 53661194-53661403 *
# [41038] chr8 102093297-102093424 *
# [41039] chr9 129244858-129244982 *
# [41040] chr9 133309101-133309246 *
# [41041] chr9 96108967-96109104 *
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths
#
# $`ES-deriv2`
# GRanges object with 112065 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chrX 1509657-1511092 *
# [2] chr6 170893078-170893758 *
# [3] chr1 110090647-110091770 *
# [4] chr11 10829234-10830492 *
# [5] chr16 20752186-20753195 *
# ... ... ... ...
# [112061] chr2 64602704-64602883 *
# [112062] chr3 52008235-52008394 *
# [112063] chr4 7418205-7418399 *
# [112064] chr5 171911609-171911804 *
# [112065] chr8 2282843-2283018 *
# -------
# seqinfo: 25 sequences from an unspecified genome; no seqlengths
#
# $Brain
# GRanges object with 77143 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr6 149865499-149868052 *
# [2] chr6 24665139-24668461 *
# [3] chr5 74061835-74064221 *
# [4] chr3 57540975-57544094 *
# [5] chr2 29091196-29094080 *
# ... ... ... ...
# [77139] chr7 7176032-7176362 *
# [77140] chr9 127054066-127054617 *
# [77141] chr9 19775575-19775856 *
# [77142] chr9 2412628-2412962 *
# [77143] chr10 39119075-39119328 *
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths
#
# $Muscle
# GRanges object with 39277 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr15 93425914-93428748 *
# [2] chr19 2269463-2271765 *
# [3] chr13 108869901-108872314 *
# [4] chr1 8481509-8484943 *
# [5] chr6 163833515-163838099 *
# ... ... ... ...
# [39273] chr7 121940460-121940656 *
# [39274] chr7 155049405-155049597 *
# [39275] chr9 23850337-23850517 *
# [39276] chr16 1382209-1382542 *
# [39277] chr17 59572803-59572983 *
# -------
# seqinfo: 25 sequences from an unspecified genome; no seqlengths
#
# $Heart
# GRanges object with 42291 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr12 56400660-56402264 *
# [2] chr19 45753661-45755905 *
# [3] chr9 116637634-116639769 *
# [4] chr7 139474702-139478904 *
# [5] chr8 117884832-117887652 *
# ... ... ... ...
# [42287] chr18 20818035-20818317 *
# [42288] chr2 20306661-20306896 *
# [42289] chr2 230578001-230578314 *
# [42290] chr5 175843764-175843954 *
# [42291] chr16 2086429-2086705 *
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths
接下来是构建绘图矩阵并绘图,还是使用make_comb_mat,就是make combination matrix的缩写嘛,非常好记,集合和组合集合的大小是总碱基对或区域宽度的总和,我们只保留500kb以上的组合集:
m <- make_comb_mat(peak_list) # 构建矩阵
m <- m[comb_size(m) > 500000] # 保留500kb以上的组合集
UpSet(m) # 绘图
# 使用axis_param格式化坐标轴,将科学计数法转为以MB为单位
UpSet(m,
top_annotation = upset_top_annotation(
m,
axis_param = list(at = c(0, 1e7, 2e7),
labels = c("0MB", "10MB", "20MB")),
height = unit(4, "cm")
),
right_annotation = upset_right_annotation(
m,
axis_param = list(at = c(0, 2e7, 4e7, 6e7),
labels = c("0MB", "20MB", "40MB", "60MB"),
labels_rot = 0),
width = unit(4, "cm")
))
有时候我们想展示更多的信息互为关联,就像每组基因组区域,我们可以关联如平均甲基化或者最近的TSS的距离等内容,我们就可以进行图形注释来展示,这就要用到我们前面讲过的注释内容了,来看一下:
subgroup <- c("ESC" = "group1",
"ES-deriv1" = "group1",
"ES-deriv2" = "group1",
"Brain" = "group2",
"Muscle" = "group2",
"Heart" = "group2"
)
comb_sets <- lapply(comb_name(m), function(nm)extract_comb(m, nm))
comb_sets <- lapply(comb_sets, function(gr) {
# 随机生成dist_to_tss和mean_meth
gr$dist_to_tss = abs(rnorm(length(gr), mean = runif(1, min = 500, max = 2000), sd = 1000))
gr$mean_meth = abs(rnorm(length(gr), mean = 0.1, sd = 0.1))
gr
})
UpSet(m,
top_annotation = upset_top_annotation(
m,
axis_param = list(at = c(0, 1e7, 2e7),
labels = c("0MB", "10MB", "20MB")),
height = unit(4, "cm")
),
right_annotation = upset_right_annotation(
m,
axis_param = list(at = c(0, 2e7, 4e7, 6e7),
labels = c("0MB", "20MB", "40MB", "60MB"),
labels_rot = 0),
width = unit(4, "cm")
),
left_annotation = rowAnnotation(group = subgroup[set_name(m)], show_annotation_name = FALSE),
bottom_annotation = HeatmapAnnotation(
dist_to_tss = anno_boxplot(lapply(comb_sets, function(gr) gr$dist_to_tss), outline = FALSE),
mean_meth = sapply(comb_sets, function(gr) mean(gr$mean_meth)),
annotation_name_side = "left"
)
)
# 这样整个图形就绘制完成,可以看到用到的注释函数还是我们上回讲过的内容,忘了的话记得复习哦!
好了,今天的内容又结束啦。今天我们学习了upset图的画法,分别从构建数据、读取数据、chip数据实战三个方面逐步递进进行介绍,内容其实并不复杂,都是建立在前面系列的基础之上,如果有哪个函数忘记了,记得通过传送门进行复习哈!
最后最后,新版的R studio来了喔,千呼万唤始出来,自从上次听到谢益辉大佬的公开介绍之后,就一直念念不忘,我在打赏营时候还跟大家聊起来说这一版本号称可以替代word,新版虽然用起来还没word那么顺手,但是优势也很明显了。比如处理复杂图形的速度更快,基本几秒钟就能够出来一个以往需要加载很久的图片,这给单细胞分析或者是甲基化分析带来了极大的便利,此外Rmarkdown的反应速度也更快,在使用“无限月读(阅读)”查件时候基本可以做到实时反应,不过也可能跟我目前尝试的内容都比较简单的关系,推荐大家去尝鲜,不需要更换R版本,只需要更新Rstudio的版本,并且也不需要重装R包,并且Rstudio还可以跑python代码哦,超级方便没错了,我想,如果office不更新换代的话,按照目前全民代码的发展趋势下去,真的有一天要被这类编译器完全取代也不一定,好啦,公众号回复“风28”就可以获得本次代码和示例数据,我们下次再见吧!*^_^*
—
END—
撰文丨风 风
排版丨四金兄
值班 | 弘 毅
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。