一文学会简单版及高级版upset图画法
大家好,我是风。这里是风风的CNS美图复现系列专栏。ComplexHeatmap系列到了这里其实已经接近了尾声,接下来的都是扩展内容,目的还是让大家更好地了解可视化的奥妙。很多学员在跟完打赏营之后都会发出一声感叹:“数据清洗比可视化重要多了!”其实我不太同意,科研的本质是去发现事物背后所存在的潜在规律,这种规律看不到摸不着,需要我们去探索,而可视化则是为了让我们发现数据背后隐藏的规律,这难道不是某种程度上的相通之处吗?所以数据清洗和可视化到底谁更重要呢?都重要,二者相辅相成,没有数据清洗,可视化无法得到合适的输入格式,而没有可视化,则可能会忽略很多本质的真相,今天,我们就用ComplexHeatmap绘制一种特别的图形来发现多数据集取交集之间的真相——upset图!(简单到三行代码出图,真正的三行情诗啊!
传送门
简单upset图
我们先来构建一个数据绘制常见的简单upset图:
# 加载R包library(ComplexHeatmap)library(GenomicRanges)library(RColorBrewer)display.brewer.all()brewer.pal(n=12,"Set3")## [1] "#8DD3C7""#FFFFB3""#BEBADA""#FB8072""#80B1D3""#FDB462""#B3DE69"## [8] "#FCCDE5""#D9D9D9""#BC80BD""#CCEBC5""#FFED6F"# 构建数据set.seed(10086)lt <- list(a = sample(letters, 10), b = sample(letters, 15), c = sample(letters, 20)) # 构建三个数据作为一个listhead(lt)## $a## [1] "r""b""g""e""x""n""t""u""s""i"## ## $b## [1] "h""s""d""a""x""c""v""b""n""e""z""y""r""l""m"## ## $c## [1] "w""a""x""p""k""h""l""i""d""z""s""q""g""m""r""f""n""c""u"## [20] "v"
# make_comb_mat()生成组合矩阵,并计算集合和组合集合的大小:m <- make_comb_mat(lt)m # 查看矩阵结果,出现的参数均可以进行调用,也可以作为参数设置图片## A combination matrix with 3 sets and 7 combinations.## ranges of combination set size: c(1, 8).## mode for the combination size: distinct.## sets are on rows.## ## Utility functions that can be applied:## - set_name(): name of the sets.## - set_size(): size of the sets.## - comb_name(): name of the combination sets.## - comb_size(): size of the combination sets.## - comb_degree(): degree of the combination sets.## - extract_comb(): extract elements in the specific combination set.## - t(): transpose the combination matrix on the UpSet plot.## - '[': subset the combination matrix.set_size(m) # 展示各集合大小## a b c ## 10 15 20comb_size(m) # 展示组合的集合的大小## 100 010 001 110 101 011 111 ## 1 1 5 2 3 8 4comb_degree(m) # 展示组合集的来源,100表示只有第一个集合,010表示只含有第二个集合## 100 010 001 110 101 011 111 ## 1 1 1 2 2 2 3
# 绘制图片UpSet(m) # 这样一个upset图就绘制完成
# 使用set_order和comb_order控制排列顺序UpSet(m, set_order = c("a", "b", "c"), #下方点线图的集合展现顺序 comb_order = order(comb_size(m))) # 从小到大进行排序
# 由pt_size、comb_col和lwd控制点的大小、颜色和线段的宽度comb_degree(m) # 查看组合类型,方便设置颜色类型## 100 010 001 110 101 011 111 ## 1 1 1 2 2 2 3UpSet(m, set_order = c("a", "b", "c"), comb_order = order(comb_degree(m)), pt_size = unit(5, "mm"), lwd = 3, comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(m)])
# 由bg_col 、bg_pt_col设置背景颜色,我们设置一个奶酪色UpSet(m, set_order = c("a", "b", "c"), comb_order = order(comb_degree(m)), pt_size = unit(5, "mm"), lwd = 3, bg_col = c("#FFFFB3", "#FDB462"), bg_pt_col = "#FFED6F", comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(m)])
# 也可以转换一下矩阵看看效果UpSet(t(m), set_order = c("a", "b", "c"), pt_size = unit(5, "mm"), lwd = 3, comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(t(m))])
# 也可以只绘制矩阵的特定部分突出重点UpSet(m[comb_size(m) >= 3])
UpSet(m[comb_degree(m) == 2])
# 使用make_comb_mat()中的不同模式:m1 <- make_comb_mat(lt, mode = "distinct") m2 <- make_comb_mat(lt, mode = "intersect")m3 <- make_comb_mat(lt, mode = "union")UpSet(m1)
UpSet(m2)
UpSet(m3)
# 也可以把三个结果展示在一个图片中UpSet(m1, column_title = "distinct mode") +UpSet(m2, column_title = "intersect mode") +UpSet(m3, column_title = "union mode")
# 后面两个图片其实只有上半部分有差别,其他部分都一样,我们可以利用热图注释把上半部分合并top_ha = HeatmapAnnotation("intersect" = anno_barplot(comb_size(m2), gp = gpar(fill = "#8DD3C7"), height = unit(2, "cm")), "union" = anno_barplot(comb_size(m3), gp = gpar(fill = "#8DD3C7"), height = unit(2, "cm")), gap = unit(2, "mm"), annotation_name_side = "right", annotation_name_rot = 0)UpSet(m2, top_annotation = top_ha, comb_order = order(comb_degree(m2)), pt_size = unit(5, "mm"), lwd = 3, bg_col = c("#FFFFB3", "#FDB462"), bg_pt_col = "#FFED6F", comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(m2)])# 奶酪配色实在是把我看饿了(T_T)
# 当然也是可以反转后进行行注释right_ha = rowAnnotation("intersect" = anno_barplot(comb_size(m2), gp = gpar(fill = "#BC80BD"), width = unit(2, "cm")), "union" = anno_barplot(comb_size(m3), gp = gpar(fill = "#BC80BD"), width = unit(2, "cm")), gap = unit(2, "mm"), annotation_name_side = "bottom")# the same for using m2 or m3UpSet(t(m2), right_annotation = right_ha, comb_order = order(comb_degree(m2)), pt_size = unit(5, "mm"), lwd = 3, comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD")[comb_degree(m2)])
# 如上所所述,善用各种模式和参数会使得你的upset图更加突出重点
高级upset图
前面是我们自己构建的list,并且数据集也比较少,那如果我们现在有了自己的数据怎么办呢?接下来我们就来看看如何导入数据绘制upset图,并且进行修饰:
movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"), header = TRUE, sep = ";") # 使用UpSetR的矩阵进行绘图head(movies) ## Name ReleaseDate Action Adventure Children## 1 Toy Story (1995) 1995 0 0 1## 2 Jumanji (1995) 1995 0 1 1## 3 Grumpier Old Men (1995) 1995 0 0 0## 4 Waiting to Exhale (1995) 1995 0 0 0## 5 Father of the Bride Part II (1995) 1995 0 0 0## 6 Heat (1995) 1995 1 0 0## Comedy Crime Documentary Drama Fantasy Noir Horror Musical Mystery Romance## 1 1 0 0 0 0 0 0 0 0 0## 2 0 0 0 0 1 0 0 0 0 0## 3 1 0 0 0 0 0 0 0 0 1## 4 1 0 0 1 0 0 0 0 0 0## 5 1 0 0 0 0 0 0 0 0 0## 6 0 1 0 0 0 0 0 0 0 0## SciFi Thriller War Western AvgRating Watches## 1 0 0 0 0 4.15 2077## 2 0 0 0 0 3.20 701## 3 0 0 0 0 3.02 478## 4 0 0 0 0 2.73 170## 5 0 0 0 0 3.01 296## 6 0 1 0 0 3.88 940
# 先仿照上面的内容来个简单版的图片m <- make_comb_mat(movies, top_n_sets = 6)m## A combination matrix with 6 sets and 39 combinations.## ranges of combination set size: c(1, 1028).## mode for the combination size: distinct.## sets are on rows.## universal set is set.## ## Utility functions that can be applied:## - set_name(): name of the sets.## - set_size(): size of the sets.## - comb_name(): name of the combination sets.## - comb_size(): size of the combination sets.## - comb_degree(): degree of the combination sets.## - extract_comb(): extract elements in the specific combination set.## - t(): transpose the combination matrix on the UpSet plot.## - '[': subset the combination matrix.m1 <- m[comb_degree(m) > 0] # 提取comb_degree大于0的交集数据UpSet(m1, pt_size = unit(5, "mm"), lwd = 3, comb_col = c("#8DD3C7", "#80B1D3", "#BC80BD","#FFED6F")[comb_degree(m1)]) #绘图

# 我们来加上箱式图展示更多信息genre <- c("Action", "Romance", "Horror", "Children", "SciFi", "Documentary")m <- make_comb_mat(movies[, genre])m <- m[comb_degree(m) > 0]comb_elements <- lapply(comb_name(m), function(nm)extract_comb(m, nm))years <- lapply(comb_elements, function(ind) movies$ReleaseDate[ind])rating <- lapply(comb_elements, function(ind) movies$AvgRating[ind])watches <- lapply(comb_elements, function(ind) movies$Watches[ind])ht <- UpSet(t(m)) + rowAnnotation(years = anno_boxplot(years), rating = anno_boxplot(rating), watches = anno_boxplot(watches), gap = unit(2, "mm"))ht
# 在上面基础上再加个热图ht + Heatmap(1:19, name = "mean") + rowAnnotation(bar = anno_points(1:19))
# ComplexHeatmap所有基础函数都可以在上面继续无限叠加,用法就如上面所示,加个+号连接即可
实战:利用ChIP-seq数据绘制upset图
接下来我们来到实战环节,对基因组学数据进行upset图分析,我们利用6个H3K4me3 ChIP-seq数据进行分析,数据已经帮大家下载好了放在后台,回复关键词即可获取
首先是读取数据:
file_list <- c( "ESC" = "E016-H3K4me3.narrowPeak.gz", "ES-deriv1" = "E004-H3K4me3.narrowPeak.gz", "ES-deriv2" = "E006-H3K4me3.narrowPeak.gz", "Brain" = "E071-H3K4me3.narrowPeak.gz", "Muscle" = "E100-H3K4me3.narrowPeak.gz", "Heart" = "E104-H3K4me3.narrowPeak.gz")# 依次读取压缩文件peak_list <- lapply(file_list, function(f) { df <- read.table(f) GRanges(seqnames = df[, 1], ranges = IRanges(df[, 2], df [, 3]))})peak_list## $ESC## GRanges object with 59526 ranges and 0 metadata columns:## seqnames ranges strand## <Rle> <IRanges> <Rle>## [1] chr4 152020035-152023748 *## [2] chr14 102604285-102609086 *## [3] chr2 153571839-153577202 *## [4] chr17 42975620-42977899 *## [5] chr9 33262735-33266051 *## ... ... ... ...## [59522] chr1 180208717-180208915 *## [59523] chr11 85430326-85430575 *## [59524] chr12 38123979-38124189 *## [59525] chr4 64145-64386 *## [59526] chr4 190200451-190200831 *## -------## seqinfo: 24 sequences from an unspecified genome; no seqlengths## ## $`ES-deriv1`## GRanges object with 41041 ranges and 0 metadata columns:## seqnames ranges strand## <Rle> <IRanges> <Rle>## [1] chr7 102987916-102988841 *## [2] chr2 38976591-38978502 *## [3] chr5 137513371-137514654 *## [4] chr10 27441927-27444117 *## [5] chr17 4699460-4700239 *## ... ... ... ...## [41037] chr6 53661194-53661403 *## [41038] chr8 102093297-102093424 *## [41039] chr9 129244858-129244982 *## [41040] chr9 133309101-133309246 *## [41041] chr9 96108967-96109104 *## -------## seqinfo: 24 sequences from an unspecified genome; no seqlengths## ## $`ES-deriv2`## GRanges object with 112065 ranges and 0 metadata columns:## seqnames ranges strand## <Rle> <IRanges> <Rle>## [1] chrX 1509657-1511092 *## [2] chr6 170893078-170893758 *## [3] chr1 110090647-110091770 *## [4] chr11 10829234-10830492 *## [5] chr16 20752186-20753195 *## ... ... ... ...## [112061] chr2 64602704-64602883 *## [112062] chr3 52008235-52008394 *## [112063] chr4 7418205-7418399 *## [112064] chr5 171911609-171911804 *## [112065] chr8 2282843-2283018 *## -------## seqinfo: 25 sequences from an unspecified genome; no seqlengths## ## $Brain## GRanges object with 77143 ranges and 0 metadata columns:## seqnames ranges strand## <Rle> <IRanges> <Rle>## [1] chr6 149865499-149868052 *## [2] chr6 24665139-24668461 *## [3] chr5 74061835-74064221 *## [4] chr3 57540975-57544094 *## [5] chr2 29091196-29094080 *## ... ... ... ...## [77139] chr7 7176032-7176362 *## [77140] chr9 127054066-127054617 *## [77141] chr9 19775575-19775856 *## [77142] chr9 2412628-2412962 *## [77143] chr10 39119075-39119328 *## -------## seqinfo: 24 sequences from an unspecified genome; no seqlengths## ## $Muscle## GRanges object with 39277 ranges and 0 metadata columns:## seqnames ranges strand## <Rle> <IRanges> <Rle>## [1] chr15 93425914-93428748 *## [2] chr19 2269463-2271765 *## [3] chr13 108869901-108872314 *## [4] chr1 8481509-8484943 *## [5] chr6 163833515-163838099 *## ... ... ... ...## [39273] chr7 121940460-121940656 *## [39274] chr7 155049405-155049597 *## [39275] chr9 23850337-23850517 *## [39276] chr16 1382209-1382542 *## [39277] chr17 59572803-59572983 *## -------## seqinfo: 25 sequences from an unspecified genome; no seqlengths## ## $Heart## GRanges object with 42291 ranges and 0 metadata columns:## seqnames ranges strand## <Rle> <IRanges> <Rle>## [1] chr12 56400660-56402264 *## [2] chr19 45753661-45755905 *## [3] chr9 116637634-116639769 *## [4] chr7 139474702-139478904 *## [5] chr8 117884832-117887652 *## ... ... ... ...## [42287] chr18 20818035-20818317 *## [42288] chr2 20306661-20306896 *## [42289] chr2 230578001-230578314 *## [42290] chr5 175843764-175843954 *## [42291] chr16 2086429-2086705 *## -------## seqinfo: 24 sequences from an unspecified genome; no seqlengths
接下来是构建绘图矩阵并绘图,还是使用make_comb_mat,就是make combination matrix的缩写嘛,非常好记,集合和组合集合的大小是总碱基对或区域宽度的总和,我们只保留500kb以上的组合集:
m <- make_comb_mat(peak_list) # 构建矩阵m <- m[comb_size(m) > 500000] # 保留500kb以上的组合集UpSet(m) # 绘图
# 使用axis_param格式化坐标轴,将科学计数法转为以MB为单位UpSet(m, top_annotation = upset_top_annotation( m, axis_param = list(at = c(0, 1e7, 2e7), labels = c("0MB", "10MB", "20MB")), height = unit(4, "cm") ), right_annotation = upset_right_annotation( m, axis_param = list(at = c(0, 2e7, 4e7, 6e7), labels = c("0MB", "20MB", "40MB", "60MB"), labels_rot = 0), width = unit(4, "cm") ))
有时候我们想展示更多的信息互为关联,就像每组基因组区域,我们可以关联如平均甲基化或者最近的TSS的距离等内容,我们就可以进行图形注释来展示,这就要用到我们前面讲过的注释内容了,来看一下:
subgroup <- c("ESC" = "group1","ES-deriv1" = "group1","ES-deriv2" = "group1","Brain" = "group2","Muscle" = "group2","Heart" = "group2")comb_sets <- lapply(comb_name(m), function(nm)extract_comb(m, nm))comb_sets <- lapply(comb_sets, function(gr) { # 随机生成dist_to_tss和mean_meth gr$dist_to_tss = abs(rnorm(length(gr), mean = runif(1, min = 500, max = 2000), sd = 1000)) gr$mean_meth = abs(rnorm(length(gr), mean = 0.1, sd = 0.1)) gr})UpSet(m, top_annotation = upset_top_annotation( m, axis_param = list(at = c(0, 1e7, 2e7), labels = c("0MB", "10MB", "20MB")), height = unit(4, "cm") ), right_annotation = upset_right_annotation( m, axis_param = list(at = c(0, 2e7, 4e7, 6e7), labels = c("0MB", "20MB", "40MB", "60MB"), labels_rot = 0), width = unit(4, "cm") ), left_annotation = rowAnnotation(group = subgroup[set_name(m)], show_annotation_name = FALSE), bottom_annotation = HeatmapAnnotation( dist_to_tss = anno_boxplot(lapply(comb_sets, function(gr) gr$dist_to_tss), outline = FALSE), mean_meth = sapply(comb_sets, function(gr) mean(gr$mean_meth)), annotation_name_side = "left" ))
# 这样整个图形就绘制完成,可以看到用到的注释函数还是我们上回讲过的内容,忘了的话记得复习哦!
小结
好了,今天的内容又结束啦。今天我们学习了upset图的画法,分别从构建数据、读取数据、chip数据实战三个方面逐步递进进行介绍,内容其实并不复杂,都是建立在前面系列的基础之上,如果有哪个函数忘记了,记得通过传送门进行复习哈!
最后最后,新版的R studio来了喔,千呼万唤始出来,自从上次听到谢益辉大佬的公开介绍之后,就一直念念不忘,我在打赏营时候还跟大家聊起来说这一版本号称可以替代word,新版虽然用起来还没word那么顺手,但是优势也很明显了。比如处理复杂图形的速度更快,基本几秒钟就能够出来一个以往需要加载很久的图片,这给单细胞分析或者是甲基化分析带来了极大的便利,此外Rmarkdown的反应速度也更快,在使用“无限月读(阅读)”查件时候基本可以做到实时反应,不过也可能跟我目前尝试的内容都比较简单的关系,推荐大家去尝鲜,不需要更换R版本,只需要更新Rstudio的版本,并且也不需要重装R包,并且Rstudio还可以跑python代码哦,超级方便没错了,我想,如果office不更新换代的话,按照目前全民代码的发展趋势下去,真的有一天要被这类编译器完全取代也不一定,好啦,公众号回复“风28”就可以获得本次代码和示例数据,我们下次再见吧!*^_^*
END

撰文丨风   风
排版丨四金兄
值班 | 弘   毅
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~
继续阅读
阅读原文