仅凭一个分析就能发3.5+SCI!万字长文教你彻底学会WGCNA!不分享说不过去吧?
WGCNA懒人秘籍教程——轻松看懂
大家好,我是风风。今天开始,我们要进入一个新的系列——WGCNA系列。WGCNA,中文全称是加权基因共表达网络分析,是一种比较高级的生物信息学分析技术,并且理解起来有一定难度。如果你愿意看帮助文档的话,建议直接观看帮助文档,因为没有什么教程是比官方文档更加权威的了。如果你嫌弃WGCNA的帮助文档是全英文比较难读懂的话,也可以跟着我们的系列推文,我的计划是先和大家一起走完WGCNA的官方文档,然后再根据学习的内容,进行文献复现,挑1-2篇应用了WGCNA的文章,对相应的内容进行复现。
首先我们要明白什么是WGCNA?
这里我们利用风风讲席营的例子简单说一下:假设现在新开了一个菜市场,菜市场里面有很多小摊贩,分别有1、卖肉类;2、卖海鲜;3、卖蔬菜。这些小摊贩摆摊没有任何规律,东西南北四个方向都有这三类摊贩,这时候的菜市场不方便管理对吧?作为菜市场的管理人员,我们可以把执行“类似功能”的摊贩聚在一起,让卖特定东西的摊贩在特定的区域进行贩卖,比如卖肉类的都在东边,卖海鲜的都在南边,卖蔬菜的都在北边。这样的排布可以让顾客一眼找到自己需要买的东西对应的区域。假如顾客今天要买牛肉,他只需要从东边的市场大门进去就可以找到相应的区域,在这个区域挑选自己想要的牛肉就可以了。
上面这个过程就是WGCNA的过程。我们把基因类比为摊贩,执行相同功能的基因表示卖同一类东西的摊贩;对基因进行聚类的过程就是“把执行‘类似功能’的摊贩聚在一起”;而买菜,就是关联表型和基因模块,找到和特定表型相关的基因模块。
由上述例子得知,我们可以将WGCNA简化为三个过程:
1、计算距离(相关性);
2、聚类;
实际上这是简化的三个步骤,按照官方文档的介绍,WGCNA的流程应该是下图展示的这样分为5个步骤:
关于原理部分的官方文档我放在了资料下载区,大家可以后台回复关键词后下载,建议大家不管能不能理解,都读一遍,然后结合数据实操进一步加深印象。
接下来,我们看看WGCNA的实现过程:
我们至少需要两份数据:表达矩阵和表型数据,如果你的表达矩阵还没进行注释,则还需要一份注释文件,在进行WGCNA之前,我们首先要进行数据清洗:
library(WGCNA)
options(stringsAsFactors = FALSE)
femData <- read.csv("data//LiverFemale3600.csv");
dim(femData)
names(femData)
datExpr0 <- as.data.frame(t(femData[, -c(1:8)]))
names(datExpr0) <- femData$substanceBXH
rownames(datExpr0) <- names(femData)[-c(1:8)]
gsg <- goodSamplesGenes(datExpr0, verbose = 3);
gsg$allOK
if (!gsg$allOK)
{
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}
sampleTree <- hclust(dist(datExpr0), method = "average")
sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree,
main = "Sample clustering to detect outliers",
sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)
abline(h = 15, col = "red")
clust <- cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)
table(clust)
keepSamples <- (clust==1)
datExpr <- datExpr0[keepSamples, ]
nGenes <- ncol(datExpr)
nSamples <- nrow(datExpr)
traitData <- read.csv("data//ClinicalTraits.csv");
dim(traitData)
names(traitData)
allTraits <- traitData[, -c(31, 16)];
allTraits <- allTraits[, c(2, 11:36) ];
dim(allTraits)
names(allTraits)
femaleSamples <- rownames(datExpr);
traitRows <- match(femaleSamples, allTraits$Mice);
datTraits <- allTraits[traitRows, -1];
rownames(datTraits) = allTraits[traitRows, 1];
collectGarbage();
sampleTree2 <- hclust(dist(datExpr), method = "average")
traitColors <- numbers2colors(datTraits, signed = FALSE);
plotDendroAndColors(sampleTree2, traitColors,
groupLabels = names(datTraits),
main = "Sample dendrogram and trait heatmap")
# 保存清洗后的数据
save(datExpr, datTraits, file = "FemaleLiver-01-dataInput.RData")
对于WGCNA来说,最重要的就是构建网络并识别出相应的模块,将基因按照相似性聚集成相应的模块,就像我们前面说的菜市场的例子,把卖肉的摊贩据集在一起一样。模块识别分为两种,一种是自动识别,另外一种是逐步法构建网络并识别,这里我们先看第一种:
设置多线程进行WGCNA分析
enableWGCNAThreads()
# Allowing parallel execution with up to 11 working processes.
加载前面清洗完成的数据
lnames <- load(file = "FemaleLiver-01-dataInput.RData");
lnames
# [1] "datExpr" "datTraits"
设置软阈值区间
powers = c(c(1:10),
seq(from = 12, to = 20, by = 2))
进行拓扑网络分析识别最佳软阈值
sft = pickSoftThreshold(datExpr,
powerVector = powers,
verbose = 5)
# pickSoftThreshold: will use block size 3600.
# pickSoftThreshold: calculating connectivity for given powers...
# ..working on genes 1 through 3600 of 3600
# Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
# 1 1 0.0278 0.345 0.456 747.00 762.0000 1210.0
# 2 2 0.1260 -0.597 0.843 254.00 251.0000 574.0
# 3 3 0.3400 -1.030 0.972 111.00 102.0000 324.0
# 4 4 0.5060 -1.420 0.973 56.50 47.2000 202.0
# 5 5 0.6810 -1.720 0.940 32.20 25.1000 134.0
# 6 6 0.9020 -1.500 0.962 19.90 14.5000 94.8
# 7 7 0.9210 -1.670 0.917 13.20 8.6800 84.1
# 8 8 0.9040 -1.720 0.876 9.25 5.3900 76.3
# 9 9 0.8590 -1.700 0.836 6.80 3.5600 70.5
# 10 10 0.8330 -1.660 0.831 5.19 2.3800 65.8
# 11 12 0.8530 -1.480 0.911 3.33 1.1500 58.1
# 12 14 0.8760 -1.380 0.949 2.35 0.5740 51.9
# 13 16 0.9070 -1.300 0.970 1.77 0.3090 46.8
# 14 18 0.9120 -1.240 0.973 1.39 0.1670 42.5
# 15 20 0.9310 -1.210 0.977 1.14 0.0951 38.7
powerEstimates <- sft$powerEstimate # 查看自动识别的最佳软阈值
如果上述没有自动识别最佳软阈值,则可以将结果进行可视化后尽心选取
sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
$fitIndices[,3])*sft$fitIndices[,2], fitIndices[,1], -sign(sft
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
$fitIndices[,3])*sft$fitIndices[,2], fitIndices[,1], -sign(sft
labels=powers,cex=cex1,col="red");
abline(h=0.90,col="red")
$fitIndices[,5], fitIndices[,1], sft
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
$fitIndices[,5], labels=powers, cex=cex1,col="red") fitIndices[,1], sft
从上图可以看到,当power选取为6的时候,左边的R2值达到了0.9,右边的Mean Connectivity也趋向于平缓,因此我们选择6为最佳软阈值
构建网络(很关键)
cor <- WGCNA::cor
net <- blockwiseModules(datExpr,
power = powerEstimates,
TOMType = "unsigned",
minModuleSize = 30,
reassignThreshold = 0,
mergeCutHeight = 0.25,
numericLabels = TRUE,
pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "femaleMouseTOM",
verbose = 3)
# Calculating module eigengenes block-wise from all genes
# Flagging genes and samples with too many missing values...
# ..step 1
# Cluster size 3600 broken into 2108 1492
# Cluster size 2108 broken into 1126 982
# Done cluster 1126
# Done cluster 982
# Done cluster 2108
# Done cluster 1492
# ..Working on block 1 .
# TOM calculation: adjacency..
# ..will not use multithreading.
# Fraction of slow calculations: 0.396405
# ..connectivity..
# ..matrix multiplication (system BLAS)..
# ..normalization..
# ..done.
# ..saving TOM for block 1 into file femaleMouseTOM-block.1.RData
# ....clustering..
# ....detecting modules..
# ....calculating module eigengenes..
# ....checking kME in modules..
# ..removing 1 genes from module 1 because their KME is too low.
# ..removing 1 genes from module 7 because their KME is too low.
# ..removing 1 genes from module 8 because their KME is too low.
# ..removing 1 genes from module 21 because their KME is too low.
# ..merging modules that are too close..
# mergeCloseModules: Merging modules whose distance is less than 0.25
# Calculating new MEs...
cor<-stats::cor
网络可视化
sizeGrWindow(12, 9)
mergedColors = labels2colors(net$colors)
dendrograms[[1]],
mergedColors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
# 提取网络特征
moduleLabels = net$colors
moduleColors = labels2colors(net$colors)
MEs = net$MEs;
geneTree = net$dendrograms[[1]];
save(MEs, moduleLabels, moduleColors, geneTree,
file = "FemaleLiver-02-networkConstruction-auto.RData")
# 网络构建是整个WGCNA最关键的一步,关于逐步法和对大数据集的网络构建方法我们在后续演示,今天先完整进行WGCNA的流程
接下来我们需要关联表型信息和模块,以便找到特定表型最关键的模块:
# 加载数据
lnames = load(file = "FemaleLiver-02-networkConstruction-auto.RData");
lnames
## [1] "MEs" "moduleLabels" "moduleColors" "geneTree"
nGenes = ncol(datExpr);
nSamples = nrow(datExpr);
# 重新赋予每个ME模块颜色标签
MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0)
moduleTraitCor = cor(MEs, datTraits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);
# 关联结果可视化
sizeGrWindow(10,6)
# Will display correlations and their p-values
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = blueWhiteRed(100),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
# 定义权重weight
weight = as.data.frame(datTraits$weight_g);
names(weight) = "weight"
# 模块对应的颜色标签
modNames = substring(names(MEs), 3)
# 计算模块ME与基因的相关性和p值(ModuleMembership, MM)
geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");
# 计算基因和表型的相关性(geneTraitSignificance, GS)
geneTraitSignificance = as.data.frame(cor(datExpr, weight, use = "p"));
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) = paste("GS.", names(weight), sep="");
names(GSPvalue) = paste("p.GS.", names(weight), sep="");
# 获取感兴趣模块,这里以brown模块为例
module = "brown"
column = match(module, modNames);
moduleGenes = moduleColors==module;
# 绘制GS和MM的相关性图片,也可以用ggplot2进行绘图
sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
abs(geneTraitSignificance[moduleGenes, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for body weight",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
annot <- read.csv(file = "data//GeneAnnotation.csv");
dim(annot)
names(annot)
probes <- names(datExpr)
probes2annot <- match(probes, annot$substanceBXH)
sum(is.na(probes2annot))
geneInfo0 <- data.frame(substanceBXH = probes,
geneSymbol = annot$gene_symbol[probes2annot],
LocusLinkID = annot$LocusLinkID[probes2annot],
moduleColor = moduleColors,
geneTraitSignificance,
GSPvalue)
modOrder <- order(-abs(cor(MEs, weight, use = "p")));
for (mod in1:ncol(geneModuleMembership))
{
oldNames <- names(geneInfo0)
geneInfo0 <- data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]],
MMPvalue[, modOrder[mod]]);
names(geneInfo0) <- c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
paste("p.MM.", modNames[modOrder[mod]], sep=""))
}
geneOrder <- order(geneInfo0$moduleColor, -abs(geneInfo0$GS.weight));
geneInfo <- geneInfo0[geneOrder, ]
write.csv(geneInfo, file = "geneInfo.csv")
probes <- names(datExpr)
probes2annot <- match(probes, annot$substanceBXH)
allLLIDs <- annot$LocusLinkID[probes2annot]
intModules = c("brown", "red", "salmon")
for (module in intModules)
{
modGenes = (moduleColors==module)
modLLIDs = allLLIDs[modGenes];
fileName = paste("LocusLinkIDs-", module, ".txt", sep="");
write.table(as.data.frame(modLLIDs), file = fileName,
row.names = FALSE, col.names = FALSE)
}
fileName = paste("LocusLinkIDs-all.txt", sep="");
write.table(as.data.frame(allLLIDs), file = fileName,
row.names = FALSE, col.names = FALSE)
lnames = load(file = "FemaleLiver-02-networkConstruction-auto.RData");
lnames
## [1] "MEs" "moduleLabels" "moduleColors" "geneTree"
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
# 计算拓扑网络
dissTOM <- 1-TOMsimilarityFromExpr(datExpr, power = 6);
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.396405
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
# 选择100个分子构建网络
nSelect = 400
# 设置随机种子
set.seed(10)
select <- sample(nGenes, size = nSelect);
selectTOM <- dissTOM[select, select];
# 重聚类
selectTree <- hclust(as.dist(selectTOM), method = "average")
selectColors <- moduleColors[select]
# 可视化
sizeGrWindow(9,9)
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
# 重新计算模块eigengenes,MEs
MEs <- moduleEigengenes(datExpr, moduleColors)$eigengenes
# 从表型特征中提取体重信息
weight <- as.data.frame(datTraits$weight_g);
names(weight) = "weight"
# 添加模块特征
MET <- orderMEs(cbind(MEs, weight))
# 可视化
sizeGrWindow(5,7.5);
par(cex = 0.9)
plotEigengeneNetworks(MET, "",
marDendro = c(0,4,1,2),
marHeatmap = c(3,4,1,2),
cex.lab = 0.8, xLabelsAngle
= 90)
# 也可以去除上面的树状图
1.0) =
plotEigengeneNetworks(MET,
adjacency heatmap",
marHeatmap = c(3,4,2,2),
plotDendrograms = FALSE,
xLabelsAngle = 90)
# 有时候我们想要将数据导出后使用cytoscape进行绘制更好看的图,可以用下面的方面
# 重新计算拓扑矩阵
TOM <- TOMsimilarityFromExpr(datExpr, power = 6);
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.396405
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
# 读取注释文件
annot <- read.csv(file = "data//GeneAnnotation.csv");
# 选择感兴趣的模块
modules = c("brown", "red");
# 选择模块探针
probes = names(datExpr)
inModule <- is.finite(match(moduleColors, modules));
modProbes <- probes[inModule];
modGenes <- annot$gene_symbol[match(modProbes, annot$substanceBXH)]
# 选择对应的拓扑重叠结构
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# 导出数据
cyt <- exportNetworkToCytoscape(modTOM,
edgeFile = paste("CytoscapeInput-edges-",
paste(modules, collapse="-"), ".txt", sep=""),
nodeFile = paste("CytoscapeInput-nodes-",
paste(modules, collapse="-"), ".txt", sep=""),
weighted = TRUE,
threshold = 0.02,
nodeNames = modProbes,
altNodeNames = modGenes,
nodeAttr = moduleColors[inModule])
names(cyt)
## [1] "edgeData" "nodeData"
# 有了这两个文件,就可以使用cytoscape进行个性化绘图啦
好了,今天的内容到此结束!今天我们主要对WGCNA进行了初探,当然大家有空的话还是推荐大家阅读官方帮助文档,看看官方英文解释.关于WGCNA的其他一些内容,包括逐步构建网络和保守性分析等等内容,我们也将在后续的推文中一步一步展开。
后台回复“feng 60”获取代码和数据,我们下期见!
单细胞分析专栏传送门
碎碎念专栏传送门(完结)
风之美图系列传送门(完结)
—
END—
撰文丨风 风
排版丨四金兄
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~
关键词
数据
网络
信息
模块
data.frame
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。