WGCNA懒人秘籍教程——轻松看懂
大家好,我是风风。今天开始,我们要进入一个新的系列——WGCNA系列。WGCNA,中文全称是加权基因共表达网络分析,是一种比较高级的生物信息学分析技术,并且理解起来有一定难度。如果你愿意看帮助文档的话,建议直接观看帮助文档,因为没有什么教程是比官方文档更加权威的了。如果你嫌弃WGCNA的帮助文档是全英文比较难读懂的话,也可以跟着我们的系列推文,我的计划是先和大家一起走完WGCNA的官方文档,然后再根据学习的内容,进行文献复现,挑1-2篇应用了WGCNA的文章,对相应的内容进行复现。
首先我们要明白什么是WGCNA?
这里我们利用风风讲席营的例子简单说一下:假设现在新开了一个菜市场,菜市场里面有很多小摊贩,分别有1、卖肉类;2、卖海鲜;3、卖蔬菜。这些小摊贩摆摊没有任何规律,东西南北四个方向都有这三类摊贩,这时候的菜市场不方便管理对吧?作为菜市场的管理人员,我们可以把执行“类似功能”的摊贩聚在一起,让卖特定东西的摊贩在特定的区域进行贩卖,比如卖肉类的都在东边,卖海鲜的都在南边,卖蔬菜的都在北边。这样的排布可以让顾客一眼找到自己需要买的东西对应的区域。假如顾客今天要买牛肉,他只需要从东边的市场大门进去就可以找到相应的区域,在这个区域挑选自己想要的牛肉就可以了。
上面这个过程就是WGCNA的过程。我们把基因类比为摊贩,执行相同功能的基因表示卖同一类东西的摊贩;对基因进行聚类的过程就是“把执行‘类似功能’的摊贩聚在一起”;而买菜,就是关联表型和基因模块,找到和特定表型相关的基因模块。
由上述例子得知,我们可以将WGCNA简化为三个过程:
1、计算距离(相关性);
2、聚类;
3、关联表型和基因模块。
实际上这是简化的三个步骤,按照官方文档的介绍,WGCNA的流程应该是下图展示的这样分为5个步骤:
关于原理部分的官方文档我放在了资料下载区,大家可以后台回复关键词后下载,建议大家不管能不能理解,都读一遍,然后结合数据实操进一步加深印象。
接下来,我们看看WGCNA的实现过程:
数据清洗
我们至少需要两份数据:表达矩阵和表型数据,如果你的表达矩阵还没进行注释,则还需要一份注释文件,在进行WGCNA之前,我们首先要进行数据清洗:
library(WGCNA)options(stringsAsFactors = FALSE)# 加载LiverFemale3600数据集femData <- read.csv("data//LiverFemale3600.csv");# 查看数据dim(femData)## [1] 3600 143names(femData)## [1] "substanceBXH""gene_symbol""LocusLinkID""ProteomeID"## [5] "cytogeneticLoc""CHROMOSOME""StartPosition""EndPosition"## [9] "F2_2""F2_3""F2_14""F2_15"## [13] "F2_19""F2_20""F2_23""F2_24"## [17] "F2_26""F2_37""F2_42""F2_43"## [21] "F2_45""F2_46""F2_47""F2_48"## [25] "F2_51""F2_52""F2_54""F2_63"## [29] "F2_65""F2_66""F2_68""F2_69"## [33] "F2_70""F2_71""F2_72""F2_78"## [37] "F2_79""F2_80""F2_81""F2_83"## [41] "F2_86""F2_87""F2_88""F2_89"## [45] "F2_107""F2_108""F2_109""F2_110"## [49] "F2_111""F2_112""F2_117""F2_119"## [53] "F2_125""F2_126""F2_127""F2_141"## [57] "F2_142""F2_143""F2_144""F2_145"## [61] "F2_154""F2_155""F2_156""F2_157"## [65] "F2_162""F2_163""F2_164""F2_165"## [69] "F2_166""F2_167""F2_169""F2_180"## [73] "F2_181""F2_182""F2_187""F2_188"## [77] "F2_189""F2_190""F2_191""F2_192"## [81] "F2_194""F2_195""F2_200""F2_201"## [85] "F2_212""F2_213""F2_214""F2_215"## [89] "F2_221""F2_222""F2_223""F2_224"## [93] "F2_225""F2_226""F2_227""F2_228"## [97] "F2_241""F2_242""F2_243""F2_244"## [101] "F2_245""F2_247""F2_248""F2_261"## [105] "F2_263""F2_264""F2_270""F2_271"## [109] "F2_272""F2_278""F2_287""F2_288"## [113] "F2_289""F2_290""F2_291""F2_296"## [117] "F2_298""F2_299""F2_300""F2_302"## [121] "F2_303""F2_304""F2_305""F2_306"## [125] "F2_307""F2_308""F2_309""F2_310"## [129] "F2_311""F2_312""F2_320""F2_321"## [133] "F2_323""F2_324""F2_325""F2_326"## [137] "F2_327""F2_328""F2_329""F2_330"## [141] "F2_332""F2_355""F2_357"# 提取表达矩阵并行列转置,将行名变为样本名datExpr0 <- as.data.frame(t(femData[, -c(1:8)]))names(datExpr0) <- femData$substanceBXHrownames(datExpr0) <- names(femData)[-c(1:8)]# 判断矩阵的样本是否都合格?gsg <- goodSamplesGenes(datExpr0, verbose = 3);## Flagging genes and samples with too many missing values...## ..step 1gsg$allOK # 显示TURE则为都合格,不需要进行额外处理,如果是FALSE,则运行下面这部分代码## [1] TRUEif (!gsg$allOK){# Optionally, print the gene and sample names that were removed:if (sum(!gsg$goodGenes)>0) printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));if (sum(!gsg$goodSamples)>0) printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));# Remove the offending genes and samples from the data: datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]}# 对样本进行聚类,dist为计算距离,hclust为聚类sampleTree <- hclust(dist(datExpr0), method = "average")# 聚类结果可视化sizeGrWindow(12,9)par(cex = 0.6);par(mar = c(0,4,2,0))plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,     cex.axis = 1.5, cex.main = 2)# 根据聚类结果,发现最左边的F2_221游离于其他样本之外,因此对其进行剪切去除# 绘制切割水平线abline(h = 15, col = "red")

# 确定集群clust <- cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)table(clust)## clust## 0 1 ## 1 134# 我们需要保留clust 1的样本,去除clust 0的样本keepSamples <- (clust==1)datExpr <- datExpr0[keepSamples, ]nGenes <- ncol(datExpr)nSamples <- nrow(datExpr)# 读取表型数据traitData <- read.csv("data//ClinicalTraits.csv");dim(traitData)## [1] 361 38names(traitData)## [1] "X""Mice""Number"## [4] "Mouse_ID""Strain""sex"## [7] "DOB""parents""Western_Diet"## [10] "Sac_Date""weight_g""length_cm"## [13] "ab_fat""other_fat""total_fat"## [16] "comments""X100xfat_weight""Trigly"## [19] "Total_Chol""HDL_Chol""UC"## [22] "FFA""Glucose""LDL_plus_VLDL"## [25] "MCP_1_phys""Insulin_ug_l""Glucose_Insulin"## [28] "Leptin_pg_ml""Adiponectin""Aortic.lesions"## [31] "Note""Aneurysm""Aortic_cal_M"## [34] "Aortic_cal_L""CoronaryArtery_Cal""Myocardial_cal"## [37] "BMD_all_limbs""BMD_femurs_only"# 去掉多余的信息(不要的信息)allTraits <- traitData[, -c(31, 16)];allTraits <- allTraits[, c(2, 11:36) ];dim(allTraits)## [1] 361 27names(allTraits)## [1] "Mice""weight_g""length_cm"## [4] "ab_fat""other_fat""total_fat"## [7] "X100xfat_weight""Trigly""Total_Chol"## [10] "HDL_Chol""UC""FFA"## [13] "Glucose""LDL_plus_VLDL""MCP_1_phys"## [16] "Insulin_ug_l""Glucose_Insulin""Leptin_pg_ml"## [19] "Adiponectin""Aortic.lesions""Aneurysm"## [22] "Aortic_cal_M""Aortic_cal_L""CoronaryArtery_Cal"## [25] "Myocardial_cal""BMD_all_limbs""BMD_femurs_only"# 合并表达矩阵和表型数据femaleSamples <- rownames(datExpr);traitRows <- match(femaleSamples, allTraits$Mice);datTraits <- allTraits[traitRows, -1];rownames(datTraits) = allTraits[traitRows, 1];collectGarbage();# 对样本进行重聚类sampleTree2 <- hclust(dist(datExpr), method = "average")# 设置颜色traitColors <- numbers2colors(datTraits, signed = FALSE);# 聚类结果可视化plotDendroAndColors(sampleTree2, traitColors, groupLabels = names(datTraits), main = "Sample dendrogram and trait heatmap")

# 保存清洗后的数据save(datExpr, datTraits, file = "FemaleLiver-01-dataInput.RData")
构建网络并进行模块识别
对于WGCNA来说,最重要的就是构建网络并识别出相应的模块,将基因按照相似性聚集成相应的模块,就像我们前面说的菜市场的例子,把卖肉的摊贩据集在一起一样。模块识别分为两种,一种是自动识别,另外一种是逐步法构建网络并识别,这里我们先看第一种:
# 设置多线程进行WGCNA分析enableWGCNAThreads()## Allowing parallel execution with up to 11 working processes.# 加载前面清洗完成的数据lnames <- load(file = "FemaleLiver-01-dataInput.RData");lnames## [1] "datExpr" "datTraits"# 设置软阈值区间powers = c(c(1:10), seq(from = 12, to = 20, by = 2))# 进行拓扑网络分析识别最佳软阈值sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)## pickSoftThreshold: will use block size 3600.## pickSoftThreshold: calculating connectivity for given powers...## ..working on genes 1 through 3600 of 3600## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.## 1 1 0.0278 0.345 0.456 747.00 762.0000 1210.0## 2 2 0.1260 -0.597 0.843 254.00 251.0000 574.0## 3 3 0.3400 -1.030 0.972 111.00 102.0000 324.0## 4 4 0.5060 -1.420 0.973 56.50 47.2000 202.0## 5 5 0.6810 -1.720 0.940 32.20 25.1000 134.0## 6 6 0.9020 -1.500 0.962 19.90 14.5000 94.8## 7 7 0.9210 -1.670 0.917 13.20 8.6800 84.1## 8 8 0.9040 -1.720 0.876 9.25 5.3900 76.3## 9 9 0.8590 -1.700 0.836 6.80 3.5600 70.5## 10 10 0.8330 -1.660 0.831 5.19 2.3800 65.8## 11 12 0.8530 -1.480 0.911 3.33 1.1500 58.1## 12 14 0.8760 -1.380 0.949 2.35 0.5740 51.9## 13 16 0.9070 -1.300 0.970 1.77 0.3090 46.8## 14 18 0.9120 -1.240 0.973 1.39 0.1670 42.5## 15 20 0.9310 -1.210 0.977 1.14 0.0951 38.7powerEstimates <- sft$powerEstimate # 查看自动识别的最佳软阈值# 如果上述没有自动识别最佳软阈值,则可以将结果进行可视化后尽心选取sizeGrWindow(9, 5)par(mfrow = c(1,2));cex1 = 0.9;plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n", main = paste("Scale independence"));text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], labels=powers,cex=cex1,col="red");abline(h=0.90,col="red")plot(sft$fitIndices[,1], sft$fitIndices[,5], xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n", main = paste("Mean connectivity"))text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

# 从上图可以看到,当power选取为6的时候,左边的R2值达到了0.9,右边的Mean Connectivity也趋向于平缓,因此我们选择6为最佳软阈值# 构建网络(很关键)cor <- WGCNA::cornet <- blockwiseModules(datExpr, power = powerEstimates, TOMType = "unsigned", minModuleSize = 30, reassignThreshold = 0, mergeCutHeight = 0.25, numericLabels = TRUE, pamRespectsDendro = FALSE, saveTOMs = TRUE, saveTOMFileBase = "femaleMouseTOM", verbose = 3)## Calculating module eigengenes block-wise from all genes## Flagging genes and samples with too many missing values...## ..step 1## Cluster size 3600 broken into 2108 1492 ## Cluster size 2108 broken into 1126 982 ## Done cluster 1126 ## Done cluster 982 ## Done cluster 2108 ## Done cluster 1492 ## ..Working on block 1 .## TOM calculation: adjacency..## ..will not use multithreading.## Fraction of slow calculations: 0.396405## ..connectivity..## ..matrix multiplication (system BLAS)..## ..normalization..## ..done.## ..saving TOM for block 1 into file femaleMouseTOM-block.1.RData## ....clustering..## ....detecting modules..## ....calculating module eigengenes..## ....checking kME in modules..## ..removing 1 genes from module 1 because their KME is too low.## ..removing 1 genes from module 7 because their KME is too low.## ..removing 1 genes from module 8 because their KME is too low.## ..removing 1 genes from module 21 because their KME is too low.## ..merging modules that are too close..## mergeCloseModules: Merging modules whose distance is less than 0.25## Calculating new MEs...cor<-stats::cor# 网络可视化sizeGrWindow(12, 9)mergedColors = labels2colors(net$colors)plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]], "Module colors", dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05)


# 提取网络特征moduleLabels = net$colorsmoduleColors = labels2colors(net$colors)MEs = net$MEs;geneTree = net$dendrograms[[1]];save(MEs, moduleLabels, moduleColors, geneTree, file = "FemaleLiver-02-networkConstruction-auto.RData")# 网络构建是整个WGCNA最关键的一步,关于逐步法和对大数据集的网络构建方法我们在后续演示,今天先完整进行WGCNA的流程
关联表型信息和模块
接下来我们需要关联表型信息和模块,以便找到特定表型最关键的模块:
# 加载数据lnames = load(file = "FemaleLiver-02-networkConstruction-auto.RData");lnames## [1] "MEs" "moduleLabels" "moduleColors" "geneTree"nGenes = ncol(datExpr);nSamples = nrow(datExpr);# 重新赋予每个ME模块颜色标签MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenesMEs = orderMEs(MEs0)moduleTraitCor = cor(MEs, datTraits, use = "p");moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);# 关联结果可视化sizeGrWindow(10,6)# Will display correlations and their p-valuestextMatrix = paste(signif(moduleTraitCor, 2), "\n(", signif(moduleTraitPvalue, 1), ")", sep = "");dim(textMatrix) = dim(moduleTraitCor)par(mar = c(6, 8.5, 3, 3));# Display the correlation values within a heatmap plotlabeledHeatmap(Matrix = moduleTraitCor, xLabels = names(datTraits), yLabels = names(MEs), ySymbols = names(MEs), colorLabels = FALSE, colors = blueWhiteRed(100), textMatrix = textMatrix, setStdMargins = FALSE, cex.text = 0.5, zlim = c(-1,1), main = paste("Module-trait relationships"))

# 定义权重weightweight = as.data.frame(datTraits$weight_g);names(weight) = "weight"# 模块对应的颜色标签modNames = substring(names(MEs), 3)# 计算模块ME与基因的相关性和p值(ModuleMembership, MM)geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));names(geneModuleMembership) = paste("MM", modNames, sep="");names(MMPvalue) = paste("p.MM", modNames, sep="");# 计算基因和表型的相关性(geneTraitSignificance, GS)geneTraitSignificance = as.data.frame(cor(datExpr, weight, use = "p"));GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));names(geneTraitSignificance) = paste("GS.", names(weight), sep="");names(GSPvalue) = paste("p.GS.", names(weight), sep="");# 获取感兴趣模块,这里以brown模块为例module = "brown"column = match(module, modNames);moduleGenes = moduleColors==module;# 绘制GS和MM的相关性图片,也可以用ggplot2进行绘图sizeGrWindow(7, 7);par(mfrow = c(1,1));verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]), abs(geneTraitSignificance[moduleGenes, 1]), xlab = paste("Module Membership in", module, "module"), ylab = "Gene significance for body weight", main = paste("Module membership vs. gene significance\n"), cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
# 提取基因注释文件,对基因进行注释annot <- read.csv(file = "data//GeneAnnotation.csv");dim(annot)## [1] 23388 34names(annot)## [1] "X""ID"## [3] "arrayname""substanceBXH"## [5] "gene_symbol""LocusLinkID"## [7] "OfficialGeneSymbol""OfficialGeneName"## [9] "LocusLinkSymbol""LocusLinkName"## [11] "ProteomeShortDescription""UnigeneCluster"## [13] "LocusLinkCode""ProteomeID"## [15] "ProteomeCode""SwissprotID"## [17] "OMIMCode""DirectedTilingPriority"## [19] "AlternateSymbols""AlternateNames"## [21] "SpeciesID""cytogeneticLoc"## [23] "Organism""clustername"## [25] "reporterid""probeid"## [27] "sequenceid""clusterid"## [29] "chromosome""startcoordinate"## [31] "endcoordinate""strand"## [33] "sequence_3_to_5_prime""sequence_5_to_3_prime"probes <- names(datExpr)probes2annot <- match(probes, annot$substanceBXH)# 没有探针注释的分子数目sum(is.na(probes2annot))## [1] 0# 说明每一个探针都可以被注释# 进行注释geneInfo0 <- data.frame(substanceBXH = probes, geneSymbol = annot$gene_symbol[probes2annot], LocusLinkID = annot$LocusLinkID[probes2annot], moduleColor = moduleColors, geneTraitSignificance, GSPvalue)# 按照权重排序modOrder <- order(-abs(cor(MEs, weight, use = "p")));for (mod in1:ncol(geneModuleMembership)){ oldNames <- names(geneInfo0) geneInfo0 <- data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]], MMPvalue[, modOrder[mod]]); names(geneInfo0) <- c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""), paste("p.MM.", modNames[modOrder[mod]], sep=""))}# 首先按模块颜色对geneInfo变量中的基因排序,然后再按geneTraitSignificance排序geneOrder <- order(geneInfo0$moduleColor, -abs(geneInfo0$GS.weight));geneInfo <- geneInfo0[geneOrder, ]write.csv(geneInfo, file = "geneInfo.csv")# 对探针进行注释probes <- names(datExpr)probes2annot <- match(probes, annot$substanceBXH)allLLIDs <- annot$LocusLinkID[probes2annot]# 选择感兴趣的模块并提取相应矩阵intModules = c("brown", "red", "salmon")for (module in intModules){# Select module probes modGenes = (moduleColors==module)# Get their entrez ID codes modLLIDs = allLLIDs[modGenes];# Write them into a file fileName = paste("LocusLinkIDs-", module, ".txt", sep=""); write.table(as.data.frame(modLLIDs), file = fileName, row.names = FALSE, col.names = FALSE)}fileName = paste("LocusLinkIDs-all.txt", sep="");write.table(as.data.frame(allLLIDs), file = fileName, row.names = FALSE, col.names = FALSE)# 提取了感兴趣的模块基因之后,就可以进行相应的GO和KEGG等分析内容了,这里我们先不做展示
进行网络可视化
lnames = load(file = "FemaleLiver-02-networkConstruction-auto.RData");lnames## [1] "MEs" "moduleLabels" "moduleColors" "geneTree"nGenes = ncol(datExpr)nSamples = nrow(datExpr)# 计算拓扑网络dissTOM <- 1-TOMsimilarityFromExpr(datExpr, power = 6);## TOM calculation: adjacency..## ..will not use multithreading.## Fraction of slow calculations: 0.396405## ..connectivity..## ..matrix multiplication (system BLAS)..## ..normalization..## ..done.# 选择100个分子构建网络nSelect = 400# 设置随机种子set.seed(10)select <- sample(nGenes, size = nSelect);selectTOM <- dissTOM[select, select];# 重聚类selectTree <- hclust(as.dist(selectTOM), method = "average")selectColors <- moduleColors[select]# 可视化sizeGrWindow(9,9)plotDiss = selectTOM^7;diag(plotDiss) = NA;TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")

# 重新计算模块eigengenes,MEsMEs <- moduleEigengenes(datExpr, moduleColors)$eigengenes# 从表型特征中提取体重信息weight <- as.data.frame(datTraits$weight_g);names(weight) = "weight"# 添加模块特征MET <- orderMEs(cbind(MEs, weight))# 可视化sizeGrWindow(5,7.5);par(cex = 0.9)plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle                     = 90)

# 也可以去除上面的树状图par(cex = 1.0)plotEigengeneNetworks(MET,"Eigengeneadjacency heatmap", marHeatmap = c(3,4,2,2),plotDendrograms = FALSE, xLabelsAngle = 90)

# 有时候我们想要将数据导出后使用cytoscape进行绘制更好看的图,可以用下面的方面# 重新计算拓扑矩阵TOM <- TOMsimilarityFromExpr(datExpr, power = 6);## TOM calculation: adjacency..## ..will not use multithreading.## Fraction of slow calculations: 0.396405## ..connectivity..## ..matrix multiplication (system BLAS)..## ..normalization..## ..done.# 读取注释文件annot <- read.csv(file = "data//GeneAnnotation.csv");# 选择感兴趣的模块modules = c("brown", "red");# 选择模块探针probes = names(datExpr)inModule <- is.finite(match(moduleColors, modules));modProbes <- probes[inModule];modGenes <- annot$gene_symbol[match(modProbes, annot$substanceBXH)]# 选择对应的拓扑重叠结构modTOM = TOM[inModule, inModule];dimnames(modTOM) = list(modProbes, modProbes)# 导出数据cyt <- exportNetworkToCytoscape(modTOM, edgeFile = paste("CytoscapeInput-edges-", paste(modules, collapse="-"), ".txt", sep=""), nodeFile = paste("CytoscapeInput-nodes-", paste(modules, collapse="-"), ".txt", sep=""), weighted = TRUE, threshold = 0.02, nodeNames = modProbes, altNodeNames = modGenes, nodeAttr = moduleColors[inModule])names(cyt)## [1] "edgeData" "nodeData"# 有了这两个文件,就可以使用cytoscape进行个性化绘图啦
好了,今天的内容到此结束!今天我们主要对WGCNA进行了初探,当然大家有空的话还是推荐大家阅读官方帮助文档,看看官方英文解释.关于WGCNA的其他一些内容,包括逐步构建网络和保守性分析等等内容,我们也将在后续的推文中一步一步展开。
后台回复“feng 60”获取代码和数据,我们下期见!
单细胞分析专栏传送门
碎碎念专栏传送门(完结)
风之美图系列传送门(完结)
END

撰文丨风   风
排版丨四金兄
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~

继续阅读
阅读原文