碎碎念专栏之决策树
大家好,我是风。这里是风风的碎碎念专栏。过了腊八,马上就是“年”了,我在腊八的0点开始写这篇推文,不知道今年春节能不能回家呢?有人因为疫情无所事事,但是也有的人因为疫情空闲时间发了几篇SCI,比如某隔壁宿舍的某师兄,趁着疫情放假,竟然连发5篇论文!丧心病狂啊!为了打探这种绝世武功秘籍,我把他的文章都下了下来,仔细比对,好家伙,终于发现了他的发文秘籍,原来靠的是临床数据+决策树啊!
什么是决策树?
作为临床医生或者研究生,大家最熟悉的就是自己手上的临床数据了,要是手上的数据有随访带来的预后数据还好,可以上预后分析嘛,万一要是没有,难道就没办法了吗?当然不是!预后分析有预后分析的套路,诊断研究当然也有诊断研究的模型,比如今天说的“决策树”模型!

 什么是“决策树”呢?决策树是数据挖掘领域中的常用模型,其基本思想是对预测变量进行二元分离,从而构建一颗可用于预测新样本单元所属类别的树,常用决策树包括经典决策树和条件推断树。
经典决策树
经典决策树,以一个二元输出变量和一组预测变量为基础。这一个二元输出变量可以是良性恶性,或者生存死亡,或者复发不复发等等二分类数据,预测变量可以包括年龄性别等临床变量,或者是抽血检查结果,或者是病理上的细胞特征,我们以预测肿瘤的良恶性为例来进行讲解,具体决策树的算法过程我们不需要了解,只需要如何利用决策树对我们的数据进行分析即可。
经典决策树的构建
我们将使用rpart包的rpart()函数构建决策树,一般经典决策树的分支都比较多,我们需要使用十折交叉验证法选择预测误差最小的树,因此就需要对rpart结果进行剪枝,也就是使用prune()函数对决策树进行剪枝。下面我们以一份乳腺癌临床数据为例,通过各个变量构建决策树判别乳腺肿块的良恶性:
library(rpart)library(tidyverse)# 数据准备breast <- read.table("breast_cancer.data", sep = ",", header = FALSE, na.strings = "?") # 加载数据,数据在后台获取names(breast) <- c("ID", "Tumor_Thickness", "SizeUniformity", "shapeUniformity", "maginalAdhesion", "singleEpiCellSize", "bareNuc", "blandChr", "normalNuc", "mitosis", "class") #对数据进行重命名df <- breast[-1]df$class <- factor(df$class,levels = c(2,4), labels = c("benign", "malignant")) # 构建因子,区分良恶性train <- sample(nrow(df), 0.7*nrow(df)) # 选取总数据70%的样本构建模型训练集set.seed(2021)df.train <- df[train,]df.validata <- df[-train,] # 将剩下30%样本作为验证集table(df.train$class) # 查看训练集数据## ## benign malignant ## 330 159table(df.validata$class) # 查看验证集数据## ## benign malignant ## 128 82# 经典决策树set.seed(202101292)# 生成决策树dtree <- rpart(class ~ ., # class为预测结果,使用.代表所有变量 data = df.train, # 利用验证集构建模型 method = "class", parms = list(split = "information"))print(dtree) # 查看决策树信息## n= 489 ## ## node), split, n, loss, yval, (yprob)## * denotes terminal node## ## 1) root 489 159 benign (0.67484663 0.32515337) ## 2) SizeUniformity< 2.5 309 9 benign (0.97087379 0.02912621) *## 3) SizeUniformity>=2.5 180 30 malignant (0.16666667 0.83333333) ## 6) SizeUniformity< 4.5 68 27 malignant (0.39705882 0.60294118) ## 12) bareNuc< 3.5 26 4 benign (0.84615385 0.15384615) *## 13) bareNuc>=3.5 42 5 malignant (0.11904762 0.88095238) *## 7) SizeUniformity>=4.5 112 3 malignant (0.02678571 0.97321429) *summary(dtree) # 查看决策树具体信息## Call:## rpart(formula = class ~ ., data = df.train, method = "class", ## parms = list(split = "information"))## n= 489 ## ## CP nsplit rel error xerror xstd## 1 0.75471698 0 1.0000000 1.0000000 0.06514843## 2 0.05660377 1 0.2452830 0.3081761 0.04176119## 3 0.01000000 3 0.1320755 0.1761006 0.03231305## ## Variable importance## SizeUniformity shapeUniformity bareNuc singleEpiCellSize ## 23 17 16 15 ## blandChr normalNuc maginalAdhesion Tumor_Thickness ## 15 14 1 1 ## ## Node number 1: 489 observations, complexity param=0.754717## predicted class=benign expected loss=0.3251534 P(node) =1## class counts: 330 159## probabilities: 0.675 0.325 ## left son=2 (309 obs) right son=3 (180 obs)## Primary splits:## SizeUniformity < 2.5 to the left, improve=186.6152, (0 missing)## shapeUniformity < 2.5 to the left, improve=177.5238, (0 missing)## bareNuc < 2.5 to the left, improve=176.0200, (11 missing)## blandChr < 3.5 to the left, improve=172.9661, (0 missing)## normalNuc < 2.5 to the left, improve=149.5745, (0 missing)## Surrogate splits:## shapeUniformity < 3.5 to the left, agree=0.914, adj=0.767, (0 split)## singleEpiCellSize < 2.5 to the left, agree=0.885, adj=0.689, (0 split)## blandChr < 3.5 to the left, agree=0.881, adj=0.678, (0 split)## normalNuc < 2.5 to the left, agree=0.877, adj=0.667, (0 split)## bareNuc < 2.5 to the left, agree=0.871, adj=0.650, (0 split)## ## Node number 2: 309 observations## predicted class=benign expected loss=0.02912621 P(node) =0.6319018## class counts: 300 9## probabilities: 0.971 0.029 ## ## Node number 3: 180 observations, complexity param=0.05660377## predicted class=malignant expected loss=0.1666667 P(node) =0.3680982## class counts: 30 150## probabilities: 0.167 0.833 ## left son=6 (68 obs) right son=7 (112 obs)## Primary splits:## SizeUniformity < 4.5 to the left, improve=21.59943, (0 missing)## bareNuc < 3.5 to the left, improve=21.24393, (2 missing)## blandChr < 3.5 to the left, improve=20.17688, (0 missing)## shapeUniformity < 3.5 to the left, improve=16.80110, (0 missing)## Tumor_Thickness < 6.5 to the left, improve=15.72829, (0 missing)## Surrogate splits:## shapeUniformity < 5.5 to the left, agree=0.789, adj=0.441, (0 split)## singleEpiCellSize < 3.5 to the left, agree=0.750, adj=0.338, (0 split)## blandChr < 3.5 to the left, agree=0.711, adj=0.235, (0 split)## maginalAdhesion < 2.5 to the left, agree=0.706, adj=0.221, (0 split)## bareNuc < 1.5 to the left, agree=0.672, adj=0.132, (0 split)## ## Node number 6: 68 observations, complexity param=0.05660377## predicted class=malignant expected loss=0.3970588 P(node) =0.1390593## class counts: 27 41## probabilities: 0.397 0.603 ## left son=12 (26 obs) right son=13 (42 obs)## Primary splits:## bareNuc < 3.5 to the left, improve=18.806650, (1 missing)## Tumor_Thickness < 6.5 to the left, improve=10.921810, (0 missing)## blandChr < 3.5 to the left, improve= 8.433749, (0 missing)## shapeUniformity < 2.5 to the left, improve= 8.035364, (0 missing)## maginalAdhesion < 6.5 to the left, improve= 6.901954, (0 missing)## Surrogate splits:## shapeUniformity < 2.5 to the left, agree=0.731, adj=0.308, (1 split)## Tumor_Thickness < 4.5 to the left, agree=0.716, adj=0.269, (0 split)## maginalAdhesion < 2.5 to the left, agree=0.687, adj=0.192, (0 split)## blandChr < 2.5 to the left, agree=0.672, adj=0.154, (0 split)## normalNuc < 2.5 to the left, agree=0.672, adj=0.154, (0 split)## ## Node number 7: 112 observations## predicted class=malignant expected loss=0.02678571 P(node) =0.2290389## class counts: 3 109## probabilities: 0.027 0.973 ## ## Node number 12: 26 observations## predicted class=benign expected loss=0.1538462 P(node) =0.05316973## class counts: 22 4## probabilities: 0.846 0.154 ## ## Node number 13: 42 observations## predicted class=malignant expected loss=0.1190476 P(node) =0.08588957## class counts: 5 37## probabilities: 0.119 0.881dtree$cptable # 查看cptable选择最佳决策树## CP nsplit rel error xerror xstd## 1 0.75471698 0 1.0000000 1.0000000 0.06514843## 2 0.05660377 1 0.2452830 0.3081761 0.04176119## 3 0.01000000 3 0.1320755 0.1761006 0.03231305

现在我们已经构建了一个决策树模型,并且知道了这个决策树的信息,但是该怎么根据结果对原始决策树进行剪枝呢?我们只需要看cptable中的结果,以我们这个模型为例,CP是复杂度参数,用于惩罚过大的树;树的大小就是分支数nsplit,有n个分支的树就有n+1个终端节点;rel error即真实误差,也可以理解为原始误差;xerror就是进行十折交叉验证后的误差,可以理解为矫正后的误差;xstd就是交叉验证误差的标准差。

   知道了每个参数后,我们对得到的模型进行解读,在我们的模型中,最小的交叉验证误差是0.1358025,标准误差是0.02829439,那么我们的决策树模型的最优的树应该是交叉验证误差在0.1358025±0.02829439之间的树,也就是0.1075081-0.1640969之间,那回到cptable中,找到这个区间的值就是0.1358025,对应的CP值是0.0100000,nsplit为4,也就是我们的最优树应该是有4个分支,5个终端节点,得到这些信息后我们就要可视化我们的最优树了:
# 绘制一个十折交叉图形验证我们上面判断的结果plotcp(dtree) #发现结果一致,选择CP值为0.0100000

dtree.pruned <-prune(dtree, cp = .0100)# 利用rpart.plot包的prp()函数展示决策树library(rpart.plot)prp(dtree.pruned, type = 2,extra = 104,fallen.leaves = TRUE,main = "Decision Tree")

# 上面的图片风格不喜欢的话,可以接着用下面这种美化后的风格library(partykit)plot(as.party(dtree.pruned))

# 接着利用外部验证集结果验证决策树模型的可靠性dtree.pred <- predict(dtree.pruned, df.validata,type = "class")dtree.perf <- table(df.validata$class, dtree.pred, dnn = c("Actual", "Predicted"))dtree.perf## Predicted## Actual benign malignant## benign 124 4## malignant 9 73

这样经典决策树模型就构建完成,难度不大,但是竞争力很强,看多了风险预测模型,来个决策树模型在最后给文章提亮一下好像也还不错是吧?那能不能再高端一点呢?当然可以,我们接着看条件推断树.
条件推断树的构建
条件推断树和经典决策树类似,区别在于算法不同,条件推断树的变量和分割的选取是基于显著性检验,而不是同质性或者纯净度这一类的变量,我们同样直接看如何操作,这一部分代码非常简单:
library(party)fit.ctree <- ctree(class ~., data = df.train) # 使用ctree构建模型print(fit.ctree)## ## Conditional inference tree with 7 terminal nodes## ## Response: class ## Inputs: Tumor_Thickness, SizeUniformity, shapeUniformity, maginalAdhesion, singleEpiCellSize, bareNuc, blandChr, normalNuc, mitosis ## Number of observations: 489 ## ## 1) bareNuc <= 2; criterion = 1, statistic = 333.722## 2) SizeUniformity <= 3; criterion = 1, statistic = 226.573## 3) mitosis <= 1; criterion = 1, statistic = 79.262## 4)* weights = 295 ## 3) mitosis > 1## 5)* weights = 8 ## 2) SizeUniformity > 3## 6)* weights = 16 ## 1) bareNuc > 2## 7) blandChr <= 3; criterion = 1, statistic = 42.137## 8) Tumor_Thickness <= 6; criterion = 1, statistic = 20.794## 9)* weights = 26 ## 8) Tumor_Thickness > 6## 10)* weights = 15 ## 7) blandChr > 3## 11) bareNuc <= 8; criterion = 0.991, statistic = 10.744## 12)* weights = 40 ## 11) bareNuc > 8## 13)* weights = 89summary(fit.ctree)## Length Class Mode ## 1 BinaryTree S4plot(fit.ctree) # 绘制条件推断树

# 利用验证集对数据进行验证ctree.pred <- predict(fit.ctree, df.validata,type = "response")ctree.perf <- table(df.validata$class, ctree.pred, dnn = c("Actual","Predicted"))ctree.perf## Predicted## Actual benign malignant## benign 124 4## malignant 7 75

凌晨两点,我写完了这个推文,都说深夜容易引发人的感慨,此言诚不我欺!我竟不禁想到几年前我另一位师兄问我会不会做决策树的时候,我还只能问他决策树是什么东东/(ㄒoㄒ)/~~ 那可能今天我不会的东西,过完春节我就会了?哎深夜发梦胡思乱想胡言乱语语无伦次想得美了!今天就到这里,大家晚安,后台回复“feng34”获得本次代码和数据,那我们下次再见吧!*^_^*
碎碎念专栏传送门
风之美图系列传送门
END

撰文丨风   风
排版丨四金兄
值班 | 先锋宇
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~

继续阅读
阅读原文