师兄用临床数据一年发五篇文章,原来靠这个诀窍!(建议收藏!)
碎碎念专栏之决策树
大家好,我是风。这里是风风的碎碎念专栏。过了腊八,马上就是“年”了,我在腊八的0点开始写这篇推文,不知道今年春节能不能回家呢?有人因为疫情无所事事,但是也有的人因为疫情空闲时间发了几篇SCI,比如某隔壁宿舍的某师兄,趁着疫情放假,竟然连发5篇论文!丧心病狂啊!为了打探这种绝世武功秘籍,我把他的文章都下了下来,仔细比对,好家伙,终于发现了他的发文秘籍,原来靠的是临床数据+决策树啊!
作为临床医生或者研究生,大家最熟悉的就是自己手上的临床数据了,要是手上的数据有随访带来的预后数据还好,可以上预后分析嘛,万一要是没有,难道就没办法了吗?当然不是!预后分析有预后分析的套路,诊断研究当然也有诊断研究的模型,比如今天说的“决策树”模型!
什么是“决策树”呢?决策树是数据挖掘领域中的常用模型,其基本思想是对预测变量进行二元分离,从而构建一颗可用于预测新样本单元所属类别的树,常用决策树包括经典决策树和条件推断树。
经典决策树,以一个二元输出变量和一组预测变量为基础。这一个二元输出变量可以是良性恶性,或者生存死亡,或者复发不复发等等二分类数据,预测变量可以包括年龄性别等临床变量,或者是抽血检查结果,或者是病理上的细胞特征,我们以预测肿瘤的良恶性为例来进行讲解,具体决策树的算法过程我们不需要了解,只需要如何利用决策树对我们的数据进行分析即可。
我们将使用rpart包的rpart()函数构建决策树,一般经典决策树的分支都比较多,我们需要使用十折交叉验证法选择预测误差最小的树,因此就需要对rpart结果进行剪枝,也就是使用prune()函数对决策树进行剪枝。下面我们以一份乳腺癌临床数据为例,通过各个变量构建决策树判别乳腺肿块的良恶性:
library(rpart)
library(tidyverse)
# 数据准备
breast <- read.table("breast_cancer.data",
sep = ",",
header = FALSE,
na.strings = "?") # 加载数据,数据在后台获取
names(breast) <- c("ID",
"Tumor_Thickness",
"SizeUniformity",
"shapeUniformity",
"maginalAdhesion",
"singleEpiCellSize",
"bareNuc",
"blandChr",
"normalNuc",
"mitosis",
"class") #对数据进行重命名
df <- breast[-1]
$class,levels = c(2,4), class <- factor(df
labels = c("benign", "malignant")) # 构建因子,区分良恶性
train <- sample(nrow(df), 0.7*nrow(df)) # 选取总数据70%的样本构建模型训练集
set.seed(2021)
df.train <- df[train,]
df.validata <- df[-train,] # 将剩下30%样本作为验证集
table(df.train$class) # 查看训练集数据
#
# benign malignant
# 330 159
table(df.validata$class) # 查看验证集数据
#
# benign malignant
# 128 82
# 经典决策树
set.seed(202101292)
生成决策树
dtree <- rpart(class ~ ., # class为预测结果,使用.代表所有变量
data = df.train, # 利用验证集构建模型
method = "class",
parms = list(split = "information"))
print(dtree) # 查看决策树信息
# n= 489
#
# node), split, n, loss, yval, (yprob)
# * denotes terminal node
#
# 1) root 489 159 benign (0.67484663 0.32515337)
# 2) SizeUniformity< 2.5 309 9 benign (0.97087379 0.02912621) *
# 3) SizeUniformity>=2.5 180 30 malignant (0.16666667 0.83333333)
# 6) SizeUniformity< 4.5 68 27 malignant (0.39705882 0.60294118)
# 12) bareNuc< 3.5 26 4 benign (0.84615385 0.15384615) *
# 13) bareNuc>=3.5 42 5 malignant (0.11904762 0.88095238) *
# 7) SizeUniformity>=4.5 112 3 malignant (0.02678571 0.97321429) *
summary(dtree) # 查看决策树具体信息
# Call:
# rpart(formula = class ~ ., data = df.train, method = "class",
# parms = list(split = "information"))
# n= 489
#
# CP nsplit rel error xerror xstd
# 1 0.75471698 0 1.0000000 1.0000000 0.06514843
# 2 0.05660377 1 0.2452830 0.3081761 0.04176119
# 3 0.01000000 3 0.1320755 0.1761006 0.03231305
#
# Variable importance
# SizeUniformity shapeUniformity bareNuc singleEpiCellSize
# 23 17 16 15
# blandChr normalNuc maginalAdhesion Tumor_Thickness
# 15 14 1 1
#
# Node number 1: 489 observations, complexity param=0.754717
# predicted class=benign expected loss=0.3251534 P(node) =1
# class counts: 330 159
# probabilities: 0.675 0.325
# left son=2 (309 obs) right son=3 (180 obs)
# Primary splits:
# SizeUniformity < 2.5 to the left, improve=186.6152, (0 missing)
# shapeUniformity < 2.5 to the left, improve=177.5238, (0 missing)
# bareNuc < 2.5 to the left, improve=176.0200, (11 missing)
# blandChr < 3.5 to the left, improve=172.9661, (0 missing)
# normalNuc < 2.5 to the left, improve=149.5745, (0 missing)
# Surrogate splits:
# shapeUniformity < 3.5 to the left, agree=0.914, adj=0.767, (0 split)
# singleEpiCellSize < 2.5 to the left, agree=0.885, adj=0.689, (0 split)
# blandChr < 3.5 to the left, agree=0.881, adj=0.678, (0 split)
# normalNuc < 2.5 to the left, agree=0.877, adj=0.667, (0 split)
# bareNuc < 2.5 to the left, agree=0.871, adj=0.650, (0 split)
#
# Node number 2: 309 observations
# predicted class=benign expected loss=0.02912621 P(node) =0.6319018
# class counts: 300 9
# probabilities: 0.971 0.029
#
# Node number 3: 180 observations, complexity param=0.05660377
# predicted class=malignant expected loss=0.1666667 P(node) =0.3680982
# class counts: 30 150
# probabilities: 0.167 0.833
# left son=6 (68 obs) right son=7 (112 obs)
# Primary splits:
# SizeUniformity < 4.5 to the left, improve=21.59943, (0 missing)
# bareNuc < 3.5 to the left, improve=21.24393, (2 missing)
# blandChr < 3.5 to the left, improve=20.17688, (0 missing)
# shapeUniformity < 3.5 to the left, improve=16.80110, (0 missing)
# Tumor_Thickness < 6.5 to the left, improve=15.72829, (0 missing)
# Surrogate splits:
# shapeUniformity < 5.5 to the left, agree=0.789, adj=0.441, (0 split)
# singleEpiCellSize < 3.5 to the left, agree=0.750, adj=0.338, (0 split)
# blandChr < 3.5 to the left, agree=0.711, adj=0.235, (0 split)
# maginalAdhesion < 2.5 to the left, agree=0.706, adj=0.221, (0 split)
# bareNuc < 1.5 to the left, agree=0.672, adj=0.132, (0 split)
#
# Node number 6: 68 observations, complexity param=0.05660377
# predicted class=malignant expected loss=0.3970588 P(node) =0.1390593
# class counts: 27 41
# probabilities: 0.397 0.603
# left son=12 (26 obs) right son=13 (42 obs)
# Primary splits:
# bareNuc < 3.5 to the left, improve=18.806650, (1 missing)
# Tumor_Thickness < 6.5 to the left, improve=10.921810, (0 missing)
# blandChr < 3.5 to the left, improve= 8.433749, (0 missing)
# shapeUniformity < 2.5 to the left, improve= 8.035364, (0 missing)
# maginalAdhesion < 6.5 to the left, improve= 6.901954, (0 missing)
# Surrogate splits:
# shapeUniformity < 2.5 to the left, agree=0.731, adj=0.308, (1 split)
# Tumor_Thickness < 4.5 to the left, agree=0.716, adj=0.269, (0 split)
# maginalAdhesion < 2.5 to the left, agree=0.687, adj=0.192, (0 split)
# blandChr < 2.5 to the left, agree=0.672, adj=0.154, (0 split)
# normalNuc < 2.5 to the left, agree=0.672, adj=0.154, (0 split)
#
# Node number 7: 112 observations
# predicted class=malignant expected loss=0.02678571 P(node) =0.2290389
# class counts: 3 109
# probabilities: 0.027 0.973
#
# Node number 12: 26 observations
# predicted class=benign expected loss=0.1538462 P(node) =0.05316973
# class counts: 22 4
# probabilities: 0.846 0.154
#
# Node number 13: 42 observations
# predicted class=malignant expected loss=0.1190476 P(node) =0.08588957
# class counts: 5 37
# probabilities: 0.119 0.881
# 查看cptable选择最佳决策树 cptable
# CP nsplit rel error xerror xstd
# 1 0.75471698 0 1.0000000 1.0000000 0.06514843
# 2 0.05660377 1 0.2452830 0.3081761 0.04176119
# 3 0.01000000 3 0.1320755 0.1761006 0.03231305
现在我们已经构建了一个决策树模型,并且知道了这个决策树的信息,但是该怎么根据结果对原始决策树进行剪枝呢?我们只需要看cptable中的结果,以我们这个模型为例,CP是复杂度参数,用于惩罚过大的树;树的大小就是分支数nsplit,有n个分支的树就有n+1个终端节点;rel error即真实误差,也可以理解为原始误差;xerror就是进行十折交叉验证后的误差,可以理解为矫正后的误差;xstd就是交叉验证误差的标准差。
知道了每个参数后,我们对得到的模型进行解读,在我们的模型中,最小的交叉验证误差是0.1358025,标准误差是0.02829439,那么我们的决策树模型的最优的树应该是交叉验证误差在0.1358025±0.02829439之间的树,也就是0.1075081-0.1640969之间,那回到cptable中,找到这个区间的值就是0.1358025,对应的CP值是0.0100000,nsplit为4,也就是我们的最优树应该是有4个分支,5个终端节点,得到这些信息后我们就要可视化我们的最优树了:
# 绘制一个十折交叉图形验证我们上面判断的结果
plotcp(dtree) #发现结果一致,选择CP值为0.0100000
dtree.pruned <-prune(dtree, cp = .0100)
# 利用rpart.plot包的prp()函数展示决策树
library(rpart.plot)
prp(dtree.pruned,
type = 2,
extra = 104,
fallen.leaves = TRUE,main = "Decision Tree")
library(partykit)
plot(as.party(dtree.pruned))
# 接着利用外部验证集结果验证决策树模型的可靠性
dtree.pred <- predict(dtree.pruned,
df.validata,
type = "class")
dtree.perf <- table(df.validata$class,
dtree.pred,
dnn = c("Actual", "Predicted"))
dtree.perf
## Predicted
## Actual benign malignant
## benign 124 4
## malignant 9 73
这样经典决策树模型就构建完成,难度不大,但是竞争力很强,看多了风险预测模型,来个决策树模型在最后给文章提亮一下好像也还不错是吧?那能不能再高端一点呢?当然可以,我们接着看条件推断树.
条件推断树和经典决策树类似,区别在于算法不同,条件推断树的变量和分割的选取是基于显著性检验,而不是同质性或者纯净度这一类的变量,我们同样直接看如何操作,这一部分代码非常简单:
library(party)
fit.ctree <- ctree(class ~.,
data = df.train) # 使用ctree构建模型
print(fit.ctree)
#
# Conditional inference tree with 7 terminal nodes
#
# Response: class
# Inputs: Tumor_Thickness, SizeUniformity, shapeUniformity, maginalAdhesion, singleEpiCellSize, bareNuc, blandChr, normalNuc, mitosis
# Number of observations: 489
#
# 1) bareNuc <= 2; criterion = 1, statistic = 333.722
# 2) SizeUniformity <= 3; criterion = 1, statistic = 226.573
# 3) mitosis <= 1; criterion = 1, statistic = 79.262
# 4)* weights = 295
# 3) mitosis > 1
# 5)* weights = 8
# 2) SizeUniformity > 3
# 6)* weights = 16
# 1) bareNuc > 2
# 7) blandChr <= 3; criterion = 1, statistic = 42.137
# 8) Tumor_Thickness <= 6; criterion = 1, statistic = 20.794
# 9)* weights = 26
# 8) Tumor_Thickness > 6
# 10)* weights = 15
# 7) blandChr > 3
# 11) bareNuc <= 8; criterion = 0.991, statistic = 10.744
# 12)* weights = 40
# 11) bareNuc > 8
# 13)* weights = 89
summary(fit.ctree)
# Length Class Mode
# 1 BinaryTree S4
plot(fit.ctree) # 绘制条件推断树
# 利用验证集对数据进行验证
ctree.pred <- predict(fit.ctree,
df.validata,
type = "response")
ctree.perf <- table(df.validata$class,
ctree.pred,
dnn = c("Actual","Predicted"))
ctree.perf
## Predicted
## Actual benign malignant
## benign 124 4
## malignant 7 75
凌晨两点,我写完了这个推文,都说深夜容易引发人的感慨,此言诚不我欺!我竟不禁想到几年前我另一位师兄问我会不会做决策树的时候,我还只能问他决策树是什么东东/(ㄒoㄒ)/~~ 那可能今天我不会的东西,过完春节我就会了?哎深夜发梦胡思乱想胡言乱语语无伦次想得美了!今天就到这里,大家晚安,后台回复“feng34”获得本次代码和数据,那我们下次再见吧!*^_^*
碎碎念专栏传送门
风之美图系列传送门
—
END—
撰文丨风 风
排版丨四金兄
值班 | 先锋宇
主编丨小雪球
欢迎大家关注解螺旋生信频道-挑圈联靠公号~
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。