Predictors of Testing Positive for Diabetes
This analysis uses the Pima Indians Diabetes dataset to determine which factors are important predictors of testing positive for diabetes.
Data Structure
Let’s get a quick understanding of the variables included in our dataset and how they relate to each other.
str(diabetes)
plot(diabetes, col = "lightblue")
'data.frame': 768 obs. of 9 variables:
$ pregnant: int 11 6 2 4 1 5 8 5 0 2 ...
$ glucose : int 143 92 90 111 103 88 176 44 109 109 ...
$ pressure: int 94 92 68 72 80 66 90 62 88 92 ...
$ triceps : int 33 0 42 47 11 21 34 0 30 0 ...
$ insulin : int 146 0 0 207 82 23 300 0 0 0 ...
$ mass : num 36.6 19.9 38.2 37.1 19.4 24.4 33.7 25 32.5 42.7 ...
$ pedigree: num 0.254 0.188 0.503 1.39 0.491 0.342 0.467 0.587 0.855 0.845 ...
$ age : int 51 28 27 56 22 30 58 36 38 54 ...
$ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 2 1 1 2 1 2 1 ...
Optimal Decision Tree Model
The optimal Decision Tree (hold-out set accuracy of 79.22%) model has the following hyper-parameters:
::toJSON(best_tree$control, pretty = TRUE, auto_unbox = TRUE) jsonlite
{
"minsplit": 20,
"minbucket": 7,
"cp": 0.01,
"maxcompete": 4,
"maxsurrogate": 5,
"usesurrogate": 2,
"surrogatestyle": 0,
"maxdepth": 18,
"xval": 10
}
We can also plot the tree structure using rpart.plot
.
::rpart.plot(best_tree, roundint = FALSE) rpart.plot
Variable Importance
Finally, we can determine the most important predictors of a positive diabetes diagnosis, using the variable importance feature of our Decision Tree.
<- transform(
var_importance data.frame(
"Feature" = names(best_tree$variable.importance),
"Importance" = best_tree$variable.importance
),Importance = Importance/sum(Importance)
)
::ggplot(
ggplot2
var_importance,::aes(x = forcats::fct_reorder(Feature, Importance), y = Importance)
ggplot2+
) ::geom_bar(stat = "identity") +
ggplot2::labs(x = "Feature") +
ggplot2::theme_light() +
ggplot2::theme(
ggplot2axis.title = ggplot2::element_text(face = "bold"),
panel.border = ggplot2::element_blank()
)
Conclusion
Unsurprisingly, glucose levels are by far the most important indicator of diabetes.