Predictors of Testing Positive for Diabetes

Author

Daniel Molitor

This analysis uses the Pima Indians Diabetes dataset to determine which factors are important predictors of testing positive for diabetes.

Data Structure

Let’s get a quick understanding of the variables included in our dataset and how they relate to each other.

str(diabetes)
plot(diabetes, col = "lightblue")
'data.frame':   768 obs. of  9 variables:
 $ pregnant: int  11 6 2 4 1 5 8 5 0 2 ...
 $ glucose : int  143 92 90 111 103 88 176 44 109 109 ...
 $ pressure: int  94 92 68 72 80 66 90 62 88 92 ...
 $ triceps : int  33 0 42 47 11 21 34 0 30 0 ...
 $ insulin : int  146 0 0 207 82 23 300 0 0 0 ...
 $ mass    : num  36.6 19.9 38.2 37.1 19.4 24.4 33.7 25 32.5 42.7 ...
 $ pedigree: num  0.254 0.188 0.503 1.39 0.491 0.342 0.467 0.587 0.855 0.845 ...
 $ age     : int  51 28 27 56 22 30 58 36 38 54 ...
 $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 2 1 1 2 1 2 1 ...

Optimal Decision Tree Model

The optimal Decision Tree (hold-out set accuracy of 79.22%) model has the following hyper-parameters:

jsonlite::toJSON(best_tree$control, pretty = TRUE, auto_unbox = TRUE)
{
  "minsplit": 20,
  "minbucket": 7,
  "cp": 0.01,
  "maxcompete": 4,
  "maxsurrogate": 5,
  "usesurrogate": 2,
  "surrogatestyle": 0,
  "maxdepth": 18,
  "xval": 10
} 

We can also plot the tree structure using rpart.plot.

rpart.plot::rpart.plot(best_tree, roundint = FALSE)

Variable Importance

Finally, we can determine the most important predictors of a positive diabetes diagnosis, using the variable importance feature of our Decision Tree.

var_importance <- transform(
  data.frame(
    "Feature" = names(best_tree$variable.importance),
    "Importance" = best_tree$variable.importance
  ),
  Importance = Importance/sum(Importance)
)

ggplot2::ggplot(
  var_importance,
  ggplot2::aes(x = forcats::fct_reorder(Feature, Importance), y = Importance)
) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::labs(x = "Feature") +
  ggplot2::theme_light() +
  ggplot2::theme(
    axis.title = ggplot2::element_text(face = "bold"),
    panel.border = ggplot2::element_blank()
  )

Conclusion

Unsurprisingly, glucose levels are by far the most important indicator of diabetes.