The goal of glmSparse is to provide common methods such as coef
, summary
, predict
, broom::tidy
, etc. for sparse GLMs implemented in the MatrixModels package. For more information on this particular implementation of GLMs, see the documentation for MatrixModels::glm4
. Additionally, glmSparse extends MatrixModels::glm4
to allow the user to both fit GLMs using the formula syntax of base R’s glm
function as well as directly passing in a pre-formulated sparse model matrix and response vector (sparse matrices as defined by the Matrix
package implementation).
The main motivation for this package is the times where I have wanted to live my life peaceably and just run a huge (pretty sparse) OLS or logit regression with no trouble, and I don’t have the memory on hand to run the regression with a standard dense matrix. Unfortunately, I’ve continued to be surprised by how little support there is to do this within existing packages! (If I’m wrong about this, I would love to be proved wrong). I discovered the potential solution to this problem with MatrixModels::glm4
, described as an experimental function way back in like 2016, with very little support in the way of methods and nothing seeming to change since that point. So, I took it upon myself to extend this function’s functionality and make the methods that I wished it had.
Here are the alternatives I’m aware of:
glmnet::bigGlm
This works fine actually for estimating model parameters and making predictions, but because it’s not finding a closed-form solution it doesn’t out-of-the-box come with features that I would typically want (e.g. standard errors). It also doesn’t really have any of the standard methods you come to expect with glm
such as fitted
, resid
, summary
, etc.
speedglm::speedglm
This option is confusing to me particularly because it has an argument sparse
with the following description: “logical. Is the model matrix sparse? By default is NULL, so a quickly sample survey will be made.” However, the function does not allow the user to input sparse matrices from the Matrix
package. Additionally, under the hood it is using model.matrix
to create the underlying modeling matrix which is the exact step I’m trying to avoid (e.g. by using Matrix::sparse.model.matrix
). It is possible that this function will allow sparse matrix classes from a different package (e.g. SparseM
), but even if it does that’s not a great solution IMO because the Matrix
implementation of sparse matrices are by far the most ubiquitous and I don’t really want to veer from that.
glm
Not meant to support sparse matrices.
Others
I’m potentially missing a good option here out of ignorance?
Assuming I’m not missing out on anything obvious here, none of the options above provide a suitable solution to the issue, which is presumably why the MatrixModels::glm4
function was created in the first place!
You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("dmolitor/glmSparse")
In this example we’ll use the dplyr::starwars
dataset which has characteristics of a bunch of Star Wars characters, and we’ll run a simple logistic regression to predict the characters’ genders. This will showcase the alternate modeling interfaces as well as the standard modeling methods.
First, we’ll do a tiny bit of data cleaning and we’ll also create a sparse modeling matrix and response vector.
library(dplyr)
library(glmSparse)
library(Matrix)
# Prep the data for modeling
starwars_clean <- starwars |>
select(height:eye_color, gender) |>
filter(!is.na(gender)) |>
mutate(
gender = as.integer(factor(gender)) - 1,
across(where(is.character), ~ factor(.x, exclude = NULL)),
across(contains("color"), ~ forcats::fct_lump_min(.x, 5)),
across(c(height, mass), ~ replace(.x, is.na(.x), median(.x, TRUE)))
)
# Pre-create sparse modeling matrix
starwars_x <- sparse.model.matrix(gender ~ . - 1, starwars_clean)
starwars_y <- starwars_clean$gender
Now that we’ve created the modeling data let’s estimate a logistic regression where gender
is our response variable. We’ll use both the formula and matrix interfaces.
## glmSparse
# Formula interface
logit_formula <- glmSparse(gender ~ ., family = binomial, data = starwars_clean)
# Matrix interface
logit_matrix <- glmSparse(x = starwars_x, y = starwars_y, family = binomial)
Let’s take a quick look at a few of the most helpful methods:
# Print the model
logit_formula
#>
#> Call: glmSparse(formula = gender ~ ., family = binomial, data = starwars_clean)
#>
#> Coefficients:
#> (Intercept) height mass hair_colorbrown
#> -5.247662 -0.001529 0.063185 0.694687
#> hair_colornone hair_colorNA hair_colorOther skin_colorfair
#> 1.618806 14.583680 0.304483 0.181056
#> skin_colorgreen skin_colorgrey skin_colorlight skin_colorOther
#> 17.714744 0.204344 -2.307464 0.306911
#> eye_colorblue eye_colorbrown eye_colororange eye_colorred
#> 0.845282 2.647640 17.520499 14.293584
#> eye_coloryellow eye_colorOther
#> 2.007027 1.273756
#>
#> Degrees of Freedom: 82 Total (i.e. Null); 65 Residual
#> Null Deviance: 84.16
#> Residual Deviance: 56.47 AIC: 92.47
# Get the summary of the model
summary(logit_formula)
#>
#> Call:
#> glmSparse(formula = gender ~ ., family = binomial, data = starwars_clean)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.3899 0.0000 0.2656 0.6361 1.7203
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -5.248e+00 3.599e+00 -1.458 0.1448
#> height -1.529e-03 1.537e-02 -0.099 0.9208
#> mass 6.319e-02 2.778e-02 2.275 0.0229 *
#> hair_colorbrown 6.947e-01 1.297e+00 0.536 0.5922
#> hair_colornone 1.619e+00 1.348e+00 1.201 0.2296
#> hair_colorNA 1.458e+01 3.187e+03 0.005 0.9963
#> hair_colorOther 3.045e-01 1.442e+00 0.211 0.8327
#> skin_colorfair 1.811e-01 1.784e+00 0.101 0.9192
#> skin_colorgreen 1.771e+01 3.512e+03 0.005 0.9960
#> skin_colorgrey 2.043e-01 2.111e+00 0.097 0.9229
#> skin_colorlight -2.307e+00 1.749e+00 -1.319 0.1871
#> skin_colorOther 3.069e-01 1.513e+00 0.203 0.8392
#> eye_colorblue 8.453e-01 1.643e+00 0.514 0.6070
#> eye_colorbrown 2.648e+00 1.919e+00 1.380 0.1676
#> eye_colororange 1.752e+01 3.013e+03 0.006 0.9954
#> eye_colorred 1.429e+01 3.505e+03 0.004 0.9967
#> eye_coloryellow 2.007e+00 1.848e+00 1.086 0.2775
#> eye_colorOther 1.274e+00 1.567e+00 0.813 0.4163
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 84.164 on 82 degrees of freedom
#> Residual deviance: 56.472 on 65 degrees of freedom
#> AIC: 92.472
#>
#> Number of Fisher Scoring iterations: 17
# Tidy the summary of the model
tidy(logit_formula, exponentiate = TRUE)
#> # A tibble: 18 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 5.26e-3 3.60 -1.46 0.145
#> 2 height 9.98e-1 0.0154 -0.0995 0.921
#> 3 mass 1.07e+0 0.0278 2.27 0.0229
#> 4 hair_colorbrown 2.00e+0 1.30 0.536 0.592
#> 5 hair_colornone 5.05e+0 1.35 1.20 0.230
#> 6 hair_colorNA 2.16e+6 3187. 0.00458 0.996
#> 7 hair_colorOther 1.36e+0 1.44 0.211 0.833
#> 8 skin_colorfair 1.20e+0 1.78 0.101 0.919
#> 9 skin_colorgreen 4.94e+7 3512. 0.00504 0.996
#> 10 skin_colorgrey 1.23e+0 2.11 0.0968 0.923
#> 11 skin_colorlight 9.95e-2 1.75 -1.32 0.187
#> 12 skin_colorOther 1.36e+0 1.51 0.203 0.839
#> 13 eye_colorblue 2.33e+0 1.64 0.514 0.607
#> 14 eye_colorbrown 1.41e+1 1.92 1.38 0.168
#> 15 eye_colororange 4.06e+7 3013. 0.00581 0.995
#> 16 eye_colorred 1.61e+6 3505. 0.00408 0.997
#> 17 eye_coloryellow 7.44e+0 1.85 1.09 0.278
#> 18 eye_colorOther 3.57e+0 1.57 0.813 0.416
Finally, let’s confirm that glmSparse
and glm
outputs are identical.
all.equal(
tidy(logit_formula),
tidy(
suppressWarnings(
glm(gender ~ ., family = binomial, data = starwars_clean)
)
),
tolerance = 0.05
)
#> [1] TRUE
Finally, let’s do a quick benchmark of glmSparse
vs the alternatives.
starwars_big <- bind_rows(
lapply(
1:10000,
function(i) starwars_clean
)
)
starwars_big_x <- sparse.model.matrix(gender ~ . - 1, starwars_big)
starwars_big_y <- starwars_big$gender
bench::mark(
glmSparse(x = starwars_big_x, y = starwars_big_y, family = binomial),
speedglm::speedglm(gender ~ ., family = binomial(), data = starwars_big),
suppressWarnings(glm(gender ~ ., family = binomial, data = starwars_big)),
check = FALSE,
iterations = 1
) |>
suppressWarnings() |>
select(min:total_time)
#> # A tibble: 3 x 5
#> min median `itr/sec` mem_alloc `gc/sec`
#> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 18.9s 18.9s 0.0528 3.94GB 0.686
#> 2 22.8s 22.8s 0.0438 4.49GB 0.526
#> 3 34.5s 34.5s 0.0290 8.89GB 0.871