---
title: "A History of glmnet"
author:
  - Trevor Hastie
  - Balasubramanian Narasimhan
date: "`r format(Sys.time(), '%B %d, %Y')`"
bibliography: assets/glmnet_refs.bib
link-citations: true
output:
  pdf_document:
    fig_caption: yes
    toc: yes
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{A History of glmnet}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE,
                      fig.align = "center")
```

## 1. Introduction

`glmnet` is an R package for fitting the lasso and elastic-net
regularization paths of generalized linear models and related
regression problems. It has been on CRAN since 2008. Over the
intervening years it has grown from a single coordinate-descent
Fortran kernel for Gaussian, binomial, multinomial, Poisson, and Cox
responses into a layered system in which *any* GLM family can be
fitted, in which large-scale genomic problems are tractable, and in
which every numerical kernel is now written in modern C++. The
package reached its current shape in identifiable steps, each
due to contributions from several people.

This vignette is a history of those steps and an acknowledgement of
the contributions. While some historical information is in `NEWS.md`,
it misses the details. We hope that this document goes beyond a
listing of enhancements and bug fixes to explain why the package has
its current structure, and attributes contributions accurately with
references to published work where appropriate.

## 2. Origins: coordinate descent and the first release (2008--2010)

The statistical backbone of `glmnet` is older than the package. The
lasso was introduced in @lasso and the elastic-net penalty, which
combines $\ell_1$ and $\ell_2$ regularization and smooths the variable
selection behavior of the lasso in the presence of correlated
features, was introduced in @elasticnet. The *algorithmic* insight,
worked out in @pathwise, that cyclical coordinate descent is
extraordinarily effective when combined with warm starts along a
decreasing sequence of regularization parameters $\lambda_1 >
\lambda_2 > \cdots > \lambda_K$ was implemented in its CRAN debut
(version 1.1-1). The path is computed once; each $\lambda_k$ solution
starts from the previous, and both the set of active variables and the
number of inner passes stay small.

Coordinate descent has a long history, and our team was not the first
to use it in conjunction with the lasso. Tibshirani's Ph.D student
Wenjiang Fu at U. Toronto invented a version of coordinate descent
called the *shooting algorithm* in 1997.  Ingrid Daubechies used
coordinate descent for $\ell_1$ image deblurring in a talk at Stanford
in 2002. In her Ph.D. thesis with Jacquie Meulman at U. Leiden in
2006, Anita van der Kooij used coordinate descent in fitting elastic
net problems. Others have used it is well. The *pathwise* use of
coordinate descent in `glmnet` leads to great efficiency, and allows
for a fairly seamless transition to different loss functions.

The canonical reference for the package itself is
@glmnet, published in the *Journal of Statistical Software* the year
after the initial release. That paper covers Gaussian, two-class
logistic, multinomial, and Poisson regression. The Cox
proportional-hazards model for right-censored survival data was
added in version 1.2 (2010) by Noah Simon, then a PhD student at
Stanford, and documented in @coxnet. At that point the available
families were exactly those exposed through the `family` argument
today: `"gaussian"`, `"binomial"`, `"multinomial"`, `"poisson"`,
`"mgaussian"`, and `"cox"`. Each one had its own Fortran subroutine,
and the dispatch in `glmnet.R` was a chain of `if/else` branches
keyed on the family string:

![Family dispatch in `glmnet` before v4.0. User-facing R in blue,
internal R in orange, Fortran subroutines in
grey.](assets/figures/dispatch_v3.png){width=88%}

A word on the implementation language is warranted, because it
explains most of what came later. The Fortran sources were not written
by hand in Fortran but generated from *Mortran*, a macro language
layered over Fortran 77. In the spirit of open source, the full
Mortran sources live in `inst/mortran/` and were maintained by Jerome
Friedman for over a decade. The combination was extremely fast but it
was difficult to extend and hard for newcomers to read. Every new
feature that touched the inner loop had to be expressed in Mortran or
directly in Fortran, regenerated to Fortran, and then debugged under
the limitations of that toolchain. Furthermore, Fortran was dropping
in popularity and C/C++ was being used more and more for writing fast
code to interface with R. So too with `glmnet`.

## 3. Filling out the matrix (2010--2018)

The years between the first JSS paper and the 2019 v3.0 release were
spent in steady, broadening of what the package could accept as input
and what it could do with it. We summarize the milestones
thematically; the version numbers come from `NEWS.md` and CRAN
archive.
```{r timeline, fig.width=7, fig.height=3.8, out.width="\\textwidth", fig.cap="glmnet release milestones. Top row runs left-to-right; the path folds down at the right and the bottom row runs right-to-left, ending at v5.0. Evenly spaced for legibility rather than to reflect calendar time."}
par(mar = c(0.4, 0.4, 0.4, 0.4))
milestones <- data.frame(
  ver = c("Initial release (2008)", "1.2 (2010)", "1.6 (2011)",
          "2.0-1 (2015)", "3.0 (2019)",
          "4.0 (2020)", "4.1 (2021)", "4.1-3 (2021)", "5.0 (2026)"),
  lbl = c("coordinate descent;\nFortran kernel\n(Friedman, Hastie,\nTibshirani)",
          "coxnet\n(Simon)",
          "strong rules screen\n(Tibshirani et al.)",
          "CV rewrite\n(Hastie)",
          "relax; assess;\nbigGlm\n(Hastie,\nNarasimhan)",
          "GLM extension\n(Tay, Hastie,\nNarasimhan)",
          "(start, stop] Cox;\nstrata; sparse Cox\n(Tay, Narasimhan,\nHastie)",
          "Fortran to C++\n(Yang)",
          "coxdev submodules;\nCox unified\n(Taylor,\nNarasimhan, Hastie)")
)
n <- nrow(milestones); ntop <- 5; nbot <- n - ntop
top_y <- 1.4; bot_y <- -1.4
col_x <- seq_len(ntop)
px <- numeric(n); py <- numeric(n)
px[1:ntop]     <- col_x;                   py[1:ntop]     <- top_y
px[(ntop+1):n] <- rev(col_x[(ntop-nbot+1):ntop]); py[(ntop+1):n] <- bot_y

plot.new(); plot.window(xlim = c(0.4, ntop + 0.45), ylim = c(-2.3, 2.3))

seg <- function(x0, y0, x1, y1)
  segments(x0, y0, x1, y1, lwd = 1.4, col = "grey35")
arr <- function(x0, y0, x1, y1)
  arrows(x0, y0, x1, y1, length = 0.09, angle = 25, lwd = 1.4,
         col = "grey35")
pad <- 0.18
# top row: left -> right
for (i in 1:(ntop-1)) arr(px[i] + pad, py[i], px[i+1] - pad, py[i+1])
# fold down at right: tiny side step to clear labels on the shared column
fold_x <- ntop + 0.3
seg(px[ntop] + pad, py[ntop], fold_x, py[ntop])        # right from last top node
seg(fold_x, py[ntop], fold_x, py[ntop+1])              # down along margin
arr(fold_x, py[ntop+1], px[ntop+1] + pad, py[ntop+1])  # left into first bottom node
# bottom row: right -> left
for (i in (ntop+1):(n-1)) arr(px[i] - pad, py[i], px[i+1] + pad, py[i+1])

points(px, py, pch = 21, bg = "grey90", col = "grey20",
       cex = 2.0, lwd = 1.3)
text(px, py, labels = seq_len(n), cex = 0.72, col = "grey15",
     font = 2)

# version labels: above top row, below bottom row
text(px[1:ntop], py[1:ntop] + 0.28, milestones$ver[1:ntop],
     cex = 0.72, col = "grey25", pos = 3, offset = 0)
text(px[(ntop+1):n], py[(ntop+1):n] - 0.28, milestones$ver[(ntop+1):n],
     cex = 0.72, col = "grey25", pos = 1, offset = 0)

# description labels: below top row, above bottom row (interior)
text(px[1:ntop], py[1:ntop] - 0.28, milestones$lbl[1:ntop],
     cex = 0.66, pos = 1, offset = 0)
text(px[(ntop+1):n], py[(ntop+1):n] + 0.28, milestones$lbl[(ntop+1):n],
     cex = 0.66, pos = 3, offset = 0)
```

**More responses.** The `mgaussian` family for multivariate Gaussian
regression with a grouped-lasso penalty, and a `grouped` option for
the multinomial family, arrived in v1.8 (2012). Sparse `x` via
`Matrix::dgCMatrix` landed even earlier, in v1.6 (2011). This was a
major step making it possible to use `glmnet` with large, sparse
design matrices.


**Strong rules.** Version 1.6 (April 2011) also introduced *strong
rules* screening [@strongrules]: at each $\lambda$ along the path a
cheap inner-product check predicts which variables are likely to
stay inactive, and the coordinate-descent sweep skips them, with a
KKT-condition re-check afterwards to catch any violations. For
sparse problems this screens out the vast majority of predictors and
was a step-change in speed for large $p$. Like several other `glmnet`
milestones, the code landed in CRAN a year before the paper appeared.

**Coefficient constraints.** Upper and lower limits on individual
coefficients, together with `glmnet.control()` for tuning numerical
parameters, appeared in v1.9-1 (2013). The ability to fix a
coefficient in an interval --- or pin it to zero from outside via
`exclude` --- is now routinely relied upon by downstream packages.

**Cross-validation maturity.** The v2.0-1 (2015) release rewrote
`cv.glmnet`: each fold fits on its own $\lambda$ sequence and is then
evaluated at the original path. The alignment subtlety that motivated
this --- paths from different folds do not in general hit the same
$\lambda$ values --- was subsequently addressed again in v2.0-18
(2019) via the `alignment` argument.

**Cox refinements.** A bug affecting ties between the death set and
the risk set was fixed in v2.0-19, and `coxgrad` and an improved
`coxnet.deviance` were added in v2.0-20. These changes prepared the
ground for the larger Cox work that followed in v4.1.

**Fortran maintenance.** Substantial effort was spent in making the
Mortran generated Fortran build cleanly on every CRAN platform:
portable on every Fortran compiler (v2.0-15),
double-precision-consistent and `-Wall`-clean (v2.0-16, 2018), native
Fortran registration (v2.0-8, 2017). These changes were behind the
scenes, less visible to users, but essential for passing CRAN checks
for another decade.

## 4. Usability matures: v3.0 (2019)

Version 3.0, released in November 2019, was the first release in
which the feature set grew faster than the implementation underneath
it. The headline additions, contributed by Trevor Hastie and
Balasubramanian Narasimhan, were: the `relax` argument [@best_subset] to both
`glmnet` and `cv.glmnet`, which refits each active set in the path
without regularization and blends the two fits through a `gamma`
parameter (a dedicated `relax` vignette documents the usage);
`assess.glmnet`, `roc.glmnet`, and `confusion.glmnet` for evaluating
model performance; `makeX` for building input matrices from raw data
frames with one-hot encoding of factors and NA handling; and
`bigGlm` for unpenalized GLM fits using the same algorithmic
machinery. A `trace.it` progress bar made long paths less opaque,
and `print` methods for `cv.glmnet` and the new `relaxed` class were
tidied up.

Importantly, the family dispatch at this point was still entirely
character-based. A call `glmnet(x, y, family = "binomial")` passed
through an `if` branch into `lognet.R`, which called a Fortran
subroutine dedicated to logistic regression. Every family had its
own subroutine. That was about to change.

## 5. GLM extension: v4.0 (2020)

Before Version 4.0, each family reference such as  `"binomial"` was
routed to a dedicated routine.  Version 4.0 (May 2020) generalized the objective 
 
$$
\sum_{i=1}^n w_i\,\text{NLL}_i(\beta_0,\beta) + \lambda\,P_\alpha(\beta),
$$

where $\text{NLL}_i$ is the negative log-likelihood contribution of
observation $i$ under *any* GLM family, and the elastic-net penalty is
$P_\alpha(\beta) = \alpha\|\beta\|_1 +
\tfrac{1-\alpha}{2}\|\beta\|_2^2$. Similar to classical GLMs, 
we use an iteratively reweighted lasso or elastic net algorithm at
each value of `lambda` (along with cyclical coordinate descent) to fit
this model, which is implemented  pathwise down the sequence of
`lambda` values. That is, at each value of `lambda`, we implement the
*proximal Newton* via penalized iterative weighted lease squares

$$
\frac{1}{2n}\sum_{i=1}^n w_i \bigl(y_i - \beta_0 - x_i^\top\beta\bigr)^2
 + \lambda\,P_\alpha(\beta).
$$

  This coordinate-descent kernel that serves this inner loop is
the weighted least squares routine `wls`, which becomes the engine
for every family the package does not handle with a dedicated
subroutine.

The programming was done by Kenneth Tay, then a PhD student at
Stanford, with  supervision from
Hastie and Narasimhan. 
The user-visible
payoff is that the `family` argument can now be an R `family()`
object --- any object of class `"family"` in the sense of `stats::glm`,
including `MASS::negative.binomial`, `statmod::tweedie`, or a
user-defined family. The published reference is @glmnetFlex. Note
that the paper appeared three years after the release; the software
led the publication.

![Family dispatch in `glmnet` v4.0. The new right-hand branch
(`glmnet.path`, `glmnet.fit`, `wls`) handles any `family()` object;
the original per-family subroutines remain for the built-in
string-named families, which stay on the fast
path.](assets/figures/dispatch_v4.png){width=92%}

Two design decisions deserve explicit mention. First, the pre-v4
character-named families were retained and continue to go through
their dedicated subroutines, because a hand-written kernel will
always beat a generic IRLS loop. So the flexibility was added alongside the
old code, not on top of it. Second, the v4.0 warm-start convention
was tightened: the earliest form required passing a full `glmnetfit`
object, which was awkward; in v4.0.1 this was relaxed so a warm start
can be a simple list containing the intercept and coefficient vector.
Consistency with the pre-v4 numerical results --- verified through
KKT-condition checks against the Fortran output --- was the
correctness criterion throughout.

## 6. Cox catches up: v4.1 (2021)

Version 4.1 (January 2021) brought the Cox model to something close
to feature parity with `survival::coxph`, which has served as the
reference implementation throughout `glmnet`'s Cox history. Four
things were added in a single release:

1. *Left truncation and time-varying covariates* via $(start, stop]$
   counting-process input, the same format `survival` has long
   accepted.
2. *Strata*, specified via `stratifySurv()`, with the interpretation
   of `survival::coxph`: a separate baseline hazard per stratum and
   a single shared coefficient vector.
3. *Sparse $x$* for Cox models, which had been a notable gap --- the
   pre-v4.1 Cox path required a dense matrix even when the rest of
   the package was happy with `dgCMatrix`.
4. A `survfit` method for `coxnet` objects so that fitted survival
   curves could be produced without leaving `glmnet`.

Much of this work was also implemented by Tay, building on the v4.0 IRLS machinery
but targeting the Cox deviance specifically. In v4.1-2, following a
suggestion by Rob Tibshirani, the `exclude` argument was extended from
a fixed set of indices to *a function* that is handed `x`, `y`, and
`weights` and returns the indices to drop. This is the user-level hook
by which strong-rules screening can be plugged into the path; see
@strongrules and, for its application at GWAS scale, @snpnet.

## 7. Fortran → C++ port (2021--2022)

During the summer of 2021, first year statistics Ph.D student James
Yang was looking for a fun activity after completing the grueling
qualiying exams. He decided to rewrite the fortran backbone in
`glmnet` in C++!  This caught the attention of Hastie (who became his
advisor) and Narasimhan, because Yang is a very experienced
programmer.  Between v4.1-3 (November 2021) and v4.1-6 (November
2022), Yang replaced essentially all of the Mortran/Fortran in the
package with a header-only C++ implementation built on `Eigen`. This
implementation, `glmnetpp`, is a single template kernel that absorbs
dense and sparse matrices and all the built-in families through
parametrization rather than code duplication.

The staging was deliberate. In v4.1-3 the weighted least squares
kernels (`wls` and its sparse counterpart) and the `elnet` family of
routines were ported first --- these were the engines that the new
`glmnet.fit` path exercised the most, and porting them delivered
roughly 4--8$\times$ speedups on the generic GLM path. In v4.1-4 the
rest of the Fortran was rewritten, leaving only the Cox subroutine.
In v4.1-6 the legacy Fortran that could be removed was removed.

![Family dispatch after the C++ port (v4.1-3 onwards). Coloured
boxes indicate kernels rewritten in C++; only `coxnet()` remained in
Fortran.](assets/figures/dispatch_v4plus.png){width=92%}

Several thousand lines of template C++ replaced several thousand lines
of Fortran, with comparable or better performance because arguments
were no longer copied---R version 3.1+ changed this behavior and `dup
= FALSE` had no effect. This provided a much lower barrier to future
extension, and removed the dependence on a generated-Fortran
toolchain. The Cox subroutine was left alone because its numerical
structure --- stratified risk sets, ties, left truncation --- is
harder to express cleanly as a template, and because James Yang's
priority was to get the rest of the package onto the new
footing. Resolving Cox would take another three years.

## 8. Applications feeding back into the package

`glmnet` is also a piece of infrastructure on which other statistical
software is built, and some of those applications have exerted
pressure on the package's interface. We mention three briefly.

*SNPnet* [@snpnet] applies the lasso at the scale of modern
genome-wide association studies --- hundreds of thousands of samples
by hundreds of thousands of variants --- by aggressively screening
predictors using the strong rules of @strongrules. The work
motivated the v4.1-2 generalization of `exclude` to accept a
function, so that strong-rules screening can be plugged into the
standard `glmnet` call rather than requiring a fork.

*The data-shared lasso* [@datasharedlasso] uses `glmnet` as its
optimization backend to share information across related datasets
while retaining per-dataset coefficients, an idiom close to multi-task
learning.

*Cooperative learning* [@coopLearning] is a framework for
multi-view regression in which multiple data modalities on the same
subjects (genomics, proteomics, imaging, and so on) are jointly
regressed against an outcome under a regularization term that
penalizes disagreement between the views. The companion CRAN
package `multiview` dispatches its fits to `glmnet`.

## 9. Streamlined Cox and v5.0 (2026)

On the eve of v5.0 the package was in an unusual state: every kernel
except the Cox subroutine had been rewritten in C++. The Cox kernel
had survived because of the combinatorial complexity its users
expect of it --- tie handling, stratified risk
sets, left truncation --- and because producing a numerically stable
deviance, gradient, and Hessian in all of these cases in a single
clean implementation is harder than it looks.

The v5.0 release resolves this. Jonathan Taylor joined the team, and contributed
`coxdev`, a standalone C++ library that computes the Cox partial
log-likelihood deviance together with its gradient and Hessian,
handling Breslow and Efron tie methods, strata, and $(start, stop]$
data through a small number of well-tested classes. `coxdev` is
separately developed (a nested git submodule of `glmnetpp`, which is
itself a submodule of `glmnet`), and the computational details are
the subject of a companion vignette. At the same time,
Narasimhan replaced the Fortran code for the Cox regularization path with
C++, making use of the `coxdev` library for the gradient and Hessian
computations, which automatically takes care of all the special cases.
From the point of
view of `glmnet`, the essential change is that Cox is now a
first-class GLM type inside `glmnetpp` --- `glm_type::cox` in the
type enumeration --- and the same IRLS loop that drives the
Gaussian, binomial, and Poisson paths drives the Cox path, with a
thin `CoxDevAdapter` handing the deviance and gradient queries off to
`coxdev`. The last of the Fortran leaves the package in this release.

![Full architecture of `glmnet` v5.0 with the `glmnetpp` and `coxdev`
submodules, the Rcpp boundary, and the `CoxDevAdapter` bridge from
the `glmnetpp` IRLS loop to the `coxdev` deviance and gradient
routines.](assets/figures/dispatch_v5.png){width=100%}

Two user-visible consequences of this unification are worth stating
plainly. First, the Cox model now takes a `cox.ties` argument
with choices `"breslow"` or `"efron"`. The v5.0 default is
`cox.ties = "breslow"`, preserving the pre-v5.0 numerical behavior for
upgraders; v5.1 will switch the default to `"efron"` to match
`survival::coxph`. Calls to `glmnet()` with `family = "cox"` that do
not set `cox.ties` explicitly emit a transition warning until the
switch occurs.  The performance picture is favourable. On the benchmark set
in `tests/benchmark_cox.R`, dense Cox paths are roughly 3 times
faster than the legacy Fortran, sparse Cox paths roughly 90 times
faster, and stratified Cox paths --- which had been the most
penalized case in the old implementation --- roughly 300 times
faster. Full benchmark numbers appear in the NEWS entry for v5.0.

## 10. Looking forward

Development continues along several lines. A Python `glmnet` is
being prepared by Jonathan Taylor that shares a common C++ codebase with the R package,
so that the same kernels serve both languages. The shared codebase
opens the possibility of compiling `glmnet` to WebAssembly for
in-browser use. Upstream work in `coxdev` is aimed at reducing
per-call allocation in the IRLS loop, which is the last remaining
place where the Cox path pays a measurable overhead for its
extensibility.

## 11. Contributors and acknowledgements

`glmnet` is the work of many hands. Jerome Friedman wrote the
original coordinate-descent kernel and the Mortran infrastructure,
and is a co-author of @glmnet, @coxnet, and @pathwise. Trevor
Hastie has been co-author throughout, as well as the package
maintainer. He also wrote the bulk of the R interface code, as well as
methods for plotting, printing, predicting and cross-validation, and
the corresponding code for the relaxed models (with help from Narasimhan).

Robert Tibshirani introduced the lasso [@lasso] and co-authored the
 strong-rules work on which the package depends. Noah
Simon led the original Cox path work and is the first author of
@coxnet. Kenneth Tay is responsible for the v4.0 GLM extension and
much of the v4.1 Cox expansion, and is first author of @glmnetFlex.
Balasubramanian Narasimhan has maintained the package through the
v4.x and v5.0 architectural transitions and is a co-author of
@glmnetFlex. James Yang carried out the v4.1-3 through v4.1-6
port from Fortran to C++ and is the author of `glmnetpp`. Junyang
Qian is first author of @snpnet, whose screening machinery
influenced the v4.1-2 `exclude`-as-function interface. Jonathan
Taylor contributed `coxdev` and with Narasimhan drove the v5.0 unification of the
Cox path onto the C++ engine. Many bug reports and fixes
contributed by users over the years are recorded in `NEWS.md`; the
work of Tomas Kalibera, David Keplinger, and others is gratefully
acknowledged.

## References
