% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/auto_stratify.R
\name{auto_stratify}
\alias{auto_stratify}
\title{Auto Stratify}
\usage{
auto_stratify(data, treat, prognosis, outcome = NULL, size = 2500,
  pilot_fraction = 0.1, pilot_sample = NULL,
  group_by_covariates = NULL)
}
\arguments{
\item{data}{\code{data.frame} with observations as rows, features as columns}

\item{treat}{string giving the name of column designating treatment
assignment}

\item{prognosis}{information on how to build prognostic scores.  Three
different input types are allowed: \enumerate{ \item vector of prognostic
scores for all individuals in the data set. Should be in the same order as
the rows of \code{data}. \item a \code{formula} for fitting a prognostic
model \item an already-fit prognostic score model}}

\item{outcome}{string giving the name of column with outcome information.
Required if prognostic_scores is specified.  Otherwise it will be inferred
from prog_formula}

\item{size}{numeric, desired size of strata (default = 2500)}

\item{pilot_fraction}{numeric between 0 and 1 giving the proportion of
controls to be allotted for building the prognostic score (default = 0.1)}

\item{pilot_sample}{a data.frame of held aside samples for building
prognostic score model.}

\item{group_by_covariates}{character vector giving the names of covariates to
be grouped by (optional). If specified, the pilot set will be sampled in a
stratified manner, so that the composition of the pilot set reflects the
composition of the whole data set in terms of these covariates.  The
specified covariates must be categorical.}
}
\value{
Returns an \code{auto_strata} object.  This contains: \itemize{

  \item \code{outcome} - a string giving the name of the column where outcome
  information is stored

  \item \code{treat} - a string giving the name of the column encoding
  treatment assignment

  \item \code{analysis_set} - the data set with strata assignments

  \item \code{call} - the call to \code{auto_stratify} used to generate this
  object

  \item \code{issue_table} - a table of each stratum and potential issues of
  size and treat:control balance

  \item \code{strata_table} - a table of each stratum and the prognostic
  score quantile bin to which it corresponds

  \item \code{prognostic_scores} - a vector of prognostic scores.

  \item \code{prognostic_model} - a model for prognosis fit on a pilot data
  set.  Will be \code{NULL} if a vector of prognostic scores was provided as
  the \code{prognosis} argument to \code{auto_stratify} rather than a model
  or formula.

  \item \code{pilot_set} - the set of controls used to fit the prognostic
  model. These are excluded from subsequent analysis so that the prognostic
  score is not overfit to the data used to estimate the treatment effect.
  Will be \code{NULL} if a pre-fit model or a vector of prognostic scores was
  provided as the \code{prognosis} argument to \code{auto_stratify} rather
  than formula.

  }
}
\description{
Automatically creates strata for matching based on a prognostic score formula
or a vector of prognostic scores already estimated by the user. Creates a
\code{auto_strata} object, which can be passed to \code{\link{strata_match}}
for stratified matching or unpacked by the user to be matched by some other
means.
}
\details{
Stratifying by prognostic score quantiles can be more effective than manually
stratifying a data set because the prognostic score is continuous, thus the
strata produced tend to be of equal size with similar prognosis.

Automatic stratification requires information on how the prognostic scores
should be derived.  This is primarily determined by the specifciation of the
\code{prognosis} argument.  Three main forms of input for \code{prognosis}
are allowed: \enumerate{ \item A vector of prognostic scores. This vector
should be the same length and order of the rows in the data set.  If this
method is used, the \code{outcome} argument must also be specified; this is
simply a string giving the name of the column which contains outcome
information. \item A formula for prognosis (e.g. \code{outcome ~ X1 + X2}).
If this method is used, \code{auto_stratify} will automatically split the
data set into a \code{pilot_set} and an \code{analysis_set}.  The pilot set
will be used to fit a logistic regression model for outcome in the absence of
treatment, and this model will be used to estimate prognostic scores on the
analysis set.  The analysis set will then be stratified based on the
estimated prognostic scores.  In this case the \code{outcome} argument need
not be specified since it can be inferred from the input formula. \item A
model for prognosis (e.g. a \code{glm} object).  If this method is used, the
\code{outcome} argument must also be specified}
}
\section{Troubleshooting}{


  This section suggests fixes for common errors that appear while fitting the
  prognostic score or using it to estimate prognostic scores on the analysis
  set.

  \itemize{

  \item \code{Encountered an error while fitting the prognostic model...
  numeric probabilities 0 or 1 produced}. This error means that the
  prognostic model can perfectly separate positive from negative outcomes.
  Estimating a treatment effect in this case is unwise since an individual's
  baseline characteristics perfectly determine their outcome, regardless of
  whether they recieve the treatment.  This error may also appear on rare
  occaisions when your pilot set is very small (number of observations
  approximately <= number of covariates in the prognostic model), so that
  perfect separation happens by chance.

  \item \code{Encountered an error while estimating prognostic scores ...
  factor X has new levels ... } This may indicate that some value(s) of one
  or more categorical variables appear in the analysis set which were not
  seen in the pilot set. This means that when we try to obtain prognostic
  scores for our analysis set, we run into some new value that our prognostic
  model was not prepared to handle.  There are a few options we have to
  troubleshoot this problem: \itemize{

  \item \strong{Rejection sampling.}  Run \code{auto_stratify} again with the
  same arguments until this error does not occur (i.e. until some
  observations with the missing value are randomly selected into the pilot
  set)

  \item \strong{Eliminate this covariate from the prognostic formula.}

  \item \strong{Remove observations with the rare covariate value from the
  entire data set.} Consider carefully how this exclusion might affect your
  results.

  } }

  Other errors or warnings can occur if the pilot set is too small and the
  prognostic formula is too complicated.  Always make sure that the number of
  observations in the pilot set is large enough that you can confidently fit
  a prognostic model with the number of covariates you want.
}

\examples{
  # make sample data set
  set.seed(111)
  dat <- make_sample_data(n = 75)

  # construct a pilot set, build a prognostic score for `outcome` based on X2
  # and stratify the data set based on the scores into sets of about 25
  # observations
  a.strat_formula <- auto_stratify(dat, "treat", outcome ~ X2, size = 25)

  # stratify the data set based on a model for prognosis
  pilot_data <- make_sample_data(n = 30)
  prognostic_model <- glm(outcome ~ X2, pilot_data, family = "binomial")
  a.strat_model <- auto_stratify(dat, "treat", prognostic_model,
                                 outcome = "outcome", size = 25)

  # stratify the data set based on a vector of prognostic scores
  prognostic_scores <- predict(prognostic_model, newdata = dat,
                               type = "response")
  a.strat_scores <- auto_stratify(dat, "treat", prognostic_scores,
                                  outcome = "outcome", size = 25)

  # diagnostic plots
  plot(a.strat_formula)
  plot(a.strat_formula, type = "FM", propensity = treat ~ X1, stratum = 1)
  plot(a.strat_formula, type = "hist", propensity = treat ~ X1, stratum = 1)
  plot(a.strat_formula, type = "residual")
}
\seealso{
\code{\link{manual_stratify}}, \code{\link{new_auto_strata}}
}
