% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/a.R
\name{fru}
\alias{fru}
\title{Train the fru model}
\usage{
fru(
  x,
  y,
  trees = 500L,
  tries,
  forest = FALSE,
  oob = TRUE,
  importance = FALSE,
  solidify = FALSE,
  threads = 0L
)
}
\arguments{
\item{x}{Data frame containing predictors; must only contain logical, numeric, integer or factor columns, without NAs.}

\item{y}{Decision; either a factor or logical, for classification, or numeric, for regression.
Integer decision will be silently converted into a real vector and treated as such afterwards.
NA values are not accepted.}

\item{trees}{Number of trees to grow, a single number larger than zero.
Also called \code{ntree} in other software.
500 by default, in principle should be set to be big enough to stabilise the outputs of the forest, either prediction accuracy or importance; generally, bigger sets will need more trees, and it is unlikely that overshooting ensemble size will hurt the model in a statistically significant way.}

\item{tries}{Number of features to try at each split, a single number larger than zero and not larger than the number of columns in \code{x}.
Also called \code{mtry} in other software.
By default, set to the rounded square root of the number of features.
It is unlikely this needs tweaking; increasing this value leads to a more accurate decision trees, but in turn makes them more correlated, spoiling the ensemble effect.}

\item{forest}{If set to \code{TRUE}, the forest object is returned and can be used for prediction.}

\item{oob}{If set to \code{TRUE}, out-of-bag (OOB) predictions will be calculated.}

\item{importance}{If set to \code{TRUE}, importance scores will be calculated.}

\item{solidify}{If set to \code{TRUE}, the forest object will use more memory but will survive serialisation, in particular when saved
by \code{save}, \code{saveRDS} or when sent between processes.
This can be done later with \code{solidify}, unless the model structure was already lost.}

\item{threads}{Number of threads to use; by default, or when set to 0, fru will try to use all available computing cores.}
}
\value{
The fitted model, an object of a class \code{fru}.
}
\description{
Fru is an implementation of Leo Breiman's Random Forest (tm) method.
It fits an ensemble of decision trees built on bootstrap resamples of observations and additionally permuted by constraining split optimisation to a random subset of features.
The ensemble prediction is than established from individual trees by voting.
Thanks to its construction, the model can also provide a cross-validation-like internal approximation of error, so called out-of-bag predictions, as well as importance scores for features.
}
\details{
In comparison to similar packages, fru is a tailored towards stability, correctness, efficiency and scalability on modern multithreaded machines, providing solid foundation for large data analysis, higher-level methods or production pipeline.
To this end, fru exposes only the original hyper-parameters and provides only the permutational importance, though calculated with a novel algorithm that alleviates its greater computational burden.

Fru accepts logical, numeric (including integer) and factor features; NAs are not allowed and will result in error.
Logical features are always split into false/true groups without optimisation, yet are scored via weighted Gini impurity (for classification) or variance reduction (for regression), in order to be compared with splits on other features.
For numerical features, threshold value is optimised by an exhaustive scan of the criterion above; real values get threshold as a mid-point between values around the split, while integer values as a minimal of the two.
In case of a tie in the score, a smaller threshold is used.
Ordered factors or factors with six or more levels are treated as numerical, so follow the above procedure.
Unordered factors with five or less levels are split by finding a level partition into two subsets via an exhaustive scan of all possibilities, scored, as above, by Gini impurity or variance drop, depending on the forest type.

The maximal tree depth is hard-coded to 512; a critical sample size that triggers branch termination into leaf is one for classification and four for regression; this means that regression needs at least ten objects to be practical.
Leaves may be formed from larger samples in same cases, for instance when no split can be found based on the feature samples; in this case, for classification, random tie breaking is used.

Fru uses its own PRNG, the pcg32 method by Melissa E. O'Neill, for its capacity to produce reasonably decorrelated streams, which are used to provide reproducibility of the output in parallel scenarios, regardless of the number of threads.
Namely, fru guarantees that the same trees will be fit for the same input and random seed, although their order may differ.
Thus, OOB predictions and importance scores will be the same up to numerical errors.
PRNG is used in training and prediction on new data; generator is seeded from the R generator, thus standard R interface of \code{set.seed} should be used to control it.
}
\examples{
set.seed(1)
data(iris)
fru(iris[,-5],iris[,5],threads=2)
}
\references{
Breiman L. (2001). \emph{Random Forests}, Machine Learning 45, 5-32.

O'Neil Melissa E. (2014). \emph{PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation}, HMC-CS-2014-0905.
}
