Outliers are observations within a dataset that seem not to belong with the rest of the data. They could be caused, for example, by spurious entries that need to be eliminated before further analysis, or by hard-to-detect signals of interest in their own right.

The **probout** package provides unsupervised estimates
of the probability of outlyingness for observations, based largely on
separation in terms of distance. It is intended for multivariate numeric
data with large numbers of sequentially accessible observations. The
dimensionality of the data should not be too large, so that distances
between individual observations can be computed efficiently.

The method relies on leader clustering (Hartigan,1975) to reduce the
size of the data in an initial phase. Leader clustering partitions the
data into groups that are within a user-specified radius \(\rho\) of *leader* observations. The
leader observations are those that are not within \(\rho\) of an existing leader as the data is
processed sequentially. The leader observations, and hence the
associated groups, will typically vary with the order of the data. By
default, the data is normalized through min-max scaling, in which each
variable is mapped to the unit interval.

After leader clustering, an outlier probability is determined for each group, based on the group centroids and data simulated from a mixture model defined by the group proportions, centroids, and variances, accumulated as the data is processed sequentially. The centroids are included to ensure representation of any groups with proportions so small that it would be unlikely that a simulated observation would be drawn from those groups.

**probout** estimates outlier probabilities by fitting
an exponential distribution to a nonparametric outlier statistic from
robust statistics (Stahel 1981, Donoho 1982). This statistic is
essentially a robust \(z\)-score: for
each observation, the median is subtracted and the absolute value of the
result is divided by the median absolute deviation (MAD). For
multivariate data, the univariate statistic is repeatedly computed for
many random projections of the data, and the maximum value is retained
as the value of the multvariate statistic. Outliers correspond to
unusually large values of the outlier statistic.

We use the 100, 400, 1500 meter timings from the *Decathlon*
dataset from CRAN package **GDAdata**.

```
require(GDAdata)
data(Decathlon)
x <- Decathlon[,c("m100","m400","m1500")]
```

A projection of the data onto the first and third coordinates can be produced as follows:

```
plot(x[,1], x[,3], xlab = "100 meter timings", ylab = "1500 meter timings",
main = "", pch = 16, cex = .5)
```

To obtain outlier probabilities, first apply leader clustering:

```
require(probout)
require(FNN)
```

`lead <- leader(x)`

The *leader* function produces a list of leader clusterings
for each radius supplied as a argument. The default is to compute the
leader clustering for a single radius, which corresponds to the default
radius \(0.1 ~ / ~ log(n)^{(1/p)}\)
from Wilkinson (2016) â€” the same as in the **HDoutliers**
package (Fraley 2016). A plot of the leaders can be produced as
follows:

```
plot(x[,1], x[,3], xlab = "100 meter timings", ylab = "1500 meter timings",
main = "leader observations (blue)", pch = 16, cex = .5)
leads <- lead[[1]]$leaders
points(x[leads,1],x[leads,3],pch="+",cex=1.5,col="dodgerblue")
```