quickOutlier is a comprehensive toolkit for Data Mining
in R. It simplifies the process of detecting, visualizing, and treating
anomalies in your datasets using both statistical and machine learning
approaches.
The most common way to find outliers is looking at one variable at a time.
# Create dummy data with one obvious outlier (500)
df <- data.frame(
id = 1:10,
revenue = c(10, 12, 11, 10, 12, 11, 13, 10, 500, 11)
)
# Detect using Interquartile Range (IQR)
outliers <- detect_outliers(df, column = "revenue", method = "iqr")
print(outliers)
#> id revenue iqr_bounds
#> 9 9 500 [5 - 17.25]Visual inspection is crucial. quickOutlier provides an
instant ggplot2 visualization to see where your anomalies
fall compared to the distribution.
Sometimes you don’t want to delete the data, but “cap” it to a maximum reasonable value. This is called Winsorization.
Some outliers are only visible when looking at two variables together (e.g., a person who is short but weighs a lot).
# Generate data: y correlates with x
df_multi <- data.frame(x = rnorm(50), y = rnorm(50))
df_multi$y <- df_multi$x * 2 + rnorm(50, sd = 0.5)
# Add an anomaly: normal x, but impossible y given x
anomaly <- data.frame(x = 0, y = 10)
df_multi <- rbind(df_multi, anomaly)
# Detect using Mahalanobis Distance
detect_multivariate(df_multi, columns = c("x", "y"))
#> x y mahalanobis_dist
#> 51 0 10 43.42For complex clusters where statistical methods fail, we use the Local Outlier Factor (LOF). This identifies points that are isolated from their local neighbors.