The treeSS package implements the tree-spatial scan statistic of Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters that are simultaneously anomalous in geographic space and on a hierarchical classification tree. It generalises Kulldorff’s (1997) circular spatial scan by searching jointly over circular spatial zones \(z\) and tree branches \(g\), returning the (zone, branch) pair that maximises the likelihood ratio.
The package targets epidemiological and surveillance applications where the events to be clustered (deaths, hospitalisations, crimes, collisions) carry both a location and a categorical hierarchy (ICD-10 chapter \(\to\) subchapter \(\to\) leaf, NAICS sector, road type, etc.). Detecting clusters jointly in both dimensions can identify specific cause–area combinations that pure spatial or pure tree scans would miss.
The package provides:
| Function | Purpose |
|---|---|
treespatial_scan() |
Tree-spatial scan (main entry point) |
circular_scan() |
Kulldorff’s circular spatial scan (tree-free) |
tree_scan() |
Tree-based scan (space-free) |
filter_clusters() |
Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1 |
sequential_scan() |
Sequential adjustment of Zhang, Assunção & Kulldorff (2010) |
get_cluster_regions() |
Tidy region-by-cluster table for plotting |
aggregate_tree() |
Roll counts up the tree |
All scans accept the Poisson and Binomial models of the original
paper and run the Monte Carlo step under OpenMP via the
n_cores argument.
The package ships four example datasets:
| Dataset | Country | Domain | Regions | Tree |
|---|---|---|---|---|
rj_mortality + rj_tree |
Brazil | Public health | 89 municipalities | ICD-10 (622 nodes) |
chicago_crimes +
chicago_tree |
USA | Crime | 77 community areas | Type × Description × Location (2841 nodes) |
fl_deaths |
USA | Public health | 65 counties | (built by user from raw data) |
london_collisions +
london_tree |
UK | Road safety | 33 boroughs | Light × Road × Junction (81 nodes) |
This vignette walks through rj_mortality end-to-end,
reproducing the published analysis of Cançado et al. (2025). A companion
vignette, vignette("florida", package = "treeSS"), is
pedagogical: it builds the inputs from raw long-form data and the ICD-10
tree from the codes that actually appear in the data. The Chicago and
London datasets are worked out in the companion software paper.
We reproduce Section 5.2 of Cançado et al. (2025): all live births and infant deaths in the 89 municipalities of Rio de Janeiro state in 2016, classified by the ICD-10 hierarchy. The pre-built dataset combines deaths, live births, ICD-10 leaf codes and centroid coordinates in long format — exactly the parallel-vector input contract that treeSS expects.
library(treeSS)
data(rj_mortality)
data(rj_tree)
c(rows = nrow(rj_mortality),
municipalities = length(unique(rj_mortality$region_id)),
tree_nodes = nrow(rj_tree),
total_deaths = sum(rj_mortality$cases),
total_births = sum(unique(
rj_mortality[, c("region_id", "live_births")]
)$live_births))
#> rows municipalities tree_nodes total_deaths total_births
#> 1440 89 622 2981 219124The scan is one function call:
result_rj <- treespatial_scan(
cases = rj_mortality$cases,
population = rj_mortality$live_births,
region_id = rj_mortality$region_id,
x = rj_mortality$x,
y = rj_mortality$y,
node_id = rj_mortality$node_id,
tree = rj_tree,
max_pop_pct = 0.50,
nsim = 999, seed = 2016,
n_cores = 4L
)
print(result_rj)The most likely cluster is the ICD-10 leaf code P209 (“unspecified intrauterine hypoxia”) concentrated in 18 municipalities of the north fluminense region. Twenty-seven deaths are observed against an expectation of 3.34 under uniform risk — a relative risk of \(\sim 8\) — with a Monte Carlo \(p\)-value of \(1/(\text{nsim} + 1) = 0.001\).
This closely matches Table 7 of Cançado et al. (2025): the same 18 municipalities (by IBGE code), 27 observed deaths, expected \(\approx 3.37\), log-likelihood ratio \(LR = 38.48\) (we obtain \(\approx 38.83\)). The minor LR discrepancy is explained by a SIM database update between the paper’s analysis and the version shipped here, and by the discontinuation of TCU population estimates after 2017; the conclusions are identical.
To make the value of the joint search concrete, it is worth running the two simpler scans on exactly the same data. The purely spatial scan, ignoring the tree, returns a single 8-municipality cluster around greater Rio de Janeiro driven by the aggregate death count (Cançado et al. 2025, Table 5); the P209 cluster in the north of the state does not appear at all, because aggregating over all causes dilutes the cause-specific signal. We can verify this directly by aggregating to one row per municipality and running the circular scan on the totals:
rj_agg <- unique(rj_mortality[, c("region_id", "x", "y", "live_births")])
rj_agg$cases <- tapply(rj_mortality$cases,
rj_mortality$region_id,
sum)[as.character(rj_agg$region_id)]
result_spatial <- circular_scan(
cases = rj_agg$cases,
population = rj_agg$live_births,
region_id = rj_agg$region_id,
x = rj_agg$x,
y = rj_agg$y,
max_pop_pct = 0.50,
nsim = 999, seed = 2016
)
result_spatial$most_likely_cluster$region_idsConversely, the purely tree-based scan, ignoring geography, returns more than fifty significant cuts at the 5 % level on this dataset (Cançado et al. 2025, Table 6); the analyst is then left to apply a multiple-testing correction across tree branches and decide which to follow up. The joint scan returns a single most-likely \((z, g)\) pair with one Monte Carlo \(p\)-value — the family-wise error rate is already controlled by the joint maximisation.
The paper’s Section 5.1.1 specifies a filtering rule: a candidate
(zone, branch) pair is retained only if its branch is disjoint from
every previously retained branch (no ancestor/descendant relation),
or its zone is disjoint from every previously retained zone (no
region in common). This is implemented in
filter_clusters():
filter_clusters(result_rj)[1:5,
c("node_id", "n_regions",
"cases", "expected", "llr")]No additional Monte Carlo simulation is performed; the secondary
clusters are read directly from
result_rj$secondary_clusters and filtered for distinctness
in \(O(k^2)\) time. The top filtered
candidates correspond to the rows of Table 7 of Cançado et
al. (2025).
get_cluster_regions() returns a tidy data.frame ready to
merge with any polygon layer:
# Most likely cluster only (single map)
cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE)
# Top-2 distinct clusters (faceted map)
cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE)The package does not ship sf polygons for Rio de Janeiro
— they are easy to fetch on demand from :
library(ggplot2)
library(geobr)
library(sf)
mun <- read_municipality(code_muni = "RJ", year = 2016)
mun$code6 <- as.integer(substr(mun$code_muni, 1, 6))
cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2,
overlap = TRUE),
unique(rj_mortality[, c("region_id", "ibge_code")]),
by = "region_id")
mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code")
ggplot(mun_facet) +
geom_sf(aes(fill = factor(cluster)), color = "gray40",
alpha = 0.75) +
scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"),
na.value = "gray95", na.translate = FALSE) +
facet_wrap(~ panel, nrow = 1) +
theme_minimal() +
theme(legend.position = "none")A complete annotated worked example, including the polygon download
and faceted plots for both filter_clusters() and
sequential_scan() results, is shipped at
system.file("examples", "example_brazil_rj.R", package = "treeSS").
filter_clusters() extracts distinct clusters from a
single run of the scan. It is fast (no additional Monte Carlo)
and faithful to the paper. When the most likely cluster is
overwhelmingly strong, however, it can shadow weaker secondary
clusters: the excess cases inside the MLC inflate the expected counts
everywhere else, weakening the likelihood ratios of any other genuine
cluster on the same dataset.
For that situation the package also offers
sequential_scan(), adapted from Zhang, Assunção and
Kulldorff (2010). It detects the MLC, removes the MLC’s regions
(optionally with a buffer of nearest neighbours) from the dataset, and
re-runs the scan on the reduced data with a fresh Monte Carlo
simulation. The procedure iterates until the current MLC is no longer
significant at the user-specified \(\alpha\) level. Because each iteration’s
\(p\)-value is computed against the
simulated null distribution of the current (reduced) dataset,
no multiple-testing correction is required.
seq_rj <- sequential_scan(
cases = rj_mortality$cases,
population = rj_mortality$live_births,
region_id = rj_mortality$region_id,
x = rj_mortality$x,
y = rj_mortality$y,
node_id = rj_mortality$node_id,
tree = rj_tree,
max_iter = 5L, alpha = 0.05,
nsim = 999, seed = 2016,
max_pop_pct = 0.50, n_cores = 4L
)
print(seq_rj)The clusters data.frame has one row per iteration with
the cluster identifier, log-likelihood ratio, fresh \(p\)-value, and a logical
significant flag:
seq_rj$clusters[, c("iteration", "node_id", "n_regions",
"llr", "pvalue", "significant")]Mapping the sequential result uses the same
get_cluster_regions() generic. The complete plotting
workflow is shipped in the same annotated example file.
filter_clusters() when you want to follow Cançado
et al.
sequential_scan() when the MLC may be hiding weaker
secondary clusters that filter_clusters() cannot detect on
the original (unreduced) data.The two answer different questions and are complementary; both are appropriate in different applied scenarios.
treespatial_scan() and sequential_scan()
both accept n_cores, which distributes the Monte Carlo step
across OpenMP threads in C++. Each thread carries its own
std::mt19937 generator seeded deterministically from the
user-supplied seed, so the result is reproducible for any
fixed pair (seed, n_cores). The serial path
(n_cores = 1L) uses R’s rmultinom and is
bit-identical to the pre-OpenMP implementation.
For publication-critical work where simulated \(p\)-values must be bit-identical between
machines, use n_cores = 1L. For exploratory work,
n_cores = 4 typically yields a 4–6× speed-up.
| File | Purpose |
|---|---|
inst/examples/example_brazil_rj.R |
Full plotting workflow for the RJ analysis (this vignette) |
inst/examples/example_florida.R |
Building the input from raw long-form data — see
vignette("florida") |
inst/examples/example_chicago.R |
Crime in Chicago, 2023 (discussed in the companion paper) |
inst/examples/example_london.R |
Road collisions in London, 2022 (discussed in the companion paper) |
Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H. (2025). A tree-spatial scan statistic. Environmental and Ecological Statistics, 32, 953–978. doi:10.1007/s10651-025-00670-w
Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics — Theory and Methods, 26(6), 1481–1496.
Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan statistic for database disease surveillance. Biometrics, 59(2), 323–331.
Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan statistics adjusted for multiple clusters. Journal of Probability and Statistics, 2010, 642379.