---
title: "Introduction to treeSS: reproducing Cançado et al. (2025)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to treeSS: reproducing Cançado et al. (2025)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>",
  eval = FALSE
)
```

## Overview

The **treeSS** package implements the tree-spatial scan statistic of
Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters
that are simultaneously anomalous in geographic space and on a
hierarchical classification tree. It generalises Kulldorff's (1997)
circular spatial scan by searching jointly over circular spatial
zones $z$ and tree branches $g$, returning the (zone, branch) pair
that maximises the likelihood ratio.

The package targets epidemiological and surveillance applications
where the events to be clustered (deaths, hospitalisations, crimes,
collisions) carry both a *location* and a *categorical hierarchy*
(ICD-10 chapter $\to$ subchapter $\to$ leaf, NAICS sector, road type,
etc.). Detecting clusters jointly in both dimensions can identify
specific cause–area combinations that pure spatial or pure tree scans
would miss.

The package provides:

| Function | Purpose |
|:---------|:--------|
| `treespatial_scan()` | Tree-spatial scan (main entry point) |
| `circular_scan()`    | Kulldorff's circular spatial scan (tree-free) |
| `tree_scan()`        | Tree-based scan (space-free) |
| `filter_clusters()`  | Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1 |
| `sequential_scan()`  | Sequential adjustment of Zhang, Assunção & Kulldorff (2010) |
| `get_cluster_regions()` | Tidy region-by-cluster table for plotting |
| `aggregate_tree()`   | Roll counts up the tree |

All scans accept the Poisson and Binomial models of the original
paper and run the Monte Carlo step under OpenMP via the `n_cores`
argument.

The package ships four example datasets:

| Dataset | Country | Domain | Regions | Tree |
|:--------|:--------|:-------|:--------|:-----|
| `rj_mortality` + `rj_tree` | Brazil | Public health | 89 municipalities | ICD-10 (622 nodes) |
| `chicago_crimes` + `chicago_tree` | USA | Crime | 77 community areas | Type × Description × Location (2841 nodes) |
| `fl_deaths` | USA | Public health | 65 counties | (built by user from raw data) |
| `london_collisions` + `london_tree` | UK | Road safety | 33 boroughs | Light × Road × Junction (81 nodes) |

This vignette walks through `rj_mortality` end-to-end, reproducing
the published analysis of Cançado et al. (2025). A companion vignette,
`vignette("florida", package = "treeSS")`, is pedagogical: it builds
the inputs from raw long-form data and the ICD-10 tree from the codes
that actually appear in the data. The Chicago and London datasets are
worked out in the companion software paper.

---

## Infant mortality in Rio de Janeiro

We reproduce Section 5.2 of Cançado et al. (2025): all live births
and infant deaths in the 89 municipalities of Rio de Janeiro state in
2016, classified by the ICD-10 hierarchy. The pre-built dataset
combines deaths, live births, ICD-10 leaf codes and centroid
coordinates in long format — exactly the parallel-vector input
contract that **treeSS** expects.

### Run the scan

```{r rj-data}
library(treeSS)
data(rj_mortality)
data(rj_tree)

c(rows           = nrow(rj_mortality),
  municipalities = length(unique(rj_mortality$region_id)),
  tree_nodes     = nrow(rj_tree),
  total_deaths   = sum(rj_mortality$cases),
  total_births   = sum(unique(
    rj_mortality[, c("region_id", "live_births")]
  )$live_births))
#> rows           municipalities  tree_nodes  total_deaths  total_births
#> 1440           89              622         2981          219124
```

The scan is one function call:

```{r rj-scan}
result_rj <- treespatial_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_pop_pct = 0.50,
  nsim        = 999, seed = 2016,
  n_cores     = 4L
)
print(result_rj)
```

The most likely cluster is the ICD-10 leaf code **P209** ("unspecified
intrauterine hypoxia") concentrated in 18 municipalities of the north
fluminense region. Twenty-seven deaths are observed against an
expectation of 3.34 under uniform risk — a relative risk of $\sim 8$
— with a Monte Carlo $p$-value of $1/(\text{nsim} + 1) = 0.001$.

This closely matches Table 7 of Cançado et al. (2025): the same 18
municipalities (by IBGE code), 27 observed deaths, expected $\approx
3.37$, log-likelihood ratio $LR = 38.48$ (we obtain $\approx 38.83$).
The minor LR discrepancy is explained by a SIM database update
between the paper's analysis and the version shipped here, and by the
discontinuation of TCU population estimates after 2017; the
conclusions are identical.

### Why the joint scan matters here

To make the value of the joint search concrete, it is worth running
the two simpler scans on exactly the same data. The purely *spatial*
scan, ignoring the tree, returns a single 8-municipality cluster
around greater Rio de Janeiro driven by the aggregate death count
(Cançado et al. 2025, Table 5); the P209 cluster in the north of the
state does not appear at all, because aggregating over all causes
dilutes the cause-specific signal. We can verify this directly by
aggregating to one row per municipality and running the circular scan
on the totals:

```{r rj-spatial-only}
rj_agg <- unique(rj_mortality[, c("region_id", "x", "y", "live_births")])
rj_agg$cases <- tapply(rj_mortality$cases,
                        rj_mortality$region_id,
                        sum)[as.character(rj_agg$region_id)]

result_spatial <- circular_scan(
  cases       = rj_agg$cases,
  population  = rj_agg$live_births,
  region_id   = rj_agg$region_id,
  x           = rj_agg$x,
  y           = rj_agg$y,
  max_pop_pct = 0.50,
  nsim        = 999, seed = 2016
)
result_spatial$most_likely_cluster$region_ids
```

Conversely, the purely *tree-based* scan, ignoring geography, returns
more than fifty significant cuts at the 5 % level on this dataset
(Cançado et al. 2025, Table 6); the analyst is then left to apply a
multiple-testing correction across tree branches and decide which to
follow up. The joint scan returns a *single* most-likely $(z, g)$
pair with one Monte Carlo $p$-value — the family-wise error rate is
already controlled by the joint maximisation.

### Multiple distinct clusters (paper-faithful)

The paper's Section 5.1.1 specifies a filtering rule: a candidate
(zone, branch) pair is retained only if its branch is disjoint from
every previously retained branch (no ancestor/descendant relation),
*or* its zone is disjoint from every previously retained zone (no
region in common). This is implemented in `filter_clusters()`:

```{r rj-filter}
filter_clusters(result_rj)[1:5,
                            c("node_id", "n_regions",
                              "cases", "expected", "llr")]
```

No additional Monte Carlo simulation is performed; the secondary
clusters are read directly from `result_rj$secondary_clusters` and
filtered for distinctness in $O(k^2)$ time. The top filtered
candidates correspond to the rows of Table 7 of Cançado et al. (2025).

### Visualising the cluster

`get_cluster_regions()` returns a tidy data.frame ready to merge with
any polygon layer:

```{r rj-getregions}
# Most likely cluster only (single map)
cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE)

# Top-2 distinct clusters (faceted map)
cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE)
```

The package does not ship `sf` polygons for Rio de Janeiro — they are
easy to fetch on demand from \CRANpkg{geobr}:

```{r rj-plot}
library(ggplot2)
library(geobr)
library(sf)

mun       <- read_municipality(code_muni = "RJ", year = 2016)
mun$code6 <- as.integer(substr(mun$code_muni, 1, 6))

cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2,
                                  overlap = TRUE),
             unique(rj_mortality[, c("region_id", "ibge_code")]),
             by = "region_id")
mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code")

ggplot(mun_facet) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40",
          alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"),
                    na.value = "gray95", na.translate = FALSE) +
  facet_wrap(~ panel, nrow = 1) +
  theme_minimal() +
  theme(legend.position = "none")
```

A complete annotated worked example, including the polygon download
and faceted plots for both `filter_clusters()` and `sequential_scan()`
results, is shipped at
`system.file("examples", "example_brazil_rj.R", package = "treeSS")`.

---

## Secondary clusters when the MLC shadows weaker ones

`filter_clusters()` extracts distinct clusters from a *single* run of
the scan. It is fast (no additional Monte Carlo) and faithful to the
paper. When the most likely cluster is overwhelmingly strong, however,
it can *shadow* weaker secondary clusters: the excess cases inside the
MLC inflate the expected counts everywhere else, weakening the
likelihood ratios of any other genuine cluster on the same dataset.

For that situation the package also offers `sequential_scan()`,
adapted from Zhang, Assunção and Kulldorff (2010). It detects the MLC,
removes the MLC's regions (optionally with a buffer of nearest
neighbours) from the dataset, and re-runs the scan on the reduced
data with a *fresh* Monte Carlo simulation. The procedure iterates
until the current MLC is no longer significant at the user-specified
$\alpha$ level. Because each iteration's $p$-value is computed against
the simulated null distribution of the *current* (reduced) dataset, no
multiple-testing correction is required.

```{r rj-seq}
seq_rj <- sequential_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_iter    = 5L, alpha = 0.05,
  nsim        = 999, seed = 2016,
  max_pop_pct = 0.50, n_cores = 4L
)
print(seq_rj)
```

The `clusters` data.frame has one row per iteration with the cluster
identifier, log-likelihood ratio, fresh $p$-value, and a logical
`significant` flag:

```{r rj-seq-table}
seq_rj$clusters[, c("iteration", "node_id", "n_regions",
                    "llr", "pvalue", "significant")]
```

Mapping the sequential result uses the same `get_cluster_regions()`
generic. The complete plotting workflow is shipped in the same
annotated example file.

---

## When to use which

  * Use `filter_clusters()` when you want to follow Cançado et al.
    (2025) exactly: one scan, then a non-overlap pass over the
    candidate pool. No extra computation.
  * Use `sequential_scan()` when the MLC may be hiding weaker
    secondary clusters that `filter_clusters()` cannot detect on the
    original (unreduced) data.

The two answer different questions and are complementary; both are
appropriate in different applied scenarios.

---

## Computational notes

`treespatial_scan()` and `sequential_scan()` both accept `n_cores`,
which distributes the Monte Carlo step across OpenMP threads in C++.
Each thread carries its own `std::mt19937` generator seeded
deterministically from the user-supplied `seed`, so the result is
reproducible for any fixed pair `(seed, n_cores)`. The serial path
(`n_cores = 1L`) uses R's `rmultinom` and is bit-identical to the
pre-OpenMP implementation.

For publication-critical work where simulated $p$-values must be
bit-identical between machines, use `n_cores = 1L`. For exploratory
work, `n_cores = 4` typically yields a 4–6× speed-up.

---

## Other examples shipped with the package

| File | Purpose |
|:-----|:--------|
| `inst/examples/example_brazil_rj.R` | Full plotting workflow for the RJ analysis (this vignette) |
| `inst/examples/example_florida.R` | Building the input from raw long-form data — see `vignette("florida")` |
| `inst/examples/example_chicago.R` | Crime in Chicago, 2023 (discussed in the companion paper) |
| `inst/examples/example_london.R` | Road collisions in London, 2022 (discussed in the companion paper) |

---

## References

Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H.
(2025). A tree-spatial scan statistic. *Environmental and Ecological
Statistics*, 32, 953–978.
[doi:10.1007/s10651-025-00670-w](https://doi.org/10.1007/s10651-025-00670-w)

Kulldorff, M. (1997). A spatial scan statistic. *Communications in
Statistics — Theory and Methods*, 26(6), 1481–1496.

Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan
statistic for database disease surveillance. *Biometrics*, 59(2),
323–331.

Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan
statistics adjusted for multiple clusters. *Journal of Probability
and Statistics*, 2010, 642379.