Building treeSS inputs from raw data: Florida mortality 2016

What this vignette adds

The companion vignette (vignette("introduction", package = "treeSS")) walks through the Rio de Janeiro dataset, which is already pre-assembled in exactly the long format that treespatial_scan() expects: one row per (region, leaf), with parallel columns for cases, population, region centroids, and tree node identifiers.

In practice, most users will start one step earlier: with a raw case-level table, a separately maintained polygon layer (counties, boroughs, census tracts…), and a classification that needs to be turned into a tree. This vignette goes through exactly that workflow on the fl_deaths dataset shipped with the package, mirroring the annotated example file inst/examples/example_florida.R.

What you will see:

  1. How to build the ICD-10 tree directly from the codes that appear in the raw data.
  2. How to assemble parallel vectors of (cases, population, region_id, x, y, node_id) by joining the case table to a polygon layer fetched on demand.
  3. How to run treespatial_scan(), filter_clusters(), and sequential_scan() on the assembled inputs and visualise the results.

The dataset is the CDC WONDER Compressed Mortality File 1999–2016 restricted to Florida deaths in 2016. It covers 65 of Florida’s 67 counties, 253 ICD-10 codes, and ≈157,000 deaths.


1. Load the raw data

fl_deaths ships in raw long form — one row per (county_fips, icd10_code) combination — with population and an ICD-10 description column attached:

library(treeSS)
data(fl_deaths)

str(fl_deaths)
#> 'data.frame': N obs. of 5 variables:
#>  $ county_fips: chr  ...
#>  $ icd10_code : chr  "C56" "C61" "I10" ...
#>  $ icd10_desc : chr  ...
#>  $ deaths     : int  ...
#>  $ population : int  ...

c(rows         = nrow(fl_deaths),
  counties     = length(unique(fl_deaths$county_fips)),
  icd10_codes  = length(unique(fl_deaths$icd10_code)),
  total_deaths = sum(fl_deaths$deaths))

Note what fl_deaths does not carry:

We need to supply both before calling the scan.


2. Build the ICD-10 tree from the codes in the data

The codes in fl_deaths use two granularities:

We build a three-level tree (Root → ICD-10 chapter → 3-character node → 5-character leaf) directly from the codes that actually appear in fl_deaths, so the tree contains no unused nodes:

icd_codes <- sort(unique(fl_deaths$icd10_code))

parsed <- data.frame(
  code  = icd_codes,
  three = ifelse(grepl("\\.", icd_codes),
                  sub("\\..*", "", icd_codes),
                  icd_codes),
  is_three_leaf = !grepl("\\.", icd_codes) & nchar(icd_codes) == 3,
  stringsAsFactors = FALSE
)

# Map 3-character codes to ICD-10 chapters
icd_chapter_of <- function(three) {
  L   <- substr(three, 1, 1)
  num <- suppressWarnings(as.integer(substr(three, 2, 3)))
  switch(L,
    A = "I_AB", B = "I_AB",
    C = "II_C_D",
    D = if (!is.na(num) && num <= 48) "II_C_D" else "III_D",
    E = "IV_E",  F = "V_F",  G = "VI_G",
    H = if (!is.na(num) && num <= 59) "VII_H" else "VIII_H",
    I = "IX_I",  J = "X_J",  K = "XI_K", L = "XII_L",
    M = "XIII_M", N = "XIV_N", O = "XV_O", P = "XVI_P",
    Q = "XVII_Q", R = "XVIII_R",
    S = "XIX_S_T", T = "XIX_S_T",
    V = "XX_V_Y", W = "XX_V_Y", X = "XX_V_Y", Y = "XX_V_Y",
    "Other"
  )
}
parsed$chapter <- sapply(parsed$three, icd_chapter_of)

fl_tree <- unique(rbind(
  data.frame(node_id = "Root", parent_id = NA_character_,
             stringsAsFactors = FALSE),
  data.frame(node_id = unique(parsed$chapter), parent_id = "Root",
             stringsAsFactors = FALSE),
  unique(data.frame(node_id = parsed$three, parent_id = parsed$chapter,
                    stringsAsFactors = FALSE)),
  data.frame(
    node_id   = parsed$code[!parsed$is_three_leaf],
    parent_id = parsed$three[!parsed$is_three_leaf],
    stringsAsFactors = FALSE
  )
))

# Sanity check: every code in fl_deaths is a leaf of fl_tree
parents <- unique(fl_tree$parent_id[!is.na(fl_tree$parent_id)])
leaves  <- setdiff(fl_tree$node_id, parents)
stopifnot(length(setdiff(icd_codes, leaves)) == 0)

c(total_nodes = nrow(fl_tree),
  chapters    = length(unique(parsed$chapter)),
  leaves      = length(leaves))

The result is a two-column data.frame fl_tree with columns node_id and parent_id. The root has parent_id = NA. The tree is acyclic by construction.

Two notes on style:


3. Download polygons and centroids

fl_deaths carries county FIPS codes but not coordinates. We fetch the 2016 county boundaries from , project to WGS84, and compute centroids:

library(sf)
library(tigris)

fl_map <- counties(state = "FL", cb = TRUE, year = 2016)
fl_map <- st_transform(fl_map, 4326)

centroids   <- st_centroid(st_geometry(fl_map))
coords      <- st_coordinates(centroids)
fl_map$x    <- coords[, "X"]
fl_map$y    <- coords[, "Y"]

Two technical points worth stating explicitly:


4. Assemble parallel vectors

treespatial_scan() consumes six parallel vectors of identical length — one row per (region, leaf) observation. We join the case table to the population (which is constant per county in fl_deaths) and to the coordinates from fl_map:

# Per-county population: same value across all leaves, so take any one
fl_pop <- aggregate(population ~ county_fips, data = fl_deaths,
                     FUN = max)

# Per-county centroids from the polygon layer
xy <- data.frame(county_fips = fl_map$GEOID,
                  x = fl_map$x, y = fl_map$y,
                  stringsAsFactors = FALSE)

dat <- merge(fl_deaths[, c("county_fips", "icd10_code", "deaths")],
             fl_pop, by = "county_fips")
dat <- merge(dat, xy, by = "county_fips")

# Integer region_id is convenient downstream; here we use the rank
# of county_fips in alphabetical order
dat$region_id <- as.integer(factor(
  dat$county_fips, levels = sort(unique(dat$county_fips))
))

# Keep only rows with at least one death (sparse-by-default scan)
dat <- dat[dat$deaths > 0, ]

c(long_rows = nrow(dat),
  counties  = length(unique(dat$region_id)))

At this point dat is exactly the same long-format shape as rj_mortality in the companion vignette: one row per (county, ICD-10) combination, with parallel columns of cases, population, coordinates, and node identifier.


5. Run the tree-spatial scan

result_fl <- treespatial_scan(
  cases       = dat$deaths,
  population  = dat$population,
  region_id   = dat$region_id,
  x           = dat$x,
  y           = dat$y,
  node_id     = dat$icd10_code,
  tree        = fl_tree,
  max_pop_pct = 0.05,
  nsim        = 999, seed = 2016,
  n_cores     = 4L
)
print(result_fl)

The most likely cluster typically falls on a specific ICD-10 leaf code (5-character) concentrated in one or two adjacent counties. The Florida dataset gives a different pattern from RJ because the overall mortality counts in Florida are much higher and the geography is much larger; the joint scan picks out narrow, cause-specific excess rather than broad geographic concentrations.

max_pop_pct = 0.05 is more restrictive than the default of 0.50. Larger study areas like Florida benefit from a tighter cap because the default would otherwise allow zones covering half the state’s population, which is usually not the size of cluster the analyst is looking for.


6. Distinct secondary clusters (paper-faithful)

filter_clusters(result_fl)[1:5,
                            c("node_id", "n_regions",
                              "cases", "expected", "llr",
                              "pvalue")]

As in the RJ analysis, filter_clusters() performs no additional Monte Carlo; the secondary clusters are read from result_fl$secondary_clusters and filtered for distinctness in \(O(k^2)\) time.


7. Sequential scan when the MLC may shadow

seq_fl <- sequential_scan(
  cases       = dat$deaths,
  population  = dat$population,
  region_id   = dat$region_id,
  x           = dat$x,
  y           = dat$y,
  node_id     = dat$icd10_code,
  tree        = fl_tree,
  max_iter    = 5L, alpha = 0.05,
  nsim        = 999, seed = 2016,
  max_pop_pct = 0.05, n_cores = 4L
)
print(seq_fl)

Each iteration of sequential_scan() runs a fresh Monte Carlo on the dataset with the previous iteration’s cluster regions removed, so the \(p\)-value of each detected cluster is computed against the correct conditional null distribution. The procedure stops at the first non-significant iteration or max_iter, whichever comes first.


8. Visualise the clusters

get_cluster_regions() returns a tidy data.frame that can be joined to fl_map for plotting:

library(ggplot2)
library(dplyr)

region_info <- unique(dat[, c("region_id", "county_fips")])

cr_seq <- merge(
  get_cluster_regions(seq_fl, overlap = TRUE),
  region_info, by = "region_id"
)

# Cross-join the polygon layer with every iteration panel so all
# counties appear in every panel (those outside the analysis
# dataset show as NA-coloured rather than as an extra empty panel)
panels       <- unique(cr_seq$panel)
cr_seq_keys  <- unique(cr_seq[, c("county_fips", "cluster",
                                   "node_id", "llr", "pvalue",
                                   "panel")])
fl_seq_plot <- merge(
  do.call(rbind,
          lapply(panels, function(p) cbind(fl_map, panel = p))),
  cr_seq_keys,
  by.x = c("GEOID", "panel"),
  by.y = c("county_fips", "panel"),
  all.x = TRUE
)

ggplot(fl_seq_plot) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40",
          linewidth = 0.12, alpha = 0.75) +
  scale_fill_manual(
    values = c("#C44E52", "#4C72B0", "#55A868",
               "#8172B2", "#CCB974"),
    na.value = "gray95", na.translate = FALSE
  ) +
  facet_wrap(~ panel, nrow = 1) +
  theme_minimal() +
  theme(legend.position = "none")

The cross-join trick on panels is what lets every county appear in every iteration panel; without it, a simple merge(fl_map, cr_seq, all.x = TRUE) would leave one extra “NA” facet for the two Florida counties not present in the mortality dataset.

The complete script with full plot annotations, file output and summary tables is at inst/examples/example_florida.R.


Recap

The Florida workflow demonstrates the three preprocessing steps that the RJ vignette skips because rj_mortality arrives pre-assembled:

  1. Build the tree from the codes in the data (no hand-coded hierarchy).
  2. Pull polygons and centroids on demand from a spatial source ( for US, for Brazil, etc.).
  3. Join everything into parallel long-form vectors.

After step 3, the rest of the analysis is identical to the RJ vignette: a single treespatial_scan() call, optional filter_clusters() or sequential_scan() post-processing, and visualisation via get_cluster_regions().


References

Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H. (2025). A tree-spatial scan statistic. Environmental and Ecological Statistics, 32, 953–978. doi:10.1007/s10651-025-00670-w

Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics — Theory and Methods, 26(6), 1481–1496.

Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan statistics adjusted for multiple clusters. Journal of Probability and Statistics, 2010, 642379.

Centers for Disease Control and Prevention. CDC WONDER Compressed Mortality File 1999–2016. National Center for Health Statistics.