Building treeSS inputs from raw data: Florida mortality 2016

What this vignette adds

The companion vignette (vignette("introduction", package = "treeSS")) walks through the Rio de Janeiro dataset, which is already pre-assembled in exactly the long format that treespatial_scan() expects: one row per (region, leaf), with parallel columns for cases, population, region centroids, and tree node identifiers.

In practice, most users will start one step earlier: with a raw case-level table, a separately maintained polygon layer (counties, boroughs, census tracts…), and a classification that needs to be turned into a tree. This vignette goes through exactly that workflow on the fl_deaths dataset shipped with the package, mirroring the annotated example file inst/examples/example_florida.R.

What you will see:

How to build the ICD-10 tree directly from the codes that appear in the raw data.
How to assemble a long-format data.frame with (cases, population, region_id, x, y, node_id) columns by joining the case table to a polygon layer fetched on demand.
How to run treespatial_scan(), filter_clusters(), and sequential_scan() on the assembled inputs and visualise the results.

The dataset is the CDC WONDER Compressed Mortality File 1999–2016 restricted to Florida deaths in 2016. It covers 65 of Florida’s 67 counties, 253 ICD-10 codes, and ≈157,000 deaths.

1. Load the raw data

fl_deaths ships in raw long form — one row per (county_fips, icd10_code) combination — with population and an ICD-10 description column attached:

library(treeSS)
data(fl_deaths)

str(fl_deaths)
#> 'data.frame': N obs. of 5 variables:
#>  $ county_fips: chr  ...
#>  $ icd10_code : chr  "C56" "C61" "I10" ...
#>  $ icd10_desc : chr  ...
#>  $ deaths     : int  ...
#>  $ population : int  ...

c(rows         = nrow(fl_deaths),
  counties     = length(unique(fl_deaths$county_fips)),
  icd10_codes  = length(unique(fl_deaths$icd10_code)),
  total_deaths = sum(fl_deaths$deaths))

Note what fl_deaths does not carry:

No tree object — we have ICD-10 codes but no parent–child structure.
No spatial coordinates — we have county FIPS codes but no centroids.

We need to supply both before calling the scan.

2. Build the ICD-10 tree from the codes in the data

The codes in fl_deaths use two granularities:

3-character codes (e.g. "C56", "R54") treated as leaves at the 3-character level — the death certificate was coded to that precision and no further.
5-character codes with a dot (e.g. "A04.7") — leaves under a 3-character internal parent.

We build a three-level tree (Root → ICD-10 chapter → 3-character node → 5-character leaf) directly from the codes that actually appear in fl_deaths, so the tree contains no unused nodes:

icd_codes <- sort(unique(fl_deaths$icd10_code))

parsed <- data.frame(
  code  = icd_codes,
  three = ifelse(grepl("\\.", icd_codes),
                  sub("\\..*", "", icd_codes),
                  icd_codes),
  is_three_leaf = !grepl("\\.", icd_codes) & nchar(icd_codes) == 3,
  stringsAsFactors = FALSE
)

# Map 3-character codes to ICD-10 chapters
icd_chapter_of <- function(three) {
  L   <- substr(three, 1, 1)
  num <- suppressWarnings(as.integer(substr(three, 2, 3)))
  switch(L,
    A = "I_AB", B = "I_AB",
    C = "II_C_D",
    D = if (!is.na(num) && num <= 48) "II_C_D" else "III_D",
    E = "IV_E",  F = "V_F",  G = "VI_G",
    H = if (!is.na(num) && num <= 59) "VII_H" else "VIII_H",
    I = "IX_I",  J = "X_J",  K = "XI_K", L = "XII_L",
    M = "XIII_M", N = "XIV_N", O = "XV_O", P = "XVI_P",
    Q = "XVII_Q", R = "XVIII_R",
    S = "XIX_S_T", T = "XIX_S_T",
    V = "XX_V_Y", W = "XX_V_Y", X = "XX_V_Y", Y = "XX_V_Y",
    "Other"
  )
}
parsed$chapter <- sapply(parsed$three, icd_chapter_of)

fl_tree <- unique(rbind(
  data.frame(node_id = "Root", parent_id = NA_character_,
             stringsAsFactors = FALSE),
  data.frame(node_id = unique(parsed$chapter), parent_id = "Root",
             stringsAsFactors = FALSE),
  unique(data.frame(node_id = parsed$three, parent_id = parsed$chapter,
                    stringsAsFactors = FALSE)),
  data.frame(
    node_id   = parsed$code[!parsed$is_three_leaf],
    parent_id = parsed$three[!parsed$is_three_leaf],
    stringsAsFactors = FALSE
  )
))

# Sanity check: every code in fl_deaths is a leaf of fl_tree
parents <- unique(fl_tree$parent_id[!is.na(fl_tree$parent_id)])
leaves  <- setdiff(fl_tree$node_id, parents)
stopifnot(length(setdiff(icd_codes, leaves)) == 0)

c(total_nodes = nrow(fl_tree),
  chapters    = length(unique(parsed$chapter)),
  leaves      = length(leaves))

The result is a two-column data.frame fl_tree with columns node_id and parent_id. The root has parent_id = NA. The tree is acyclic by construction.

Two notes on style:

treeSS accepts the tree either as this two-column data.frame via the tree = argument, or as parallel vectors via the tree_node_id = and tree_parent_id = arguments. Both produce identical scans.
node_id is a string here, but treeSS is type-agnostic: any R type that supports == and can be coerced to character will work, including integers and factors.

3. Download polygons and centroids

fl_deaths carries county FIPS codes but not coordinates. We fetch the 2016 county boundaries from , project to WGS84, and compute centroids:

library(sf)
library(tigris)

fl_map <- counties(state = "FL", cb = TRUE, year = 2016)
fl_map <- st_transform(fl_map, 4326)

centroids   <- st_centroid(st_geometry(fl_map))
coords      <- st_coordinates(centroids)
fl_map$x    <- coords[, "X"]
fl_map$y    <- coords[, "Y"]

Two technical points worth stating explicitly:

Centroids of sf objects. st_centroid() is exact under sf and respects the polygon geometry; if you use st_centroid() after st_transform(), the centroid is computed in WGS84 coordinates, which is what treespatial_scan() expects for the small spatial extents of US counties / Brazilian municipalities / London boroughs.
tigris requires network access. A first run downloads cached shapefiles to a user-configurable cache directory; the second run is instant.

4. Assemble the long-format data.frame

treespatial_scan() consumes a single long-format data.frame with one row per (region, leaf) observation, and refers to its columns by name. We join the case table to the population (which is constant per county in fl_deaths) and to the coordinates from fl_map:

# Per-county population: same value across all leaves, so take any one
fl_pop <- aggregate(population ~ county_fips, data = fl_deaths,
                     FUN = max)

# Per-county centroids from the polygon layer
xy <- data.frame(county_fips = fl_map$GEOID,
                  x = fl_map$x, y = fl_map$y,
                  stringsAsFactors = FALSE)

dat <- merge(fl_deaths[, c("county_fips", "icd10_code", "deaths")],
             fl_pop, by = "county_fips")
dat <- merge(dat, xy, by = "county_fips")

# Integer region_id is convenient downstream; here we use the rank
# of county_fips in alphabetical order
dat$region_id <- as.integer(factor(
  dat$county_fips, levels = sort(unique(dat$county_fips))
))

# Keep only rows with at least one death (sparse-by-default scan)
dat <- dat[dat$deaths > 0, ]

c(long_rows = nrow(dat),
  counties  = length(unique(dat$region_id)))

At this point dat is exactly the same long-format shape as rj_mortality in the companion vignette: one row per (county, ICD-10) combination, with parallel columns of cases, population, coordinates, and node identifier.

5. Run the tree-spatial scan

result_fl <- treespatial_scan(
  dat,
  cases       = deaths,
  population  = population,
  region_id   = region_id,
  x           = x,
  y           = y,
  node_id     = icd10_code,
  tree        = fl_tree,
  max_pop_pct = 0.05,
  nsim        = 999, seed = 2016,
  n_cores     = 4L
)
print(result_fl)

The most likely cluster typically falls on a specific ICD-10 leaf code (5-character) concentrated in one or two adjacent counties. The Florida dataset gives a different pattern from RJ because the overall mortality counts in Florida are much higher and the geography is much larger; the joint scan picks out narrow, cause-specific excess rather than broad geographic concentrations.

max_pop_pct = 0.05 is more restrictive than the default of 0.50. Larger study areas like Florida benefit from a tighter cap because the default would otherwise allow zones covering half the state’s population, which is usually not the size of cluster the analyst is looking for.

6. Distinct secondary clusters (paper-faithful)

filter_clusters(result_fl)[1:5,
                            c("node_id", "n_regions",
                              "cases", "expected", "llr",
                              "pvalue")]

As in the RJ analysis, filter_clusters() performs no additional Monte Carlo; the secondary clusters are read from result_fl$secondary_clusters and filtered for distinctness in $O(k^2)$ time.

7. Sequential scan when the MLC may shadow

seq_fl <- sequential_scan(
  dat,
  cases       = deaths,
  population  = population,
  region_id   = region_id,
  x           = x,
  y           = y,
  node_id     = icd10_code,
  tree        = fl_tree,
  max_iter    = 5L, alpha = 0.05,
  nsim        = 999, seed = 2016,
  max_pop_pct = 0.05, n_cores = 4L
)
print(seq_fl)

Each iteration of sequential_scan() runs a fresh Monte Carlo on the dataset with the previous iteration’s cluster regions removed, so the $p$-value of each detected cluster is computed against the correct conditional null distribution. The procedure stops at the first non-significant iteration or max_iter, whichever comes first.

8. Visualise the clusters

get_cluster_regions() returns a tidy data.frame that can be joined to fl_map for plotting:

library(ggplot2)
library(dplyr)

region_info <- unique(dat[, c("region_id", "county_fips")])

cr_seq <- merge(
  get_cluster_regions(seq_fl, overlap = TRUE),
  region_info, by = "region_id"
)

# Cross-join the polygon layer with every iteration panel so all
# counties appear in every panel (those outside the analysis
# dataset show as NA-coloured rather than as an extra empty panel)
panels       <- unique(cr_seq$panel)
cr_seq_keys  <- unique(cr_seq[, c("county_fips", "cluster",
                                   "node_id", "llr", "pvalue",
                                   "panel")])
fl_seq_plot <- merge(
  do.call(rbind,
          lapply(panels, function(p) cbind(fl_map, panel = p))),
  cr_seq_keys,
  by.x = c("GEOID", "panel"),
  by.y = c("county_fips", "panel"),
  all.x = TRUE
)

ggplot(fl_seq_plot) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40",
          linewidth = 0.12, alpha = 0.75) +
  scale_fill_manual(
    values = c("#C44E52", "#4C72B0", "#55A868",
               "#8172B2", "#CCB974"),
    na.value = "gray95", na.translate = FALSE
  ) +
  facet_wrap(~ panel, nrow = 1) +
  theme_minimal() +
  theme(legend.position = "none")

The cross-join trick on panels is what lets every county appear in every iteration panel; without it, a simple merge(fl_map, cr_seq, all.x = TRUE) would leave one extra “NA” facet for the two Florida counties not present in the mortality dataset.

The complete script with full plot annotations, file output and summary tables is at inst/examples/example_florida.R.

Recap

The Florida workflow demonstrates the three preprocessing steps that the RJ vignette skips because rj_mortality arrives pre-assembled:

Build the tree from the codes in the data (no hand-coded hierarchy).
Pull polygons and centroids on demand from a spatial source ( for US, for Brazil, etc.).
Join everything into parallel long-form vectors.

After step 3, the rest of the analysis is identical to the RJ vignette: a single treespatial_scan() call, optional filter_clusters() or sequential_scan() post-processing, and visualisation via get_cluster_regions().