Small adjustments to the DESCRIPTION file
Small adjustments to the vignettes
The package now ships two vignettes:
vignette("introduction", package = "treeSS") — Rio
de Janeiro end-to-end, reproducing Section 5.2 of Cançado et al. (2025).
This was the previous introduction vignette, trimmed to RJ
only.
vignette("florida", package = "treeSS")
(new) — a pedagogical walk-through of building the tree-spatial
scan inputs from raw data using the bundled fl_deaths
dataset: building the ICD-10 tree from the codes that actually appear in
the data, downloading county polygons + centroids from
tigris, and assembling the parallel-vector input contract
that treespatial_scan() expects.
The Chicago and London datasets, previously discussed inline in the
introduction vignette, are now reserved for the companion
software paper.
The four bundled plotting examples for sequential_scan()
(example_brazil_rj.R, example_chicago.R,
example_florida.R) previously did a left join from the full
map polygon set onto the cluster table. When the shapefile contained
polygons not present in the analysis dataset (3 RJ municipalities
missing from the DATASUS/IBGE 89-municipality subset, for instance),
those polygons emerged with panel = NA, which
facet_wrap rendered as an extra empty panel labelled
“NA”.
The examples now cross-join the polygon set with the panel labels
first and then left-join the cluster information by
(id, panel), so every map polygon is drawn in every
iteration panel — those that fall outside the analysis dataset get the
na.value colour (a light grey), exactly as intended. No
extra “NA” panel is produced.
The london example uses leaflet rather than
facet_wrap and was not affected.
multicluster_scan()multicluster_scan() (added in 0.1.45 as an adaptation of
Li, Wang, Yang, Li and Lai 2011 to the tree-spatial setting) has been
removed. The function is gone, along with its C++ backend
(mc_multicluster_treespatial_cpp,
mc_multicluster_spatial_cpp), the
get_cluster_regions.multicluster_scan S3 method, the
corresponding print / summary methods, all
examples, and the vignette subsection.
Rationale:
On real datasets with a concentrated signal (e.g. infant
mortality in Rio de Janeiro: 622 tree nodes, 5358 zones), the top-K
candidate pool was dominated by overlapping variants of a single
geographic neighbourhood, so the fast top-K disjoint-pair search could
not find a valid pair. The full-pool rescue path was too slow to be
practical (timing out on nsim = 999 with 4 cores).
The factorisation of the joint LLR used by Li et al. (2011) is exact under the Poisson model for circular scans; its extension to the tree-spatial setting was not formally established.
filter_clusters() (Cançado et al. 2025) and
sequential_scan() (Zhang, Assunção and Kulldorff 2010)
together already cover the practical secondary-cluster use cases with
published, well-studied statistical properties.
Users who want joint-cluster detection in the circular case can use the original implementation from Li et al. (2011) outside this package.
The package now offers two clearly-bounded approaches:
filter_clusters() — paper-faithful non-overlap
criterion of Cançado et al. (2025), Sec. 5.1.1, applied to the
single-pass candidate pool.
sequential_scan() — sequential adjustment of Zhang,
Assunção and Kulldorff (2010): detect MLC, remove its regions (with
optional buffer of nearest neighbours), re-run the scan on the reduced
data with a fresh Monte Carlo simulation; iterate until the current MLC
is no longer significant. Each iteration’s p-value is correct under the
conditional argument in the paper, so no multiple-testing correction is
required.
Replaced the ad-hoc Holm-Bonferroni iterative_scan()
with two methods drawn directly from the published literature on
multi-cluster spatial scan statistics, adapted to the tree-spatial
setting. The package now offers three approaches to secondary-cluster
detection, with the choice driven by which type of shadowing the user
wants to remove:
filter_clusters() (unchanged) – the original
non-overlap criterion of Cancado et al. (2025) Sec. 5.1.1, applied to
the single-pass candidate pool.
sequential_scan() (new) – the sequential adjustment
of Zhang, Assuncao and Kulldorff (2010), adapted to tree-spatial /
circular / tree-only inputs. Detects the MLC, removes its regions (and
an optional buffer_size of nearest neighbours) from the
dataset, and re-runs the scan on the reduced data with a fresh Monte
Carlo simulation. Iterates until the MLC of the current reduced data is
no longer significant or max_iter is reached. Each
iteration’s p-value is correct under the conditional argument of Section
3 of the paper – no post-hoc multiple-testing correction is applied or
required.
multicluster_scan() (new) – the two-cluster joint
statistic of Li, Wang, Yang, Li and Lai (2011), adapted to tree-spatial
and circular scans. Builds the alternative as a joint presence of two
region-disjoint clusters; the joint LLR factorises into the sum of the
two single-cluster LLRs under Poisson, so the observed maximum is found
by sweeping the candidate pool. The Monte Carlo for the joint statistic
runs in C++ (new exports mc_multicluster_treespatial_cpp
and mc_multicluster_spatial_cpp) with the same OpenMP
backend as the other scans, so performance is on par with
treespatial_scan(). The decision rule of Table 2 of the
paper is applied: 0, 1, or 2 significant clusters are reported based on
the joint p-value and a re-evaluation of the weaker cluster on the
reduced dataset.
iterative_scan() and its
print/summary/get_cluster_regions methods have been
removed. The Holm-Bonferroni “scan + zero cases + re-scan” procedure is
not part of the published methods we wanted to offer; the sequential and
multi-cluster scans above cover the intended use cases and are grounded
in the literature.
Internal helper .matrix_to_vectors() (previously
used only by iterative_scan) has been removed.
print.sequential_scan(),
summary.sequential_scan()print.multicluster_scan(),
summary.multicluster_scan()get_cluster_regions.sequential_scan(),
get_cluster_regions.multicluster_scan()filter_clusters(), treespatial_scan(), and
circular_scan() cross-reference the new methods in
@seealso.inst/examples/
(Brazil/RJ, Chicago, Florida, London) use sequential_scan()
in place of the removed iterative_scan() block.tests/testthat/test-sequential-scan.R covering
structure, the max_iter stopping rule, the buffer
mechanism, behaviour under H0, and printing.tests/testthat/test-multicluster-scan.R covering
structure, the stronger-versus-weaker ordering, region disjointness of
the returned pair, the significance decision rule, and printing.tests/testthat/test-get-cluster-regions.R and
tests/testthat/test-binomial.R updated to drop their
references to iterative_scan().Address the four items requested in the first-round CRAN review.
Single-quote software/API names per the CRAN cookbook:
OpenMP is now written as 'OpenMP' in the
package description. Reference: https://contributor.r-project.org/cran-cookbook/description_issues.html#formatting-software-names
Add DOI links to the two references that were previously cited
without a link, using the CRAN-mandated
authors (year) <doi:...> form (no space after
doi:, no space inside the angle brackets):
\value tags (and the corresponding
@return roxygen blocks) to the seven
print()/summary() method Rd files flagged by
CRAN. Each documents that the method invisibly returns its input object
unchanged and is called for its printing side effect, with a description
of the fields written to the console (and, for summary()
methods, the additional fields beyond those of the matching
print() method):
print.circular_scan.Rdprint.iterative_scan.Rdprint.tree_scan.Rdprint.treespatial_scan.Rdsummary.circular_scan.Rdsummary.tree_scan.Rdsummary.treespatial_scan.Rd Reference: https://contributor.r-project.org/cran-cookbook/docs_issues.html#missing-value-tags-in-.rd-filesgenerate_example_data() no longer sets a hardcoded seed
within the function: the default of the seed argument is
now NULL (previously 123L). When the user does
not pass a seed, the function draws from the user’s session-level RNG
state without modifying it; when the user passes an explicit integer,
the existing save-and-restore logic (introduced in 0.1.43) still
applies. The \usage{} block and the
\item{seed}{...} description of the corresponding Rd file
have been updated to match. The roxygen example
(ex <- generate_example_data(seed = 42)) is unchanged:
it passes an explicit seed and so remains reproducible. Reference: https://contributor.r-project.org/cran-cookbook/code_issues.html#setting-a-specific-seedTesting a a clean R CMD check --as-cran.
\source{} blocks to all three tree datasets,
pointing at the corresponding leaf-level dataset and at the
data-raw/ build script in the GitHub repo.get_cluster_regions(). Added
@examples block.@examples block to the roxygen comments.seed = ... argument no
longer silently overwrite the user’s session-level RNG state.
Previously, calling treespatial_scan(..., seed = 42) after
a set.seed(2026) in the user’s session would leave the RNG
in a state determined by the internal Monte Carlo loop, so any
subsequent runif(), sample(), etc. was no
longer reproducible from the user’s set.seed(2026). Now the
user’s pre-existing RNG state is saved on entry and restored on exit
(whether the function returns normally or via an error), so the
seed argument affects only the result of the call.
Implementation is in two new internal helpers
.seed_save_and_set() and .seed_restore() in
R/utils.R.print.iterative_scan() now accepts
max_show for API consistency with the other three print
methods. The default behavior is unchanged (the table is printed without
the region_ids and leaf_ids columns to keep it
compact); pass max_show = -1L to include both columns.cran-comments.md file.remotes::install_github("allanvc/treeSS").summary() methods for circular_scan,
tree_scan, and treespatial_scan now have
proper roxygen descriptions and explicitly document that the
max_show argument added in 0.1.39 is forwarded to the
corresponding print() method via . Each summary doc points
to the matching print doc for the full details.The print methods now truncate long Leaf IDs and
Regions lists by default, in the style of
tibble. The motivation is the Chicago example: the most
likely cluster turns out to be the root of the FBI
crime taxonomy (1900+ leaves), which under the previous policy printed
every single leaf, producing more than 10 pages of console output in the
rendered PDF.
New argument max_show on
print.treespatial_scan(), print.tree_scan()
and print.circular_scan(). Default is 10L.
When a vector field exceeds this length, only the first
max_show values are shown and a tail of
... and N more is appended. Pass
max_show = -1L (or any value at least as large as the
field) to recover the previous full-output behavior.
The internal .cat_wrapped() helper gained the same
max_show argument (default 10L) and propagates
it through the print methods.
No changes to the underlying scan results: only the console / PDF
rendering of the result objects is affected. The full leaf and region
IDs are always available on
result$most_likely_cluster$ leaf_ids and
result$most_likely_cluster$region_ids for programmatic
use.
The choice of default mirrors tibble’s behavior: enough
to give the reader a sense of the cluster contents, but not so much that
a single print() call dominates the document.
treespatial_scan() for combined spatial and
hierarchical cluster detection.circular_scan() for Kulldorff’s circular
spatial scan statistic.tree_scan() for the tree-based scan
statistic.build_zones(),
aggregate_tree(), filter_clusters().print() and summary() methods for all
scan result classes.