Spitalfields: Comparison with known age-at-death

library(baytaAAR)

To show how the age estimations of baytaAAR can be compared with known age-at-death, we use the data set of Spitalfields (Buckberry and Chamberlain 2002). The individuals come from a crypt at Christ Church in Spitalfields, London and were identified in terms of age and sex by plates on their coffins. The following table gives an impression of the data:

data(spitalfields, package = "baytaAAR")
head(spitalfields)
#>   Age TO ST MI MA AP Sex
#> 1  16  1  1  1  1  2   F
#> 2  17  2  1  1  1  1   F
#> 3  19  1  2  1  1  1   F
#> 4  23  3  2  2  1  2   F
#> 5  27  1  1  3  1  1   F
#> 6  27  2  2  2  1  1   F

Buckberry and Chamberlain introduced a new scoring system of the pubic symphysis and thereby replaced a single score with multiple levels to several traits with fewer levels each.

For running the model, we use NIMBLE and the multivariate normal likelihood (mnorm). In a real-life setting, multicore should be set to TRUE to reduce computation time. Still, even on modern hardware, the model will run for up to several weeks until the usual quality criteria are met. However, already a shorter run demonstrates the principle (for the full run see the results in Müller-Scheeßel, Rinne, and Fuchs 2026).

spitalfields_res <- bay.ta(
  framework = "NIMBLE",
  algorithm = "mnorm",
  multicore = F,
  method = spitalfields[,c(2:6)],
  minimum_age = 16,
  parameters = c("b", "a", "beta0", "beta", "thresh", "age.s", "Ustar"),
  thinSteps = 200,
  numSavedSteps = 500,
  seed = 331
)

For the sake of illustration, we skip the diagnostic checks here and move directly to the comparison with known ages-at-death. For this, baytaAAR provides several custom functions which essentially work as wrappers around other R functions for convenience.

As the name suggests, the function age.comp.summary() summarizes standard quality measures of the comparison of known and estimated ages-at-death. These fall into two groups:

  1. Frequentist point estimates
  1. Bayesian density estimates

For nearly all measures, lower is better. The only exception is corrPearson where larger is better. Because the differences are squared with the RMSE, outliers influence this measure more strongly than this is the case for Inaccuracy, and therefore, the former tends to be higher than the latter. A similar relationship exists between CRPS and TMNLP. Overall, Müller-Scheeßel et al. found that the CRPS seems to be more informative than the TMNLP and, therefore, the CRPS is probably the most informative and single best measure to compare the results between different age estimation methods in terms of their quality to ‘predict’ age-at-death.

age.comp.summary() assumes that you choose one of the point estimates Mode, Median, Mean but in the code snippet below we demonstrate how all three measures can be computed and tabulated in one go. age.comp.summary() expects the raw output from bay.ta() as well as a vector of known age-at-death. The latter may contain NAs, these are subsequently simply ignored.

summary_list <- lapply(c("Mode", "Median", "Mean"), function(choice) {
  age.comp.summary(mcmc_list = spitalfields_res, 
                   known_age = spitalfields$Age,
                   mean_choice = choice)})
summary_mat <- do.call(rbind, summary_list)
rownames(summary_mat) <- c("Mode", "Median", "Mean")
summary_mat |> t() |> knitr::kable(digits = 2)
Mode Median Mean
Bias 2.64 4.33 5.42
corrPearson 0.63 0.67 0.67
corr_p 0.00 0.00 0.00
Residual_slope 0.31 0.39 0.47
Inaccuracy 12.52 11.21 11.19
RMSE 15.88 14.28 14.06
TMNLP 4.95 4.95 4.95
CRPS 8.03 8.03 8.03

As the above table shows, it can make quite a difference which mean measures Mean (= arithmetic mean), Median or Mode is chosen for the frequentist point estimate. In this case, the differences are even exaggerated because there are not enough iterations for stable estimates. On the other hand, the Bayesian estimates are not influenced by the choice of the mean measure because they are based on the probability density of the posterior.

Age estimation methods usually also supply minimum and maximum values for their estimates, so a range within which the true age-at-death should fall. Ideally, this range should adapt to the credibility/confidence level chosen, and the percentage of the true cases, the so-called coverage should be close to the chosen level. To test coverage formally, in the literature the cumulative binomial test has been proposed (Konigsberg et al. 2008). The wrapper sequential.binom.test() calculates this test for several given credibility levels:

sequential.binom.test(spitalfields_res,
                      HDImass = c(seq(0.5, 0.9, 0.1), 0.95),
                      known_age = spitalfields$Age) |>
  knitr::kable(digits = 3)
coverage n_in perc CI_low CI_up p_value
0.50 80 0.444 0.371 0.520 0.157
0.60 99 0.550 0.474 0.624 0.172
0.70 122 0.678 0.604 0.745 0.516
0.80 137 0.761 0.692 0.821 0.192
0.90 158 0.878 0.821 0.922 0.319
0.95 166 0.922 0.873 0.957 0.088

In the table above, for all levels the confidence intervals of the observed coverage include the expected coverage. Thus, this result suggests that the model is able to ‘predict’ age-at-death with the correct level of confidence.

Finally, age.comp.plot() shows the relationship between estimated and known ages-at-death graphically. Its four plots convey the following information:

Again, it can make quite a difference which mean measure is chosen for the point estimate for the two lower plots. The left top plot is not affected because the highest density intervals derive from the density distribution of the age estimates. For the right top plot, the Gompertz parameters are always taken from the arithmetic mean because only then the deterministic relationship between \(\alpha\) and \(\beta\) holds. As the code below shows, age.comp.plot() expects the output from diagnostic.summary() so the credibility level chosen (HDImass) there also determines what is shown in the top left plot. age.comp.plot() presupposes, of course, that there is a vector with known ages.

diagnostic.summary(spitalfields_res, HDImass = 0.95) |>
  age.comp.plot(known_age = spitalfields$Age)

The top left plot visualizes the result of the bottom line of the last table: Only eleven estimates, marked in red, do not contain the true age-at-death at the default credibility level 0.95. The top right plot illustrates the good agreement of the estimated Gompertz parameters fed into a Gompertz function with the density of the true ages-at-death. The bottom left plot is related to the correlation expressed in corrPearson, of course. The bottom-right plot shows the same relationship, relative to the dotted line in the bottom-left plot which is equivalent to the zero-line on the bottom right. The regression line in blue on the bottom right is nothing else than the quality measure Residual_slope. This plot also visualizes the spread of the difference between true and estimated age-at-death.


References

Buckberry, J. L., and A. T. Chamberlain. 2002. Age estimation from the auricular surface of the ilium: a revised method.” American Journal of Physical Anthropology 119 (3): 231–39. https://doi.org/10.1002/ajpa.10130.
Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E. Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.” Journal of the Royal Statistical Society Series B: Statistical Methodology 69 (2): 243–68.
Konigsberg, Lyle W., Nicholas P. Herrmann, Daniel J. Wescott, and Erin H. Kimmerle. 2008. “Estimation and Evidence in Forensic Anthropology: Age-at-Death.” Journal of Forensic Sciences 53 (3): 541–57.
Müller-Scheeßel, Nils, Christoph Rinne, and Katharina Fuchs. 2026. A Fully Bayesian Approach to Adult Skeletal Age Estimation: Multivariate Latent Trait Modeling with Markov Chain Monte Carlo Sampling.” American Journal of Biological Anthropology 190 (2): e70289. https://doi.org/10.1002/ajpa.70289.
Stull, Kyra E., Elaine Y. Chu, Louise K. Corron, and Michael H. Price. 2022. “Subadult Age Estimation Using the Mixed Cumulative Probit and a Contemporary United States Population.” Forensic Sciences 2 (4): 741–79.