rtrim by example

Jeroen Pannekoek, Arco van Strien and Patrick Bogaart

2020-04-21

Introduction

As an example of the use of rtrim, counts of the Skylark (Alauda arvensis) will be analysed (the data are obtained from the Breeding Bird Monitoring Scheme in the Netherlands of Sovon and Statistics Netherlands). A first view of the overall structure of the data can be obtained from base R functions.

rm(list=ls())
library(rtrim)
#> Welcome to rtrim 2.0.6 Type ?`rtrim-package` to get started.
#> 
#> Attaching package: 'rtrim'
#> The following object is masked from 'package:stats':
#> 
#>     heatmap
data(skylark2) # use extended version of the Skylark dataset
summary(skylark2)
#>       site          year          count         habitat      deposition   
#>  site_01:  8   Min.   :1984   Min.   :  1.00   dunes:232   Min.   :1.000  
#>  site_02:  8   1st Qu.:1986   1st Qu.:  2.00   heath:208   1st Qu.:2.000  
#>  site_03:  8   Median :1988   Median :  5.00               Median :3.000  
#>  site_04:  8   Mean   :1988   Mean   : 12.55               Mean   :2.818  
#>  site_05:  8   3rd Qu.:1989   3rd Qu.: 11.00               3rd Qu.:3.000  
#>  site_06:  8   Max.   :1991   Max.   :131.00               Max.   :4.000  
#>  (Other):392                  NA's   :238                                 
#>      weight      
#>  Min.   : 1.000  
#>  1st Qu.: 1.000  
#>  Median :10.000  
#>  Mean   : 5.745  
#>  3rd Qu.:10.000  
#>  Max.   :10.000  
#> 

A more specific overview of the data can be obained by running the rtrim command count_summary. This function expects the presence of columns names count, year and site. If one or more of the actual data columns have different names, these can be specified.

idx <- which(names(skylark2)=="year")      # rename year->season
names(skylark2)[idx] <- "season"
count_summary(skylark2, year_col="season") # show that it works
#> Total number of sites                   55
#> Sites without positive counts (0): 
#> Number of observed zero counts            0
#> Number of observed positive counts      202
#> Total number of observed counts         202
#> Number of missing counts                238
#> Total number of counts                  440
names(skylark2)[idx] <- "year"             # revert to original name

In this case, we find that the Skylark dataset contains counts for 55 sites in 8 years (1984–1991). Of these 440 Site by Year combinations 202 were observed and the other 238 were missing. Two covariates are included: Habitat, which distinguishes between Dunes and Heathland sites and Deposition, which indicates the amount of acidic aerial deposition (This dataset was collected in the 1990’s when acidification was a prominent theme in ecological research).

Initial model estimation

To analyse these data with rtrim, we start with a model with time effects (model 3), ignoring the Habitat covariate. Model 3 is chosen because it makes no assumption about how population changes over time. Year effects are strictly independent of each other. A quick overview of the model results can be obtained by running summary() and plot(overall()).

z1 <- trim(count ~ site + year, data=skylark2, model=3, serialcor=TRUE, overdisp=TRUE)
summary(z1)
#> Call:
#> tools::buildVignettes(dir = ".", tangle = TRUE)
#> 
#> Model  : 3
#> Method : GEE (Convergence reached after 11 iterations)
#> 
#> Coefficients:
#>   time         add    se_add       mul     se_mul
#> 1    1  0.00000000 0.0000000 1.0000000 0.00000000
#> 2    2 -0.32017834 0.1054679 0.7260195 0.07657174
#> 3    3 -0.16867813 0.1054130 0.8447808 0.08905090
#> 4    4 -0.18965668 0.1083411 0.8272431 0.08962440
#> 5    5 -0.08241227 0.1070220 0.9208922 0.09855569
#> 6    6  0.02080919 0.1058650 1.0210272 0.10809105
#> 7    7  0.09969167 0.1082296 1.1048302 0.11957536
#> 8    8  0.15580476 0.1107841 1.1685980 0.12946210
#> 
#>  Overdispersion     : 1.3672
#>  Serial Correlation : 0.3024
#> 
#> Goodness of fit:
#>               Chi-square = 191.40, df=140, p=0.0026
#>         Likelihood Ratio = 194.80, df=140, p=0.0015
#>   AIC (up to a constant) = -85.20
plot(overall(z1))

Output from summary() includes:

The goodness-of-fit test (Likelihood Ratio) for this model amounts 194.8, with 140 degrees of freedom and p<0.05, which implies that the model has to be rejected.

Covariates

A possible improvement of the model for a better fit might be the inclusion of the Habitat covariate.

z2 <- trim(count ~ site + year + habitat, data=skylark2, model=3, serialcor=TRUE, overdisp=TRUE)
summary(z2)
#> Call:
#> tools::buildVignettes(dir = ".", tangle = TRUE)
#> 
#> Model  : 3
#> Method : GEE (Convergence reached after 12 iterations)
#> 
#> Coefficients:
#>       covar cat time        add    se_add       mul    se_mul
#> 1  baseline   0    1  0.0000000 0.0000000 1.0000000 0.0000000
#> 2  baseline   0    2 -0.2165232 0.1991433 0.8053138 0.1603728
#> 3  baseline   0    3 -0.3781379 0.2303445 0.6851360 0.1578173
#> 4  baseline   0    4 -0.4981991 0.2437072 0.6076240 0.1480823
#> 5  baseline   0    5 -0.7391541 0.2565409 0.4775177 0.1225028
#> 6  baseline   0    6 -0.5212781 0.2408642 0.5937612 0.1430158
#> 7  baseline   0    7 -0.6366106 0.2633641 0.5290827 0.1393414
#> 8  baseline   0    8 -0.7215137 0.2651985 0.4860160 0.1288907
#> 9   habitat   2    1  0.0000000 0.0000000 1.0000000 0.0000000
#> 10  habitat   2    2 -0.1444846 0.2323601 0.8654682 0.2011003
#> 11  habitat   2    3  0.2771464 0.2553396 1.3193596 0.3368847
#> 12  habitat   2    4  0.3865051 0.2681110 1.4718279 0.3946133
#> 13  habitat   2    5  0.7747286 0.2789551 2.1700031 0.6053334
#> 14  habitat   2    6  0.6449194 0.2642595 1.9058334 0.5036346
#> 15  habitat   2    7  0.8588036 0.2856443 2.3603350 0.6742164
#> 16  habitat   2    8  1.0307734 0.2882726 2.8032330 0.8080952
#> 
#>  Overdispersion     : 1.1616
#>  Serial Correlation : 0.2265
#> 
#> Goodness of fit:
#>               Chi-square = 154.50, df=133, p=0.0979
#>         Likelihood Ratio = 159.64, df=133, p=0.0575
#>   AIC (up to a constant) = -106.36

Now, the \(p\)-value of the likelihood ratio is (just slightly) above the classical threshold value of 0.05, and we decide to accept this model.

Model simplification

The advantage of Model 3 is, as argued above, the absence of any assumptions regarding the temporal trend. This, however, comes at a price: Postive counts are required for all individual years to allow estimation of the model parameters. So, this model cannot be used for cases where one or more years are missing. Furthermore, the model is far from being parsimonious. Even if the Skylark population follows a perfectly theoretical trend with constant population increase or decrease, each year is assigned it’s own growth parameters, even if these are identical to last year’s.

For both reasons it may be preferable to replace model 3 by model 2 (piecewise linear), especially because in one extreme case these models are equivalent. This is the case when all years are treated as change points, and each year the trend changes into a different one. Let’s first check this.

z3 <- trim(count ~ site + year + habitat, data=skylark2, model=2, changepoints="all",
           serialcor=TRUE, overdisp=TRUE)
summary(z3)
#> Call:
#> tools::buildVignettes(dir = ".", tangle = TRUE)
#> 
#> Model  : 2
#> Method : GEE (Convergence reached after 11 iterations)
#> 
#> Coefficients:
#>       covar cat from upto         add    se_add       mul    se_mul
#> 1  baseline   0 1984 1985 -0.21652346 0.1991430 0.8053136 0.1603726
#> 2  baseline   0 1985 1986 -0.16161496 0.2207074 0.8507687 0.1877709
#> 3  baseline   0 1986 1987 -0.12006156 0.2194797 0.8868658 0.1946491
#> 4  baseline   0 1987 1988 -0.24095494 0.2260048 0.7858770 0.1776120
#> 5  baseline   0 1988 1989  0.21787599 0.2249234 1.2434329 0.2796772
#> 6  baseline   0 1989 1990 -0.11533261 0.2180151 0.8910697 0.1942667
#> 7  baseline   0 1990 1991 -0.08490346 0.2329769 0.9186010 0.2140128
#> 8   habitat   2 1984 1985 -0.14448463 0.2323599 0.8654682 0.2011001
#> 9   habitat   2 1985 1986  0.42163114 0.2480259 1.5244461 0.3781021
#> 10  habitat   2 1986 1987  0.10935908 0.2335696 1.1155628 0.2605616
#> 11  habitat   2 1987 1988  0.38822335 0.2389049 1.4743590 0.3522316
#> 12  habitat   2 1988 1989 -0.12980911 0.2366361 0.8782631 0.2078287
#> 13  habitat   2 1989 1990  0.21388409 0.2301313 1.2384791 0.2850127
#> 14  habitat   2 1990 1991  0.17197025 0.2458819 1.1876425 0.2920198
#> 
#>  Overdispersion     : 1.1616
#>  Serial Correlation : 0.2265
#> 
#> Goodness of fit:
#>               Chi-square = 154.50, df=133, p=0.0979
#>         Likelihood Ratio = 159.64, df=133, p=0.0575
#>   AIC (up to a constant) = -106.36

and indeed this results in a similar model fit (although parameter values are different, of course).

The graphical display of the time-totals suggests that after an initial decline in counts, Skylark population recovers with approximately the same rate. One could either just argue if this recovery starts in 1985 or in 1987, and to what extent the recovery rate is ‘constant’, or one can look at the model statistics tfor a more objective analysis. In this case, we look at the Wald statistics associated with the Habitat covariate, and the individual changepoints:

wald(z3)
#> Wald test for significance of covariates
#>  Covariate        W df           p
#>    habitat 21.54981  7 0.003036166
#> 
#> Wald test for significance of changes in slope
#>  Changepoint   Wald_test df           p
#>         1984 10.27482731  2 0.005872859
#>         1985  9.17718172  2 0.010167175
#>         1986  3.08043508  2 0.214334471
#>         1987  1.53703040  2 0.463701061
#>         1988  1.63613306  2 0.441284040
#>         1989  0.88658542  2 0.641919282
#>         1990  0.01468516  2 0.992684312

The first test shows that there is a significant (at the 5% level) effect of the Habitat covariate on slopes (or year indices), showing that the slopes (year indices) for Dunes are different from those for Heathland. The tests for the significance of changes in slopes show that the only significant changes are for the years 1984, which means that the slope between 1984 and 1985 is different from zero, and 1985, which means that the slope between 1985 and 1986 is different from the slope between 1984 and 1985. This suggests that it should be possible to describe these data with a model with less than the full set of seven changepoints. To investigate this possibility, the stepwise procedure for selection of changepoints can be used by including stepwise=TRUE in the call to trim():

z4 <- trim(count ~ site + year + habitat, data=skylark2, model=2, changepoints="all",
           stepwise=TRUE, serialcor=TRUE, overdisp=TRUE)
summary(z4)
#> Call:
#> tools::buildVignettes(dir = ".", tangle = TRUE)
#> 
#> Model  : 2
#> Method : GEE (Convergence reached after 11 iterations)
#> 
#> Coefficients:
#>      covar cat from upto         add     se_add       mul     se_mul
#> 1 baseline   0 1984 1985 -0.26905602 0.18234931 0.7641004 0.13933319
#> 2 baseline   0 1985 1991 -0.07758256 0.04105089 0.9253506 0.03798646
#> 3  habitat   2 1984 1985 -0.02040236 0.20676977 0.9798044 0.20259392
#> 4  habitat   2 1985 1991  0.17488185 0.04372458 1.1911055 0.05208058
#> 
#>  Overdispersion     : 1.1265
#>  Serial Correlation : 0.2283
#> 
#> Goodness of fit:
#>               Chi-square = 161.09, df=143, p=0.1431
#>         Likelihood Ratio = 160.76, df=143, p=0.1471
#>   AIC (up to a constant) = -125.24
wald(z4)
#> Wald test for significance of covariates
#>  Covariate        W df            p
#>    habitat 18.50555  2 9.584531e-05
#> 
#> Wald test for significance of changes in slope
#>  Changepoint Wald_test df            p
#>         1984  10.99440  2 0.0040982293
#>         1985  14.64858  2 0.0006593285

Not surprisingly, this results in a model with only two changepoints left, at 1984 and 1985.

The difference between the models of this run (z4) and the previous (z3) can be tested by comparing their Likelihood Ratio’s, see also Section 2.5.

gof(z3)
#> Goodness of fit:
#>               Chi-square = 154.50, df=133, p=0.0979
#>         Likelihood Ratio = 159.64, df=133, p=0.0575
#>   AIC (up to a constant) = -106.36
LR3 <- gof(z3)$LR$LR # Get raw LR info for run 4
df3 <- gof(z3)$LR$df

gof(z4)
#> Goodness of fit:
#>               Chi-square = 161.09, df=143, p=0.1431
#>         Likelihood Ratio = 160.76, df=143, p=0.1471
#>   AIC (up to a constant) = -125.24
LR4 <- gof(z4)$LR$LR # idem for run 3
df4 <- gof(z4)$LR$df

# Test the differece by using the fact that the difference of two LR measures is
# asymptotically Chi^2 distributed
LR <- abs(LR4 - LR3)
df <- abs(df4 - df3)
p  <- 1 - pchisq(LR, df=df) # Use Chi-squared distribution
p
#> [1] 0.9997119

Since \(p \gg 0.05\) the \(H_0\) hypothesis that model z4 is a submodel of z3 in the sense that z4 can be obtained from z3 by setting some of z4 parameters to 0, cannot be rejected at the \(alpha=0.05\) level. In other words, both models are practically equivalent. The model z4, however, is the most sparse model, as shown by Akaike’s Information Criterion.

Concerning model z4, the Wald-test for the significance of the effects of the covariate on the slope parameters shows that this effect is very significant (p=0.0001) and the Wald-tests for the significance of changes in slope shows that both changes (at 1984 and 1985) are, as expected, also very significant (p is 0.004 and 0.0007, respectively).

The slope (in the additive parameterization) for a site is the sum of the constant term and the effects for the covariate values for that site. The effect for the first category of a covariate is zero and omitted from the output of summary() and coefs(). Thus, sites with covariate value 1 (Dunes) have slope -0.269 between 1984 and 1985 and -0.078 from 1985 onwards. The corresponding multiplicative parameters show that for Dunes there is a sharp decrease (the multipicative coefficient of 0.76 corresponds to \((1-0.76)\times100=24\%\) decrease) between 1984 and 1985 and a much smaller annual decrease (0.93, equivalent to 7% decrease) from 1985 to 1991. For sites with covariate value 2 (Heathland) the slope between 1984 and 1985 is \(-0.269 - 0.020 = -0.289\) corresponding with a multiplicative effect of \(0.764 \times 0.980 = 0.75\) which is only slightly different from the effect for Dunes for this time period. Apparently, the significant effect of the covariate is mainly determined by the trend from 1985 onwards. The parameters show indeed that Skylark populations increase in Heathland, while they decrease in Dunes. The slope is 0.097 (additive) and 1.10 multiplicative, corresponding to an annual increase of 10%.

Indices and time-totals

Model-based and imputed overall indices (based on the time-totals for all sites) can be obtained from TRIM output by calling index() or totals(). By default, only imputed indices or time-totals are returned. Model-based indices or time-totals can be added by using the which="both" option.

index(z4, which="both")
#>   time    fitted     se_fit   imputed     se_imp
#> 1 1984 1.0000000 0.00000000 1.0000000 0.00000000
#> 2 1985 0.7530579 0.06553658 0.7372610 0.06663569
#> 3 1986 0.7915817 0.06477129 0.8303956 0.07292113
#> 4 1987 0.8369110 0.06696343 0.8179245 0.07314308
#> 5 1988 0.8895273 0.07203043 0.8859143 0.07976879
#> 6 1989 0.9499771 0.07998826 0.9628009 0.08812600
#> 7 1990 1.0188773 0.09098601 1.0268504 0.09643874
#> 8 1991 1.0969220 0.10530053 1.1098103 0.10876964

Indices can also be plotted:

plot(index(z4))

In this plot the solid red line connects the indices for the individual years. In this case, the first year, 1984, is chosen as base year. Standard errors for the indices are shown using a transparent band and white ‘error-bars’.

In the last trim run, habitat was used as a covariate. Indices for covariates can also be computed, by setting the covars flag:

index(z4, which="both", covars=TRUE)
#>    covariate category time    fitted     se_fit   imputed     se_imp
#> 1    Overall   (none) 1984 1.0000000 0.00000000 1.0000000 0.00000000
#> 2    Overall   (none) 1985 0.7530579 0.06553658 0.7372610 0.06663569
#> 3    Overall   (none) 1986 0.7915817 0.06477129 0.8303956 0.07292113
#> 4    Overall   (none) 1987 0.8369110 0.06696343 0.8179245 0.07314308
#> 5    Overall   (none) 1988 0.8895273 0.07203043 0.8859143 0.07976879
#> 6    Overall   (none) 1989 0.9499771 0.07998826 0.9628009 0.08812600
#> 7    Overall   (none) 1990 1.0188773 0.09098601 1.0268504 0.09643874
#> 8    Overall   (none) 1991 1.0969220 0.10530053 1.1098103 0.10876964
#> 9    habitat    dunes 1984 1.0000000 0.00000000 1.0000000 0.00000000
#> 10   habitat    dunes 1985 0.7641004 0.13933319 0.7791479 0.17017428
#> 11   habitat    dunes 1986 0.7070608 0.12049989 0.7022059 0.17788473
#> 12   habitat    dunes 1987 0.6542792 0.10988682 0.6448394 0.17652462
#> 13   habitat    dunes 1988 0.6054377 0.10615237 0.5533376 0.17018093
#> 14   habitat    dunes 1989 0.5602421 0.10724676 0.5814696 0.18256289
#> 15   habitat    dunes 1990 0.5184204 0.11109150 0.5270963 0.17883059
#> 16   habitat    dunes 1991 0.4797206 0.11609263 0.4820668 0.17468761
#> 17   habitat    heath 1984 1.0000000 0.00000000 1.0000000 0.00000000
#> 18   habitat    heath 1985 0.7486689 0.07298069 0.7203724 0.07707145
#> 19   habitat    heath 1986 0.8251756 0.07643505 0.8820811 0.09258138
#> 20   habitat    heath 1987 0.9095004 0.08191091 0.8877115 0.09474977
#> 21   habitat    heath 1988 1.0024425 0.09019392 1.0200076 0.10685876
#> 22   habitat    heath 1989 1.1048823 0.10206243 1.1165518 0.11809890
#> 23   habitat    heath 1990 1.2177904 0.11821918 1.2283488 0.12999561
#> 24   habitat    heath 1991 1.3422366 0.13928361 1.3629135 0.14869062

Indices are collected in a single dataframe, but can be easily separated by using e.g. subset().

Again, indices for the covariate categories can be plotted without much effort:

plot(index(z4,which="fitted",covars=TRUE))

The model based indices reflect the strong decrease from 1984 to 1985 and the smaller decrease from 1985 onwards for Dunes (habitat category 1) and the similar decrease from 1984 to 1985 and the increase from that year onwards for Heathland (category 2). The overall model based indices are between the indices for Dunes and Heathland and show much less change over time than when Dunes and Heathland are treated separately. The imputed indices are very similar to the model based indices with the exception that the imputed index for 1986 is larger than the model based index for that year.

Multiple covariates

One may try to extend the model further by also incorporating the second covariate ‘deposition’ in the model. This covariate is a measure for the amount of acidic aerial deposition. This time, the time-effects model (model 3) cannot be estimated due to lack of data in particular years:

check_observations(skylark2, model=3, covars=c("habitat","deposition"))
#> $errors
#> $errors$deposition
#>   year deposition
#> 1 1990          1
#> 2 1991          1
#> 
#> 
#> $sufficient
#> [1] FALSE

The linear trend model with covariates can still be estimated.

z5 <- trim(count ~ site + year + habitat+deposition, data=skylark2, model=2,
           serialcor=TRUE, overdisp=TRUE)
#> Warning: Serial correlation is very low (rho=0.162); consider disabling it.
summary(z5)
#> Call:
#> tools::buildVignettes(dir = ".", tangle = TRUE)
#> 
#> Model  : 2
#> Method : GEE (Convergence reached after 11 iterations)
#> 
#> Coefficients:
#>        covar cat from upto           add     se_add       mul     se_mul
#> 1   baseline   0 1984 1991 -0.0547486466 0.30356576 0.9467231 0.28739271
#> 2    habitat   2 1984 1991  0.1631245102 0.04752131 1.1771833 0.05594129
#> 3 deposition   2 1984 1991 -0.0399174384 0.30595598 0.9608688 0.29398355
#> 4 deposition   3 1984 1991 -0.0734776978 0.30615962 0.9291569 0.28447032
#> 5 deposition   4 1984 1991 -0.0002787989 0.30637534 0.9997212 0.30628993
#> 
#>  Overdispersion     : 1.1557
#>  Serial Correlation : 0.1620
#> 
#> Goodness of fit:
#>               Chi-square = 164.11, df=142, p=0.0987
#>         Likelihood Ratio = 147.99, df=142, p=0.3482
#>   AIC (up to a constant) = -136.01
wald(z5)
#> Wald test for significance of covariates
#>   Covariate         W df            p
#>     habitat 11.783157  1 0.0005976904
#>  deposition  8.013505  3 0.0457334254
#> 
#> Wald test for significance of changes in slope
#>  Changepoint Wald_test df            p
#>         1984  47.85308  5 3.805843e-09

Weighting

So far, the overall indices are the indices that correspond with the time totals summed over all sites. The next run shows the results if the sites in Dunes are weighted 10 times.

z6 <- trim(count ~ site + year + habitat, data=skylark2, model=2, changepoints="auto",
           serialcor=TRUE, overdisp=TRUE, weights="weight")
idx = index(z6, "fitted", covars=TRUE)
plot(idx)

The separate indices for Dunes and Heathland remain similar, of course, but due to the weighting the overall index decreases from 1985 onwards.