Introduction_to_PHEindicatormethods

Georgina Anderson

2019-05-15

Introduction

This vignette introduces the following functions from the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.

The following packages must be installed and loaded if not already available

library(PHEindicatormethods)
library(dplyr)

Package functions

This vignette covers the following functions available within the first release of the package (v1.0.8) but has been updated to apply to these functions in their latest release versions. If further functions are added to the package in future releases these will be explained elsewhere.

Function Type Description
phe_proportion Non-aggregate Performs a calculation on each row of data (unless data is grouped)
phe_rate Non-aggregate Performs a calculation on each row of data (unless data is grouped)
phe_mean Aggregate Performs a calculation on each grouping set
phe_dsr Aggregate, standardised Performs a calculation on each grouping set and requires additional reference inputs
phe_smr Aggregate, standardised Performs a calculation on each grouping set and requires additional reference inputs
phe_isr Aggregate, standardised Performs a calculation on each grouping set and requires additional reference inputs

Non-aggregate functions

Create some test data for the non-aggregate functions

The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:

df <- data.frame(
        area = rep(c("Area1","Area2","Area3","Area4"), 2),
        year = rep(2015:2016, each = 4),
        obs = sample(100, 2 * 4, replace = TRUE),
        pop = sample(100:200, 2 * 4, replace = TRUE))
df
#>    area year obs pop
#> 1 Area1 2015  20 178
#> 2 Area2 2015  34 139
#> 3 Area3 2015  99 110
#> 4 Area4 2015  89 129
#> 5 Area1 2016  71 104
#> 6 Area2 2016  24 108
#> 7 Area3 2016 100 199
#> 8 Area4 2016  77 196

Execute phe_proportion and phe_rate

INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.

OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.

OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.

Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified

# default proportion
phe_proportion(df, obs, pop)
#>    area year obs pop     value   lowercl   uppercl confidence
#> 1 Area1 2015  20 178 0.1123596 0.0739225 0.1671747        95%
#> 2 Area2 2015  34 139 0.2446043 0.1806468 0.3222986        95%
#> 3 Area3 2015  99 110 0.9000000 0.8297650 0.9432399        95%
#> 4 Area4 2015  89 129 0.6899225 0.6055856 0.7632751        95%
#> 5 Area1 2016  71 104 0.6826923 0.5881007 0.7642685        95%
#> 6 Area2 2016  24 108 0.2222222 0.1541255 0.3094008        95%
#> 7 Area3 2016 100 199 0.5025126 0.4336577 0.5712723        95%
#> 8 Area4 2016  77 196 0.3928571 0.3271730 0.4626604        95%
#>         statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson

# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#>    area year obs pop     value    lowercl   uppercl confidence
#> 1 Area1 2015  20 178 0.1123596 0.05815251 0.2060419      99.8%
#> 2 Area2 2015  34 139 0.2446043 0.15080575 0.3712392      99.8%
#> 3 Area3 2015  99 110 0.9000000 0.77743894 0.9586576      99.8%
#> 4 Area4 2015  89 129 0.6899225 0.55469910 0.7989650      99.8%
#> 5 Area1 2016  71 104 0.6826923 0.53148183 0.8031739      99.8%
#> 6 Area2 2016  24 108 0.2222222 0.12416324 0.3654136      99.8%
#> 7 Area3 2016 100 199 0.5025126 0.39540552 0.6093895      99.8%
#> 8 Area4 2016  77 196 0.3928571 0.29244940 0.5032203      99.8%
#>         statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson

# specify to output proportions as percentages
phe_proportion(df, obs, pop, multiplier=100)
#>    area year obs pop    value  lowercl  uppercl confidence  statistic
#> 1 Area1 2015  20 178 11.23596  7.39225 16.71747        95% percentage
#> 2 Area2 2015  34 139 24.46043 18.06468 32.22986        95% percentage
#> 3 Area3 2015  99 110 90.00000 82.97650 94.32399        95% percentage
#> 4 Area4 2015  89 129 68.99225 60.55856 76.32751        95% percentage
#> 5 Area1 2016  71 104 68.26923 58.81007 76.42685        95% percentage
#> 6 Area2 2016  24 108 22.22222 15.41255 30.94008        95% percentage
#> 7 Area3 2016 100 199 50.25126 43.36577 57.12723        95% percentage
#> 8 Area4 2016  77 196 39.28571 32.71730 46.26604        95% percentage
#>   method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson

# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100)
#>    area year obs pop    value   lowercl  uppercl confidence  statistic
#> 1 Area1 2015  20 178 11.23596  5.815251 20.60419      99.8% percentage
#> 2 Area2 2015  34 139 24.46043 15.080575 37.12392      99.8% percentage
#> 3 Area3 2015  99 110 90.00000 77.743894 95.86576      99.8% percentage
#> 4 Area4 2015  89 129 68.99225 55.469910 79.89650      99.8% percentage
#> 5 Area1 2016  71 104 68.26923 53.148183 80.31739      99.8% percentage
#> 6 Area2 2016  24 108 22.22222 12.416324 36.54136      99.8% percentage
#> 7 Area3 2016 100 199 50.25126 39.540552 60.93895      99.8% percentage
#> 8 Area4 2016  77 196 39.28571 29.244940 50.32203      99.8% percentage
#>   method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson

# specify level of detail to output for proportion and remove metadata columns
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100, type="standard")
#>    area year obs pop    value   lowercl  uppercl
#> 1 Area1 2015  20 178 11.23596  5.815251 20.60419
#> 2 Area2 2015  34 139 24.46043 15.080575 37.12392
#> 3 Area3 2015  99 110 90.00000 77.743894 95.86576
#> 4 Area4 2015  89 129 68.99225 55.469910 79.89650
#> 5 Area1 2016  71 104 68.26923 53.148183 80.31739
#> 6 Area2 2016  24 108 22.22222 12.416324 36.54136
#> 7 Area3 2016 100 199 50.25126 39.540552 60.93895
#> 8 Area4 2016  77 196 39.28571 29.244940 50.32203

# default rate
phe_rate(df, obs, pop)
#>    area year obs pop    value   lowercl   uppercl confidence
#> 1 Area1 2015  20 178 11235.96  6860.353  17353.87        95%
#> 2 Area2 2015  34 139 24460.43 16936.943  34182.07        95%
#> 3 Area3 2015  99 110 90000.00 73145.834 109572.81        95%
#> 4 Area4 2015  89 129 68992.25 55404.874  84901.82        95%
#> 5 Area1 2016  71 104 68269.23 53316.611  86113.57        95%
#> 6 Area2 2016  24 108 22222.22 14234.001  33066.31        95%
#> 7 Area3 2016 100 199 50251.26 40885.480  61119.57        95%
#> 8 Area4 2016  77 196 39285.71 31002.547  49101.04        95%
#>         statistic method
#> 1 rate per 100000  Byars
#> 2 rate per 100000  Byars
#> 3 rate per 100000  Byars
#> 4 rate per 100000  Byars
#> 5 rate per 100000  Byars
#> 6 rate per 100000  Byars
#> 7 rate per 100000  Byars
#> 8 rate per 100000  Byars

# specify rate parameters
phe_rate(df, obs, pop, confidence=99.8, multiplier=100)
#>    area year obs pop    value  lowercl   uppercl confidence    statistic
#> 1 Area1 2015  20 178 11.23596  5.01281  21.39609      99.8% rate per 100
#> 2 Area2 2015  34 139 24.46043 13.49041  40.42569      99.8% rate per 100
#> 3 Area3 2015  99 110 90.00000 64.59850 121.62697      99.8% rate per 100
#> 4 Area4 2015  89 129 68.99225 48.56588  94.73311      99.8% rate per 100
#> 5 Area1 2016  71 104 68.26923 45.92149  97.22584      99.8% rate per 100
#> 6 Area2 2016  24 108 22.22222 10.75409  40.15731      99.8% rate per 100
#> 7 Area3 2016 100 199 50.25126 36.13251  67.81087      99.8% rate per 100
#> 8 Area4 2016  77 196 39.28571 26.87893  55.19594      99.8% rate per 100
#>   method
#> 1  Byars
#> 2  Byars
#> 3  Byars
#> 4  Byars
#> 5  Byars
#> 6  Byars
#> 7  Byars
#> 8  Byars

# specify rate parameters and reduce columns output and remove metadata columns
phe_rate(df, obs, pop, type="standard", confidence=99.8, multiplier=100)
#>    area year obs pop    value  lowercl   uppercl
#> 1 Area1 2015  20 178 11.23596  5.01281  21.39609
#> 2 Area2 2015  34 139 24.46043 13.49041  40.42569
#> 3 Area3 2015  99 110 90.00000 64.59850 121.62697
#> 4 Area4 2015  89 129 68.99225 48.56588  94.73311
#> 5 Area1 2016  71 104 68.26923 45.92149  97.22584
#> 6 Area2 2016  24 108 22.22222 10.75409  40.15731
#> 7 Area3 2016 100 199 50.25126 36.13251  67.81087
#> 8 Area4 2016  77 196 39.28571 26.87893  55.19594

These functions can also return aggregate data if the input dataframes are grouped:

# default proportion - grouped
df %>%
  group_by(year) %>%
  phe_proportion(obs, pop)
#> # A tibble: 2 x 9
#>    year   obs   pop value lowercl uppercl confidence statistic       method
#>   <int> <int> <int> <dbl>   <dbl>   <dbl> <chr>      <chr>           <chr> 
#> 1  2015   242   556 0.435   0.395   0.477 95%        proportion of 1 Wilson
#> 2  2016   272   607 0.448   0.409   0.488 95%        proportion of 1 Wilson

# default rate - grouped
df %>%
  group_by(year) %>%
  phe_rate(obs, pop)
#> # A tibble: 2 x 9
#>    year   obs   pop  value lowercl uppercl confidence statistic      method
#>   <int> <int> <int>  <dbl>   <dbl>   <dbl> <chr>      <chr>          <chr> 
#> 1  2015   242   556 43525.  38213.  49369. 95%        rate per 1000~ Byars 
#> 2  2016   272   607 44811.  39643.  50465. 95%        rate per 1000~ Byars



Aggregate functions

The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.

The df test data generated earlier can be used to demonstrate phe_mean:

Execute phe_mean

INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.

OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.

OPTIONS: The function also accepts additional arguments to specify the level of confidence and a reduced level of detail to be output.

Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified

# default mean
phe_mean(df,obs)
#>   value_sum value_count    stdev value  lowercl  uppercl confidence
#> 1       514           8 33.37985 64.25 36.34375 92.15625        95%
#>   statistic                   method
#> 1      mean Student's t-distribution

# multiple means in a single execution with 99.8% confidence
df %>%
    group_by(year) %>%
        phe_mean(obs, confidence=0.998)
#> # A tibble: 2 x 10
#>    year value_sum value_count stdev value lowercl uppercl confidence
#>   <int>     <int>       <int> <dbl> <dbl>   <dbl>   <dbl> <chr>     
#> 1  2015       242           4  39.3  60.5  -140.     261. 99.8%     
#> 2  2016       272           4  31.9  68     -94.8    231. 99.8%     
#> # ... with 2 more variables: statistic <chr>, method <chr>

# multiple means in a single execution with 99.8% confidence and data-only output
df %>%
    group_by(year) %>%
        phe_mean(obs, type = "standard", confidence=0.998)
#> # A tibble: 2 x 7
#>    year value_sum value_count stdev value lowercl uppercl
#>   <int>     <int>       <int> <dbl> <dbl>   <dbl>   <dbl>
#> 1  2015       242           4  39.3  60.5  -140.     261.
#> 2  2016       272           4  31.9  68     -94.8    231.

Standardised Aggregate functions

Create some test data for the standardised aggregate functions

The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:

df_std <- data.frame(
            area = rep(c("Area1", "Area2", "Area3", "Area4"), each = 19 * 2 * 5),
            year = rep(2006:2010, each = 19 * 2),
            sex = rep(rep(c("Male", "Female"), each = 19), 5),
            ageband = rep(c(0, 5,10,15,20,25,30,35,40,45,
                           50,55,60,65,70,75,80,85,90), times = 10),
            obs = sample(200, 19 * 2 * 5 * 4, replace = TRUE),
            pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(df_std)
#>    area year  sex ageband obs   pop
#> 1 Area1 2006 Male       0   9 13434
#> 2 Area1 2006 Male       5  87 11372
#> 3 Area1 2006 Male      10 173 17767
#> 4 Area1 2006 Male      15 126 14290
#> 5 Area1 2006 Male      20 126 10457
#> 6 Area1 2006 Male      25  48 15674

Execute phe_dsr

INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:

The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.

The function can also accept standard populations provided as a column within the input data frame.

OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the total count, the total population, the dsr, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.

OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and a reduced level of detail to be output.

Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified

# calculate separate dsrs for each area, year and sex
df_std %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop)
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   total_count total_pop value lowercl uppercl confidence
#>    <fct> <int> <fct>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~        2370    285858  858.    821.    895. 95%       
#>  2 Area1  2006 Male         2120    283503  706.    673.    739. 95%       
#>  3 Area1  2007 Fema~        1738    277490  636.    605.    669. 95%       
#>  4 Area1  2007 Male         1799    268074  638.    605.    671. 95%       
#>  5 Area1  2008 Fema~        1529    298730  531.    504.    560. 95%       
#>  6 Area1  2008 Male         1870    294577  682.    650.    716. 95%       
#>  7 Area1  2009 Fema~        2156    278258  800.    764.    838. 95%       
#>  8 Area1  2009 Male         1978    272909  749.    714.    786. 95%       
#>  9 Area1  2010 Fema~        1961    306239  688.    656.    722. 95%       
#> 10 Area1  2010 Male         1808    287200  597.    567.    628. 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate separate dsrs for each area, year and sex and drop metadata fields from output
df_std %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop, type="standard")
#> # A tibble: 40 x 8
#> # Groups:   area, year [20]
#>    area   year sex    total_count total_pop value lowercl uppercl
#>    <fct> <int> <fct>        <int>     <int> <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Female        2370    285858  858.    821.    895.
#>  2 Area1  2006 Male          2120    283503  706.    673.    739.
#>  3 Area1  2007 Female        1738    277490  636.    605.    669.
#>  4 Area1  2007 Male          1799    268074  638.    605.    671.
#>  5 Area1  2008 Female        1529    298730  531.    504.    560.
#>  6 Area1  2008 Male          1870    294577  682.    650.    716.
#>  7 Area1  2009 Female        2156    278258  800.    764.    838.
#>  8 Area1  2009 Male          1978    272909  749.    714.    786.
#>  9 Area1  2010 Female        1961    306239  688.    656.    722.
#> 10 Area1  2010 Male          1808    287200  597.    567.    628.
#> # ... with 30 more rows

# calculate same specifying standard population in vector form
df_std %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   total_count total_pop value lowercl uppercl confidence
#>    <fct> <int> <fct>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~        2370    285858  858.    821.    895. 95%       
#>  2 Area1  2006 Male         2120    283503  706.    673.    739. 95%       
#>  3 Area1  2007 Fema~        1738    277490  636.    605.    669. 95%       
#>  4 Area1  2007 Male         1799    268074  638.    605.    671. 95%       
#>  5 Area1  2008 Fema~        1529    298730  531.    504.    560. 95%       
#>  6 Area1  2008 Male         1870    294577  682.    650.    716. 95%       
#>  7 Area1  2009 Fema~        2156    278258  800.    764.    838. 95%       
#>  8 Area1  2009 Male         1978    272909  749.    714.    786. 95%       
#>  9 Area1  2010 Fema~        1961    306239  688.    656.    722. 95%       
#> 10 Area1  2010 Male         1808    287200  597.    567.    628. 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
    mutate(refpop = rep(esp2013,40)) %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   total_count total_pop value lowercl uppercl confidence
#>    <fct> <int> <fct>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~        2370    285858  858.    821.    895. 95%       
#>  2 Area1  2006 Male         2120    283503  706.    673.    739. 95%       
#>  3 Area1  2007 Fema~        1738    277490  636.    605.    669. 95%       
#>  4 Area1  2007 Male         1799    268074  638.    605.    671. 95%       
#>  5 Area1  2008 Fema~        1529    298730  531.    504.    560. 95%       
#>  6 Area1  2008 Male         1870    294577  682.    650.    716. 95%       
#>  7 Area1  2009 Fema~        2156    278258  800.    764.    838. 95%       
#>  8 Area1  2009 Male         1978    272909  749.    714.    786. 95%       
#>  9 Area1  2010 Fema~        1961    306239  688.    656.    722. 95%       
#> 10 Area1  2010 Male         1808    287200  597.    567.    628. 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
df_std %>%
    filter(ageband <= 70) %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop, stdpop = esp2013[1:15])
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   total_count total_pop value lowercl uppercl confidence
#>    <fct> <int> <fct>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~        2000    229818  896.    856.    937. 95%       
#>  2 Area1  2006 Male         1541    220982  688.    653.    724. 95%       
#>  3 Area1  2007 Fema~        1465    221012  660.    625.    695. 95%       
#>  4 Area1  2007 Male         1185    205276  614.    579.    650. 95%       
#>  5 Area1  2008 Fema~        1234    237581  519.    490.    549. 95%       
#>  6 Area1  2008 Male         1462    228084  666.    631.    702. 95%       
#>  7 Area1  2009 Fema~        1693    224768  793.    754.    833. 95%       
#>  8 Area1  2009 Male         1606    214973  777.    738.    816. 95%       
#>  9 Area1  2010 Fema~        1599    236802  711.    676.    748. 95%       
#> 10 Area1  2010 Male         1191    232198  558.    526.    591. 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>
    
# calculate separate dsrs for persons for each area and year)
df_std %>%
    group_by(area, year, ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop)) %>%
    group_by(area, year) %>%
    phe_dsr(obs,pop)
#> # A tibble: 20 x 10
#> # Groups:   area [4]
#>    area   year total_count total_pop value lowercl uppercl confidence
#>    <fct> <int>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006        4490    569361  781.    757.    806. 95%       
#>  2 Area1  2007        3537    545564  631.    609.    654. 95%       
#>  3 Area1  2008        3399    593307  602.    581.    623. 95%       
#>  4 Area1  2009        4134    551167  760.    736.    786. 95%       
#>  5 Area1  2010        3769    593439  623.    602.    645. 95%       
#>  6 Area2  2006        4032    576854  710.    687.    734. 95%       
#>  7 Area2  2007        3900    559042  666.    643.    689. 95%       
#>  8 Area2  2008        3920    575995  707.    684.    730. 95%       
#>  9 Area2  2009        3758    603377  580.    561.    600. 95%       
#> 10 Area2  2010        3618    556090  677.    655.    701. 95%       
#> 11 Area3  2006        4118    553110  766.    742.    791. 95%       
#> 12 Area3  2007        4035    578825  722.    699.    746. 95%       
#> 13 Area3  2008        3771    567526  700.    677.    724. 95%       
#> 14 Area3  2009        3681    572989  615.    595.    637. 95%       
#> 15 Area3  2010        3916    610865  649.    628.    672. 95%       
#> 16 Area4  2006        4041    529902  817.    790.    844. 95%       
#> 17 Area4  2007        3283    594363  518.    500.    538. 95%       
#> 18 Area4  2008        3924    548693  740.    715.    765. 95%       
#> 19 Area4  2009        3338    609596  539.    519.    559. 95%       
#> 20 Area4  2010        3355    599294  534.    515.    554. 95%       
#> # ... with 2 more variables: statistic <chr>, method <chr>

Execute phe_smr and phe_isr

INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, plus reference numerators and denominators for each standardisation category.

The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:

OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the observed and expected counts, the reference rate (isr only), the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit, the confidence level, the statistic name and the method.

OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.

The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:

df_ref <- df_std %>%
    filter(year == 2006) %>%
    group_by(ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop))
    
head(df_ref)
#> # A tibble: 6 x 3
#>   ageband   obs    pop
#>     <dbl> <int>  <int>
#> 1       0   614 107211
#> 2       5   824 123370
#> 3      10   943 118207
#> 4      15   736 110219
#> 5      20  1060 106109
#> 6      25   857 110917

Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified

# calculate separate smrs for each area, year and sex
df_std %>%
    group_by(area, year, sex) %>%
    phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   observed expected value lowercl uppercl confidence
#>    <fct> <int> <fct>    <int>    <dbl> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~     2370    2147. 1.10    1.06    1.15  95%       
#>  2 Area1  2006 Male      2120    2132. 0.994   0.952   1.04  95%       
#>  3 Area1  2007 Fema~     1738    2077. 0.837   0.798   0.877 95%       
#>  4 Area1  2007 Male      1799    2023. 0.889   0.849   0.931 95%       
#>  5 Area1  2008 Fema~     1529    2241. 0.682   0.649   0.717 95%       
#>  6 Area1  2008 Male      1870    2215. 0.844   0.807   0.884 95%       
#>  7 Area1  2009 Fema~     2156    2071. 1.04    0.998   1.09  95%       
#>  8 Area1  2009 Male      1978    2050. 0.965   0.923   1.01  95%       
#>  9 Area1  2010 Fema~     1961    2296. 0.854   0.817   0.893 95%       
#> 10 Area1  2010 Male      1808    2145. 0.843   0.804   0.883 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate the same smrs by appending the reference data to the data frame
df_std %>%
    mutate(refobs = rep(df_ref$obs,40),
           refpop = rep(df_ref$pop,40)) %>%
    group_by(area, year, sex) %>%
    phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   observed expected value lowercl uppercl confidence
#>    <fct> <int> <fct>    <int>    <dbl> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~     2370    2147. 1.10    1.06    1.15  95%       
#>  2 Area1  2006 Male      2120    2132. 0.994   0.952   1.04  95%       
#>  3 Area1  2007 Fema~     1738    2077. 0.837   0.798   0.877 95%       
#>  4 Area1  2007 Male      1799    2023. 0.889   0.849   0.931 95%       
#>  5 Area1  2008 Fema~     1529    2241. 0.682   0.649   0.717 95%       
#>  6 Area1  2008 Male      1870    2215. 0.844   0.807   0.884 95%       
#>  7 Area1  2009 Fema~     2156    2071. 1.04    0.998   1.09  95%       
#>  8 Area1  2009 Male      1978    2050. 0.965   0.923   1.01  95%       
#>  9 Area1  2010 Fema~     1961    2296. 0.854   0.817   0.893 95%       
#> 10 Area1  2010 Male      1808    2145. 0.843   0.804   0.883 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate separate smrs for each year and drop metadata columns from output
df_std %>%
    group_by(year, ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop)) %>%
    group_by(year) %>%
    phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 6
#>    year observed expected value lowercl uppercl
#>   <int>    <int>    <dbl> <dbl>   <dbl>   <dbl>
#> 1  2006    16681   16681  1       0.985   1.02 
#> 2  2007    14755   17162. 0.860   0.846   0.874
#> 3  2008    15014   17156. 0.875   0.861   0.889
#> 4  2009    14911   17520. 0.851   0.837   0.865
#> 5  2010    14658   17714. 0.827   0.814   0.841

The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:

# calculate separate isrs for each area, year and sex
df_std %>%
    group_by(area, year, sex) %>%
    phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 12
#> # Groups:   area, year [20]
#>    area   year sex   observed expected ref_rate value lowercl uppercl
#>    <fct> <int> <fct>    <int>    <dbl>    <dbl> <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Fema~     2370    2147.     748.  826.    793.    860.
#>  2 Area1  2006 Male      2120    2132.     748.  744.    713.    776.
#>  3 Area1  2007 Fema~     1738    2077.     748.  626.    597.    656.
#>  4 Area1  2007 Male      1799    2023.     748.  665.    635.    697.
#>  5 Area1  2008 Fema~     1529    2241.     748.  511.    485.    537.
#>  6 Area1  2008 Male      1870    2215.     748.  632.    604.    661.
#>  7 Area1  2009 Fema~     2156    2071.     748.  779.    747.    813.
#>  8 Area1  2009 Male      1978    2050.     748.  722.    690.    754.
#>  9 Area1  2010 Fema~     1961    2296.     748.  639.    611.    668.
#> 10 Area1  2010 Male      1808    2145.     748.  631.    602.    660.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> #   statistic <chr>, method <chr>

# calculate the same isrs by appending the reference data to the data frame
df_std %>%
    mutate(refobs = rep(df_ref$obs,40),
           refpop = rep(df_ref$pop,40)) %>%
    group_by(area, year, sex) %>%
    phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 12
#> # Groups:   area, year [20]
#>    area   year sex   observed expected ref_rate value lowercl uppercl
#>    <fct> <int> <fct>    <int>    <dbl>    <dbl> <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Fema~     2370    2147.     748.  826.    793.    860.
#>  2 Area1  2006 Male      2120    2132.     748.  744.    713.    776.
#>  3 Area1  2007 Fema~     1738    2077.     748.  626.    597.    656.
#>  4 Area1  2007 Male      1799    2023.     748.  665.    635.    697.
#>  5 Area1  2008 Fema~     1529    2241.     748.  511.    485.    537.
#>  6 Area1  2008 Male      1870    2215.     748.  632.    604.    661.
#>  7 Area1  2009 Fema~     2156    2071.     748.  779.    747.    813.
#>  8 Area1  2009 Male      1978    2050.     748.  722.    690.    754.
#>  9 Area1  2010 Fema~     1961    2296.     748.  639.    611.    668.
#> 10 Area1  2010 Male      1808    2145.     748.  631.    602.    660.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> #   statistic <chr>, method <chr>

# calculate separate isrs for each year and drop metadata columns from output
df_std %>%
    group_by(year, ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop)) %>%
    group_by(year) %>%
    phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 7
#>    year observed expected ref_rate value lowercl uppercl
#>   <int>    <int>    <dbl>    <dbl> <dbl>   <dbl>   <dbl>
#> 1  2006    16681   16681      748.  748.    737.    760.
#> 2  2007    14755   17162.     748.  643.    633.    654.
#> 3  2008    15014   17156.     748.  655.    644.    665.
#> 4  2009    14911   17520.     748.  637.    627.    647.
#> 5  2010    14658   17714.     748.  619.    609.    629.