Title: Datasets from the UK COVID-19 Outbreak
Version: 0.0.3
Description: Provides easy access to a curated selection of pre-processed data sets relevant to the COVID-19 outbreak in the UK for teaching and demonstration purposes.
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.3.3.9007
Depends: R (≥ 3.5)
LazyData: true
Language: en-GB
URL: https://ai4ci.github.io/ukc19/, https://github.com/ai4ci/ukc19
Imports: dplyr
NeedsCompilation: no
Packaged: 2025-12-15 16:58:03 UTC; vp22681
Author: Robert Challen ORCID iD [aut, cre], AI4CI Hub; UKRI AI Programme and EPSRC (EP/Y028392/1) [fnd, cph] (url: https://gtr.ukri.org/projects?ref=EP%2FY028392%2F1)
Maintainer: Robert Challen <rob.challen@bristol.ac.uk>
Repository: CRAN
Date/Publication: 2025-12-19 15:20:02 UTC

ukc19: Datasets from the UK COVID-19 Outbreak

Description

logo

Provides easy access to a curated selection of pre-processed data sets relevant to the COVID-19 outbreak in the UK for teaching and demonstration purposes.

Author(s)

Maintainer: Robert Challen rob.challen@bristol.ac.uk (ORCID)

Other contributors:

See Also

Useful links:


COVID-19 viral load following challenge

Description

Viral load from nasal swabs of subset of positive participants from COVID-19 human challenge study, as detected by Quantitative PCR. Values were mined from the vector files of the figures. The Y-axis values are approximate as had to be manually read from the scale.

Usage

data("covid_challenge")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 629 rows and 3 columns.

Details

Data extracted from Killingley et al, 2022, figure 2 "Viral shedding after a short incubation period peaks rapidly after human SARS-CoV-2 challenge". Panel A (middle left sub panel).

For datasets compiled from existing literature, Scientific Data’s policy is that compilers (creators of the secondary compilation dataset and authors of the associated Data Descriptor) are not required by the journal to ask permission from the original authors to extract small amounts of numerical information or other fields. Expected practice is to attribute the original work via citation.

id (chr)

id a unique ID for participant

log10_viral_load (dbl)

log 10 viral load in copies per millilitre detected

time (dbl)

time of the sample in days from exposure.

Source

https://www.nature.com/articles/s41591-022-01780-9/figures/2

References

B. Killingley et al., ‘Safety, tolerability and viral kinetics during SARS-CoV-2 human challenge in young adults’, Nat Med, vol. 28, no. 5, pp. 1031–1041, May 2022, doi: 10.1038/s41591-022-01780-9.

Examples

dplyr::glimpse(covid_challenge)

COG-UK counts of genomic variants

Description

Weekly counts of identified variants for the whole of England.

Usage

data("covid_variants")

Format

An object of class grouped_df (inherits from tbl_df, tbl, data.frame) with 479 rows and 5 columns.

Details

Counts of COVID-19 variants from the COGUK COVID-19 sequencing project. Positive samples were selected based on viral load on initial PCR testing and sent onward for testing. Prioritisation and over-sampling of cases with S-gene target failure happened so this data is not unbiased.

From late March 2023 onward, due to the low number of sequenced samples, the UK SARS-CoV-2 sequencing surveillance data is not updated on the Wellcome Sanger Institute COVID-19 Genomic surveillance dashboard. Due to changes since the end of mass COVID-19 testing in the UK since April 2022 - the Wellcome Sanger Institute COVID-19 Genomic surveillance dashboard only includes a subset of UK SARS-CoV-2 sequencing surveillance data and should not be used to estimate frequency of SARS-CoV-2 variants circulating. Not all samples sequenced and deposited in public databases are presented here. This data is not de-duplicated on a patient level - and may include targeted sequencing that may introduce biases.

covid_variants dataframe with 479 rows and 5 columns

date (date)

The date - unclear if this was of the sample or result

class (fct)

The variant description as a name and pango lineage

who_class (fct)

The WHO short name

count (dbl)

The number of sequences of this variant identified on this date

denom (dbl)

The total number of sequences of all variants identified on this date

Source

https://covid19.sanger.ac.uk/lineages/raw Contains Ordnance Survey data © Crown copyright and database right 2019 Contains UK Health Security Agency data © Crown copyright and database right 2020 Office for National Statistics licensed under the Open Government Licence v.3.0

Examples

dplyr::glimpse(covid_variants)


COG-UK counts of genomic variants by lower tier local authority

Description

Counts of COVID-19 variants from the COGUK COVID-19 sequencing project. Positive samples were selected based on viral load on initial PCR testing and sent onward for testing. Prioritisation and over-sampling of cases with S-gene target failure happened so this data is not unbiased.

Usage

data("covid_variants_ltla")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 55785 rows and 8 columns.

Details

Weekly counts of identified variants by Lower tier local authority (2019 names) This dataset has implicit zeros. The full range of areas can be got from the geography data set with: geography %>% dplyr::filter(codeType == "LAD19")

From late March 2023 onward, due to the low number of sequenced samples, the UK SARS-CoV-2 sequencing surveillance data is not updated on the Wellcome Sanger Institute COVID-19 Genomic surveillance dashboard. Due to changes since the end of mass COVID-19 testing in the UK since April 2022 - the Wellcome Sanger Institute COVID-19 Genomic surveillance dashboard only includes a subset of UK SARS-CoV-2 sequencing surveillance data and should not be used to estimate frequency of SARS-CoV-2 variants circulating. Not all samples sequenced and deposited in public databases are presented here. This data is not de-duplicated on a patient level - and may include targeted sequencing that may introduce biases.

covid_variants_ltla dataframe with 55785 rows and 8 columns

date (date)

The date - unclear if this was of the sample or result

code (chr)

The ONS geographical region code

codeType (chr)

The type of ONS geographical code

name (chr)

The ONS geographical region name

who_class (fct)

The WHO short name

count (dbl)

The number of sequences of this variant identified on this date

denom (dbl)

The total number of sequences of all variants identified on this date

Source

https://covid19.sanger.ac.uk/lineages/raw Contains Ordnance Survey data © Crown copyright and database right 2019 Contains UK Health Security Agency data © Crown copyright and database right 2020 Office for National Statistics licensed under the Open Government Licence v.3.0

Examples

dplyr::glimpse(covid_variants_ltla)


Serial interval from publicly reported cases

Description

Data on which initial serial interval estimates were performed by Du et al, 2020.

Usage

data("du_serial_interval")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 752 rows and 3 columns.

Details

"This is a publication of the U.S. Government. This publication is in the public domain and is therefore without copyright. All text from this work may be reprinted freely. Use of these materials should be properly cited."

du_serial_interval dataframe with 752 rows and 3 columns

id (dbl)

Unique case id

symptom_onset (dbl)

Time of symptom onset as an integer

infector_id (dbl)

Case id of infector where known

Source

https://github.com/MeyersLabUTexas/COVID-19

References

Z. Du, X. Xu, Y. Wu, L. Wang, B. J. Cowling, and L. A. Meyers, ‘Serial Interval of COVID-19 among Publicly Reported Confirmed Cases’, Emerg Infect Dis, vol. 26, no. 6, pp. 1341–1343, Jun. 2020, doi: 10.3201/eid2606.200357.

Examples

dplyr::glimpse(du_serial_interval)

John Hopkins data from the early outbreak

Description

Mined out the commit history of COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University this dataset has early outbreak trajectories (21st Jan 2020 up to March 8th 2020) for a wide range of geographies, for confirmed cases, deaths and recovered cases. These trajectories are based on reported date, but are occasionally revised which will vary from region to region and maybe between different statistics, which show up as infrequent changes in published estimates over time.

Usage

data("early_global_combined")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 104036 rows and 9 columns.

Details

This data set is originally licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.

country (chr)

The country

province (chr)

Sub-national division

lat (dbl)

Latitude

long (dbl)

Longitude

reported_date (date)

Date of the observation based on reports of cases on this date.

total_cases (dbl)

Cumulative cases

published_date (date)

Date the observation was published on the JHU github.

total_deaths (dbl)

Cumulative deaths

total_recovered (dbl)

Cumulative recovered

Source

https://github.com/CSSEGISandData/COVID-19

Examples

dplyr::glimpse(early_global_combined)

England only COVID-19 case counts stratified by 5-year age bands

Description

A dataset of the daily count of COVID-19 cases by age group in England downloaded from the UKHSA coronavirus API, and formatted for use in ggoutbreak. A denominator is calculated which is the overall positive count for all age groups. This data set can be used to calculate group-wise incidence and absolute growth rates and group wise proportions and relative growth rates by age group.

Usage

data("england_cases_by_5yr_age")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 26790 rows and 8 columns.

Details

You may want england_covid_positivity instead which includes the test denominator. The denominator here is the total number of positive tests across all age groups and not the number of tests taken or population size.

england_cases_by_5yr_age dataframe with 26790 rows and 8 columns

name (chr)

The region name

code (chr)

The region code

codeType (chr)

The ONS geographical region code type (including year)

date (date)

The date

class (chr)

the age group in 5 year age bands

count (dbl)

the test positives for each age group

denom (dbl)

the test positives across all age groups

population (dbl)

the population size for this age group

Source

https://ukhsa-dashboard.data.gov.uk/covid-19-archive-data-download

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(england_cases_by_5yr_age)

England only COVID-19 case counts with total test numbers

Description

The daily count of COVID-19 new PCR positive cases in England. The denominator the overall number of PCR tests conducted. This gives us a proportion of positive tests which can be used to correct for testing effort.

Usage

data("england_covid_positivity")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 1413 rows and 6 columns.

Details

england_covid_positivity dataframe with 2048 rows and 6 columns

name (chr)

The region name

code (chr)

The region code

codeType (chr)

The ONS geographical region code type (including year)

date (date)

The date

count (dbl)

the count of PCR test positives

denom (dbl)

the total count of PCR tests conducted on that day

Source

https://ukhsa-dashboard.data.gov.uk/covid-19-archive-data-download

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(england_covid_positivity)

COVID-19 cluster outbreaks data from Tianjin and Singapore

Description

Data from which serial interval and generation time estimates were performed by Ganyani et al, 2020

Usage

data("ganyani_clusters")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 196 rows and 6 columns.

Details

Original article licensed under Creative Commons 4.0. Data was cleansed and formatted for R.

ganyani_clusters dataframe with 196 rows and 6 columns

id (dbl)

a unique id for a person (unique within the source)

contacts (list dbl)

list of known contacts in the cluster

cluster_id (dbl)

id of a cluster (unique within the source)

symptom_onset (date)

symptom onset date

known_primary_case (lgl)

flag if this person is know to be the primary case in the cluster

source (chr)

geographical source of the data

Source

https://github.com/cecilekremer/COVID19

References

Ganyani T, Kremer C, Chen D, Torneri A, Faes C, Wallinga J, Hens N. Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data, March 2020. Euro Surveill. 2020 Apr;25(17):2000257. doi: 10.2807/1560-7917.ES.2020.25.17.2000257. PMID: 32372755; PMCID: PMC7201952.

Examples

dplyr::glimpse(ganyani_clusters)

UK geographic codes an CTRY, RGN and LAD level

Description

Geographic codes and names from the ONS for administrative regions of the UK relevant to the COVID-19 response. There are multiple entries for lower tier local authority codes as these changed during the course of the pandemic.

Usage

data("geography")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 1512 rows and 3 columns.

Details

geography dataframe with 1512 rows and 3 columns

name (chr)

The region name

code (chr)

The region code

codeType (chr)

The ONS geographical region code type (including year)

Source

https://geoportal.statistics.gov.uk/

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(geography)

UK-wide COVID-19 case counts stratified by Lower tier local authority

Description

A dataset of the daily count of COVID-19 cases by Lower tier local authority in the UK downloaded from the UKHSA coronavirus API, and formatted for use in ggoutbreak.

Usage

data("ltla_cases")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 512050 rows and 6 columns.

Details

ltla_cases dataframe with 512050 rows and 6 columns

name (chr)

The region name

code (chr)

The region code

codeType (chr)

The ONS geographical region code type (including year)

date (date)

The date

count (dbl)

the test positives for each LTLA

population (dbl)

the population size for this geography

Source

https://ukhsa-dashboard.data.gov.uk/covid-19-archive-data-download

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(ltla_cases)

NHS digital contact tracing activity

Description

Summary data collected as part of the NHS digital contact tracing app monitoring. This describes the number of alerts issued, and venue "check-ins".

Usage

data("nhs_app")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 137 rows and 3 columns.

Details

date (date)

The date

alerts (int)

Number of alerts

visits (int)

Number of check-ins

Source

https://www.gov.uk/government/publications/nhs-covid-19-app-statistics

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(nhs_app)

ONS COVID-19 infection survey

Description

The COVID-19 ONS infection survey took a random sample of the population and provides an estimate of the prevalence of COVID-19 that is theoretically free from ascertainment bias. This data set is the output of the model based on underlying data.

Usage

data("ons_infection_survey")

Format

An object of class grouped_df (inherits from tbl_df, tbl, data.frame) with 9820 rows and 8 columns.

Details

code (chr)

The ONS geographical region code

codeType (chr)

The type of ONS geographical code

name (chr)

The ONS geographical region name

date (date)

A date

prevalence.0.5 (dbl)

the median proportion of people in the region testing positive for COVID-19

prevalence.0.025 (dbl)

the lower CI of the proportion of people in the region testing positive for COVID-19

prevalence.0.975 (dbl)

the upper CI of the proportion of people in the region testing positive for COVID-19

denom (int)

the sample size on which this estimate was made (daily rate inferred from weekly sample sizes.)

Source

https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/datasets/coronaviruscovid19infectionsurveydata

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(ons_infection_survey)

COVID PCR test sensitivity over time

Description

Model output from Binny et al, 2023, describing the sensitivity of COVID PCR tests over the course of an infection.

Usage

data("pcr_test_sensitivity")

Format

An object of class list of length 2.

Details

pcr_test_sensitivity named list with 2 items

modelled (df modelled*)

Original model output from supplementary

resampled (df resampled*)

resampled and reformatted data

⁠df modelled⁠ dataframe with 501 rows and 4 columns

days_since_infection (dbl)

days since infection

median (dbl)

median sensitivity

lower_95 (dbl)

lower 95% CI of sensitivity

upper_95 (dbl)

upper 95% CI of sensitivity

⁠df resampled⁠ dataframe with 5100 rows and 3 columns

tau (dbl)

days since infection

probability (dbl)

the sensitivity as a probability of detection

boot (int)

a bootstrap identifier

Source

https://pmc.ncbi.nlm.nih.gov/articles/instance/9796165/bin/jiac317_supplementary_data.zip

References

Rachelle N Binny, Patricia Priest, Nigel P French, Matthew Parry, Audrey Lustig, Shaun C Hendy, Oliver J Maclaren, Kannan M Ridings, Nicholas Steyn, Giorgia Vattiato, Michael J Plank, Sensitivity of Reverse Transcription Polymerase Chain Reaction Tests for Severe Acute Respiratory Syndrome Coronavirus 2 Through Time, The Journal of Infectious Diseases, Volume 227, Issue 1, 1 January 2023, Pages 9–17, https://doi.org/10.1093/infdis/jiac317


SPI-M-O consensus reproduction number and growth rate estimates

Description

A set of consensus estimates for the reproduction number and growth rate of the COVID-19 epidemic in England, produced by the SPI-M-O subgroup of SAGE

Usage

data("spim_consensus")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 113 rows and 5 columns.

Details

spim_consensus_rt dataframe with 113 rows and 5 columns

date (date)

the date

rt.low (dbl)

the lower estimate of the reproduction number

rt.high (dbl)

the upper estimate of the reproduction number

growth.low (dbl)

the lower estimate of the exponential growth rate

growth.high (dbl)

the higher estimate of the exponential growth rate

Source

https://www.gov.uk/guidance/the-r-value-and-growth-rate

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(spim_consensus)

Timeline of events

Description

Major events in the UK COVID-19 pandemic, limited to lock-downs, vaccination roll-out and first identification of major variants.

Usage

data("timeline")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 19 rows and 3 columns.

Details

label (chr)

The event

start (date)

The start date

end (date)

The end date if a period

Source

https://en.wikipedia.org/wiki/Timeline_of_the_COVID-19_pandemic_in_the_United_Kingdom

Examples

dplyr::glimpse(timeline)

Country, regional, and sub-national total population estimates

Description

ONS National and sub-national mid-year population estimates for the UK and its constituent countries by administrative area, age and sex (including components of population change, median age and population density).

Usage

data("uk_population_2019")

Format

An object of class tbl_df (inherits from tbl, data.frame) with 398 rows and 4 columns.

Details

Mid-2019: April 2019 local authority district codes edition of this dataset. This is UK wide and covers country, regions and LTLA (2019 boundaries)

uk_population_2019 dataframe with 398 rows and 4 columns

name (chr)

The region name

code (chr)

The region code

codeType (chr)

The ONS geographical region code type (including year)

population (dbl)

the count of the population in that age group

Source

https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(uk_population_2019)

Country, regional, and sub-national population estimates by 10 year age groups

Description

ONS National and sub-national mid-year population estimates for the UK and its constituent countries by administrative area, age and sex (including components of population change, median age and population density).

Usage

data("uk_population_2019_by_10yr_age")

Format

An object of class grouped_df (inherits from tbl_df, tbl, data.frame) with 3980 rows and 6 columns.

Details

Mid-2019: April 2019 local authority district codes edition of this dataset, this is UK wide and covers country, regions and LTLA (2019 boundaries)

Stratified by 10 year age groups

uk_population_2019_by_10yr_age dataframe with 3980 rows and 6 columns

name (chr)

The region name

code (chr)

The region code

codeType (chr)

The ONS geographical region code type (including year)

class (chr)

The age group in 10 year age bands

population (dbl)

the count of the population in that age group

baseline_proportion (dbl)

the proportion of the total regional population that is in an age group

Source

https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(uk_population_2019_by_10yr_age)

Country, regional, and sub-national population estimates by 5 year age groups

Description

ONS National and sub-national mid-year population estimates for the UK and its constituent countries by administrative area, age and sex (including components of population change, median age and population density).

Usage

data("uk_population_2019_by_5yr_age")

Format

An object of class grouped_df (inherits from tbl_df, tbl, data.frame) with 7562 rows and 6 columns.

Details

Mid-2019: April 2019 local authority district codes edition of this dataset, this is UK wide and covers country, regions and LTLA (2019 boundaries)

Stratified by 5 year age groups

uk_population_2019_by_5yr_age dataframe with 7562 rows and 6 columns

name (chr)

The region name

code (chr)

The region code

codeType (chr)

The ONS geographical region code type (including year)

class (chr)

The age group in 5 year age bands

population (dbl)

the count of the population in that age group

baseline_proportion (dbl)

the proportion of the total regional population that is in an age group

Source

https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates

Originally licensed under the Open Government Licence v3.0

Examples

dplyr::glimpse(uk_population_2019_by_5yr_age)

COVID-19 Viral shedding data

Description

Data from van Kampen et al, 2019, describing duration of viral shedding from symptom onset in patients with COVID-19.

Usage

data("viral_shedding")

Format

An object of class list of length 2.

Details

viral_shedding named list with 2 items

original (df original*)

original description

resampled (df resampled*)

resampled description

⁠df original⁠ dataframe with 690 rows and 4 columns

⁠duration of symptoms in days⁠ (dbl)

duration of symptoms in days

⁠RNA copies per mL⁠ (chr)

RNA copies per mL

⁠PRNT titer⁠ (chr)

PRNT titer

⁠virus culture result⁠ (chr)

virus culture result

⁠df resampled⁠ dataframe with 2600 rows and 3 columns

tau (int)

time from symptom onset to measurement

probability (dbl)

probability of detected viral excretion

boot (int)

a bootstrap identifier

Source

https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-020-20568-4/MediaObjects/41467_2020_20568_MOESM4_ESM.xlsx

References

van Kampen, J.J.A., van de Vijver, D.A.M.C., Fraaij, P.L.A. et al. Duration and key determinants of infectious virus shedding in hospitalized patients with coronavirus disease-2019 (COVID-19). Nat Commun 12, 267 (2021). https://doi.org/10.1038/s41467-020-20568-4