1 Introduction

Families are social networks of related individuals belonging to different generations. A social network may be approached from a holistic perspective, which concentrates on network characteristics, or an egocentric perspective, which concentrates on network ties of a reference person or focal individual, called ego. Ego may be a member of the oldest generation, the youngest generation, or an intermediate generation. That leads to different perspectives on population. If ego is a member of the oldest generation, ego’s perspective is that of a parent, a grandparent and a great-grandparent. If, on the other hand, ego is a member of the youngest generation, the perspective is that of a child, grandchild and great-grandchild. Members of an intermediate generation are both parents and children. Some of these members may go through a stage of life with old parents and young children and experience the double-burden of caring for parents and children. To study family ties from different perspectives, simulation is often used. Simulation produces a virtual population that mimics a real population. The multi-generation virtual population offers unique opportunities to study aspects of family demography that did not receive much attention until recently: (a) the study of the population from a child’s perspective, (b) the demography of grandparenthood (perspective of the elderly), and (c) the double burden experienced by the sandwich generation.

The three subjects listed are receiving a growing attention in the demographic literature. The dominant method of analysis is microsimulation and SOCSIM is the software package of choice. SOCSIM creates individual life histories and genealogies from age-specific mortality and fertility rates (Wachter et al. 1997). ‘VirtualPop’, a simulation package in R (Willekens2022package?), uses a similar method to produce individual life histories and genealogies in multigeneration populations. There is an important difference, however: simulation proceeds in continuous time. Microsimulation in continuous time has two important advantages (Willekens 2009). First, no two events can occur in the same interval, which resolves the issue of event sequence that requires the user to determine which event comes first and which second. In SOCSIM, the simulation proceeds month by month and events scheduled in a month are executed in random order (Wachter et al. 1997, p. 93)1. Second, the duration between events can be computed exactly, not approximately.

Microsimulation is often used to study subjects that cannot be studied easily otherwise. Wolf (1994), Arpino et al. (2018), Margolis (2016), Margolis and Verdery (2019) and others adopt the perspective of the elderly, and raise issues such as the presence of children and grandchildren, the number of grandchildren, the timing of grandparenthood, the duration of grandparenthood, and the age of grandchildren at a particular moment in a grandparent’s life. A second subject is population structure and demographic change from the perspective of children. A child interprets demography as the presence of parents, grandparents, siblings and peers. The perspective of children is rarely adopted in demography (Mills and Rahal 2021, p. 20). Lam and Marteleto (2008) and Verdery (2015) adopt the perspective of a child to interpret demographic change. Using survey data, Bumpass and Lu (2000) study the children’s family contexts in the United States. The third subject is the double burden of child care and elderly care. The double burden was recently studied by Alburez-Gutierrez et al. (2021). The authors used mortality and fertility rates of 198 countries and territories included in the 2019 Revision of the UNWPP and the period 1950-2100 to estimate the time spent caring for as child (and grandchild) and an elderly parent. The output is a complete kinship network for the 1970–2040 birth cohorts from which it is possible to determine the time each simulated individual spent sandwiched as a parent and grandparent. The authors conclude that “Demographic microsimulation allowed us to overcome the lack of international and comparable data on past, present, and future kinship structures.”

The purpose of the vignette is to shown how to navigate the multi-generation virtual population database to extract demographic information of direct interest to aspects of family demography listed above. The desired information is extracted using queries, formulated as R functions. The code uses subsetting, which is a feature of vectorised operations. Most of the functions in R are vectorised, which means that a function operates on all elements of a vector simultaneously. Vectorized operations prevent the loops that makes R slow2.

This document consists of eight sections in addition to the introduction. The next section is a very brief introduction to ‘VirtualPop’. Section three presents six basic queries. They are sufficient to retrieve families ties and produce kinship networks. Section four combines the basic functions in queries to obtain information on the kin of ego: siblings, cousins and aunts. The fifth section zooms in on grandparenthood. It illustrates how the basic queries may be used to answer questions about grandparenthood, such as age at onset and duration. Section six concentrates on the double burden. It addresses questions such as age at onset and termination, and the proportion of people who experience a double burden. The last section presents the fertility table, which is a multistate life table of fertility histories. The application of the basic functions, alone or in combination, to extract desired information on families and kinships in the virtual population may be labeled kinship calculus.

2 The ‘VirtualPop’ package

The ‘VirtualPop’ package generates a multi-generation virtual population. You select the number of generations and the size of the first generation (generation 1). The sizes of the other generations depend on the fertility rates. For each individual in the virtual population, the lifespan and the fertility history from death are generated by sampling age-at-death distributions implicit in the death rates by age (single years of age) and sex included in the Human Mortality Database (HMD) and the time-to-event (waiting time) distributions implicit in the fertility rates by age and birth order. Rates may be downloaded from the Human Fertility Database (HFD). ‘VirtualPop’ uses conditional fertility rates (in HFD parlance). They are more widely known as occurrence-exposure rate. For details, see the tutorial and the other vignettes included in the ‘VirtualPop’ package.

Note that the life histories of members of the virtual population produced by ‘VirtualPop’ are not projections or forecasts. They are the life histories and genealogies that would result if the individuals in successive generations experience the same mortality and fertility. The demographic indicators computed from the histories convey information on the actually observed mortality and fertility rates, unless the rates are programmed to change in time. A virtual population is analogous to a stable population. In demography, stable population theory is used to assess the long-term consequences of sets of age-specific fertility and mortality rates, and rates of transition between living states. In the long run, life history characteristics and population characteristics coincide (ergodicity theorem). The same theorem applies to a multi-generation virtual population that results from a set of age-specific fertility and mortality rates, and rates of transition between living states. In the long run, i.e. after a large number of generations, life history characteristics and population characteristics coincide.

The fertility and mortality rates of the United States in the calendar year 2019 and the virtual population generated by the ‘VirtualPop’ from these rates are included in the data folder of the ‘Families’ package. To list the datasets in the package, type \(data(package = 'Families')\). To attach the package and to load the data, the following code is used:

# Attach the package to the search path
library (Families)
# Load the data in your workspace (environment:R_GlobalEnv)
data(dataLH_F,package = 'Families')
data(rates,package = 'Families')

dLH <-  dataLH_F
# Number of generations in the dataset
ngen <- max(unique (dLH$gen))

Note that the virtual population in \(dLH\) differs from the data file distributed with “VirtualPop$ due to randomness. The virtual population is produced from the same rates, but with a different random seed. The virtual population consists of 2965 individuals belonging to 4 generations. The following table shows the number of individuals by generation and sex:

addmargins(table (Generation=dLH$gen,Sex=dLH$sex))
#>           Sex
#> Generation Male Female  Sum
#>        1    535    465 1000
#>        2    416    368  784
#>        3    324    343  667
#>        4    266    248  514
#>        Sum 1541   1424 2965

3 Basic functions to navigate the virtual population database

The key to retrieve information on a reference person or multiple reference persons is the individual identification number (ID). The ID of a reference person is denoted by \(IDego\) or \(idego\). Since R treats a scalar as a vector, a request for information on a single reference person or on a group of reference persons is treated similarly. \(IDego\) (and \(idego\)) is a vector of reference persons. In case of a single reference person, the vector consists of a single element. To retrieve data on an individual’s social network (kinship network), the ‘Families’ package includes four basic functions to navigate the virtual population database. They retrieve the IDs of mother, father, partner and children. A combination of basic functions retrieves the IDs of siblings, aunts, uncles and cousins. In addition to these four functions, two basic functions return dates. In multi-generation populations, working with dates is more convenient than working with ages. The six functions are:

Ego’s age at an event is the date of birth minus the date of the event. For instance, the age at death is the date of birth minus the date of death: \(Db(idego)-Dd(idego)\).

To retrieve information on kin other than members of the nuclear family, the basic functions are combined. For instance, the function call \(IDmother(IDmother(idego))\) gives the ID of ego’s grandmother and \(IDmother(IDmother(IDmother(idego)))\) retrieves the ID of ego’s great-grandmother. If an individual for which information is requested is not included in the database, the missing value indicator NA is returned. The query \(IDfather(IDmother(idego)))\) returns the ID of ego’s maternal grandfather and \(IDfather(IDfather(idego)))\) returns the ID of the paternal grandfather. The function call \(IDch(IDch(idego))\) returns the IDs of ego’s grandchildren, and \(IDch(IDch(IDch(idego)))\) the IDs of the great-grandchildren.

The names of the basic functions may not be unique to ‘Families’. Other packages included in the Comprehensive R Archive Network (CRAN) may have function by the same name. CRAN has about 20 thousand contributed packages. To ensure that \(IDmother\) and the other basic functions attached during the computations are from ‘Families’ and not from another package in the archive, the double colon operator is used. The second line of the code chunk that follows ensures that \(IDmother\) of the package ‘Families’ is attached and not a function by the same name in another package.

In the remainder of this section, information is retrieved on mothers, grandmothers, children and grandchildren.

3.1 Mothers and grandmothers

By way of illustration, let’s retrieve the data on three individuals, their mother and maternal grandmother. Before doing that, the functions of the ‘Families’ package, which are in the package environment, are copied to the workspace, which is the global environment.

# Create local copies of the functions (in workspace)
IDmother <- Families::IDmother
IDfather <- Families::IDfather
IDpartner <- Families::IDpartner
IDch <- Families::IDch
Db <- Families::Db
Dd <- Families::Dd

The individuals are selected at random from members of the third generation. The retrieval of information on mother and grandmother requires the retrieval of (a) the IDs of mother and grandmother, and (b) the individual records of these persons. The code is:

base::set.seed(30)
idego <- sample (dLH$ID[dLH$gen==3],3)
z <- dLH[c(idego,IDmother(idego),IDmother(IDmother(idego))),]
rownames(z) <- NULL
z[,1:9]

The three individuals selected have IDs 2114, 2037, 2090. The first three records contain data on the reference persons. Records 4-6 have data on the mothers of the reference persons and records 7-9 have data on the grandmothers. For each individual, the table includes ID of the individual, generation, sex, date of birth, date of death and age of death. In addition, it includes the ID of the partner, and the IDs of the mother and father (if their generation is included in the virtual population). Individual records also include the birth order of the individual and data on her or his children: the number of children (nch), the IDs of the children and the ages of the mother at childbearing. In this R R chunk, the information is limited to the first 3 children:

z[,c(10:14,21:23)]

The second individual has 2 children. The ID of the children are 2462, 2463 and the age of the mother at birth of the child is 29.43, 33.92 years.

To list the IDs of egos, their mothers and grandmothers, the function call is:

IDmother(IDmother(idego,keep_ego=TRUE))

with IDgm the ID of the grandmother of ego. The second argument of the function instructs the function to keep the IDs of ego and ego’s mother. For an explanation of the code, see the description of the function \(IDmother()\) (by typing ?IDmother into the console).

3.2 Children and grandchildren

Individual with ID 2 has 2 children and 2 grandchildren. The IDs of the grandchildren are:

# Children
IDch(id=2)
#> [1] 1663
# Grandchildren
IDch(IDch(id=2))
#> numeric(0)

Consider all women in generation 1 (4883 women) and let \(idego\) denote the vector of IDs of the women. The IDs of women with children are:

# Select all females in generation 1 
idego <- dLH$ID[dLH$gen==1 & dLH$sex=="Female"]
# IDs of children
idch <- IDch(idego)
# IDs of mother with children
idm <- unique(IDmother(idch))

with idm the IDs of women with children. 3751 women have children and 1132 remain childless. The IDs of women without children are retrieved by the typing \(idego\left[idego\%in\%idm==FALSE\right]\). The proportion remaining childless is \(1-length(idm)/length(idego)\), which gives 22.37 percent. The number of women in generation 1 by number of children is

addmargins (table(dLH$nch[dLH$gen==1 & dLH$sex=="Female"]))
#> 
#>   0   1   2   3   4   5   6 Sum 
#> 104 114 131  75  27   9   5 465

The IDs of the oldest and youngest child in a family are relatively easily determined. The oldest child is the child with the lowest date of birth. For each woman in the generation 1, the ID of the oldest child is obtained as follows:

idego <- dLH$ID[dLH$gen==1 & dLH$sex=="Female"]
# ID of children
idch <- IDch(idego,dLH)
# Date of birth of children
dbch <- dLH$bdated[idch]
# Create data frame
zz <-data.frame (ID=idch,dbch=dbch)
# Select, for each ego, child with lowest date of birth
ch_oldest=aggregate(zz,list(dLH$IDmother[idch]),function(x) min(x))
colnames(ch_oldest) <- c("idego","ID of oldest child","date of birth of oldest child")

The object \(ch\_oldest\) is a data frame with three columns. The first has the ID of the mother, the second the ID of the oldest child, and the third the date of birth of the oldest child. The first lines of \(ch\_oldest\) are

For each woman with children, the ID of the youngest child is

# Select, for each ego, child with highest date of birth
#    zz has ID of all children and dates of birth of children. 
#    They are used to select youngest child ever born (by mother)
ch_youngest=aggregate(zz,list(dLH$IDmother[idch]),function(x) max(x))
# Date of birth of mother
ch_youngest$db_mother <- Db(ch_youngest[,1])
# Age of mother at birth youngest child
ch_youngest$agem_chLast <- ch_youngest[,3] - ch_youngest$db_mother
colnames(ch_youngest) <- c("idego","ID_youngest_child","db_youngest_child","db_idego","agemLast")

\(ch\_youngest\) is a data frame with five columns: the ID of the mother, the ID of the youngest child, the date of birth of the youngest child, the date of birth of the mother and the age of the mother at birth of the youngest child. The density distribution of the ages of mothers at birth of youngest child is:

  xmin <- 10
  xmax <- 50
  library(ggplot2)
  p <- ggplot(ch_youngest, aes(agemLast)) +  
               geom_histogram(aes(y=..density..), alpha=0.5, position="identity",bins=50)+  
               geom_density(alpha=0.2) +
            scale_x_continuous(breaks=seq(xmin,xmax,by=5)) +
            scale_y_continuous (breaks=seq(0,0.07,by=0.01)) +
            labs(x="Age")
  title <- paste ("Age of mother at birth of youngest child; ",attr(dLH,"country"),attr(dLH,"year") )
  p <- p + ggtitle(title) +
  theme(plot.title = element_text(size = 10, face = "bold"))
  p

The mean age is 33.31 years and the standard deviation is 5.25 years. The median age at birth of the youngest child is \(r round(median(ch_youngest\)agemLast),2)` years.

3.3 Child’s perspective

The age of a newborn’s mother is the difference between the date of birth of the child and the date of birth of the mother. The age of a mother at a given reference age of a child is the age of the mother at birth of the child plus the reference age, provided the mother and the child are both alive at that age. Suppose we are interested in the age distribution of mothers of 10-year old children. A 10-year old is the reference person (ego). Consider females of generation 1. The IDs of mothers alive at ego’s 10 birthday is:

Status_refageEgo <- function(refage_ego)
{ # IDs of egos (children)
  idego <- IDch(dLH$ID[dLH$gen==1 & dLH$sex=="Female"])
  # Is ego alive at reference age?
  alive_ego <- Dd(idego) >= Db(idego) + refage_ego
  # Number of children (ego) alive (=TRUE) and dead (=FALSE) at reference age
  t1 <- table (alive_ego)
  # IDs of egos alive at reference age
  idegoRefage <- idego[alive_ego]
  # Is mother alive at ego's reference age?
  alive_m <- Dd(IDmother(idego)) >=  Db(idego) + refage_ego
  # Number of mothers alive (=TRUE) or dead (=FALSE) at reference ages of egos
  t2 <- table (alive_m)
  aa <- list (idego=idego,
              alive_ego=alive_ego,
              alive_m=alive_m,
              LivingStatus_ch_refage=t1,
              LivingStatus_m_refage=t2)
  return(aa)
}
refage_ego <- 10
out <- Status_refageEgo (refage_ego)

The proportion of egos with a living mother at their 10th birthday is

round (length(out$alive_m[out$alive_m]) / length(out$alive_ego),2)
#> [1] 0.99

The age distribution of mothers at reference age of children is:

# Age of living mothers at refage of ego
idegoRefage <-  out$idego[out$alive_ego]
age_m <- (Db(idegoRefage) + refage_ego - Db(IDmother(idegoRefage)))[out$alive_m]
mean(age_m)
#> [1] NA
sd(age_m)
#> [1] NA
hist(age_m,main=paste ("Age of living mother at age ",refage_ego," of ego",sep=""),breaks=40,xlab="Age of living mother")
box()

The age distribution of living mothers at age 65 of children is:

refage_ego <- 65
out <- Status_refageEgo (refage_ego)
# Age of living mothers at refage of ego
idegoRefage <-  out$idego[out$alive_ego]
age_m <- (Db(idegoRefage) + refage_ego - Db(IDmother(idegoRefage)))[out$alive_m]
mean(age_m)
#> [1] NA
sd(age_m)
#> [1] NA
hist(age_m,main=paste ("Age of living mother at age ",refage_ego," of ego",sep=""),breaks=40,xlab="Age of living mother")
box()

The age of a child at a given reference age of the mother is computed in a similar way. The age is the date of birth of the mother plus the reference age minus the date of birth of the mother, provided mother and child are alive. The age of ego at the 85th birthday of ego’s mother and the age of ego at death of ego’s mother (provided ego is alive at these ages) are:

idego <- dLH$ID[dLH$gen==3 & dLH$sex=="Female"]
age_m85 <- dLH$bdated[IDmother(idego)] + 85 - dLH$bdated[idego]
age_md <- dLH$ddated[IDmother(idego)] - dLH$bdated[idego]
d <- data.frame (idego,m85=age_m85,md=age_md)
library (ggplot2)
Plot_ages <- function (d)
{ # Age of ego at age 85 of mother and at death of mother
  dd <- reshape::melt.data.frame(d,id.vars="idego",measure.vars=c("m85","md"))
  colnames(dd)[2] <- "Age"
  xmin <- 10
  xmax <- 90
  p <- ggplot(dd, aes(x=value,color=Age,fill=Age)) +  
               geom_histogram(aes(y=..density..), alpha=0.5, position="identity",bins=50)+  
               geom_density(alpha=0.2) +
            scale_x_continuous(breaks=seq(xmin,xmax,by=10)) +
            scale_y_continuous (breaks=seq(0,0.07,by=0.01)) +
            xlab("Age")
  # Add median
  p <- p + theme(legend.position=c(0.76,0.99),legend.justification=c(0,1)) 
  title <- paste ("Age of ego at the 85th birthday of ego's mother and at death of mother; ",attr(dLH,"country"),attr(dLH,"year") )
  p <- p + ggtitle(title) +
  theme(plot.title = element_text(size = 10, face = "bold"))
 p
}
Plot_ages(d)