The life course or life history is conceptualized as a sequence of states and transitions between states [@willekens2014]. A state is a personal attribute, e.g. marital status or number of children ever born, or a combination of personal attributes. The set of possible states is the state space. It is the set of possible values of a personal attribute or combination of attributes. The individual life history is modelled as a multistate stochastic process. The transitions the individual experiences and the states occupied after the transitions depend on personal characteristics and chance. Transition rates, which are the parameters of the multistate process, are estimated by combining data from different individuals. As a result, the individual life history generated by the model that describes the stochastic process is synthetic. The population is referred to as a virtual population. An individual’s life history depends on several factors, including individual characteristics, social context and personal experiences. In this paper, the influencing factors are limited to personal characteristics, in particular age, sex and parity. The life course is also affected by chance. Individuals with identical characteristics have different life histories due to random factors ^[For a discussion of the distinction between individual heterogeneity and individual stochasticity, see @caswell2009stage.].
The vignette consists of four sections. Section 2 is a brief review of the theory of multistate modelling. For a more extensive discussion, see [@aalen2008survival; @cook2018multistate; and @willekens2014]. Section 3 covers the simulation of a single fertility history. Section 4 is an application. Fertility histories of women in the United States are simulated using data from the Human Fertility Database (HFD) (www.humanfertility.org) and the Human Mortality Database (HMD) (www.mortality.org).
At any point in time, an individual is in one of r possible states. Let \(_kY(x)\) denote the state individual k occupies at age x. \(_kY(x)\) is a discrete random variable, a Bernoulli random variable if r=2 and a multinomial random variable if r>2. Individual k may leave state i and enter another state j. In some cases, a state can be entered but not left. A state that can be entered and left is a transient state, while a state that can be entered but not left is an absorbing state.
The probability that k is in state i at age x is the state probability \(_kp_i(x)=Pr \left[ _kY(x) = i\right]\), with \(\sum_{i=1}^{r} {_kp_i(x)=1}\). The probability that k is in state i at age x and in state j at age \(x+\tau\) is the joint probability
\begin{equation} Pr \left[ _kY(x) = i,\text{ } _kY(x+\tau) = j \right] \end{equation}
The transition probability is the product of the probability that k is in i at x and the conditional probability that k is in j at \(x+\tau\), provided k is in i at x:
\begin{equation} Pr \left[ _kY (x)=i,_kY(x+\tau)=j \right] = Pr \left [ _kY(x+\tau)=j |_kY(x)=i \right] Pr \left[_kY(x)=i \right] \end{equation}
The conditional probability is the transition probability and is denoted by \(_kp_{j|i}(x,x+\tau)\), which is more conveniently written as \(_kp_{ij}(x,x+\tau)\) and, if \(\tau=1\), it is also written as \(_kp_{ij}(x)\). The probability that k is in j at \(x+\tau\) is the state probability
\begin{equation} kp_j(x+\tau)=\sum{i=1}^{r} {kp{ij}(x,x+\tau)} {_kp_i(x)} \end{equation}
It expresses the state probability at \(x+\tau\) in terms of the state probabilities at x and the transition probabilities.
State probabilities may be combined in a state vector \(_k\mathbf{S}(x)\) and transition probabilities in the transition matrix \(_k{\mathbf{P}}(x,x+\tau)\):
\begin{equation}
_k{\mathbf{P}}(x,x+\tau)=
\begin{bmatrix}
\text{ }k p{11}(x) & k p{12}(x) & \cdots & k p{1r}(x) \
k p{21}(x) & \text{ }k p{22}(x) & \cdots & k p{2r}(x) \
\vdots & \vdots & \ddots & \vdots \
k p{r1}(x) & k p{r2}(x) & \cdots & \text{ }k p{rr}(x)
\end{bmatrix}
\end{equation}
where \(p_{ii}(\tau)=1 - \sum_{j \ne i}{} _k p_{ij} (\tau)\).
The element \(_k p_{ij} (x,x+\tau)\) denotes the probability that k who occupies stata i at age x occupies state j at age \(x+\tau\). The diagonal element \(_kp_{ii} (x,x+\tau)\) is the probability that k, who is in i at x, is also in i at \(x+\tau\). It does not imply that k remains in i during the entire period from x to \(x+\tau\). Multiple transitions are allowed during the interval \((x,x+\tau)\).
The vector or state probabilities (state vector) at age x is
\begin{equation} _k\mathbf{S}(x+\tau)= \begin{bmatrix} \text{ }k p{1}(x) \ k p{2}(x) \ \vdots \ k p{r}(x) \end{bmatrix} \end{equation}
The state vector at \(x+\tau\) depends on the state vector at \(x\) and the transition probability matrix:
\begin{equation} _k\mathbf{S}(x+\tau)=\left[ _k \mathbf{P}(x,x+\tau \right]’ \text{ }_k\mathbf{S}(x) \end{equation}
where$_k\mathbf{S}(t)$ is a column vector and ’ denotes transpose. This matrix expression is a fundamental equation in multistate modelling. In this vignette, the state equation is specified at the individual level.
If \(\tau\) is small and tends to zero,
\begin{equation} \frac{1}{\tau} \displaystyle \lim_{\tau \to 0} Pr \left[_kY(x+\tau)=j | _kY(x)=i \right]= \text{ }k \mu{ij}(x) \end{equation}
is the instantaneous rate of transition at age x. Multistate models are fully specified by their instantaneous rates of transition. A common assumption in demography is that \(_k \mu_{ij}(x)\) is piecewise constant. They vary between age groups of 1 or 5 years but are constant within age groups.
The instantaneous transition rates are combined in a transition rate matrix: \begin{equation} _k\mathbf{\mu}(x)= \begin{bmatrix} \text{ }k\mu{11}(x) & -k\mu{12}(x) & \cdots & -k\mu{1r}(x) \ -k\mu{21}(x) & \text{ }k\mu{22}(x) & \cdots & -k\mu{2r}(x) \ \vdots & \vdots & \ddots & \vdots \ -k\mu{r1}(x) & -k\mu{r2}(x) & \cdots & \text{ }k\mu{rr}(x) \end{bmatrix} \end{equation} where \(\mu_{ii}(\tau)=\sum_{j=i}{} _k \mu_{ij} (\tau)\).
The cumulative transition rate during the age interval from 0 to x is the cumulative hazard: \begin{equation} k\Lambda{ij}(x)=\int_{0}^{x} {k\mu{ij}(x)} \end{equation}
The cumulative hazard forms the basis for calculating transition probabilities. The probability that k, who is in i at \(x_1\), is in j at age \(x_2\) is \(exp \left[- _k \Lambda_{ij}(x_1,x_2) \right]\). The probability that k, who is in i at \(x_1\), is still in i at x \((x>x_1)\) is \begin{equation} k S{i+}(x_1,x) =exp \left[-\int_{x_1}^{x} {\sum_{j \ne i} {k \mu{ij}(\tau) d\tau} } \right] = exp\left[-\int_{x_1}^{x} {k \mu{i+}(\tau) d\tau} \right]=exp\left[- k\Lambda{i+}(x_1,x) \right] \end{equation}
Recall that \(_kY(\tau)=i\) is the probability that k occupies state i at age \(\tau\). It is sometimes convenient to denote the state occupied differently. Let \(_kY_i(\tau)\) be an indicator variable which is one if k is in i at \(\tau\) and zero otherwise:$_kY_i(\tau)=I((_kY(\tau)=i)$. The quantity \(_kY_i(\tau) d\tau\) is the time k spends in i during the interval \((\tau,\tau+d\tau)\). It is also the duration of exposure to the risk (duration at risk) of leaving i. The total duration of exposure before age x is \(\int_{0}^{x} {_kY_i(\tau) d\tau}\). It is the duration of stay in i between 0 and x. The stay does not need to be on a continuous basis. Individual k may leave i and return to i between ages 0 and x.
The product of the instantaneous transition rate at \(\tau\) and the indicator variable is the transition intensity \(_k\lambda_{ij}(\tau)\):
\begin{equation} k\lambda{ij}(\tau) = k\mu{ij}(\tau) _kY_i(\tau) \end{equation}
The function \(_k \mu_{ij}(\tau)\) is the hazard function, \(_kY_i(\tau)\) the exposure function and \(_k\lambda_{ij}(\tau)\) the intensity function [@aalen2008survival,p. 31; @cook2018multistate, p. 29]. The expression \(_k\mu_{ij}(\tau) d \tau\) is the probability that a transition occurs during the interval \((\tau,\tau+d\tau)\), provided k is at risk of experiencing the transition; \(_kY_i(\tau) d\tau\) is the time spent in i during the interval \((\tau,\tau+d\tau)\) (which is the duration of exposure of leaving i); and \(_k\lambda_{ij}(\tau) d\tau\) is the expected number of transitions from i to j by individual k during the interval \((\tau,\tau+d\tau)\). The expected number of transitions between ages 0 and \(x\) is
\begin{equation} kN{ij}(x)=\int_{0}^{x} {k\lambda{ij}(\tau) d\tau} \end{equation}
The sequence of random variables \(\{ {_kN_{ij}(\tau); 0 \le \tau \le x}\}\) is a counting process [@aalen2008survival, p. 25]. The presentation of life history processes as counting processes is a logical approach in demography because event occurrences are related to exposures. @willekens2014 adopts the counting process perspective in multistate demographic modelling.
Throughout the interval from 0 to x, individual k may enter any state j from any state i unless \(_k \mu_{ij}(\tau)\) is zero. To be able to move from i to j during the infinitesimally small interval from \(\tau\) to \(\tau+d\tau\), individual k must be in i at age \(\tau\). The current state occupied determines possible destination states. The state k occupies at \(\tau\) is determined endogenously. As a consequence, the period of exposure is also endogenous starting (and ending) with the occurrence of another transition. For instance, the birth of a first child triggers a period at risk of a birth of a second child. During the interval, k may stop being at risk by moving to another destination or by censoring the observation. In the simulation, the cumulative hazard needs to be computed for a period during which individual k is exposed to the risk of a (i,j)-transition. The exposure function denotes whether an individual of a given age in a given state is at risk of a transition to a given other state. Suppose that between ages \(x_1\) and \(x_2\) (with \(0 \le x_1, x_2 \le x\)) individual k is in state i. The cumulative hazard may be written as
\begin{equation} k\Lambda{ij}(x_1,x_2)=\int_{x_1}^{x_2} {k\mu{ij}(\tau) d\tau} = \int_{0}^{x} {k\mu{ij}(\tau) kY_i(\tau)} d\tau = \int{0}^{x} {k\lambda{ij}(\tau) d\tau} \end{equation}
Suppose we know the transition rates and are requested to infer the waiting time to leaving i for individual k, who is in i at \(x_1\). To determine a plausible waiting time, a random number u is drawn from the standard uniform distribution and the root of the equation
\begin{equation} g(x-x_1)=k \Lambda{i+}(x_1,\infty)+ln(1-u)=0 \end{equation} is determined. It gives the duration between \(x_1\) and the age at transition, i.e.$x_u-x_1$, with \(x_u\) the age for which the above equation holds.
An individual leaving i moves to a destination state, j say. The destination j is determined by sampling the multinomial distribution with parameters \(\{ _kp_{i1}(x),_kp_{i2}(x), \cdots,_kp_{ir}(x) \}\) \begin{equation} k p{ij}^*(x)=\frac{k\mu{ij}(x)} {\sum_{j \ne i} {k\mu{ij}(x)} } \hspace{1.5cm} {j \ne i} \end{equation}
and \(\sum_{j \ne i} {_k}p_{ij}^*(x)=1\). Sampling a multinomial distribution consists of two steps. First, a random number u is drawn from the standard uniform distribution. Second, the position of u in the parameter space is determined. If \(u < {_kp_{i1}^*(x)}\), the destination is state 1. If \({_kp_{i1}^*(x)} \le u < {_kp_{i2}^*(x)}\), the destination is state 2, etc. The method is equivalent to creating a vector with endpoints 0 and 1 and breakpoints \({_kp_{i1}^*(x)},{_kp_{i1}^*(x)}+{_kp_{i2}^*(x)},{_kp_{i1}^*(x)}+{_kp_{i2}^*(x)}+{_kp_{i3}^*(x)}\), etc. and to determine the segment that contains u.
Some transitions may not be possible, leading to an adjustment of the matrix of instantaneous transition rates \(_k\mathbf{\mu}(x)\). Consider the reproductive process. Each female member of the population becomes at risk of conception at menarche and remains at risk until menopause (conditional on survival and in the absence of sterilization). Fertility rates are zero initially, turn positive during adolescence and become zero again at menopause. The age at birth of the first child is determined by drawing a random waiting time from a piecewise exponential distribution with the age-specific first birth rates as parameters. If the waiting time exceeds the maximum reproductive age, the woman remains childless. Assume singleton births (no twins or triplets). Women with a first child are exposed to the risk of a second child. Exposure starts at the woman’s age at birth of the first child and ends at the age at birth of the second child or the end of the reproductive period, whatever comes first. Women with two children are exposed to the risk of a third child, and so on. Formally, women with i children are at risk of an additional child. The risk period starts at the age of birth of the i-th child and ends at birth of the i+1-st child or the end of the reproductive period (age at menopause). If \(i=0\), exposure starts at the lowest age of the reproductive period (age at menarche). In the reproductive process the transition intensities \(_k\mu_{ij}(\tau)\) are 0 except if j=i+1, with i the current number of biological children ever born to k. The matrix of party-specific instantaneous fertility rates is: \begin{equation} _k\mathbf{\mu}(\tau)= \begin{bmatrix} \text{ }k\mu{11}(\tau) & -k\mu{12}(\tau) & 0 & \cdots & 0 \ 0 & \text{ }k\mu{22}(\tau) & -k\mu{23}(\tau) & \cdots & 0 \ \vdots & \vdots & \vdots & \ddots & \vdots \ 0 & 0 & 0 & \cdots & \text{ }k\mu{rr}(\tau) \end{bmatrix} \end{equation} with r the highest parity considered in the analysis and the diagonal element \(_k\mu_{rr}(x)\) the instantaneous rate that a woman with r children has another child between \(\tau\) and \(\tau+d\tau\). For instance, if r is five, then a woman with five children may have another child. Beyond parity five, the fertility rate is independent of the parity. The matrix of transition intensities defines a hierarchical model. A hierarchical model is characterized by having r living states connected by r-1 rates of transition [@schoen2016hierarchical;@schoen2019analytical].
In the presence of mortality, the reproductive life course may be interrupted by death. Birth of the next child and death are competing risks. In the model, it is assumed that the rate of death is independent of the number of children ever born. In that case, individual lifespans may be simulated first, followed by the simulation of fertility careers from birth to death or the end of the reproductive period, whatever comes first.
Two cases are distinguished. In the first, the transition rates do not vary with age. The function sim.msm() of the msm package is used to generate a sequence of transitions. In the second, the transition rates are age-specific. The function Sim_bio() of the VirtualPop package generates the transitions. The life history of an individual is a realization (sample path) of a continuous-time Markov process.
Consider a most simple case. An individual has one attribute with three possible values (A, B and C). The three possible values constitute the state space. The individual moves between the three states at constant rates. The transition rates are:
M <- matrix(c( 0.0,0.10,0.05,
0.07,0.0,0.03,
0.02,0.05,0.0),nrow=3,byrow=TRUE)
namstates <- c("A","B","C")
dimnames(M) <- list(origin=namstates,destination=namstates)
diag(M) <- -rowSums(M)
MM <- -M
out <- knitr::kable(MM,
caption = "Transition rate matrix",
format = 'latex', booktabs = T,linesep = "")
y <- kableExtra::kable_styling(out,latex_options="HOLD_position")
kableExtra::add_header_above(y,c("From"=1,
"To"=3))
\begin{table}[H]
\caption{\label{tab:unnamed-chunk-1}Transition rate matrix} \centering \begin{tabular}[t]{lrrr} \toprule \multicolumn{1}{c}{From} & \multicolumn{3}{c}{To} \ \cmidrule(l{3pt}r{3pt}){1-1} \cmidrule(l{3pt}r{3pt}){2-4} & A & B & C\ \midrule A & 0.15 & -0.10 & -0.05\ B & -0.07 & 0.10 & -0.03\ C & -0.02 & -0.05 & 0.07\ \bottomrule \end{tabular} \end{table}
The transition rate matrix MM has the format of the matrix in equation 5.
The constant transition rates are used to generate life histories during segments of life. The function sim.msm() of the msm package simulates one realisation from a continuous-time Markov process up to a given time. Note that sim.msm() requires that the states are numeric values. The function also requires that the transition matrix has origin in rows and destination in columns [@jackson2011multi, p. 6]. The following code chunk generates a history between ages 20 and 40 for an individual who starts in state A at age 20.
set.seed(33)
bio <- msm::sim.msm (qmatrix=-MM,mintime=20,maxtime=40,start=1)
bio
$states
[1] 1 3 2 2
$times
[1] 20.00000 26.44465 36.26984 40.00000
$qmatrix
destination
origin A B C
A -0.15 0.10 0.05
B 0.07 -0.10 0.03
C 0.02 0.05 -0.07
The function produces a list of three objects:
At the start of the simulation, the individual is in state A (state 1) and exposed to the risk of moving to state B (state 2) or C (state 3). A random draw from the exponential waiting time distribution gives 6.44 years (since the start of the simulation and 26.44 years since the start of the time scale). Since 6.44 is the duration at which the value of the cumulative probability distribution is equal to u, we may determine the value of u produced by the random draw:
The probability that an individual in state A at age 20 leaves A before the end of the observation at 40 is \(1-exp\left [-0.15*20\right]=0.95\). Hence, if a random number is drawn that exceeds 0.95, the waiting time to the transition exceeds the observation window. In that case, the transition does not occur due to censoring at age 40.
The destination state is determined by a Bernoulli trial with two possible outcomes (B and C) and the probability \(0.10/(0.10+0.05)=0.667\) that B is the outcome.
The function sim.msm() is also used by @cook2018multistate[p.339].
After this illustration of the mechanism, we return to fertility. Suppose a female member of the virtual population experiences throughout her life the age- and parity-specific fertility rates of the United States in 2021. The function Sim_bio() of VirtualPop is used to generate the fertility career. Unlike in sim.msm(), the transition rates used by Sim_bio() are age-specific. The fertility rates of the United States in 2021 are distributed with the \(VirtualPop\) package. They are loaded with the \(data\) function of R base:
rates <- NULL
data(rates,package="VirtualPop")
The object \(rates\) has three components: (a) the death rates by age and sex (ASDR), (b) the fertility rates by age and parity (ASFR), and © the fertility rates in a format required by multistate models (ratesM). The rates for ages 25 to 29 are:
rates$ratesM[26:29,,]
, , Origin = par0
Destination
Age par0 par1 par2 par3 par4 par5 par6
25 0.05096 -0.05096 0 0 0 0 0
26 0.05541 -0.05541 0 0 0 0 0
27 0.06092 -0.06092 0 0 0 0 0
28 0.06700 -0.06700 0 0 0 0 0
, , Origin = par1
Destination
Age par0 par1 par2 par3 par4 par5 par6
25 0 0.16837 -0.16837 0 0 0 0
26 0 0.16459 -0.16459 0 0 0 0
27 0 0.16296 -0.16296 0 0 0 0
28 0 0.15846 -0.15846 0 0 0 0
, , Origin = par2
Destination
Age par0 par1 par2 par3 par4 par5 par6
25 0 0 0.14037 -0.14037 0 0 0
26 0 0 0.13269 -0.13269 0 0 0
27 0 0 0.12488 -0.12488 0 0 0
28 0 0 0.11813 -0.11813 0 0 0
, , Origin = par3
Destination
Age par0 par1 par2 par3 par4 par5 par6
25 0 0 0 0.13098 -0.13098 0 0
26 0 0 0 0.12428 -0.12428 0 0
27 0 0 0 0.11592 -0.11592 0 0
28 0 0 0 0.10721 -0.10721 0 0
, , Origin = par4
Destination
Age par0 par1 par2 par3 par4 par5 par6
25 0 0 0 0 0.14481 -0.14481 0
26 0 0 0 0 0.13719 -0.13719 0
27 0 0 0 0 0.12908 -0.12908 0
28 0 0 0 0 0.12046 -0.12046 0
, , Origin = par5
Destination
Age par0 par1 par2 par3 par4 par5 par6
25 0 0 0 0 0 0.14481 -0.14481
26 0 0 0 0 0 0.13719 -0.13719
27 0 0 0 0 0 0.12908 -0.12908
28 0 0 0 0 0 0.12046 -0.12046
, , Origin = par6
Destination
Age par0 par1 par2 par3 par4 par5 par6
25 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0
28 0 0 0 0 0 0 0
The following function call simulates the fertility career of a hypothetical person born on June 12th 1990.
popsim <- data.frame (ID=3,
born=1990.445,
start=0,
end=55,
st_start="par0")
set.seed(31)
ch <- suppressWarnings(VirtualPop::Sim_bio (datsim=popsim,
ratesM=rates$ratesM))
ch
$age_startSim
[1] 0
$age_endSim
[1] 56
$nstates
[1] 6
$states
[1] 1 2 3 4 5 6
$path
[1] "par0par1par2par3par4par5"
$ages_trans
[1] 19.36979 25.50009 31.11729 33.41448 38.56408
The individual has 5 children. The first child is born at age 19.37 and the second at age 25.5. Since the date of birth is known, the calendar dates at childbirth can be determined. The first child is born on Sun Oct 25 2009:
z <- format(lubridate::date_decimal(1990.445+ch$ages_trans[1]),
"%a %b %d %Y" )
The simulation window covers the entire reproductive period. The simulation may be interrupted at any age during the reproductive period, analogous to the censoring of a longitudinal observation. The argument \(end\) of the Sim_bio() function is an individual-level variable indicating the end of the simulation for the individual. In the presence of mortality, \(end\) is the age at death. The individual fertility career is simulated until \(end\) is reached. Sim_bio() is used in the function Children() of VirtualPop, used in the next section.
The model described in the previous section is used to simulate individual fertility histories in the presence of mortality.The data are mortality rates of the United States in 2021, by age and sex and fertility rates by age and parity. The rates are included in the VirtualPop package for illustrative purpose. They are retrieved using the code
rates <- NULL
data(rates,package="VirtualPop")
The creation of a virtual population involves several steps. The remainder of this section is a description of the steps to generate the virtual population. The steps are:
The simulation starts by determining the size of the virtual population. Each individual in the virtual population is assigned a unique identification number (ID) and a date of birth. If all births occur in the same year or same period, then the population consists of a single birth cohort. In this section, it is assumed that all births occur in a single year. Dates of birth are obtained by sampling from a probability distribution of births during the year. In this paper, a uniform distribution is assumed (no seasonal effects). The date of birth of an individual is the year of birth plus a fraction of a year, equal to a random number from the standard uniform distribution. The date of birth is expressed as a real number and is referred to as decimal date of birth. The format is particularly useful to compute ages and durations. Note that the computer does not store calendar dates, but stores dates as time (seconds or milliseconds) elapsed since a reference date and time, e.g. midnight, 1st January 1970.
At the start of the simulation, each individual is randomly assigned a sex using the empirical sex ratio at birth. Other personal attributes may be added, but they are omitted in this illustration.
The simulation includes a very simple partner formation process. Partners are selected at random from the opposite sex. Since the process is purely random, union formation may be simulated to occur at birth. When more boys are born than girls, some males remain without a partner.
For each individual in the virtual population, a data record is created. It contains the individual’s ID, date of birth and sex. It also contains a variable for the generation to which the individual belongs. The code is
cohort <- 2021
ncohort <- 1000
ID <- 1:ncohort
sex <- rbinom(ncohort,1,prob=1/2.05)
sex <- factor (sex,levels=c(0,1),labels=c("Male","Female"),ordered=TRUE)
# Population size by sex
nmales <- length(sex[sex=="Male"])
nfemales <- length(sex[sex=="Female"])
gen <- rep(1,ncohort) # generation 1
# Decimal date of birth
bdated <- cohort+runif(ncohort)
# Create data frame
d <- data.frame (ID=ID,
gen=gen,
cohort=cohort,
sex=sex,
bdated=bdated,
ddated=NA,
x_D=NA,
IDmother=NA,
IDfather=NA,
jch=NA,
IDpartner=NA,
udates=NA,
nch=NA)
# Ages at death, obtained by sampling a peicewise-exponential distribution,
# using the rpexp function of the msm package
ages <- as.numeric(rownames(rates$ASDR))
d$x_D[d$sex=="Male"] <- msm::rpexp(n=nmales,rate=rates$ASDR[,"Males"],
t=ages)
d$x_D[d$sex=="Female"] <- msm::rpexp(n=nfemales,rate=rates$ASDR[,"Females"],
t=ages)
# Decimal data of death
d$ddated <- d$bdated+d$x_D
Individual records include empty spaces for the ID an individual’s partner, mother and father. Their IDs are determined later. The spaces are included to facilitate the merging of data on multiple generations into a single data frame.
The simple PartnerSearch model is implemented in function PartnerSearch(). It accepts the data frame \(d\) as input and returns d augmented by the partner’s ID.
d <- VirtualPop::PartnerSearch(dLH=d)
The first lines of the data frame of individual records are:
out <- knitr::kable(head(d),
caption = "Data for selected individuals",
format = 'latex', booktabs = T,linesep = "")
kableExtra::kable_styling(out,latex_options=c("scale_down", "HOLD_position"))
\begin{table}[H]
\caption{\label{tab:unnamed-chunk-10}Data for selected individuals} \centering \resizebox{\linewidth}{!}{ \begin{tabular}[t]{rrrlrrrlllrll} \toprule ID & gen & cohort & sex & bdated & ddated & x_D & IDmother & IDfather & jch & IDpartner & udates & nch\ \midrule 1 & 1 & 2021 & Male & 2021.033 & 2094.614 & 73.58084 & NA & NA & NA & 122 & NA & NA\ 2 & 1 & 2021 & Male & 2021.041 & 2108.618 & 87.57696 & NA & NA & NA & 786 & NA & NA\ 3 & 1 & 2021 & Male & 2021.181 & 2089.131 & 67.95004 & NA & NA & NA & 999 & NA & NA\ 4 & 1 & 2021 & Male & 2021.857 & 2099.140 & 77.28293 & NA & NA & NA & 68 & NA & NA\ 5 & 1 & 2021 & Male & 2021.081 & 2111.526 & 90.44429 & NA & NA & NA & 548 & NA & NA\ 6 & 1 & 2021 & Male & 2021.645 & 2110.168 & 88.52360 & NA & NA & NA & 852 & NA & NA\ \bottomrule \end{tabular}} \end{table}
For women, ages at childbirth are determined by sampling from waiting time distributions with parameters equal to the fertility rates by age and parity. When a child is born, it receives a unique ID and a new individual data record is created. The children of the initial cohort constitute generation 2. The date of birth of a child is determined by the date of birth of the mother and her exact age at birth of the child. The ID of the mother is added to the child’s record and the child’s ID is added to the data record of the mother. The record linkages enable to track the descending line of offspring and ascending line of parentage. It significantly facilitates the study of genealogies and kinship networks. The new-born is also assigned a sex using the empirical sex ratio at birth. A partner is allocated by randomly drawing an ID of members of the same generation but of opposite sex. The lifespan and the fertility career of each member of the second generation is determined by sampling the appropriate probability distributions. The function \(Children()\) does the computations. The first argument is the individual data structure,$d$ in this illustration, and the second is the object with the transition rates. For instance,
dch1 <- VirtualPop::Children(dat0=d,rates=rates)
The object \(dch1\) has two components: \(data\) and \(dch\). The first components is comparable to the women’s file in demographic surveys and the second to the children’s file (see e.g. Demographic and Health Survey (DHS) [https://dhsprogram.com]). The object \(data\) is the input object \(d\) with, for each mother and father, information on number of children.
The second object \(dch\) is a data frame with individual records for the new-borns (second generation). It has the ID of the child, the generation, sex, date of birth, the birth order (variable \(jch\)), the ID of the mother, the ID of the father, and the date of death of the child. The ID of the father is the ID of the partner of the mother:
dch1$dch$IDfather <- dch1$data$IDpartner[dch1$dch$IDmother]
The first records are shown below.
out <- knitr::kable(head(dch1$dch),
caption = "Data for selected children of members of initial cohort",
format = 'latex', booktabs = T,linesep = "")
kableExtra::kable_styling(out,latex_options=c("scale_down","HOLD_position"))
\begin{table}[H]
\caption{\label{tab:unnamed-chunk-13}Data for selected children of members of initial cohort} \centering \resizebox{\linewidth}{!}{ \begin{tabular}[t]{rrrlrrrrrrlll} \toprule ID & gen & cohort & sex & bdated & ddated & x_D & IDmother & IDfather & jch & IDpartner & udates & nch\ \midrule 1001 & 2 & 2021 & Male & 2046.150 & 2128.191 & 82.04110 & 7 & 229 & 1 & NA & NA & NA\ 1002 & 2 & 2021 & Female & 2046.193 & 2124.670 & 78.47655 & 7 & 229 & 2 & NA & NA & NA\ 1003 & 2 & 2021 & Male & 2056.244 & 2122.886 & 66.64204 & 7 & 229 & 3 & NA & NA & NA\ 1004 & 2 & 2021 & Male & 2060.358 & 2108.437 & 48.07914 & 10 & 501 & 1 & NA & NA & NA\ 1005 & 2 & 2021 & Female & 2046.527 & 2116.963 & 70.43553 & 13 & 232 & 1 & NA & NA & NA\ 1006 & 2 & 2021 & Male & 2055.651 & 2156.033 & 100.38134 & 13 & 232 & 2 & NA & NA & NA\ \bottomrule \end{tabular}} \end{table}
The members of the second generation partner with individuals of the same generation. Partners are allocated randomly.
d2 <- VirtualPop::PartnerSearch (dLH=dch1$dch)
The object \(d2\) is \(dch1\\)dch$ augmented by the IDs of partners. The data frame contains the data on the second generation in a format that is consistent with the data frame of the first generation, which facilitates merging the data frames. The Children() function is used repeatedly to obtain data on the third and higher generations:
dch2 <- VirtualPop::Children(dat0=d2,rates=rates)
PartnerSearch in the third generation is the outcome of a random search:
d3 <- VirtualPop::PartnerSearch (dLH=dch2$dch)
dch3 <- VirtualPop::Children(dat0=d3,rates=rates)
d4 <- VirtualPop::PartnerSearch (dLH=dch3$dch)
dch4 <- VirtualPop::Children(dat0=d4,rates=rates)
d4 <- dch4$data[,1:which (colnames(dch4$data)=="nch")]
The object \(d4\) has data on the fourth generation, including on the children of members of the fourth generation.
Data on the four generations are merged to a single data frame.
dLH2 <- rbind(dch1$data,dch2$data,dch3$data,dch4$data)
The data frame that results is similar to the \(dLH\) data object included in the VirtualPop package.