In order to be computationally efficient and allow for multiallelic markers, with
rubias we boil most of the data down to a bunch of integer vectors in a data structure that we operate on with some compiled code.
This document is intended to document that data structure (mostly for Eric’s benefit, at this point. We should have had a document like this a long time ago).
The basic data structure is what we call a
param_list. it has the following named elements, which are briefly described here. We will describe each in detail in separate sections below.
L: the number of loci, an integer
N: the number of individuals (integer)
C: the number of collections in the reference data set (integer)
A: the number of alleles at each locus (integer vector of length L)
CA: the “cumulative number of alleles” at each locus. For each locus this gives the base-0 index of the first allele at the locus (if you were to line all the alleles at each locus up one after another.)
coll: an integer vector of length N that gives the index of the collection that each fish is in.
coll_N: an integer vector of length C that gives the number of fish in each of the collections
RU_vec: This is the hardest one to figure out / remember. Imagine that each collection has an index from 1 up to C, and imagine that each collection belongs to a single reporting unit. Each reporting unit is assigned an integer. Now, sort everything first by reporting unit index and then by collection index. The order that you get is the order of the collections in
RU_vec. This vector is a named integer vector. The collections are in the order as described above. The names are the collection names and the values are the base-1 index of each collection.
RU_starts: The base-0 index of the starting position of each reporting unit in the
RU_vec vector. This is a named integer vector. For example, the first few entries of the chinook data set are:
<- check_refmix(chinook, 5) ploidies <- tcf2param_list(chinook, 5, summ = FALSE, ploidies = ploidies) cpar $RU_starts[1:5] cpar#> CentralValleyfa CentralValleysp CentralValleywi CaliforniaCoast KlamathR #> 0 8 12 13 15
and if we look at the first 15 elements of
RU_vec it gives us the names and the indices of the collections in those first 4 listed reporting units:
$RU_vec[1:15] cpar#> Feather_H_sp Feather_H_fa Butte_Cr_fa #> 1 6 7 #> Mill_Cr_fa Deer_Cr_fa Mokelumne_R_fa #> 8 9 10 #> Battle_Cr Sacramento_R_lf Butte_Cr_Sp #> 11 12 2 #> Mill_Cr_sp Deer_Cr_sp UpperSacramento_R_sp #> 3 4 5 #> Sacramento_H Eel_R Russian_R #> 13 14 15
I: an integer vector giving the allelic type of each gene copy carried by each individual. For ploidy = 2 (the only case implemented so far) this vector is of length (N * L * 2). An entry of 0 denotes missing data, and the observed alleles are named 1, 2, …
AC: this is a flat integer vector of the counts of alleles of different types in the different populations. It has length C * sum(A) (i.e. the number of collections in the reference times that total number of alleles at all the loci.). This is created by a somewhat lengthy process: first the function
reference_allele_counts() makes a long data frame that has
counts. This then gets turned into a list of matrices in
a_freq_list(). One matrix for each collection. The rows are the different alleles and the columns are the different populations. Then in
list_diploid_params() that list of matrices gets flattened into one long integer vector.
One of the weaknesses as I see that now, is that the loci are arranged alphabetically, rather than by input order. We should at least include the names of the loci in the order in which they appear so that we can get back to the loci, if necessary. The order of the loci coming out of this process is used to make sure that it corresponds to the order of the loci in
I, which is good, but not super intuitive. At any rate, from the foregoing, it can be deduced that we can index into this vector thus (all indexes are base-0): if we want the count of the a-th allele at the l-th locus in the c-th collection then we get that by base-0 subscipting
[C * CA[l] + c * A[l] + a]. WhereC
is the number of collections,CA
is the cumulative number of alleles, andA
is the number of alleles at each locus. Now it should be clear why we storeCA`—this is where we use it!
sum_AC: the sum of the allele counts at each locus for each collection in the reference data set. (Basically the number of observed gene copies at the locus in the reference data set). This gets computed in
list_diploid_params() from the list of matrices returned by
a_freq_list(). It is of length L * C. It is a named vector with the names taking
Locus.Collection, but I don’t think those names get used at all. It gets indexed as
[l * C + c]
DP: this is a vector completely parallel to
AC but in which the prior weights have been added to each allele in each collection.
sum_DP: this is the sum of Dirichlet Parameters
DP for each locus and each collection. It is parallel to
Finally, we have some entries that we should have had from day one, but didn’t, so they aren’t consistently used throughout the code to access the names of entities ordered as they ended up ordered: -
This is a trickier question than it seems, because things are done slightly differently in the different top-level functions.
In both of these functions, the original data sets gets read in, collection and repunit get converted to factors, and then the
param_list is made inside a single function:
Same as above, this uses
tcf2param_list() after doing a few other steps on the original data frame.
tcf2param_list() unless it is using preCompiledParams so that it can run through stuff during infer_mixture to compute the locus-specific means and variances of the log-likelihoods.
This is the tough one. Because we end up doing multiple mixture collections, we couldn’t simply use
tcf2param_list() in the function. Rather, we create a summary for the reference sample (keeping track of alleles found in both the reference and the mixture), and then we split the mixture samples up by mixture collection and use
One problem with the current approach is that it is terribly slow when you start to get 10K+ SNPs. It would be much faster to read and store those data in an 012 matrix. Here is how I am thinking I could deal with that:
For the functions that use
tcf2param_list() I could just write another function,
tcf2param_list_012(), that took
D as just a data frame with
indiv, and then had an 012 matrix with the genetic data in it, with indiv names in the rownames and locus names in the colnames. Then we just have to deal with seeing AC_list and I_list correctly. Actually, looking at it now, I can just do it in the same
tcf2param_list() function. If the
d012 parameter is not NULL we would:
cleaned$longNULL, and set
AC_listdirectly from the 012 matrix. This should be super straightforward. I would probably want to drop monomorphic loci first.
Cool, in order to do all this I should make two new functions:
allelic_list_012. That might give me enough insight that I could easily do it for