Introduction to Perc Package

Introduction

Rank relationships amongst members of a social group are most commonly represented as an ordered list of individuals that are ranked from highest to lowest. Numerous methods exist to identify this rank order, and most methods fall under one of two approaches: (1) finding the appropriate ranking by reordering the rows and columns of a win/loss matrix or (2) calculating a cardinal dominance index which ranks individuals by the proportion of others dominated, both of which are based upon a data set of agonistic and/or submissive interactions. Such methods produce a linear rank order, whether or not the society is expected to have a linear hierarchy or whether the data fit the assumption of linearity (which is particularly problematic when calculating cardinal ranks (Shev et al 2012)). Most researchers agree that the dominance structure of animal societies is likely to be more complex than simple linearity, but few, if any, methods exist for quantifying nonlinear structure. We present a new network-based method, called Percolation and Conductance, that permits nonlinear structure to emerge via estimates of network-wide directional consistency in the flow of dominance interactions and detection of blocks of dominance ambiguity that are indicative of nonlinear segments of a hierarchy. An additional benefit of this method is the ability to quantify dyadic-level dominance certainty. Dyadic dominance potentials are calculated using multiple indirect dominance pathways in the network (via common third parties) to infer missing data in the win/loss matrix and enhance the calculation of dominance potentials from direct interactions. Greater consistency in the direction of transitive dominance pathways from A to B (e.g. via pathway through C, D, and/or E) result in higher certainty that A outranks B, whereas greater evidence of inconsistent direction (e.g. some directed pathways go from A to B whereas other directed pathways go from B to A) result in dominance ambiguity. Because the number of indirect pathways is typically exponentially larger than direct win/loss interactions, one can be relatively confident that dyads with ambiguous dominance potentials are truly ambiguous due to evidence of directional inconsistency in their network pathways rather than lack of data.

Rationale

Rank order using the percolation-conductance method is based on a matrix that combines information from direct win/loss interactions with information from indirect pathways as described above to create a matrix of probabilities that the row individual outranks the column individual. Percolation-conductance finds these multi-step, directed pathways between all pairs of nodes by performing a series of random walks through the network. A simulated annealing process explores possible rank orders (from the space of all possible rank orders) – the more simulated annealing runs that are performed, a larger number of possible rank orders are explored, meaning that the best possible rank order is more likely to be found. The starting location of simulated annealing is determined by the total wins per individual in the matrix. Multiple simulated annealing steps are required to arrive at the optimal rank order, keeping in mind that multiple rank orders may be equally good if there are nonlinear segments in the hierarchy. The heat map function (plot.conf.mat) will plot the ranking and can allow for the detection of non-linear dominance structures. A block of individuals with probabilities near 0.5 indicates a subgroup of individuals whose relationships are not clearly defined. A second metric yielded by this method, dominance uncertainty, reflects the directional consistency of information flow through the direct and indirect pathways in a directed and weighted network.

The Percolation-Conductance method of quantifying rank and dominance uncertainty is computation complicated and requires researchers to have a sophisticated computational background to process the data. Our group, in collaboration with collaboraters in UC Davis Statistics Deparment, developed an R package that implements the percolation-conductance algorithm on directed dyadic interactions. By applying simple R functions, researchers will be able to quantify rank order and dominance uncertainty with the help of the package. The goal of this tutorial is to show how to quantify rank and dominance uncertainty from win-loss raw data.

User Manual

Importing Data

The function as.conflictmat is used to import raw edgelists or matrices for further analysis. Raw edgelists or matrices are records for directed dyadic win-loss interactions. The prefered R data types for as.conflictmat are “matrix” and “data.frame”. Raw edgelists or matrices can be imported as R data type of either data.frame or matrix to be used in as.conflictmat.

Preparing raw data

The most simple and commonly used raw data is a edgelist of two columns consisting all dyadic win-loss interactions, with the first columns being the winner and the second column being the loser. Duplicates are allowed in this two-column edgelist because multiple interactions between a pair are represented by multiple rows of the same winner and loser. Below is an example of a two-column edgelist.

library(Perc)

# displaying the first 5 rows of the example data.
head(sampleEdgelist, 5)

##   Iname  Rname
## 1  Kale Kibitz
## 2  Kale Kibitz
## 3  Kale Kibitz
## 4  Kale Kibitz
## 5  Kale Kolyma

A weighted edgelist is also allowed as raw edgelist data. It is a “matrix” or “data.frame” of three columns, with the first two columns being winner IDs and loser IDs respectively and the third column being the frequency of the win-loss interaction between the winner and the loser. An example of a weighted edgelsit is shown below:

# displaying the first 5 rows of the example data.
head(sampleWeightedEdgelist, 5)

##   Initiator1   Recipient1 Freq
## 1   Angelina       Alexis    1
## 2   Angelina        Darcy    1
## 3   Angelina FirstTimeMom    1
## 4   Angelina          Mak    1
## 5   Angelina      Tabitha    3

In addition, a matrix representing dyadic win-loss interaction is also accepted as raw data. For example,

# displaying the first 5 rows and columns
sampleRawMatrix[1:5, 1:5]

##             Alexis Angelina Aristotle BaldSpotMom C19
## Alexis           0        0         0           0   0
## Angelina         1        0         0           0   0
## Aristotle        0        0         0           0   0
## BaldSpotMom      6        0         0           0   0
## C19              0        0         0           0   0

Importing your raw data

The function as.conflictmat is used to import raw data of any accepted format, transform raw data to a “conflict matrix” for further analysis. “conflict matrix” is very specifically refered to a class in R. If your raw data is an edgelist of either two- or three- column, it will transform the your raw edgelist to a matrix in which counts representing for how many times a row wins over a column. Running this on a win-loss matrix creates an identical conflict matrix. This is required for all data formats to create R object of class conflict matrix.

Below is an example on transforming a two-column edgelist:

# convert an two-column edgelist to conflict matrix
confmatrix <- as.conflictmat(sampleEdgelist)

# displaying the first 5 rows and columns of the converted matrix
confmatrix[1:5, 1:5]

##          Kalani Kale Kalleigh Keira Kibitz
## Kalani        0    0        0     0      0
## Kale          0    0        0     0      4
## Kalleigh      0    1        0     0      0
## Keira         0    0        0     0      0
## Kibitz        0    6        0     0      0

Below is an example on transforming a raw win-loss matrix:

# convert a win-loss matrix to conflict matrix
confmatrix2 <- as.conflictmat(sampleRawMatrix)

# displaying the first 5 rows and columns of the converted matrix
confmatrix2[1:5, 1:5]

##             Alexis Angelina Aristotle BaldSpotMom C19
## Alexis           0        0         0           0   0
## Angelina         1        0         0           0   0
## Aristotle        0        0         0           0   0
## BaldSpotMom      6        0         0           0   0
## C19              0        0         0           0   0

When importing weighted edgelist (three-column), you’ll need to specify weighted = TRUE if your raw data is a weighted edgelist. Below is an example:

confmatrix3 <- as.conflictmat(sampleWeightedEdgelist, weighted = TRUE)

Directionality is very importnat when using the function as.conflictmat to import raw data. By default, winners (sources) are in the first column and loses (targets) are in the second column for edgelists. Row names are assumed to be the winner for a raw win-loss matrix. Values in a raw win-loss matrix describe that for how many times the corresponding row ID win over the corresponding column ID. If the directionality in your raw data happens to be opposite from the default, you could specify swap.order = TRUE in as.conflictmat. It will tell R that the directionality in your raw data is the opposite of default setting. Running as.conflictmat on your raw data with swap.order = TRUE will automatically correct the directionality when transforming your raw data to a conflict matrix. Here is an example,

# displaying the first 5 rows and columns of the raw win-loss matrix
sampleRawMatrix[1:5, 1:5]

##             Alexis Angelina Aristotle BaldSpotMom C19
## Alexis           0        0         0           0   0
## Angelina         1        0         0           0   0
## Aristotle        0        0         0           0   0
## BaldSpotMom      6        0         0           0   0
## C19              0        0         0           0   0

# use swap.order = TRUE when running as.conflictmat.
confmatrix4 <- as.conflictmat(sampleRawMatrix, swap.order = TRUE)

# displaying the first 5 rows and columns of the converted matrix
confmatrix4[1:5, 1:5]

##             Alexis Angelina Aristotle BaldSpotMom C19
## Alexis           0        1         0           6   0
## Angelina         0        0         0           0   0
## Aristotle        0        0         0           0   0
## BaldSpotMom      0        0         0           0   0
## C19              0        0         0           0   0

Examining the network

Before computing the dominance probability and finding the best rank order, you want to know whether or not long pathways are present in your data, and if present, whether or not these long pathways can be useful for gathering additional information about rank relationships.

You could find pathways of particular length that starting at a particular node (ID). For example,

# Testing findIDpaths.R
pathKuai <- findIDpaths(confmatrix, "Kuai", len = 3)
# Displaying the first 5 rows of pathKuai
pathKuai[1:5, ]

##      [,1]   [,2]       [,3]     [,4]    
## [1,] "Kuai" "Kalleigh" "Kale"   "Kibitz"
## [2,] "Kuai" "Kalleigh" "Kale"   "Kolyma"
## [3,] "Kuai" "Kalleigh" "Kimora" "Kibitz"
## [4,] "Kuai" "Kalleigh" "Kimora" "Kioga" 
## [5,] "Kuai" "Kalleigh" "Kimora" "Koppy"

# When there's no pathway starting at a particular individual, you'll get an output like the example below: 
pathKalani <- findIDpaths(confmatrix, ID = "Kalani", len = 2)

## no pathways of length 2 found starting at Kalani

You could also use transitivity to find out information on transitivity, a measure of how consistent or inconsistent the flow of information is through the network as a whole. This procedure looks for the number of triangles (e.g. A→B→C→A) and determines whether or not they are transitive.

conftrans <- transitivity(confmatrix)
conftrans$transitive      # number of transitive triangles

## [1] 14

conftrans$intransitive    # number of intransitive triangles

## [1] 4

conftrans$transitivity    # transitivity

## [1] 0.7777778

conftrans$alpha           # alpha

## [1] 6.468627

The function transitivity allow estimation for alpha, which is a value used in the conductance procedure to weight the information from the indirect pathways. The higher the transitivity of a network (the higher the alpha), the greater the weight the conductance procedure will give the indirect pathways.

Finding Rank Order

After transforming your raw data in an R object of conflict matrix class, we will use the function conductance to find all indirect pathways of a particular length (defined by maxLength) and update the conflict matrix. The maximum length of indirect pathways should be decided based on the data. We use a max length of 4 for our dominance rank. This yields the dominance probability matrix which represents the probability the row outranks the column. Values in the matrix are from 0 - 1, with 0.5 being the most uncertain. If a value is less than 0.5, the corresponding row ID is more likely to lose against the corresponding column ID; if a value is greater than 0.5, the corresponding row ID is more likely to win over the corresponding column ID.

DominanceProbability <- conductance(confmatrix, maxLength = 2)

DominanceProbability is a list of 2 elements. The first element, named imputed.conf is the updated conflict matrix which incorporates information from indirect pathways from the original conflict matrix. Comparing the original conflict matrix with the updated conflict matrix will show the information that is gained through the indirect pathways:

# displaying the first 5 rows and columns of the original conflict matrix
confmatrix[1:5, 1:5]

##          Kalani Kale Kalleigh Keira Kibitz
## Kalani        0    0        0     0      0
## Kale          0    0        0     0      4
## Kalleigh      0    1        0     0      0
## Keira         0    0        0     0      0
## Kibitz        0    6        0     0      0

# displaying the first 5 rows and columns of the imputed conflict matrix
DominanceProbability$imputed.conf[1:5, 1:5]

##              Kalani     Kale Kalleigh      Keira    Kibitz
## Kalani   0.00000000 0.000000        0 0.00000000 0.0000000
## Kale     0.00000000 0.000000        0 0.00000000 4.0557534
## Kalleigh 0.05575338 1.000000        0 0.11150676 0.2230135
## Keira    0.00000000 0.000000        0 0.00000000 0.0000000
## Kibitz   0.00000000 6.055753        0 0.05575338 0.0000000

Examining information gained through indirect pathways will provide you information to decide what is the appropriate maxLength for your data. You could examine the information gained through indirect pathways by substracting the original conflict matrix from imputed conflict matrix. The information gained could visually examined by generating a corresponding heatmap using the function plot.conf.mat.

# substracting the original conflict matrix from imputed conflict matrix.
informationGain <- DominanceProbability$imputed.conf - confmatrix

# examine the first five rows and columns of the informationGain
informationGain[1:5, 1:5]

##              Kalani       Kale Kalleigh      Keira     Kibitz
## Kalani   0.00000000 0.00000000        0 0.00000000 0.00000000
## Kale     0.00000000 0.00000000        0 0.00000000 0.05575338
## Kalleigh 0.05575338 0.00000000        0 0.11150676 0.22301352
## Keira    0.00000000 0.00000000        0 0.00000000 0.00000000
## Kibitz   0.00000000 0.05575338        0 0.05575338 0.00000000

# generating a heatmap representing information gained by using informatio from indirect pathways.
plotConfmat(informationGain, ordering = NA, labels = TRUE)

As we could see from the heatmap, using maxLength = 2 does provide us more information than the original raw conflict matrix that we imported from the raw data. In real analysis, you may want to consider using maxLength = 3 or even more and repeat this diagnosis process to examine the information gain. This will allow you to have a better idea on what is the appropriate maxLength for your data.

For example, if you want to examine whether or not maxLength = 3 provide more information than maxLength = 2. Here is an example:

DominanceProbabilityLength3 <- conductance(confmatrix, maxLength = 3)
informationGain2 <- DominanceProbabilityLength3$imputed.conf - DominanceProbability$imputed.conf
plotConfmat(informationGain2, ordering = NA, labels = TRUE)

As we can see in this heatmap that using maxLength = 3 still gives you some more information. In real analysis, you may want to repeat the analysis and try maxLength = 4. In addition, you may also want to consider your alpha value to decide how much you want to trust those long pathways.

The second element of DominanceProbability the dominance probability matrix, named p.hat. Here’s how you could pull out the dominance probability matrix:

# displaying the first 5 rows and columns of the dominance probability matrix
DominanceProbability$p.hat[1:5, 1:5]

##             Kalani      Kale  Kalleigh     Keira    Kibitz
## Kalani   0.0000000 0.5000000 0.4236125 0.5000000 0.5000000
## Kale     0.5000000 0.0000000 0.1180829 0.5000000 0.4040371
## Kalleigh 0.5763875 0.8819171 0.0000000 0.6325280 0.7095211
## Keira    0.5000000 0.5000000 0.3674720 0.0000000 0.4236125
## Kibitz   0.5000000 0.5959629 0.2904789 0.5763875 0.0000000

When you want to use dominance probability matrix for further analysis, often a long-format representation of the matrix is easier. The function dyadicLongConverter convert win-loss uncertainty to long format.

# displaying the first 5 rows of the converted long format win-loss probability.
dyadicLongConverter(DominanceProbability$p.hat)[1:5, ]

##       ID1      ID2 ID1 Win Probability ID2 Win Probability RankingCertainty
## 12 Kalani     Kale           0.5000000           0.5000000        0.5000000
## 23 Kalani Kalleigh           0.4236125           0.5763875        0.5763875
## 24   Kale Kalleigh           0.1180829           0.8819171        0.8819171
## 34 Kalani    Keira           0.5000000           0.5000000        0.5000000
## 35   Kale    Keira           0.5000000           0.5000000        0.5000000

Simulated rank order is computed using a simulated annealing algorithm by running the function simRankOrder on DominanceProbability$p.hat. The argument num is the number of simulated annealing runs to generate best rank order. The argument kmax tells you how long (perhaps iterations) the simulated annealing function looks around for a local optimum to find the best rank order. By default, kmax = 1000 was used. But it takes a long time to run when using kmax = 1000. The example below uses kmax = 10 just because it runs quickly and it is enough to illustrate the process, however this value is not recommended when processing real data. Here is how to use simRankOrder to find the simulated rank order:

# find simRankOrder
s.rank <- simRankOrder(DominanceProbability$p.hat, num = 10, kmax = 10)

The function s.rank returns a list containing simulated rank order and Costs for each simulated annealing run. The best rank order will be the one with the lowest cost.

# displaying the first 5 rows of the simulated rank order
s.rank$BestSimulatedRankOrder[1:5, ]

##         ID ranking
## 1    Koppy       1
## 2    Kioga       2
## 3     Kuai       3
## 4 Kalleigh       4
## 5   Kyushu       5

# displaying cost for each simulated annealing run
s.rank$Costs

##    simAnnealRun     Cost
## 1             1 2.392065
## 2             2 2.321503
## 3             3 2.392065
## 4             4 2.392065
## 5             5 2.321503
## 6             6 2.321503
## 7             7 2.321503
## 8             8 2.321503
## 9             9 2.321503
## 10           10 2.392065

# rank orders generated by each simulated annealing run 
s.rank$AllSimulatedRankOrder

##    SimAnneal1 SimAnneal2 SimAnneal3 SimAnneal4 SimAnneal5 SimAnneal6 SimAnneal7
## 1       Koppy      Koppy      Koppy      Koppy      Koppy      Koppy      Koppy
## 2       Kioga      Kioga      Kioga      Kioga      Kioga      Kioga      Kioga
## 3    Kalleigh       Kuai   Kalleigh   Kalleigh       Kuai       Kuai       Kuai
## 4      Kyushu   Kalleigh     Kyushu     Kyushu   Kalleigh   Kalleigh   Kalleigh
## 5      Kimora     Kyushu     Kimora     Kimora     Kyushu     Kyushu     Kyushu
## 6        Kuai     Kimora       Kuai       Kuai     Kimora     Kimora     Kimora
## 7      Kibitz     Kalani     Kalani     Kibitz     Kibitz     Kalani     Kalani
## 8      Kalani     Kibitz     Kibitz     Kalani      Keira     Kibitz     Kibitz
## 9      Kolyma     Kolyma     Kolyma      Keira     Kalani     Kolyma     Kolyma
## 10       Kale       Kale       Kale     Kolyma     Kolyma      Keira       Kale
## 11      Keira      Keira      Keira       Kale       Kale       Kale      Keira
##    SimAnneal8 SimAnneal9 SimAnneal10
## 1       Koppy      Koppy       Koppy
## 2       Kioga      Kioga       Kioga
## 3        Kuai       Kuai    Kalleigh
## 4    Kalleigh   Kalleigh      Kyushu
## 5      Kyushu     Kyushu      Kimora
## 6      Kimora     Kimora        Kuai
## 7      Kibitz     Kalani      Kibitz
## 8       Keira     Kibitz       Keira
## 9      Kolyma      Keira      Kalani
## 10       Kale     Kolyma      Kolyma
## 11     Kalani       Kale        Kale

After finding the simulated rank order, you can apply the rank order to your dominance probability matrix in order to generate a heatmap with its rows and columns ordered by rank. This heatmap will highlight areas of non-linear dominance rank (i.e. areas that are brown in the example below). A block of individuals with probabilities near 0.5 indicates a subgroup of individuals whose relationships are not clearly defined.

plotConfmat(DominanceProbability$p.hat, ordering = s.rank[[1]]$ID, labels = TRUE)

Regarding Dominance Certainty

Besides ranking information, the undirected certainty information in dominance relationship is also important. The values in the dominance probability matrix range from 0-1 indicating both the direction of the relationship and the certainty of the relationship. The function valueConverter converts all values to values between 0.5 - 1.0, which, in essence, removes directionality from the relationship leaving a single metric of dominance certainty (0.5 indicates total uncertainty and 1.0 indicates total certainty). For example,

# displaying the first 5 rows and columns of the converted matrix
valueConverter(DominanceProbability$p.hat)[1:5, 1:5]

##             Kalani      Kale  Kalleigh     Keira    Kibitz
## Kalani   1.0000000 0.5000000 0.5763875 0.5000000 0.5000000
## Kale     0.5000000 1.0000000 0.8819171 0.5000000 0.5959629
## Kalleigh 0.5763875 0.8819171 1.0000000 0.6325280 0.7095211
## Keira    0.5000000 0.5000000 0.6325280 1.0000000 0.5763875
## Kibitz   0.5000000 0.5959629 0.7095211 0.5763875 1.0000000

When you want to understand dyadic level dominance uncertainty, you could get the information from dyadicLongConverter. For example,

# displaying the first 5 rows of Ranking Certainty
dyadicLongConverter(DominanceProbability$p.hat)[1:5, c("ID1", "ID2", "RankingCertainty")]

##       ID1      ID2 RankingCertainty
## 12 Kalani     Kale        0.5000000
## 23 Kalani Kalleigh        0.5763875
## 24   Kale Kalleigh        0.8819171
## 34 Kalani    Keira        0.5000000
## 35   Kale    Keira        0.5000000

When individual-level win-loss relationship uncertainty is the focus of your question, the function individualDomProb computes the mean and standard deviation for individual-level win-loss probability.

individualDomProb(DominanceProbability$p.hat)[1:10,]

##          ID      Mean        SD
## 1    Kalani 0.6079515 0.1705907
## 2      Kale 0.6403923 0.1628646
## 3  Kalleigh 0.7350389 0.1343705
## 4     Keira 0.6606999 0.1782413
## 5    Kibitz 0.6974806 0.1677272
## 6    Kimora 0.7269768 0.1736934
## 7     Kioga 0.7037738 0.1644045
## 8    Kolyma 0.6072364 0.1559790
## 9     Koppy 0.7157361 0.1433639
## 10     Kuai 0.6873502 0.1703318