Using Markov chain Monte Carlo (MCMC) to visualize and test the linearity assumption of the Bradley–Terry class of models (original) (raw)

. Author manuscript; available in PMC: 2013 Sep 17.

Abstract

The construction of dominance hierarchies for animal societies is an important aspect of understanding the nature of social relationships, and the models to calculate dominance ranks are many. However, choosing the appropriate model for a given data set may appear daunting to the average behaviourist, especially when many of these models assume linearity of dominance. Here, we present a method to test whether or not a data set fits the assumption of linearity using the Bradley–Terry model as a representative of the class of models that assume linearity. Our method uses the geometry of a posterior distribution of possible rankings given the data by using a random walk on this distribution. This test is intuitive, efficient, particularly for large number of individuals, and represents an improvement over previous linearity tests because it takes into account all information (i.e. both linear and apparently circular or nonlinear information) from the data with few restrictions due to high dimensionality. Such a test is not only useful in determining whether a linear hierarchy is relevant to a given animal society, but is necessary in justifying the results of any analysis for which the assumption of linearity is made, such as the Bradley–Terry model. If the assumption of linearity is not met, other methods for ranking, such as the beta random field method proposed by Fushing et al. (2011, PLoS One, 6, e17817) should be considered.

Keywords: Bradley–Terry, goodness of fit, linearity, paired comparisons, ranking, rhesus macaque

Consider a society or group of individuals vying for a position of dominance or stature in a series of pairwise conflicts. Each decisive interaction in which a winner and a loser are identified holds information about the hierarchical structure of this group. The intuitive approach is to use this information to order the individuals from greatest competitive ability to least, as is often done for many sports to determine placement in playoff tournaments. Ranks assigned to teams in this manner are certainly transitive if team A outranks team B and team B outranks team C, then team A outranks team C. But cyclic relations in the data, such as A > B > C > A, can cause problems in determining the appropriate ranking if such a ranking even exists. In animal societies, for example, decisive agonistic interactions among group members are used to reconstruct a social hierarchy. For instance, if one individual threatens another and the second individual runs away, this is an indication of dominance and carries information about the rank of each individual. As with the sports tournament example, all agonistic interactions between pairs of individuals are aggregated and used to determine a rank order. However, the dominance structure in animal societies may be more complex than a simple linear hierarchy. For example, rhesus macaques, Macaca mulatta, live in large multimale, multifemale social groups in which both individuals and family groups (matrilines) are constantly competing for status in the society, so conflicting information can cause misleading results in ranking when using a method that assumes transitivity in dominance. Indeed, several captive groups of rhesus macaques at the California National Primate Research Center (CNPRC) exhibit nonlinear, corporative hierarchical structure (Beisner et al. 2011; Fushing et al. 2011b). Here we develop a new approach for estimating ranks that does not require linearity (Fushing et al. 2011a, b).

Constructing a dominance hierarchy from observed agonistic interactions in a group of animals was first described by Schjelderup-Ebbe (1922) for chickens and, since then, there has been much discussion over how to best determine the order for a hierarchy. Most methods fall into one of two classes. The first class of methods finds the appropriate ranking by reordering the rows and columns of the win/loss matrix, (wij), where wij is the number of times individual i has prevailed over individual j. The reordering is done by minimizing some numerical criterion calculated for the win/loss matrix. This first class is perhaps best characterized by de Vries’s (1998) I&SI method, which orders the matrix by minimizing the number of times a pair of individuals has a rank ordering that does not match their empirical dominance, denoted as an inconsistency, and the absolute difference in ranks for an inconsistent pair, denoted as the strength of an inconsistency. More recently this class has been characterized by the beta random field method proposed by Fushing et al. (2011a), which provides a more sophisticated criterion for ordering the matrix than the I&SI method.

The second class of models are based on Thurstone’s (1927) method of paired comparison, which utilizes a cardinal dominance index to rank individuals. The method of paired comparisons ranks individuals by the proportion of individuals dominated, which fails to model the randomness in the data and does not use all the information provided. The Bradley–Terry model (Bradley & Terry 1952) solves the issues of the Thurstone model by linking the cardinal dominance indices to the binomial probabilities of dominance through the expit function,

Boyd & Silk (1983) modified the Bradley–Terry model by generalizing this structure and linking the binomial probabilities of dominance to any monotone increasing function, F, of the difference between cardinal dominance indices:

Thus the Bradley–Terry model becomes a special case of the Boyd & Silk model.

Of the two classes of models, we claim that the first is generally the most reasonable approach to finding an appropriate ordering for a hierarchy. The primary disadvantage of the second class of models is the assignment of a cardinal dominance index to every individual, which inherently assumes that dominance is transitive. Consider a circular relationship among three individuals in which, when laid out in a triangular formation, each individual will win with probability 1 against the competitor directly clockwise. If all dyads have played an equal number of games, it is impossible to determine a reasonable ranking. Complex relationships such as this circular triad are more likely to arise in larger data sets, and thus, the simple assumption of linear structure is less likely to fit the data.

As it stands, Bradley–Terry/Boyd & Silk approach is still the most widely used methodology for ranking inside and outside of academia. For instance, the very popular and widely applied Elorating is just another form of this approach (see the recent paper by Neumann et al. 2011). Its popularity is partly built on its easy interpretation and partly on its effective computation. Another important feature of this approach is that no matter whether the critical linearity assumption is satisfied or not, this approach always provides a ranking sequence. That is to say, it becomes extremely convenient because of its easy application, even though the resultant ranking sequence might be poorly supported by the data. We will show that, when the linearity assumption is violated, the Bradley–Terry model may produce counterintuitive orderings. Thus, a rigorous linearity check is essential, especially in animal behavioural researches, which typically contain ranking as a very common and important part of scientific tasks.

We propose a goodness-of-fit test for the Bradley–Terry model to test whether the assumption of dominance transitivity is violated in the data and to show the effect of circular dominance on rankings. We choose the Bradley–Terry model as a representative of the second (Thurstonian) class of models from which to draw conclusions for the entire class for a couple of reasons. First, the deterministic models (i.e. those models that do not incorporate the randomness of the data) are a poor choice because they do not properly weigh the outcomes of likely events (e.g. a strong individual winning) versus unlikely events (e.g. a weak individual winning). Second, precisely which model is chosen as representative is somewhat arbitrary because all models in this class make the assumption of dominance transitivity. For example, the Boyd & Silk model is a generalized version of other models in this class, and it assumes transitive dominance regardless of the monotone increasing function, F, that is chosen. Thus, the only effective difference between models of this type is sensitivity to the assumption. If it is shown that a violation of this assumption produces poor results for a given method, it must be true for all other methods with the same assumption, although the severity of the violation required to produce poor results may vary by method. Finally, given that the choice of the representative of the Thurstonian class of models does not affect the conclusion, we chose the Bradley–Terry model specifically as it is perhaps the most widely used case of Boyd & Silk’s model. It has been well studied and generalized in a number of ways including modifications to handle ties by Rao & Kupper (1967), to account for home field advantage by Agresti (1990), and for use in the context of classification by Hastie & Tibshirani (1998). In addition, Adams (2005) proposed a Bayesian extension to the Bradley– Terry model, which is utilized in this paper and described in more detail below.

Other tests have previously been proposed but each of these is limited in scope. Appleby (1983) proposed a test that counts the number of circular triads in the data and compares this to the expected number, but this method is limited because it only identifies circular dominance paths of length three and because empirical data almost never come close to the random expectation of cyclicity (Shizuka & McDonald 2012). Kasuya (1995) also proposed a test that counts all possible data sets possessing fewer violations of linearity, but this test is hindered by computation time when large numbers of individuals are competing. Here, we will visually demonstrate why the Bradley–Terry model fails when the assumption of linearity is not met, and propose a test that is more thorough in evaluation of the assumption, is broadly applicable to data sets of various sizes, and is relatively efficient and intuitive to use.

In the field of animal behaviour, researchers are now more aware of the potential flaws due to violation of the linearity assumption. Although simple linear ranking methods are appealing for reasons of parsimony, large data sets (i.e. those with many subjects) may often violate the assumption and may require the more complex nonparametric approaches (de Vries 1998; Fushing et al. 2011a).

THE BRADLEY–TERRY MODEL AND AN EXTENSION

The Bradley–Terry Model

Suppose we observe a series of independent pairwise competitions among N individuals. For a given dyad (a group of two competitors), denote nij as the number of competitions between competitors i and j, and denote wij as the number of competitions in which individual i won. The Bradley–Terry model offers an explicit probability model for estimation of the dominance indices through maximum likelihood. Let pij be the true probability that player i dominates player j, then the probability that individual i wins wij times against individual j is given by

This provides the basis for the likelihood function, but in order to estimate the dominance indices, the model relates pij to di and dj through the expit link shown in equation (1). The term (nijwij) can be dropped from the likelihood function as it does not depend on pij, so the likelihood function is proportional to

∏i<jpijwijpjiwji=∏i≠jpijwij=∏i≠j(ediedi+edj)wij.	(3)

Finally, maximizing this quantity for d, the vector of dominance indices, yields a maximum likelihood estimate. To ensure the identifiability of this estimate, we choose an individual prior to the analysis and set the corresponding dominance index to zero.

A Bayesian Extension of the Bradley–Terry Model

Adams (2005) proposed an extension of the Bradley–Terry model in which a prior probability distribution for d is introduced to form a Bayesian maximum likelihood estimate for a posterior distribution for the dominance indices. The prior distribution can be used to encapsulate any prior information about a competitor’s dominance ability, such as a team’s ranking from a prior season or a researcher’s expert knowledge of a social group of animals being studied. Alternatively, the prior can also be specified as non-informative by choosing a symmetric distribution with large variance centred at zero such as a normal distribution with mean zero and large σ2. Given the likelihood and prior distribution, this Bayesian model uses a MCMC method, specifically the Metropolis– Hastings algorithm, to create a long-sequence Markov chain that will converge to the stationary distribution, the posterior distribution for d. Ideally the mode of this approximate posterior distribution can be identified and taken as d̂, the Bayesian maximum likelihood estimate of d.

In principle, this posterior density should be unimodal, and thus it should be straightforward to obtain the estimate, d̂, but in practice this is not the case. If the dimension of d̂ is large, say greater than 20 or 30, the number of MCMC samples required to reveal the unimodal structure of the posterior distribution can be overwhelming. In many real world examples, the dimension is large because social groups comprise more than 20–30 individuals. For example, social groups of rhesus macaques at the CNPRC range in size from 80 to 200 individuals, so the task of identifying the posterior mode is impossible. One alternative to this difficulty is to take the MCMC sample average, d̄. Since the posterior distribution is unimodal, this average, d̄, could be very close to the posterior mode. Thus, we work with d̄ as a surrogate of the mode, d̂. Once we adopt d̄ it is seemingly natural to have an estimated ranking sequence based on it.

The reasoning behind the Bayesian extension of the Bradley– Terry model seems to be built on very sound logic, and this is the main reason for its widespread use in many scientific fields as well as in sports. However, its seemingly broad applicability is deceptive. The fact that the Bradley–Terry model and its Bayesian counterpart always provide an answer, regardless of what the data structure might be, appears to suggest that the model is immune from all conflicting information, such as a triangular dominance path, no matter how many pieces of such information are present in the data set. This is, however, counterintuitive because every model has a strict domain of applicability, and the Bradley–Terry model is no exception. In the next section we illustrate the ways in which the Bradley–Terry model and its extensions are very sensitive to conflicting information by examining the posterior distribution in the right perspective.

MOTIVATION AND ILLUSTRATING EXAMPLES

The Bradley–Terry model and its Bayesian extension rely on two assumptions that are often violated or taken as true without proper justification: (1) the dominance ability of any three subjects is transitive; (2) the choice of the baseline individual is independent of the maximum likelihood estimate (MLE).

The first assumption is made to ensure that the estimated dominance probabilities match a linear dominance hierarchy to rank individuals sensibly. If this assumption is not met, another method that does not require such a restriction must be used, such as the beta random field method proposed by Fushing et al. (2011a). The second assumption is a consequence of quantifying the competitive ability of each individual. The dominance index for each individual is estimated relative to all other individuals, so a baseline is required.

Figure 1 displays results from three contests involving individuals A, B, C and D. In Fig. 1a, it is clear that the MLE will suggest that competitor A should be ranked higher than B, while the MLE in Fig. 1b would suggest the opposite. Figure 1c combines the data from the contests in Fig. 1a, b, creating a data set with conflicting information that violates the linearity assumption. Using the Bayesian Bradley–Terry model to generate a sample from the posterior distribution will yield several different rankings, each with high likelihood, as opposed to a single most-accepted ranking. It then raises the question whether the ranking corresponding to the MLE is useful, or whether there is another ranking that is nearly as likely.

Figure 1.

Three examples of tournaments. (a, b) Results from two contests in which the maximum likelihood estimate (MLE) disagrees on the ranking of competitors A and B. (c) Results combining the conflicting information in (a) and (b). The combination is a simple example of a data set that exhibits strong nonlinearity.

To answer this question, we used the MCMC method based upon the Bayesian extension of the Bradley–Terry model (described above) to determine the posterior density function of the dominance indices _d_A, _d_B and _d_C (given _d_D = 0) and identify evidence of multiple disparate rankings of high likelihood. Figure 2 shows a three-dimensional scatterplot of a sample of 5000 draws from the posterior distribution using the baseline _d_D = 0. Each point represents one draw from the posterior distribution, which is a vector value of (_d_A, _d_B, _d_C) plotted in three-dimensional space. Thus, each data point produces one potential ranking among the individuals A, B, C and D. Figure 2a shows the three-dimensional distribution of all points drawn, and Fig. 2b shows each face of the cube (note: the three faces of the cube above the diagonal are mirror images of the those in the bottom half, rotated 90°). The origin is marked by a red dot and the first and second most likely rankings are coloured green and orange, respectively. The influence of the conflicting information depicted in Fig. 1c is shown in the three-dimensional scatterplot as ambiguity in the relative rankings of individuals A, B, C and D. This evidence of ambiguity is most prevalent in the plots of _d_A against _d_C. Individual A outranks individual C whenever _d_A > _d_C. If the relationship between A and C was clearly defined, we would expect the majority of the mass of the distribution to be above or below the line _d_A = _d_C, but this line nearly splits the mass in half. That is, for about half of the points drawn from the posterior distribution, the corresponding rankings suggest that individual A outranks individual C and the remaining points suggest the opposite. The same can be seen with the pairs AD and CD. Since _d_D is set equal to 0, we expect the values for _d_A and _d_C to be mostly positive or mostly negative, but again the mass is almost evenly split along the lines _d_A = 0 and _d_C = 0. Thus, ambiguity in the relative rankings of pairs of individuals translates into multiple, disparate rank orders that each have similarly high likelihood.

Figure 2.

(a) Scatterplot of all the draws from the posterior distribution of d. (b) Matrix of pairwise scatterplots. We set _d_D = 0, so any distribution centred at 0 indicates ambiguity in ranking. Green points correspond to the most likely rankings; orange points correspond to the second most likely rankings; red point marks the origin.

The most likely rank order of individuals A, B, C and D varies depending upon which individual is set as the baseline. Ideally, a middle-ranking individual should be chosen as the baseline, but because of conflicting information, it is not clear which individual this should be. Table 1 lists the five most likely rankings using the Bayesian extension of the Bradley–Terry model, alternately setting each of the four individuals as the baseline. These results represent a posterior distribution of ranks, as opposed to the posterior distribution of dominance indices shown in the scatterplot of Fig. 2. However, the interpretation of these results is the same: there are many separate, and contradictory, points (rankings) of high probability. Looking at the final column in Table 1, it is clear that the ambiguities seen in Fig. 2 have resulted in several rankings with relatively high posterior likelihood. The ordering of individuals A and C is the same as that shown in the scatterplot, with A being ranked higher than C about half the time. The scatterplot is also in agreement with Table 1 with regard to pair AD and pair CD. If the assumption of transitive dominance were reflected in the data, we should see a single point of high probability in the posterior distribution for ranks similar to a continuous unimodal distribution, but this is not the case. The Bayesian Bradley–Terry model produces multiple, contradictory rankings, which is analogous to a multimodal posterior distribution. Furthermore, it is worth noting that the posterior distribution is sensitive to the choice of the baseline when there is conflicting information.

Table 1.

Most likely rankings and corresponding probabilities fromthe posterior distribution of the data shown in Fig. 1c

Ranking	Posterior probability	Ranking	Posterior probability
Baseline: _d_A=0	Baseline: _d_B=0
C>B>A>D	0.123	C>B>A>D	0.126
C>A>D>B	0.107	D>B>A>C	0.114
C>A>B>D	0.086	C>A>D>B	0.082
B>A>D>C	0.080	D>A>B>C	0.076
D>B>A>C	0.080	C>A>B>D	0.070
Baseline: _d_C=0	Baseline: _d_D=0
D>A>B>C	0.100	C>A>D>B	0.117
C>A>B>D	0.094	D>B>A>C	0.110
C>B>A>D	0.090	C>B>A>D	0.094
C>A>D>B	0.087	D>A>B>C	0.093
D>A>C>B	0.087	D>A>C>B	0.079

In the above example, we show that conflicting dominance information can result in multiple, highly disparate rankings with high likelihood, indicating that the assumption of a linear dominance structure is invalid. Thus, a goodness-of-fit test for the Bradley–Terry model must (1) detect alternative rankings of high likelihood and (2) quantify the distance between these rankings and the single ranking produced by the posterior mode.

DETECTION OF ALTERNATIVE RANK ORDERS OF HIGH LIKELIHOOD

The goodness-of-fit test for the Bradley–Terry model takes the following steps.

Step 1. Use the Metropolis–Hastings MCMC algorithm to sample from the posterior distribution for rankings

As mentioned previously, the posterior distribution for d under the Bayesian Bradley–Terry model has a single mode, which is taken to be the rank order. To detect the presence of alternative rank orders of high likelihood, we used the Metropolis–Hastings algorithm MCMC method with a diffuse prior, specifically Inline graphic (0, 100I), to conduct a random walk to points around the posterior mode. Although the Metropolis–Hastings (M–H) algorithm might traditionally be used to approximate the posterior distribution of d, such an application to a large animal group (N = 100) is not practical because it would require a very large chain to obtain a sample large enough to provide an adequate estimate. In this paper, we instead take advantage of the M–H algorithms equilibrium trajectory: the M–H algorithm takes relatively small steps in a random manner that most likely circle around the unique posterior mode (under the Bayesian Bradley–Terry model) and fully cover all directions. Each MCMC run converges upon a ranking, and these stops are recorded. The number of different stops (ranking sequences) and their relative frequency are both measures of ranking heterogeneity. Discovery of heterogeneity in ranking sequences can be achieved with a relatively short MCMC trajectory. In sampling from the posterior we discarded the first 10 000 step to allow the chain to converge. Since the likelihood is unimodal and uncomplicated, it required few steps, compared with the total number of possible rankings, for the posterior to be well sampled. In the following section we show through simulations that the method described here is able to accurately evaluate the linearity with a chain of length of 1000 after burn in. The MCMC chain becomes a vehicle to identify how much influence conflicting information has on the result produced by the Bradley–Terry model.

Step 2. Calculate the distance between the sampled rankings and the testing distribution function, Gu1 (·)

Here we provide a method to identify how far apart these stops (rankings) are as well as a way to test whether the distances between them are larger than we might expect from data with no conflicting information. These two key components are defined as follows. First, we define a distance metric that appropriately represents how far apart two given rankings are. We accomplish this by identifying whether two given rankings agree on the ordering of a given pair of individuals for all pairs. The number of times the rankings disagree is the distance between them. Second, we use the pairwise distances from all the draws to identify whether the distances are larger than expected. This is achieved by creating two samples of distances, one from the data and one from the expected data if the linearity assumption was met, then identifying whether it appears that the two samples came from the same distribution (described in detail under Steps 3 and 4).

We first define the appropriate measure of distance. Let (_r_1, _r_2, …, rn) be a ranking of n individuals where ri is the rank of the _i_th individual. Define the pair (ri, rj) to be an inversion, if for i < _j_, _ri_ > rj. For example, the ordering (2,1,3,4,5) has one inversion (2,1); specifically, individuals 1 and 2 are out of rank order. The ordering (5,2,3,4,1) has seven inversions. Counting them, we have: (5, 2), (5, 3), (5, 4), (5, 1), (4, 1), (3, 1), (2, 1). The number of inversions can be thought of as a distance of a given permutation from the non-permuted, strictly increasing ordering. An efficient algorithm for counting inversions is based on a well-known algorithm for sorting a list (www.cp.eng.chula.ac.th/~piak/teaching/algo/algo2008/count-inv.htm). R code (R Development Core Team 2010) as well as a description is given in the Appendix.

A distance between two rankings can now be defined in two steps. Consider two rankings, r1=(r11,r21,…,rn1) and r2=(r12,r22,…,rn2), where rij is an integer representing the rank of the _i_th individual in the _j_th ranking, _i_∈(1, 2, …, n) and _j_∈(1, 2). First, sort r2 by the rank order of r1. That is keeping the ranks assigned in _r_2, arrange individuals in order of ranks in _r_1. This is given by,

r2(order(r1))=(r(i:ri1=1)2,r(i:ri1=2)2,…,r(i:ri1=n)2)

where r(i:ri1=j)2 indicates the rank from r2 of the individual whose rank in r1 equals j. Any inversions present in this new ordering indicate where r1 and r2 disagreed on the ranks of individuals. Thus, the second step is to count the number of inversions remaining in this new ranking. This number represents the discrepancy, with respect to inversions, between the two rankings. To illustrate this, consider the two rankings used in the previous example: r1 = (2, 1, 3, 4, 5) and r2 = (5, 2, 3, 4, 1). Then r2(order(r1)) = (2, 5, 3, 4, 1) with a total of six inversions: (4, 1), (3, 1), (5, 1), (2, 1), (5, 3), (5, 4).

It is also clear that r1(order(r1)) = r2(order(r2)) = (1, 2, 3, 4, 5). Indeed the last equality holds for all ranking sequences. That is, the distance between r2 and r1 is equal to 6, and denoted as _I_D(r1, r2) = 6, while _I_D(r1, r1) = ID(r2, r2) = 0.

After defining the distance _I_D(·,·), the testing statistics are constructed to determine whether the empirical distances differ from those expected under the null hypothesis that dominance linearity has been met (i.e. there is only one ranking order of high likelihood). First, we calculate the matrix of pairwise distances for all rankings, called Ω1. Consider an equilibrium segment of MCMC trajectory of length M: (d(1), d(2),·, d(M)), which corresponds to a vector of ranking sequences of the same length: (r1, r2, …, rM). A matrix of distances is further computed from this segment of the MCMC trajectory and is given by: Ω1 = (_I_D(ri, rj)|1 ≤ i, j ≤ M). To obtain a measure of how close a given ranking is to all other rankings in the sample, we sum the row of Ω1 corresponding to that given ranking. Define a vector of these closeness measures by: ω1={∑j=1MID(ri,rj)∣1≤i≤M}. When there is little heterogeneity present in the data, the elements of ω1 will be small, as the rankings will tend to be close together. When heterogeneity does exist, the sampled rankings will be farther apart and the elements of ω1 will be large. We finally define our test statistic, _G_ω1 (·), as the empirical distribution function corresponding to ω1.

Step 3. Create homogeneous data necessary to calculate the reference distribution function, Fω 0 (·)

To calculate _F_ω0 (·), the distribution of expected ranking closeness measures under the null hypothesis of dominance linearity, we must create a data set consisting of the expected number of wins for each dyad assuming homogeneity. Given that the posterior density is unimodal and the posterior mode is too expensive to compute, the average d¯=1M∑k=1Md(k) under the null hypothesis is an effective estimate of the true d and is used here in our calculation of FM(·). Using equation (1) we can construct a mean dominance probability matrix,

P¯ij=(p¯ij)=(ed¯ied¯i+ed¯j).

To hold the number of interactions constant for each dyad, nij constant, we take the number of wins of i over j as wij∗=(nijp¯ij)+1{pij>1/2}. We can then repeat step 1 and step 2 for this new data set to generate a vector of closeness measures, ω0, and an empirical distribution function of these closeness measures _F_ω0.

Step 4. Use a receiver operating characteristic curve to test whether Fω0 (·) and Gω1 (·) are the same distribution

Once _F_ω0 (·) and _G_ω1 (·) have been calculated, we need to test whether these distributions are statistically significantly different. The receiver operating characteristic (ROC) curve (Fushing & Turnbull 1996) is ideal for this task and is given by the function Fω0(Gω1-1(p)), where p is a percentile between 0 and 1. If the two distributions have the same percentile, that is, _F_ω0 (x) = _G_ω1 (x) = p for some x, then the ROC function returns Fω0(Gω1-1(p))=Fω0(x)=p. We can then conclude that, if FM(·) and _G_ω 1 (·) are equal for all percentiles, then the function Fω0(Gω1-1(·)) is a straight line with a slope of 1. If the distributions are not equal, for example, if _F_ω0 (x) = _p_′ > p = _G_ω1(x), then Fω0(Gω1-1(p))=p′>p and the ROC curve will bend upwards. This behaviour is expected under the alternative as the heterogeneous ranking sequence would render large values of _I_D(·). Hence, _G_ω1 (·) is expected to be located on the right of _F_ω0 (·). The area under the ROC curve when the distributions are equal is always 1/2; thus, testing for equality of distributions is equivalent to testing whether the area under the ROC curve is greater than 1/2.

As with all quantitative comparisons, observing that the area under the ROC curve is greater than 0.5, such as p = 0.55, is not sufficient to statewith certainty that the distributions of _F_ω0 (·) and _G_ω1 (·) are different. The degree to which the area under the curve varies due to randomness depends upon the function _F_ω0 (·) = _G_ω1 (·). We therefore need to calculate the 95% confidence bounds of the value of p under the null hypothesis _F_ω0 (·) = G_ω1 (·). First, we simulate values of the area under the ROC curve to approximate its distribution under the null hypothesis. To do this, first take a large sample (at least 2_N) from the approximate posterior density based on the simulated data. Next, calculate a distance matrix for this sample. By randomly selecting two samples of size N, an ROC curve can be constructed based on the corresponding distances for each sample in the distance matrix. By determining the area under the curve and repeating this process, say 5000 times, we obtain an approximate distribution for the area under the ROC curve under the null hypothesis. Thus, the area under the ROC curve is significantly greater than 1/2 if Fω0(Gω1-1(·)) is greater than the 95th percentile of the simulated distribution.

ANALYSIS USING SIMULATED AND REAL DATA SETS

To illustrate the efficacy of the inversion test, we simulate two data sets, one linear and one nonlinear. The linear data set consists of 100 competitors with 1000 dyads chosen at random to produce observations. The winner of each pairwise competition was determined using a 100 by 100 dominance probability matrix in which the probability that individual i dominates individual j for i < j is given by

pij=11+e(i-j)=e-ie-i+e-j.	(4)

Note that the consequence of equation (4) is that we are using di = −1*i, that is, we are using the negative of an individual’s index as the dominance index. Here, we would expect the ROC curve to be nearly straight and to follow the diagonal line closely, and we would expect the test to yield a large P value. Figure 3a shows that the results from the inversion linearity test did indeed match our expectations. The sample size required to identify linearity depends on the steepness of the curve in equation (4). By adding a constant to the exponential term, _e_−β(i_−_j), where β > 0, the signal will become stronger for β > 1 and can be picked up with fewer dyads observed, or the signal will become weaker for β < 1 and more dyads will be required. In contrast, Fig. 3b shows an example of simulated nonlinear data. This data set was generated as before, but using the dominance probability matrix from the previous example with random values between −0.1 and 0.1 added to every element. For a nonlinear data set, we expect the ROC curve to bend and to obtain a small P value, both of which are achieved by the test. The test works as it should in both cases.

Figure 3.

Receiver operating characteristic (ROC) curves of linearity tests performed on simulated data from (a) linear and (b) nonlinear hierarchies.

We now consider a real data set, the dominance interactions from several groups of captive rhesus macaques from the CNPRC. Researchers endeavour to use the records of aggressive interactions to reconstruct the dominance hierarchy within a given social group. An aggressive dyadic interaction is treated as a contest between two macaques, and only those interactions that resulted in a clear winner and loser were entered into the data. Macaque society is known to be very complex as both individuals and matrilines (families of related females) constantly compete for dominance and look for opportunities to improve their social rank. In fact, several rhesus groups at the CNPRC have previously been shown to have corporative dominance structure, in which some matrilines are placed on the same tier of the hierarchy because their relative ranks are ambiguous (Fushing et al. 2011a). Such a structure produces a nonlinear dominance hierarchy and violates the linearity assumption of the Bradley–Terry model. Here we look at four independent rhesus social groups whose group sizes range from 46 to 97 adults, with the number of recorded interactions ranging from over 1000 to over 4000. Figure 4 shows the results of the inversion test for these four groups. All four groups show similar violations of linearity, suggesting that the nonlinear structure of the dominance hierarchy is not an isolated phenomenon but is perhaps a characteristic of all such rhesus societies, and may be true of other animal societies with large group size and/or matrilineal structure. The nonlinearity detected in these data is unlikely to be due simply to sparse data. The number of interactions recorded is larger than that of the simulated data, which were shown to be amenable to evaluation of their linearity, even in the presence of noise. Some of the curvature seen could be due to chance effects, but this is taken into account by the simulated distribution of the area under the curve. The curvature seen here is primarily due to nonlinearities caused by many inconsistencies in the form of polygons (cycles), as illustrated in Fig. 1.

Figure 4.

Inversion linearity test performed on data from four independent groups of monkeys. Receiver operating characteristic (ROC) curves in all four plots indicate severe violations of the linearity assumption.

DISCUSSION

The construction of dominance hierarchies is a common practice when studying animal societies. Researchers have a number of different models at their disposal to accomplish this, and choosing the best model for a given data set is important, especially when many of these models assume linearity of dominance. We show that when the linearity assumption of the Bradley–Terry class of models is not met, the resultant rank order can be inaccurate and counterintuitive. We present a method to test whether or not a data set fits the assumption of linearity using the geometry of a posterior distribution of possible rankings given the data. This test is intuitive, efficient, especially for data involving a large number of individuals, and represents an improvement over previous linearity tests because it takes into account all information in the data with few restrictions due to high dimensionality. In a sense, this method lets the data set speak for itself by utilizing a random walk on the posterior distribution of the ranks. Such a test is not only useful in determining whether a linear hierarchy is relevant to a given animal society, but is necessary in justifying the results of any analysis for which the assumption of linearity is made, such as the Bradley–Terry model. If the assumption of linearity is not met, other methods for ranking, such as the beta random field method proposed by Fushing et al. (2011a) should be considered.

The test performs best for a large number of individuals and improves with larger draws from the posterior. The test, specifically the final step involving comparison of _G_ω1 and _F_ω0 distributions, is not recommended for data sets with fewer than 10 individuals. However, in these cases, steps 1 and 2 remain quite useful, as one may determine visually whether the assumption is met by inspecting pairwise scatterplots as in Fig. 2, generated using the Metropolis–Hastings algorithm MCMC method. The test has been shown to work well with the number of draws close to N = 1000. This number tends to yield accurate results without demanding too much computation time. A chain of this length, however, remains very small compared to the total number of possible rankings that could occur with even 20 individuals. The MCMC method described here provides a useful tool to illuminate the structure of the data (i.e. whether the data follow the linearity assumption) without requiring the massive computation that often accompanies such high-dimensional problems.

Acknowledgments

This research was supported in part by the National Science Foundation (grant DMS-1007219 to F.H.) and the National Institutes of Health (grant R24 RR024396 to B.M.). The project was conducted under Institutional Animal Care and Use Committee protocol no. 11843, and all research adhered to recommendations in the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health and the laws of the United States government. The methodological approach was purely observational and involved no experimental or invasive treatment of the animals.

Appendix

Counting Inversions Algorithm

Here we provide R code for the inversion algorithm. The algorithm works by recursively separating a vector into two parts, then sorting the parts and merging them back together, while counting how many elements are out of order. This is done in two parts.

merge.and.count

This function takes two sorted vectors of positive integers that were previously split from a parent vector and iteratively moves the smallest element in either vector to a new vector C. If the smallest element is in B, then an element in the original vector was out of place and the number of inversions caused by this is the number of remaining elements in A.

merge.count = function(A, B){ n = length(A) + length(B) #Input sorted vectors of integers, A and B C = numeric(n) #Output C a sorted vector of integers i = 1 #Initialize indices to 1 j = 1 k = 1 count = 0 #Initialize inversion count to 0 while(sum(A) != 0 & sum(B) != 0){ #while A and B are not empty if(B[j] < A[i]){ #Move smallest element in A or B to C C[k] = B[j] #If this element was in B add a number B[j] = 0 #of inversions equal to the remaining count = count + sum(A != 0) #length of A j = j + 1 } else{ C[k] = A[i] A[i] = 0 i = i + 1 } k = k + 1 } if(sum(A) != 0){ #Append the remainder of the non-empty C = c(C[1:(k − 1)], A[A != 0]) #vector to C } else{ C = c(C[1:(k − 1)], B[B != 0]) } return (list (count, C)) #Output C }

sort.and.count

This function splits a vector of positive integers into two parts and counts the number of inversions in both parts, then merges the two parts back together and counts the inversions between the two parts. The sum of the three counts is the total number of inversions in the original vector.

sort.count = function(L){ #L, input, vector of positive integers if (length(L) = = 1){ #If L has length 1 return L and 0 return(list(0, L)) #inversion count } else{ n = length(L) A = L[1: floor(n/2)] #Divide L into two parts, A,B B = L[(floor(n/2) + 1):n] r.A = sort.count (A) #Count the inversions in A and sort rA = r.A[[1]] A = r.A[[2]] r.B = sort.count (B) #Count the inversions in B and sort rB = r.B[[1]] B = r.B[[2]] r.L = merge.count(A, B) #Count the inversions between A and B r = r.L[[1]] #and merge back into L L = r.L[[2]] #The total number of inversion is the r = rA + rB + r #sum of inversions in A, B and between return(list(r, L)) #Return inversion count and sorted L } }

References

Adams ES. Bayesian analysis of linear dominance hierarchies. Animal Behaviour. 2005;69:1191–1201. [Google Scholar]
Agresti A. Categorical Data Analysis. 2. Hoboken, New Jersey: J. Wiley; 1990. [Google Scholar]
Appleby MC. The probability of linearity in hierarchies. Animal Behaviour. 1983;31:600–608. [Google Scholar]
Beisner BA, McAssey M, Fushing H, McCowan B. Nonlinear dominance hierarchies in captive rhesus macaque (Macaca mulatta) societies. American Journal of Primatology Supplement. 2011;73:75. [Google Scholar]
Boyd R, Silk JB. A method for assigning cardinal dominance ranks. Animal Behaviour. 1983;31:45–58. [Google Scholar]
Bradley RA, Terry ME. Rank analysis of incomplete block designs. I. The method of paired comparisons. Biometrika. 1952;39:324–345. http://www.jstor.org/stable/2334029. [Google Scholar]
Fushing H, Turnbull BW. Nonparametric and semiparametric estimation of the receiver operating characteristic curve. Annals of Statistics. 1996;24:25–40. [Google Scholar]
Fushing H, McAssey MP, Beisner B, McCowan B. Ranking network of a captive rhesus macaque society: a sophisticated corporative kingdom. PLoS One. 2011a;6:e17817. doi: 10.1371/journal.pone.0017817. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fushing H, McAssey MP, McCowan B. Computing a ranking network with confidence bounds from a graph-based beta random field. Proceedings of the Royal Society B. 2011b;467:3590–3612. doi: 10.1098/rspa.2011.0268. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastie T, Tibshirani R. Classification by pairwise comparison. Annals of Statistics. 1998;26:451–471. [Google Scholar]
Kasuya E. A randomization test for linearity of dominance hierarchies. Journal of Ethology. 1995;13:137–140. http://dx.doi.org/10.1007/BF02352574. [Google Scholar]
Neumann C, Duboscq J, Dubuc C, Ginting A, Irwan AM, Agil M, Widdig A, Engelhardt A. Assessing dominance hierarchies: validation and advantages of progressive evaluation with Elo-rating. Animal Behaviour. 2011;82:911–921. [Google Scholar]
R Development Core Team. R: a Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2010. http://www.R-project.org. [Google Scholar]
Rao PV, Kupper LL. Ties in paired-comparison experiments: a generalization of the Bradley–Terry model. Journal of the American Statistical Association. 1967;62:194–204. [Google Scholar]
Schjelderup-Ebbe T. Beitrge zur sozialpsychologie des haushuhns. Zeitschrift für Psychologie. 1922;88:225–252. [Google Scholar]
Shizuka D, McDonald D. A social network perspective on measurements of dominance hierarchies. Animal Behaviour. 2012;83:925–934. [Google Scholar]
Thurstone LL. A law of comparative judgment. Psychological Review. 1927;34:273–286. [Google Scholar]
de Vries H. Finding a dominance order most consistent with a linear hierarchy: a new procedure and review. Animal Behaviour. 1998;55:827–843. doi: 10.1006/anbe.1997.0708. [DOI] [PubMed] [Google Scholar]