Need a supporting hand in manuscript formatting? Write to our team.
Assistance in presubmission | editorialoffice@clinmedjournals.org
Substitution rate matrices are used to correct multiple hits at the same sites, which requires the derivation of transition probabilities and evolutionary distances from substitution rate matrices. The derivation is essential in molecular phylogenetics and phylogenomics, and represents the only statistically sound way for developing scoring matrices used in sequence alignment and local string matching (e.g., BLAST and FASTA). Three different approaches are frequently used for deriving transition probabilities and evolutionary distances: 1) The probability reasoning, 2) Solving partial differential equations, and 3) Matrix exponential and logarithm. The first approach demands the least amount of mathematical skills but offers the best way for conceptual understanding, and can often generate nice mathematical expressions of transition probabilities and evolutionary distances. This review represents the most systematic and comprehensive numerical illustration of the first approach.
Substitution model, Substitution rate, Transition probability, Evolutionary distance
Substitutions occur over time and can overwrite each other at the same nucleotide or amino acid site. When we compare two homologous nucleotide sequences and find differences in N sites, the actual number of substitutions (designated by M) could be much greater than N because multiple substitutions could have happened at the same site, overwriting each other. Substitution models are used to infer the observable M from the observed substitutions from sequence comparisons.
Many substitution models have been proposed for nucleotide, amino acid and codon sequences. All substitution models used in molecular phylogenetics are Markov chain models characterized by 1) Either a transition probability matrix (P) with discrete time or a rate matrix (Q) in continuous time where P can be derived from Q, and 2) Equilibrium frequencies. The general form of an instantaneous rate matrix for nucleotide sequences is, in the order of A, G, C, and T:
$$Q\text{=}\begin{array}{c}A\\ G\\ C\\ T\end{array}\left[\begin{array}{cccc}-& a& b& c\\ g& -& d& e\\ h& i& -& f\\ j& k& l& -\end{array}\right]\text{(1)}$$
Transition probability matrix, often referred to as the P matrix, specifies the probability of a nucleotide or amino acid changing into another one after time t. It is needed to calculate likelihood and to derive evolutionary distances, and consequently is needed phylogenetics based on the maximum likelihood and distance-based methods as well as Bayesian inference. Whether a substitution model can be implemented for phylogenetic analysis essentially depends on whether the model’s transition probabilities can be calculated.
There are three ways to obtain transition probabilities from the Q matrix [1,2] :1) By probability reasoning, 2) By solving differential equations involving rates, and 3) By taking the matrix exponential of the rate matrix. The last two require some mathematical background in calculus and linear algebra. The first, in contrast, demands little mathematical skill except for careful book-keeping and solving simultaneous equations. This approach is particularly relevant to biological students not only for gaining a conceptual understanding of the substitution models, but also to deriving nice mathematical expressions for transition probabilities and evolutionary distances. New researchers often ask why we can derive evolutionary distances between two aligned sequences for the TN93 model [3] but cannot for the simpler HKY85 model [4] which is a special case of the TN93 model, yet another model, F84 (used in PHYLIP since 1984), which is also a special case of the TN93 model, can have its evolutionary distance readily derived. One can easily obtain answers to such questions by taking the first approach. However, the first approach is not of general purpose and cannot handle very complicated substitution models. In contrast, the last one can be used with any substitution models specified by a rate matrix from which the matrix exponential can be obtained. In short, all these approaches need to be learned by anyone wishing to become a molecular phylogeneticist, but this paper will focus only on the probability reasoning approach illustrated with JC69 [5], K80 [6], F84 (the model used in PHYLIP since 1984), HKY85 [4], and TN93 [3] models.
Felsenstein [1] presented nice examples of probability reasoning to derive transition probabilities and evolutionary distances from rate matrices. This section presents the approach in a more systematic and accessible way.
Consider nucleotide A in the JC69 model (Figure 1a). Imagine that the nucleotide has a rate α of changing into any of the four nucleotides, i.e., including changing to itself (Figure 1b). This is effectively the same specification as the JC69 model. After time t, the expected number of substitutions is 4αt and the probability of no substitution is p (x = 0, α, t) = e^{-4αt} according to the Poisson distribution, and the probability of having at least one change is then p (x ≥ 1, α, t) = 1-e^{-4αt} (Figure 1c). Because nucleotide A can change into any one of the four nucleotides (including nucleotide A itself), each nucleotide gets 1/4 of p (x ≥ 1, α, t). We therefore have in Figure 1.
$${p}_{ij}(t)\text{}=\text{}\frac{p(x\ge 1,\alpha ,t)}{4}\text{}=\text{}\frac{1}{4}-\frac{1}{4}{e}^{-4\alpha t}\text{(2)}$$
Figure 1: Derivation of transition probabilities and the evolutionary distance (D) based on the JC69 model. The d value in the diagonal of the rate matrix (a) is constrained by the row sum equal to 0, i.e., d = -3α. P(j|i,t) means the probability of changing from the original nucleotide i to nucleotide j after time t, and is synonymous to p_{ij}(t) or simply p_{ij} in this paper. View Figure 1
The transition probability p_{ii}(t) is the summation of two probabilities: the probability of no change (which is e ^{4αt}) and the probability of changing to itself which is the same as specified in Eq. (2), as shown in Figure 1e, i.e.,
$${p}_{ii}(t)\text{}=\text{}{e}^{-4\alpha t}+\frac{p(x\ge 1,\alpha ,t)}{4}\text{}=\text{}\frac{1}{4}+\frac{3}{4}{e}^{-4\alpha t}\text{(3)}$$
The transition probability matrix for the JC69 model has only two distinct elements. The four diagonal elements are the same as specified in Eq. (3) and all the off-diagonal elements are the same as specified in Eq. (2). Each row in P adds up to 1 as a nucleotide can either stay the same or change into some other nucleotides.
There are some quick ways to check the derived transition probabilities. First, we note that when t approaches infinity, then all entries in matrix P approaches ¼ if α > 0. This is what we have expected. Second, when t = 0, then all diagonal elements in matrix P are 1 and all off-diagonal elements are zero. This is again what we expected. Third, if α is zero, then no change is possible, and we again expect all diagonal elements in matrix P to be 1 and all off-diagonal elements to be zero, which is also true.
From p_{ij} in Eq. (2), the expected proportion of sites that are different between two aligned homologous sequences (p_{diff}) is 3^{*}p_{ij}(t), i.e.,
$${p}_{diff}\text{}=\text{}3{p}_{ij}(t)\text{}=\text{}\frac{3}{4}-\frac{3}{4}{e}^{-4\alpha t}\text{(4)}$$
Note that p_{diff} approaches ¾ when t is infinitely large, which means that multiple substitutions can no longer be corrected. Eq. (4) offers another way of deriving p_{ii}(t) in Eq. (3), i.e., p_{ii}(t) is simply 1-p_{diff}.
Eq. (4) allows us to derive the JC69 distance (D_{JC69}) because a distance is defined as μt where μ is the substitution rate which is equal to 3α in the JC69 model. This is the same as the distance that you have driven is the product of the speed (rate) and time. Given that D_{JC69} = 3αt, we can derive D_{JC69} (Figure 1g) by substituting αt = D_{JC69}/3 into Eq. (4), i.e.,
$${D}_{JC69}\text{}=\text{}-\frac{3}{4}\mathrm{ln}\left(1-\frac{4{p}_{diff}}{3}\right)\text{(5)}$$
Where p_{diff} (the expected number of sites that are different between the two homologous sequences) can be approximated by the observed proportion of sites (p_{diff.obs}) differing between the two aligned sequences. Note that p_{diff.obs} may differ from p_{diff} even when the underlying substitution model indeed follows JC69 because of 1) Stochastic factors due to limited aligned length of the two sequences, and 2) Distortion caused by suboptimal sequence alignment. Thus, although p_{diff} in Eq. (4) cannot be greater than 0.75, p_{diff.obs} could, even when sequences evolve strictly according to the JC69 model. D_{JC69} is not defined when p_{diff} ≥ 0.75 as there is no logarithm for 0 or negative values.
We can optionally show that D_{JC69} in Eq. (5) is a maximum likelihood distance. For two aligned sequences of length N, designate the number of sites that differ between the two sequences as N_{D} and the number of sites identical between the two sites as (N-N_{D}). Now the likelihood function is:
$L\text{}=\text{}{\left(\frac{1}{4}\right)}^{N}{p}_{ii}^{(N-{N}_{D})}{(1-{p}_{ii})}^{{N}_{D}}$
$\mathrm{ln}L\text{}=\text{}N\mathrm{ln}\left(\frac{1}{4}\right)+(N-{N}_{D})\mathrm{ln}({p}_{ii})+{N}_{D}\mathrm{ln}(1-{p}_{ii})$
$=N\mathrm{ln}\left(\frac{1}{4}\right)+(N-{N}_{D})\mathrm{ln}\left(\frac{1}{4}+\frac{3}{4}{e}^{-4{D}_{JC69}/3}\right)+{N}_{D}\mathrm{ln}\left(\frac{3}{4}-\frac{3}{4}{e}^{-4{D}_{JC69}/3}\right)\text{(6)}$
Where the constant term N^{*}ln(1/4) can be dropped in maximizing lnL to obtain the distance estimate, but needs to be kept when performing likelihood ratio test for comparing different substitution models (e.g., JC69 against TN93).
We take the derivative of lnL with respect to D_{JC69}, set the derivative to 0 and solve for D_{JC69}. The resulting D_{JC69} is exactly the same as that in Eq. (5). I used D instead of D_{JC69} in the equations below:
$$\frac{d\mathrm{ln}L}{dD}\text{}=\text{}-\frac{(N-{N}_{D}){e}^{-4D/3}}{\frac{1}{4}+\frac{3}{4}{e}^{-4D/3}}+\frac{{N}_{D}{e}^{-4D/3}}{\frac{3}{4}-\frac{3}{4}{e}^{-4D/3}}\text{}=\text{}0$$ $$D\text{}=\text{}-\frac{3}{4}\mathrm{ln}\left(\frac{3N-4{N}_{D}}{3N}\right)\text{}=\text{}-\frac{3}{4}\mathrm{ln}\left(1-\frac{4{p}_{diff}}{3}\right)\text{(7)}$$
The variance of D_{JC69} (designated as V_{JC69}) is obtained as the negative reciprocal of the second derivative of lnL:
$${V}_{JC69}\text{}=\text{}-\frac{1}{\frac{{d}^{2}\mathrm{ln}L}{d{D}_{JC69}^{2}}}\text{}=\text{}\frac{{p}_{diff}(1-{p}_{diff})}{L{\left(1-\frac{4{p}_{diff}}{3}\right)}^{2}}\text{(8)}$$
Note that V_{JC69} decreases with sequence length L as one would have expected. We illustrate the application of Eqs. (5) and (8) by using the aligned sequences in Figure 2 where N = 24, N_{D} = 6, and p_{diff} = 6/24 = 0.25. So D_{JC69} = 0.3041, and var (D_{JC69}) = 0.0176.
Figure 2: Two homologous sequences for illustrating computation of pairwise evolutionary distances. View Figure 2
The equilibrium frequencies of the π vector can be derived by set t = ∞ in Eqs. (2) and (3) which leads to p_{ii} = p_{ij} = ¼. This implies that equilibrium frequencies of the four nucleotides will be equal for the JC69 model. This is not surprising because the frequencies did not even appear in the rate matrix (Figure 1a).
The K80 model has a transition substitution rate α and a transversion rate β (Figure 3a). We will focus on nucleotide A and conceptualize the model with two events (Figure 3b), in contrast to only one event in the JC69 model. The first event (e_{1}) occurs when nucleotide A changes into any of the four nucleotides (including to itself). In other words, the original A is replace by a nucleotide randomly drawn from a nucleotide pool with equal nucleotide frequencies. This event occurs with a rate β. The second event (e_{2}) occurs when nucleotide A changes either to G or to itself, i.e., the original A is replace by a nucleotide randomly drawn from a purine pool with equal number of A and G. This e_{2} occurs with a rate γ. Thus the transition rate α equals β+γ according to this conceptualization. Note that, whenever e_{1} happens, the original nucleotide is replace by any one of the four nucleotides with equal probability, no matter how many e_{2} events has occur before or after the occurrence of e_{1}. It might help to think of a long sequence with L sites being A at time 0. If these L sites each have experienced at least one e_{2} event, then these sites will either be A or G with equal probability (i.e., 0.5), and we expect to have L/2 sites being A and the other L/2 sites being G. In contrast, if each of these L sites has experienced at least one e_{1} event, then the site will be replaced by either A, C, G, or T with equal probability, and we expect to observe A, C, G and T in L/4 sites each. Any e_{2} events occurring before or after the e_{1} event do not change this expectation. This means that e_{1} erases e_{2}, but not vice versa. The probability that an e_{2} event happened is informative only when no e_{1} event has happened.
Figure 3: Derivation of transition probabilities and the evolutionary distance (D) based on the K80 model. The rate matrix (a) has the diagonal elements (d) constrained by the row sum equal to 0, i.e., d = -α - 2β. P and Q are the observed proportion of transitional and transversional changes between two aligned homologous sequences. Equating them to their respective expected values, E(P) and E(Q), leads to the solution of αt and βt shown, and the evolutionary distance D. P(j|i,t) means the probability of changing from the original nucleotide i to nucleotide j after time t, and is synonymous to p_{ij}(t) or simply p_{ij} in this paper. View Figure 3
After time t, the expected number of substitutions is 2(α+β)t, i.e., the nucleotide A has two ways of change with a rate of α (to A and to G) and another two ways of change with a rate β (to C and to T), so the probability of no change, according to Poisson distribution, is
$$p({e}_{1}\text{}=\text{}0,{e}_{2}\text{}=\text{}0,t)\text{}=\text{}{e}^{-2(\alpha +\beta )t}\text{(9)}$$
Note that α is conceptualized as (β+γ) in Figure 3, so e^{-2(α+β)t} in Eq. (9) is equivalent to e^{-(4β+2γ)t}. The probability that at least one e_{1} event has occurred is
$$p({e}_{1}>0,t)\text{}=\text{}1-{e}^{-4\beta t}\text{(10)}$$
Thus, the probability that at least one e_{2} has occurred but e_{1} has not occurred is simply
$$p({e}_{2}>0,{e}_{1}\text{}=\text{}0,t)\text{}=\text{}1-p({e}_{1}\text{}=\text{}0,{e}_{2}\text{}=\text{}0,t)-p({e}_{1}0,t)$$ $$=1-{e}^{-2(\alpha +\beta )t}-(1-{e}^{-4\beta t})$$ $$={e}^{-4\beta t}-{e}^{-2(\alpha +\beta )t}\text{(11)}$$
These probabilities are also shown in Figure 3c. The reason for the condition that “e_{1} has not occurred” is because e_{1} event can erase e_{2} event (as we have discussed before).
Now the probability of the starting nucleotide A changing to G during time t, designated as p(G|A,t), is the summation of two probabilities. The first is 1/2 of the probability of p(e_{2} > 0, e_{1} = 0,t) in Eq. (11) because the other 1/2 is for A to itself. The second is 1/4 of p(e_{1} > 0,t) in Eq. (10) because A→A, A→G, A→C and A→T each get ¼, so only 1/4 of p(e_{1} > 0,t) is for A→G. The summation of these two probabilities (Figure 3d) is p(G|A,t). This probability is equal to p(A|G,t), p(C|T,t), and p(T|C,t) in the K80 model. In other words, the summation of these two probabilities is the probability of a transition (P_{s}) during time t. Thus,
$${P}_{s}\text{}=\text{}\frac{p({e}_{2}0,{e}_{1}=0,t)}{2}+\frac{p({e}_{1}0,t)}{4}$$ $$=\frac{{e}^{-4\beta t}-{e}^{-2(\alpha +\beta )t}}{2}+\frac{1-{e}^{-4\beta t}}{4}$$ $$=\frac{1}{4}+\frac{{e}^{-4\beta t}}{4}-\frac{{e}^{-2(\alpha +\beta )t}}{2}\text{}=\text{}P(G|A,t)\text{}=\text{}P(A|G,t)\text{}=\text{}P(T|C,t)\text{}=\text{}P(C|T,t)\text{(12)}$$
Similarly, the probability of the starting A changing to C (or to T) is 1/4 of p(e_{1} > 0,t) in Eq. (10) because 1/4 is for A→A, ¼ is for A→G and 1/4 is for A→T, so only 1/4 is for A→C (Figure 3d). This probability is the probability for a transversional change during time t,
$${P}_{v}\text{}=\text{}\frac{p({e}_{1}0,t)}{4}\text{}=\text{}\frac{1-{e}^{-4\beta t}}{4}\text{(13)}$$
As a quick check of the derived transition probabilities, we note that P_{s} and P_{v} are zero when t = 0 (or when α = 0 and β = 0). This also implies that all diagonal elements in the transition probability matrix are equal to 1, and is what we have expected. When t = ∞, with α > 0 and β > 0, all entries in matrix P approaches ¼ (the equilibrium frequency of the K80 model). This is also what we expected.
For two aligned homologous sequences, P_{s} can be approximated by the proportion of sites differing by a transition (P), and 2P_{v} by the portion of sites differing by a transversion (Q, Figure 3e). Note that the expected Q is equal to 2P_{v} because there are two ways of having a transversional change. Therefore,
$$P\text{}=\text{}\frac{1}{4}+\frac{{e}^{-4\beta t}}{4}-\frac{{e}^{-2(\alpha +\beta )t}}{2}\text{(14)}$$
$$Q\text{}=\text{}2{P}_{v}\text{}=\text{}2\left(\frac{1-{e}^{-4\beta t}}{4}\right)\text{}=\text{}\frac{1-{e}^{-4\beta t}}{2}\text{(15)}$$
We can now first solve for βt from Eq. (15), and then substitute the solution for βt into Eq. (14) to solve for αt. This leads to
$$\alpha t\text{}=\text{}-\frac{\mathrm{ln}(1-2P-Q)}{2}+\frac{\mathrm{ln}(1-2Q)}{4}$$ $$\beta t\text{}=\text{}-\frac{\mathrm{ln}(1-2Q)}{4}\text{(16)}$$
Recall that evolutionary distance is defined as μt, where μ is the substitution rate which is equal to (α+2β) in the K80 model. Thus, the evolutionary distance based on the K80 model (D_{K80}) is (α+2β)t, which comes to
$${D}_{K80}\text{}=\text{}\alpha t+2\beta t\text{}=\text{}-\frac{\mathrm{ln}(1-2P-Q)}{2}-\frac{\mathrm{ln}(1-2Q)}{4}\text{(17)}$$
Where P and Q can be approximated by the observed proportion of sites differing by a transition or a transversion from two aligned sequences, designated as P_{obs} and Q_{obs}. Similar to what I have mentioned with reference to D_{JC69}, P and Q may differ from P_{obs} and Q_{obs} even if the K80 model is followed during the sequence evolution. This is because 1) Limited aligned length of the two sequences may result in stochastic variation in P_{obs} and Q_{obs}, and 2) The two observed proportions may be distorted by alignment errors (i.e., misidentification of site homology). For example, two homologous sequences that have diverged for an infinite length of time according to the K80 model should have expected P and Q equal to 0.25 and 0.5, respectively. However, we may actually have P_{obs} > 0.25 or Q_{obs} > 0.5, which would render D_{K80} inapplicable. On the other hand, after sequence alignment and deletion of indels (because evolutionary distances are typically calculated without using sites with indels), P_{obs} and Q_{obs} may well be much smaller than the expected 0.25 and 0.5 leading to severer underestimation of the true distance. It is also possible to have P_{obs} and Q_{obs} values that, when used to replace P and Q in Eq. (16), result in negative αt or βt values that make no biological sense. The same applies to D_{JC69} (in fact to any evolutionary distances based on a substitution model). Methods for handling such situations are discussed later in the section on the GTR model.
We may optionally show D_{K80} in Eq. (17) to be a maximum likelihood estimator of the distance based on the K80 model, just like the K_{JC69} distance in Eq. (5). To see this, it is better to re-parameterize the K80 model by replacing αt and βt by D_{K80} and κ using the following relationship:
$$\begin{array}{l}{D}_{K80}\text{}=\text{}\alpha t+2\beta t\text{(18)}\\ \kappa \text{}=\text{}\alpha t/\beta t\end{array}$$
Solving these two equations gives us
$$\begin{array}{l}\alpha t\text{}=\text{}\frac{{D}_{K80}\kappa}{\kappa +2}\text{(19)}\\ \beta t\text{}=\text{}\frac{{D}_{K80}}{\kappa +2}\end{array}$$
Substituting αt and βt into Eqs. (14) and (15) so that P and Q will be functions of D_{K80} and κ, and the likelihood function for deriving D_{K80} and κ is
$$L\text{}=\text{}{\left(\frac{1}{4}\right)}^{N}{P}^{{N}_{s}}{Q}^{{N}_{v}}{(1-P-Q)}^{N-{N}_{s}-{N}_{v}}$$ $$\mathrm{ln}L\text{}=\text{}N\mathrm{ln}\left(\frac{1}{4}\right)+{N}_{s}\mathrm{ln}P+{N}_{v}\mathrm{ln}Q+(N-{N}_{s}-{N}_{v})\mathrm{ln}(1-P-Q)\text{(20)}$$
Where the constant term N^{*}ln(1/4) can be dropped in maximizing lnL to obtain the distance estimate, but need to be kept when performing likelihood ratio test for comparing different substitution models (e.g., K80 against TN93).
Taking partial derivatives with respect to D_{K80} and κ, setting them to zero and solving the simultaneous equations, we have
$${D}_{K80}\text{}=\text{}-\frac{\mathrm{ln}(1-2{P}_{obs}-{Q}_{obs})}{2}-\frac{\mathrm{ln}(1-2{Q}_{obs})}{4}\text{(21)}$$
$$\kappa \text{}=\text{}\frac{2\mathrm{ln}(1-2{P}_{obs}-{Q}_{obs})}{\mathrm{ln}(1-2{Q}_{obs})}-1\text{(22)}$$
Where P_{obs} = N_{s}/N, and Q_{obs} = N_{v}/N. Using the two aligned sequences in Figure 2, we have N = 24 and P_{obs} = 4/24 and Q_{obs} = 2/24. These lead to D_{K80} = 0.3151, and κ = 4.9126. It may be relevant to add that, while D_{JC69} and D_{K80} are maximum likelihood estimates, distance formulae for F84 and TN93 models, obtained in the same way by equating the observed substitutions to expected substitutions, are generally not maximum likelihood estimates. This will become clear when we deal with these models.
We have previously derived the variance of D_{JC69} as the negative reciprocal of the second derivative of lnL with respect to D_{JC69}. This can be used only when the log-likelihood function is used to estimate a single parameter. When there are multiple parameters (e.g., D_{K80} and κ), we cannot use the same approach unless the parameters are not correlated. There are two commonly used methods for deriving variances of parameters. The first is the delta method [7-9], and the second uses the Fisher information matrix to obtain the variances and covariance matrix for the parameters. The delta method, which often yields nice and clean mathematical expressions for the variance, is illustrated in the Appendix. The method using the Fisher information matrix is shown below.
To estimate variance involving multiple parameters such as D_{K80} and κ, we first take the second order partial derivatives of lnL with respective to D_{K80} and κ, substituting the estimated D_{K80} and κ in Eqs. (21) and (22) into the second-order partial derivatives, arranging them into what is called a Fisher information matrix (M_{FI}) below, and compute the matrix inverse of M_{FI} (designated by M_{FI} ^{-1}):
$${M}_{FI}\text{}=\text{}\left[\begin{array}{cc}-\frac{{\partial}^{2}\mathrm{ln}L}{\partial {\kappa}^{2}}& -\frac{{\partial}^{2}\mathrm{ln}L}{\partial \kappa \text{\hspace{0.17em}}\partial {D}_{K80}}\\ -\frac{{\partial}^{2}\mathrm{ln}L}{\partial {D}_{K80}\partial \kappa}& -\frac{{\partial}^{2}\mathrm{ln}L}{\partial {D}_{K80}^{2}}\end{array}\right]\text{(23)}$$
The diagonal elements of M_{FI}^{-1} are the variances for κ and D_{K80}, and the off-diagonal elements of M_{FI}^{-1} are covariances. The mathematical expression for the variance of κ is tedious, but that for the variance of D_{K80} is simpler:
$$V({D}_{K80})\text{}=\text{}\frac{{a}^{2}P+{c}^{2}Q-{(aP+cQ)}^{2}}{N}\text{,where}$$ $$a\text{}=\text{}\frac{1}{1-2P-Q}\text{,}b\text{}=\text{}\frac{1}{1-2Q},\text{\hspace{0.17em}}c\text{}=\text{}\frac{a+b}{2}\text{(24)}$$
With the aligned sequences in Figure 2, we have N = 24 and empirical P = 4/24 and Q = 2/24. These lead to D_{K80} = 0.3151, and κ = 4.9126. The M_{FI} and M_{FI} ^{-1} are
$${M}_{FI}\text{}=\text{}\left[\begin{array}{cc}0.0\text{47435}& -0.286795\\ -0.286795& 49.641105\end{array}\right]$$ $${M}_{FI}^{-1}\text{}=\text{}\left[\begin{array}{cc}\text{21}\text{.84451668}& \text{0}\text{.126204037}\\ \text{0}\text{.126204037}& \text{0}\text{.020873724}\end{array}\right]\text{(25)}$$
Where the two parameters are in the order of κ and D_{K80}, i.e., the variance is 21.8445 for κ and 0.0209 for D_{K80}. The off-diagonal elements are covariances between the two parameters.
The F84 and HKY85 model accommodate not only the differential substitution rates between transitions and transversions, but also different equilibrium nucleotide frequencies, in contrast to JC69 and K80 which assume equal equilibrium nucleotide frequencies. The same probabilistic reasoning used before can be applied to derive transition probabilities for the HKY80 model.
The rate matrix for the F84 model is
$${Q}_{F84}\text{=}\begin{array}{c}A\\ G\\ C\\ T\end{array}\left[\begin{array}{cccc}-& \beta {\pi}_{G}+\gamma {\pi}_{G}/{\pi}_{R}& \beta {\pi}_{C}& \beta {\pi}_{T}\\ \beta {\pi}_{A}+\gamma {\pi}_{A}/{\pi}_{R}& -& \beta {\pi}_{C}& \beta {\pi}_{T}\\ \beta {\pi}_{A}& \beta {\pi}_{G}& -& \beta {\pi}_{T}+\gamma {\pi}_{T}/{\pi}_{Y}\\ \beta {\pi}_{A}& \beta {\pi}_{G}& \beta {\pi}_{C}+\gamma {\pi}_{C}/{\pi}_{Y}& -\end{array}\right]\text{(26)}$$
Where π_{A}, π_{G}, π_{C} and π_{T} are equilibrium frequencies, π_{R} and π_{Y} are frequencies of purines and pyrimidines, and the diagonal elements are constrained by each row summing up to 0. The parameter γ in Eq. (26) is sometimes replaced by κβ, but it is easier to understand the F84 model by using Q_{F84} specified in Eq. (26).
We may view the F84 model as featuring two events (e_{1} and e_{2}). Suppose we start with a nucleotide A. Event e_{1} occurs with rate β. When it occurs, the original A will be replaced by a nucleotide drawn randomly from a nucleotide pool in which the nucleotide frequencies are the same as the equilibrium frequencies. This means that the original A has a rate of βπ_{A}, βπ_{G}, βπ_{C} and βπ_{T} to change to A, G, C and T, respectively, when e_{1} occurs. This is different from the K80 model where, when e_{1} occurs, the original A has a rate of 0.25 to change to any of the four nucleotides. Event e_{2} has a rate of γ to occur, and will result in the original A being replaced by a purine drawn randomly from a purine pool with A and G frequencies specified as π_{A}/R and π_{G}/R. Thus, the original A has a rate of γπ_{A}/π_{R} and γπ_{G}/π_{R} to change to A and G when e_{2} occurs. This again differs from K80 where the original A has a rate of 0.5 of changing to A or Gwhen e_{2} occurs. These events are illustrated in Figure 4a, where we use x to represent β+γ/π_{R}. Note that it is not a good idea to use α to represent β+γ/π_{R} for two reasons. First, if we had started with a nucleotide C or T instead of A, then we would have β+γ/π_{Y} instead of β+γ/π_{R} which would force us to use α_{1} and α_{2} to distinguish between the two. A casual reader will then be misled to think that F84 has three rate parameters (i.e., β, α_{1} and α_{2}) without knowing that α_{1} and α_{2} are used as different functions of the same rate parameter γ. Second, I have reserved α to represent β+γ in Figure 4b which simplifies the derivation of transition probabilities illustrated in Figure 4.
Figure 4: Derivation of transition probabilities based on the F84 model. π_{A}, π_{G},π_{C}, π_{T} and π_{R}, π_{Y} are equilibrium frequencies of A, C, G, T, purine (A + G) and pyrimidine (C + T). Event e_{1} occurs at a rate of β and leads to the original A being replaced by any of the four nucleotides according to their equilibrium frequencies, and event e_{2} occurs at a rate of γ and results in the original A being replaced by either A or G according to their frequencies in the purine pool, i.e., π_{A}/π_{R}, and π_{G}/π_{R}. I used x as a shorthand for β+γ/π_{R}. The rate γ does not appear in the final transition probabilities because it has been absorbed into α which equals β+γ shown in (b). P(j|i,t) means the probability of changing from the original nucleotide i to nucleotide j after time t, and is synonymous to p_{ij}(t) or simply p_{ij} in this paper. View Figure 4
Note that whenever event e_{1} happens, the original A is replace by A, C, G and T with probabilities π_{A}, π_{G}, π_{C} and π_{T}, no matter how many e_{2} events has occur before or after the occurrence of e_{1}. This is similar to the scenario involving the K80 model, except that the K80 model assumes equal nucleotide frequencies. It might help to think of a long sequence with L sites being A at time 0. If these L sites each have experienced at least one e_{2} event, then these sites will either be A or G with probabilities π_{A} and π_{G}, respectively, and we expect to have π_{A}L sites being A and π_{G}L sites being G. In contrast, if each of these L sites has experienced at least one e_{1} event, then the site will be replaced by A, C, G, or T with probabilities π_{A}, π_{G}, π_{C} and π_{T}, and we expect to observe A, C, G and T in π_{A}L, π_{G}L, π_{C}L, and π_{T}L sites, respectively. Any number of e_{2} events occurring before or after the e_{1} event does not change this expectation. This means that e_{1} erases e_{2}, but not vice versa. The occurrence of an e_{2} event is informative only when no e_{1} event has happened.
After time t, the total flow of the original A to the four nucleotides (including itself, Figure 4a and Figure 4b) is
$${\pi}_{A}x+{\pi}_{G}x+\beta {\pi}_{C}+\beta {\pi}_{T}\text{}=\text{}{\pi}_{R}x+{\pi}_{Y}\beta \text{}=\text{}{\pi}_{R}(\beta +\gamma /{\pi}_{R})+{\pi}_{Y}\beta \text{}=\text{}\beta +\gamma \text{}=\text{}\alpha \text{(27)}$$
So the probability that no substitution has happened during time t (Figure 4c), according to Poisson distribution, is
$$p({e}_{1},{e}_{2}\text{}=\text{}0,t)\text{}=\text{}{e}^{-\alpha t}\text{(28)}$$
The rate of A changing to A, G, C, and T through e_{1} is βπ_{A} + βπ_{G} + βπ_{C} + βπ_{T} = β, so the probability that at least one e_{1} has occurred during time t is
$$p({e}_{1}>0,t)\text{}=\text{}1-{e}^{-\beta t}\text{(29)}$$
The probability that e_{2} has happened but e_{1} has not is then
$$p({e}_{2}>0,{e}_{1}\text{}=0,t)\text{}=\text{}1-p({e}_{1},{e}_{2}\text{}=\text{}0,t)-p({e}_{1}0,t)\text{}=\text{}{e}^{-\beta t}-{e}^{-\alpha t}\text{(30)}$$
The reason for the condition that “e_{1} has not occurred” is because e_{1} event can erase e_{2} event. With these, it is easy to derive transition probability from A to G (Figure 4d) as the summation of 1) A fraction of π_{G} of p(e_{1} > 0,t), which is the probability of e_{1} event that results in the original A being replaced by A, C, G, and T with probabilities π_{A}, π_{G}, π_{C} and π_{T}, and 2) A fraction of π_{G}/π_{R} of p(e_{2} > 0,e_{1} = 0,t), which is the probability that e_{2} events not erased by e_{1}. That is,
$$p(G|A,t)\text{}=\text{}p({e}_{1}0,t){\pi}_{G}+\frac{p({e}_{2}0,{e}_{1}=0,t){\pi}_{G}}{{\pi}_{R}}\text{}={\pi}_{G}+\frac{{\pi}_{G}{\pi}_{Y}{e}^{-\beta t}}{{\pi}_{R}}-\frac{{\pi}_{G}{e}^{-\alpha t}}{{\pi}_{R}}\text{(31)}$$
From now on, p(j|i,t) will be written simply as p_{ij}, so p(G|A,t) is p_{AG}. With the same reasoning, we can derive transition probabilities for other A↔G and C↔T substitutions. Note that the two rate parameters in the F84 model (β and γ) have been re-parameterized into α (= β + γ) and β in Eq. (31). The transition probability from the original A to C (a transversion, Figure 4e) is simply
$${p}_{AC}\text{}=\text{}{\pi}_{C}p({e}_{1}0,t)\text{}=\text{}{\pi}_{C}(1-{e}^{-\beta t})\text{(32)}$$
For other transversions, e.g., p_{AT}, one just need to replace π_{C} by π_{T}. The complete transition probability matrix for the F84 model is
$${P}_{F84}\text{=}\begin{array}{c}A\\ G\\ C\\ T\end{array}\left[\begin{array}{cccc}{\pi}_{A}+{\pi}_{A}{\pi}_{Y}{x}_{1}+{\pi}_{G}{x}_{2}& {\pi}_{G}+{\pi}_{G}{\pi}_{Y}{x}_{1}-{\pi}_{G}{x}_{2}& {\pi}_{C}\left(1-{e}^{-\beta t}\right)& {\pi}_{T}\left(1-{e}^{-\beta t}\right)\\ {\pi}_{A}+{\pi}_{A}{\pi}_{Y}{x}_{1}-{\pi}_{A}{x}_{2}& {\pi}_{G}+{\pi}_{G}{\pi}_{Y}{x}_{1}+{\pi}_{A}{x}_{2}& {\pi}_{C}\left(1-{e}^{-\beta t}\right)& {\pi}_{T}\left(1-{e}^{-\beta t}\right)\\ {\pi}_{A}\left(1-{e}^{-\beta t}\right)& {\pi}_{G}\left(1-{e}^{-\beta t}\right)& {\pi}_{C}+{\pi}_{C}{\pi}_{R}{x}_{3}+{\pi}_{G}{x}_{4}& {\pi}_{T}+{\pi}_{T}{\pi}_{R}{x}_{3}-{\pi}_{T}{x}_{4}\\ {\pi}_{A}\left(1-{e}^{-\beta t}\right)& {\pi}_{G}\left(1-{e}^{-\beta t}\right)& {\pi}_{C}+{\pi}_{C}{\pi}_{R}{x}_{3}-{\pi}_{C}{x}_{4}& {\pi}_{T}+{\pi}_{T}{\pi}_{R}{x}_{3}+{\pi}_{C}{x}_{4}\end{array}\right]\text{(33)}$$
Where
$${x}_{1}\text{}=\text{}\frac{{e}^{-\beta t}}{{\pi}_{R}},{x}_{2}\text{}=\text{}\frac{{e}^{-\alpha t}}{{\pi}_{R}},{x}_{3}\text{}=\text{}\frac{{e}^{-\beta t}}{{\pi}_{Y}},{x}_{4}\text{}=\text{}\frac{{e}^{-\alpha t}}{{\pi}_{Y}}\text{(34)}$$
As a quick check of the transition probabilities, we first note that when t = 0 (or when α = 0 and β = 0), then the diagonal elements are 1 and all off-diagonal elements are 0, which is what we expected. Second, when t = ∞ with α > 0 and β > 0, then the transition probabilities will approach the equilibrium frequencies, which is also what we expected.
To obtain the distance for the F84 model (D_{F84}), recall that a distance is defined as μt where μ is the average substitution rate, i.e., substitution rates in Eq. (26) weighted by the equilibrium frequencies:
$${D}_{F84}\text{}=\text{}2{\pi}_{A}{\pi}_{G}(\beta t+\gamma t/{\pi}_{R})+2{\pi}_{T}{\pi}_{C}(\beta t+\gamma t/{\pi}_{Y})+2{\pi}_{Y}{\pi}_{R}\beta t\text{(35)}$$
Now we need to obtain βt and γt in order to calculate D_{F84}. We can obtain αt and βt, and then obtain γt = αt - βt, remembering that α = β + γ (Figure 4b and Eq. (27)]. The method we will use is the same as that for the K80 model, i.e., we obtain the expected transitions and transversions, designated E(S) and E(V), respectively, from transition probabilities and equate them to the observed S and V to solve for αt and βt. With the property of time reversibility (e.g., π_{A}•p_{AG} = π_{G}•pGA), we have
$$E(S)\text{}=\text{}2{\pi}_{A}{p}_{AG}+2{\pi}_{C}{p}_{CT}$$ $$E(V)\text{}=\text{}2{\pi}_{A}{p}_{AT}+2{\pi}_{A}{p}_{AC}+2{\pi}_{G}{p}_{GC}+2{\pi}_{G}{p}_{GT}\text{(36)}$$
Equating E(S) and E(V) to the observed S and V, and solving these two equations with the two unknowns (αt and βt), we have
$$\alpha t\text{}=\text{}\mathrm{ln}\left(\frac{-2({\pi}_{A}{\pi}_{G}{\pi}_{R}{\pi}_{Y}^{2}+{\pi}_{C}{\pi}_{T}{\pi}_{R}^{2}{\pi}_{Y})}{S{\pi}_{R}^{2}{\pi}_{Y}^{2}-2{\pi}_{A}{\pi}_{G}{\pi}_{R}{\pi}_{Y}^{2}-2{\pi}_{C}{\pi}_{T}{\pi}_{R}^{2}{\pi}_{Y}+\left({\pi}_{A}{\pi}_{G}{\pi}_{Y}^{2}+{\pi}_{C}{\pi}_{T}{\pi}_{R}^{2}\right)V}\right)\text{(37)}$$
$$\beta t\text{}=\text{}-\mathrm{ln}\left(1-\frac{V}{2{\pi}_{R}{\pi}_{Y}}\right)\text{(38)}$$
Substitute βt and γt (= αt - βt) into Eq. (35) and, after some algebraic manipulation, we have a more useful form of D_{F84}:
$${D}_{F84}\text{}=\text{}\frac{2}{{\pi}_{R}{\pi}_{Y}}[-({\pi}_{A}{\pi}_{G}+{\pi}_{C}{\pi}_{T}){\pi}_{R}{\pi}_{Y}\mathrm{ln}({x}_{1})+{\pi}_{C}{\pi}_{T}{\pi}_{R}\mathrm{ln}\left(\frac{{x}_{2}}{{x}_{3}}\right)+\text{}{\pi}_{A}{\pi}_{G}{\pi}_{Y}\mathrm{ln}\left(\frac{{x}_{2}}{{x}_{3}}\right)-{\pi}_{R}^{2}{\pi}_{Y}^{2}\mathrm{ln}\left({x}_{1}\right)]\text{(39)}$$
Where
${x}_{1}=\text{}1-\frac{V}{2{\pi}_{R}{\pi}_{Y}}$
${x}_{2}\text{}=\text{}({\pi}_{A}{\pi}_{G}{\pi}_{Y}+{\pi}_{C}{\pi}_{T}{\pi}_{R})(2{\pi}_{R}{\pi}_{Y}-V)$
${x}_{3}\text{}=\text{}-S{\pi}_{R}^{2}{\pi}_{Y}^{2}+2{\pi}_{A}{\pi}_{G}{\pi}_{R}{\pi}_{Y}^{2}+2{\pi}_{C}{\pi}_{T}{\pi}_{Y}{\pi}_{R}^{2}-{\pi}_{A}{\pi}_{G}{\pi}_{Y}^{2}V-{\pi}_{C}{\pi}_{T}{\pi}_{R}^{2}V\text{(40)}$
To illustrate the calculation of D_{F84}, we may use the two aligned sequences in Figure 2 which gives us π_{A} = 6/48, π_{C} = 12/48, π_{G} = 10/48, π_{T} = 20/48, S = 4/24, V = 2/24, αt = 0.5778363341, βt = 0.2076393648, γt = αt - βt = 0.3701969693, and D_{F84} = 0.3198867427. The variance of the D_{F84} can be obtained by either the delta method or the method using Fisher information matrix.
A substitution model similar to the F84 model is the HKY85 model, with its rate matrix specified as:
$${Q}_{HKY85}\text{=}\begin{array}{c}A\\ G\\ C\\ T\end{array}\left[\begin{array}{cccc}-& \left(\beta +\gamma \right){\pi}_{G}& \beta {\pi}_{C}& \beta {\pi}_{T}\\ \left(\beta +\gamma \right){\pi}_{A}& -& \beta {\pi}_{C}& \beta {\pi}_{T}\\ \beta {\pi}_{A}& \beta {\pi}_{G}& -& \left(\beta +\gamma \right){\pi}_{T}\\ \beta {\pi}_{A}& \beta {\pi}_{G}& \left(\beta +\gamma \right){\pi}_{C}& -\end{array}\right]\text{(41)}$$
Where (β+γ) is often written as α and the diagonal elements are constrained by each row summing up to 0. The HKY85 model and the F84 model differ only in the specification of rates involving transitions. qAG and qCT are π_{G}(β+γ) and π_{T}(β+γ) in the HKY85 model specified in Eq. (41), in contrast to π_{G}(β+γ/π_{R}) and π_{T}(β+γ/π_{Y}), respectively, in the F84 model specified in Eq. (26). By comparing these rates, it becomes obvious that the F84 model would be equivalent to the HKY85 model if π_{R} = π_{Y}.
We can obtain the transition probabilities for the HKY85 model in the same way as that for the F84 model. In short, we again start with a nucleotide A and envision two events e_{1} and e_{2}. Event e_{1} occurs with rate β, and results in the original A replaced by any of the four nucleotides with probabilities equal to their respective equilibrium frequencies. Event e_{2} occurs with a rate γ and results in the original A being replaced by either A or G with the probabilities equal to their respective equilibrium frequencies. Fictionalized in this way, the expected number of substitutions after time t is β(π_{A}+π_{G}+π_{C}+π_{T}) + γ(π_{A}+π_{G}) = β+γR. According to the Poisson distribution, the probability that no substitution has happened during time t is
$$p({e}_{1},{e}_{2}\text{}=\text{}0,t)\text{}=\text{}1-{e}^{-(\beta +\gamma R)}\text{(42)}$$
The probability that at least one e_{1} occurred after time t is
$$p({e}_{1}>0,t)\text{}=\text{}1-{\mathrm{e}}^{-\beta t}\text{(43)}$$
The probability that e_{2} has occurred but e_{1} has not is
$$p({e}_{2}>0,{e}_{1}\text{}=\text{}0,t)\text{}=\text{}1-p({e}_{1},{e}_{2}\text{}=\text{}0,t)-p({e}_{1}0,t)\text{}=\text{}{e}^{-\beta t}-{e}^{-(\beta +\gamma R)}\text{(44)}$$
The transition probability p(G|A,t), abbreviated as p_{AG}, is
$${p}_{AG}\text{}=\text{}{\pi}_{G}p({e}_{1}0,t)+\frac{{\pi}_{G}}{{\pi}_{R}}p({e}_{2}0,{e}_{1}\text{}=\text{}0,t)\text{}=\text{}{\pi}_{G}+\frac{{\pi}_{G}{\pi}_{Y}{e}^{-\beta t}}{{\pi}_{R}}-\frac{{\pi}_{G}{e}^{-(\beta +{\pi}_{R}\gamma )t}}{{\pi}_{R}}\text{(45)}$$
In the same way, we can derive other transition probabilities which are shown below:
$${P}_{HKY}\text{=}\begin{array}{c}A\\ G\\ C\\ T\end{array}\left[\begin{array}{cccc}{\pi}_{A}+{\pi}_{A}{x}_{1}+{\pi}_{G}{x}_{2}& {\pi}_{G}+{\pi}_{G}{x}_{1}-{\pi}_{G}{x}_{2}& {\pi}_{C}\left(1-{e}^{-\beta t}\right)& {\pi}_{T}\left(1-{e}^{-\beta t}\right)\\ {\pi}_{A}+{\pi}_{A}{x}_{1}-{\pi}_{A}{x}_{2}& {\pi}_{G}+{\pi}_{G}{x}_{1}+{\pi}_{A}{x}_{2}& {\pi}_{C}\left(1-{e}^{-\beta t}\right)& {\pi}_{T}\left(1-{e}^{-\beta t}\right)\\ {\pi}_{A}\left(1-{e}^{-\beta t}\right)& {\pi}_{G}\left(1-{e}^{-\beta t}\right)& {\pi}_{C}+{\pi}_{C}{x}_{3}+{\pi}_{T}{x}_{4}& {\pi}_{T}+{\pi}_{T}{x}_{3}-{\pi}_{T}{x}_{4}\\ {\pi}_{A}\left(1-{e}^{-\beta t}\right)& {\pi}_{G}\left(1-{e}^{-\beta t}\right)& {\pi}_{C}+{\pi}_{C}{x}_{3}-{\pi}_{C}{x}_{4}& {\pi}_{T}+{\pi}_{T}{x}_{3}+{\pi}_{C}{x}_{4}\end{array}\right]\text{(46)}$$
Where
$${x}_{1}\text{}=\text{}\frac{{\pi}_{Y}{e}^{-\beta t}}{{\pi}_{R}};{x}_{2}\text{}=\text{}\frac{{e}^{-(\beta +{\pi}_{R}\gamma )t}}{{\pi}_{R}};{x}_{3}\text{}=\text{}\frac{{\pi}_{R}{e}^{-\beta t}}{{\pi}_{Y}};{x}_{4}\text{}=\text{}\frac{{e}^{-(\beta +{\pi}_{Y}\gamma )t}}{{\pi}_{Y}}\text{(47)}$$
As a quick check of the transition probabilities, we first note that when t = 0 (or when α = 0 and β = 0), then the diagonal elements are 1 and all off-diagonal elements are 0, which is what we expected. Second, when t approaches infinity with β > 0 and γ > 0, then the transition probabilities will approach the equilibrium frequencies, which is also what we expected.
We cannot derive the distance for the HKY85 model by following the same approach as that for the F84 model. Hasegawa, et al. [4] has tried this approach but were not successful because there is no explicit solution for βt and γt. However, if we treat the A↔G transition and C↔T transition separate, then we can solve for βt and γt [10]. In other words, we obtain one set of βt and γt from observed A↔G transitions and transversions, and another set of βt and γt from observed C↔T transitions and transversions. βt in the two sets are the same as that in Eq. (38), but γt is different between the two sets of estimates. We can then take a weighted average of γt. Admittedly, this does sound mathematically clumsy and explains why HKY85, while commonly used in phylogenetic analysis involving a likelihood framework or Bayesian inference, is almost never used in distance-based phylogenetics.
Here is the somewhat circuitous protocol to get βt and γt from HKY85. The expected numbers of A↔G and C↔T transitions, designated S_{R} and S_{Y}, respectively, and transversions are
$E({S}_{R})\text{}=\text{}2{\pi}_{A}{p}_{AG}$
$E({S}_{Y})\text{}=\text{}2{\pi}_{C}{p}_{CT}$
$E(V)\text{}=\text{}2{\pi}_{A}{p}_{AT}+2{\pi}_{A}{p}_{AC}+2{\pi}_{G}{p}_{GC}+2{\pi}_{G}{p}_{GT}\text{(48)}$
Setting E(S_{R}) and E(V) to their the observed S_{R} and V, and solve for βt and γt, we have
$\beta t\text{}=\text{}-\mathrm{ln}\left(1-\frac{V}{2{\pi}_{R}{\pi}_{Y}}\right)$
${\gamma}_{R}t\text{}=\text{}\frac{1}{{\pi}_{R}}\mathrm{ln}\left(\frac{{\pi}_{A}{\pi}_{G}(2{\pi}_{R}{\pi}_{Y}-V)}{2{\pi}_{A}{\pi}_{G}{\pi}_{R}{\pi}_{Y}-{S}_{R}{\pi}_{Y}{\pi}_{R}^{2}-{\pi}_{A}{\pi}_{G}{\pi}_{Y}V}\right)\text{(49)}$
Where βt is the same as that in Eq. (38), and γ_{R}t in Eq. (49) is γt estimated from observed S_{R} and V.
Now we obtain another set of solutions for βt and γt by setting E(S_{Y}) and E(V) to their observed S_{Y} and V, and solve for βt and γt, we have the same βt but a different γt:
$${\gamma}_{Y}t\text{}=\text{}\frac{1}{{\pi}_{Y}}\mathrm{ln}\left(\frac{{\pi}_{C}{\pi}_{T}(2{\pi}_{R}{\pi}_{Y}-V)}{2{\pi}_{C}{\pi}_{T}{\pi}_{R}{\pi}_{Y}-{S}_{Y}{\pi}_{R}{\pi}_{Y}^{2}-{\pi}_{C}{\pi}_{T}{\pi}_{R}V}\right)\text{(50)}$$
A weighted average of γt could be
$$\gamma t\text{}=\text{}{\pi}_{R}{\gamma}_{R}t+{\pi}_{Y}{\gamma}_{Y}t\text{(51)}$$
The distance for the HKY model
$${D}_{HKY85}\text{}=\text{}\mu t\text{}=\text{}2{\pi}_{A}{\pi}_{G}(\beta t+\gamma t)+2{\pi}_{T}{\pi}_{C}(\beta t+\gamma t)+2{\pi}_{Y}{\pi}_{R}\beta t\text{(52)}$$
To compute D_{HY85} using the two aligned sequences in Figure 2, we have π_{A} = 6/48, π_{C} = 12/48, π_{G} = 10/48, π_{T} = 20/48, S_{Y} = 4/24, S_{R} = 0, V = 2/24, βt = 0.2076393648, γ_{R}t = -0.2223239164, γ_{Y}t = 1.047432870, weighted γt = 0.624180608, D_{HKY85} = 0.308904. I intentionally choose the aligned sequences in Figure 2 with S_{R} = 0 just to see if D_{HKY85} would behave strangely. It did not. For comparison, the same two sequences yield D_{F84} = 0.319887.
In general, D_{HKY85} is slightly smaller than D_{F84}. I used the eight vertebrate COI sequences in the FASTA file VertCOI.fas that comes with DAMBE [11] to compute both D_{HKY85} and D_{F84} (Figure 5). The difference is minor, although D_{HKY85} is consistently but slightly smaller than D_{F84}.
Figure 5: Evolutionary distances from the HKY85 and F84 models are nearly identical. View Figure 5
The HKY85 model itself may not carry much biological significance given the existence of the F84 model. However, the twists involved in computing the evolutionary distance, i.e., the separate estimation of γA↔G and γC↔T, lead very naturally to a very useful TN93 model that we will cover next.
We have come far, so far that we need hardly any extra effort to derive transition probabilities for the TN93 model. There are two equivalent specifications of the rate matrix for the TN93 model. The first is
$${Q}_{TN93}\text{=}\begin{array}{c}A\\ G\\ C\\ T\end{array}\left[\begin{array}{cccc}-& \beta {\pi}_{G}+{\gamma}_{R}{\pi}_{G}/{\pi}_{R}& \beta {\pi}_{C}& \beta {\pi}_{T}\\ \beta {\pi}_{A}+{\gamma}_{R}{\pi}_{A}/{\pi}_{R}& -& \beta {\pi}_{C}& \beta {\pi}_{T}\\ \beta {\pi}_{A}& \beta {\pi}_{G}& -& \beta {\pi}_{T}+{\gamma}_{Y}{\pi}_{T}/{\pi}_{Y}\\ \beta {\pi}_{A}& \beta {\pi}_{G}& \beta {\pi}_{C}+{\gamma}_{Y}{\pi}_{C}/{\pi}_{Y}& -\end{array}\right]\text{(53)}$$
Where the diagonal elements are constrained by each row summing up to 0. The second specification simply replaces (β + γR/π_{R}) by α_{1} and (β + γY/π_{Y}) by α_{2}. We see that TN93 is reduced to F84 if γR = γY, and to HKY85 if γR / π_{R} = γY / π_{Y}.
The similarity between TN93 and F84 allows us to re-use Figure 4 for deriving transition probabilities for TN93. We only need to add a subscript R to γ and α in Figure 4 so that we have γ_{R} and α_{R} as rates for purine, keeping everything else the same, and we instantly obtain the transition probabilities for transitional substitutions between purines and for transversional substitutions as shown in Figure 4. To get transition probabilities between pyrimidines, we can just replace the original nucleotide A in Figure 4 by nucleotide C or T and rename γ and α in Figure 4 to γ_{Y} and α_{Y}. Note that our α_{R} = β+γ_{R}, and α_{Y} = β+γ_{Y}.
The transition probability matrix for the TN93 model is
$${P}_{TN93}\text{=}\begin{array}{c}A\\ G\\ C\\ T\end{array}\left[\begin{array}{cccc}{\pi}_{A}+{\pi}_{A}{\pi}_{Y}{x}_{1}+{\pi}_{G}{x}_{2}& {\pi}_{G}+{\pi}_{G}{\pi}_{Y}{x}_{1}-{\pi}_{G}{x}_{2}& {\pi}_{C}\left(1-{e}^{-\beta t}\right)& {\pi}_{T}\left(1-{e}^{-\beta t}\right)\\ {\pi}_{A}+{\pi}_{A}{\pi}_{Y}{x}_{1}-{\pi}_{A}{x}_{2}& {\pi}_{G}+{\pi}_{G}{\pi}_{Y}{x}_{1}+{\pi}_{A}{x}_{2}& {\pi}_{C}\left(1-{e}^{-\beta t}\right)& {\pi}_{T}\left(1-{e}^{-\beta t}\right)\\ {\pi}_{A}\left(1-{e}^{-\beta t}\right)& {\pi}_{G}\left(1-{e}^{-\beta t}\right)& {\pi}_{C}+{\pi}_{C}{\pi}_{R}{x}_{3}+{\pi}_{G}{x}_{4}& {\pi}_{T}+{\pi}_{T}{\pi}_{R}{x}_{3}-{\pi}_{T}{x}_{4}\\ {\pi}_{A}\left(1-{e}^{-\beta t}\right)& {\pi}_{G}\left(1-{e}^{-\beta t}\right)& {\pi}_{C}+{\pi}_{C}{\pi}_{R}{x}_{3}-{\pi}_{C}{x}_{4}& {\pi}_{T}+{\pi}_{T}{\pi}_{R}{x}_{3}+{\pi}_{C}{x}_{4}\end{array}\right]\text{(54)}$$
Where x_{1} and x_{3} are the same as those in Eq. (34), but x_{2} has α replaced by α_{R} and x_{4} has α replaced by α_{Y}, i.e.,
$${x}_{1}\text{}=\text{}\frac{{e}^{-\beta t}}{{\pi}_{R}},{x}_{2}\text{}=\text{}\frac{{e}^{-{\alpha}_{R}t}}{{\pi}_{R}},{x}_{3}\text{}=\text{}\frac{{e}^{-\beta t}}{{\pi}_{Y}},{x}_{4}\text{}=\text{}\frac{{e}^{-{\alpha}_{Y}t}}{{\pi}_{Y}}\text{(55)}$$
To obtain the distance for the TN93 model (D_{TN93}), recall that a distance is defined as μt where μ is the average substitution rate, i.e., substitution rates in Eq. (53) weighted by the equilibrium frequencies, so:
$${D}_{TN93}\text{}=\text{}2{\pi}_{A}{\pi}_{G}(\beta t+{\gamma}_{R}t/{\pi}_{R})+2{\pi}_{T}{\pi}_{C}(\beta t+{\gamma}_{Y}t/{\pi}_{Y})+2{\pi}_{Y}{\pi}_{R}\beta t\text{(56)}$$
Now we need to obtain α_{R}t, &α_{Y}t, and βt. The method we will use is the same as that for the K80 and F84 models, i.e., we obtain the expected numbers of A↔G transitions, C↔T transitions, and transversions, designated E(S_{R}), E(S_{Y}) and E(V), respectively, from transition probabilities, and equate them to the observed S_{R}, S_{Y} and V to solve for α_{R}t, &α_{Y}t, and βt:
$E({S}_{R})\text{}=\text{}2{\pi}_{A}{p}_{AG}\text{}=\text{}{S}_{R}$
$E({S}_{Y})\text{}=\text{}2{\pi}_{C}{p}_{CT}\text{}=\text{}{S}_{Y}$
$E(V)\text{}=\text{}2{\pi}_{A}{p}_{AT}+2{\pi}_{A}{p}_{AC}+2{\pi}_{G}{p}_{GC}+2{\pi}_{G}{p}_{GT}\text{}=\text{}V\text{(57)}$
The resulting α_{R}t, &α_{Y}t, and βt are
$${a}_{R}t\text{}=\text{}\mathrm{ln}\left(\frac{2{\pi}_{A}{\pi}_{G}{\pi}_{R}{\pi}_{Y}}{2{\pi}_{A}{\pi}_{G}{\pi}_{R}{\pi}_{Y}-{\pi}_{R}^{2}{\pi}_{Y}{S}_{R}-{\pi}_{A}{\pi}_{G}{\pi}_{Y}V}\right)\text{(58)}$$
$${a}_{Y}t\text{}=\text{}\mathrm{ln}\left(\frac{2{\pi}_{C}{\pi}_{T}{\pi}_{R}{\pi}_{Y}}{2{\pi}_{C}{\pi}_{T}{\pi}_{R}{\pi}_{Y}-{\pi}_{Y}^{2}{\pi}_{R}{S}_{Y}-{\pi}_{C}{\pi}_{T}{\pi}_{R}V}\right)\text{(59)}$$
$$\beta t\text{}=\text{}-\mathrm{ln}\left(1-\frac{V}{2{\pi}_{R}{\pi}_{Y}}\right)\text{(60)}$$
If one wishes to express D_{TN93} in S_{R}, S_{Y} and V, then one may just substitute γ_{R}t, γ_{Y}t, and βt into Eq. (56), which yields:
$${D}_{TN93}\text{}=\text{}\frac{2{\pi}_{A}{\pi}_{G}\left[{\pi}_{Y}\mathrm{ln}({x}_{1})+\mathrm{ln}({x}_{2})\right]}{{\pi}_{R}}+\frac{2{\pi}_{C}{\pi}_{T}\left[{\pi}_{R}\mathrm{ln}({x}_{1})+\mathrm{ln}({x}_{3})\right]}{{\pi}_{Y}}-2{\pi}_{R}{\pi}_{Y}{x}_{1}\text{(61)}$$
Where
${x}_{1}\text{}=\text{}1-\frac{V}{2{\pi}_{R}{\pi}_{Y}}$
${x}_{2}\text{}=\text{}\frac{2{\pi}_{A}{\pi}_{G}{\pi}_{R}{\pi}_{Y}}{2{\pi}_{A}{\pi}_{G}{\pi}_{R}{\pi}_{Y}-{S}_{R}{\pi}_{R}^{2}{\pi}_{Y}-{\pi}_{A}{\pi}_{G}{\pi}_{Y}V}$
${x}_{3}\text{}=\text{}\frac{2{\pi}_{C}{\pi}_{T}{\pi}_{R}{\pi}_{Y}}{2{\pi}_{C}{\pi}_{T}{\pi}_{R}{\pi}_{Y}-{S}_{Y}{\pi}_{Y}^{2}{\pi}_{R}-{\pi}_{C}{\pi}_{T}{\pi}_{R}V}\text{(62)}$
To illustrate the application of D_{TN93} with the two aligned sequences in Figure 2, we have π_{A} = 6/48, π_{C} = 12/48, π_{G} = 10/48, π_{T} = 20/48, S_{Y} = 4/24, S_{R} = 0, V = 2/24, α_{R}t = 0.13353, &α_{Y}t = 0.90593, βt = 0.20764, γ_{R}t = α_{R}t – βt = -0.07411, γ_{Y}t = α_{R}t – βt = 0.69829, D_{TN93} = 0.35299. The variance of the D_{TN93} can be obtained by either the delta method or the method using Fisher information matrix. Note that S_{R} = 0 means no information for estimating α_{R}t properly.
I should mention that all distance formulations in this paper are known as Independently Estimated (IE) distances because they use information from only two aligned sequences and are independent of other pairs of sequences. Practical molecular phylogenetic analysis typically would use Simultaneously Estimated (SE) distances [12,13] which use information from all pairs of sequences. SE distances are implemented in MEGA [14] and DAMBE [11,15]. The PhyPA [16] function in DAMBE, which performs phylogenetic reconstruction base on pairwise alignment when reliable multiple sequence alignment is difficult to obtain for highly diverged sequences, uses SE distances only.
In short, the approach of deriving transition probabilities by probability reasoning can go a long way if one can do good bookkeeping. In particularly, the probability reasoning approach is very useful for conceptual understanding. However, the approach becomes increasing difficult with more complicated substitution models. Two alternative approaches, one involving solving differential equations and the other involving matrix exponential and logarithms, are often used in practical computation with the GTR model for nucleotide sequences and amino acid-based substitution models. They will be numerically illustrated elsewhere.
This study is funded by the Discovery Grant from Natural Science and Engineering Research Council of Canada (RGPIN/261252-2013). I thank C. Vlasschaert and S. Aris-Brosou for feedback.